-
PDF
- Split View
-
Views
-
Cite
Cite
Wubin Ding, Diljeet Kaur, Steve Horvath, Wanding Zhou, Comparative epigenome analysis using Infinium DNA methylation BeadChips, Briefings in Bioinformatics, Volume 24, Issue 1, January 2023, bbac617, https://doi.org/10.1093/bib/bbac617
- Share Icon Share
Abstract
The arrival of the Infinium DNA methylation BeadChips for mice and other nonhuman mammalian species has outpaced the development of the informatics that supports their use for epigenetics study in model organisms. Here, we present informatics infrastructure and methods to allow easy DNA methylation analysis on multiple species, including domesticated animals and inbred laboratory mice (in SeSAMe version 1.16.0+). First, we developed a data-driven analysis pipeline covering species inference, genome-specific data preprocessing and regression modeling. We targeted genomes of 310 species and 37 inbred mouse strains and showed that genome-specific preprocessing prevents artifacts and yields more accurate measurements than generic pipelines. Second, we uncovered the dynamics of the epigenome evolution in different genomic territories and tissue types through comparative analysis. We identified a catalog of inbred mouse strain-specific methylation differences, some of which are linked to the strains’ immune, metabolic and neurological phenotypes. By streamlining DNA methylation array analysis for undesigned genomes, our methods extend epigenome research to broad species contexts.
Introduction
DNA methylation, an ancient biochemical innovation traced back to a viral defense mechanism in bacteria [1], is a classical epigenetic mark in higher order eukaryotes [2]. It is extensively implicated in transcriptional regulation [3], cell differentiation [4], organismal development [5] and human diseases [6]. The evolution of genome-wide DNA methylation plays a significant role in regulating transposable element expansion in the host genomes [7], shaping the host genome composition [8] and driving genome evolution [9]. Nonhuman model organisms, e.g. inbred laboratory mouse, and comparative approaches have played a significant role in the discovery of methylation biology and the evolution of epigenetic mechanisms in higher order eukaryotes [10–14]. Recently, DNA methylation was also linked to traits and phenotypes in domesticated animals [15, 16] and postulated to serve biomarkers in livestock improvement programs [17, 18].
The Infinium DNA methylation microarray technologies (e.g. EPIC and HM450 arrays) are among the most widely used DNA methylation assay technologies for human DNA [19]. Despite its success in human epigenetic research, this technology had limited applications in non-human species until recently. This lack of adoption is partly attributable to the need to pre-determine the target CpGs during the array design and the human array’s limited coverage of the non-human genomes. Several attempts have been made to co-opt the Infinium array to primate and mammalian species using probes targeting evolutionarily conserved sequences [20–26]. Recently, Infinium arrays for non-human species were developed and have become available, best represented by the MM285 array, which targets 284 860 CpGs in the C57BL/6 J mouse genome, and the HorvathMammalMethylChip40 (Mammal40) array, a mammalian array that targets around 37 488 CpGs relatively conserved across mammalian species [27]. These two arrays were designed for multiple mouse strains or mammalian species. Despite the progress in assay development, using one array for multiple genomes is met with technical challenges in analyzing the array data. Notably, one must accommodate more flexible probe usage depending on the input DNA. For example, the array signal detection success is usually determined by comparing the probe signal intensity to the background. One can derive these background signals from the built-in control probes [28] and the out-of-band signals of Infinium-I chemistry [29]. However, both approaches depend on assumed probe alignment and may not faithfully represent the actual signal background when those assumptions fail to hold in arbitrary genomes. Similarly, these probe mapping assumptions affect background subtraction [30], dye bias correction [29] and other data normalization techniques that assume the detection success of most designed probes [31] or correct annotation of the control probes in every target species [32].
To streamline the use of Infinium arrays in model organisms and farm animals and their use for comparative epigenetic studies, we present a suite of informatics solutions to allow DNA methylation analysis on multiple species and mouse strains. These solutions are implemented in our tool SeSAMe [29] (Version 1.16.0+), which features novel methods and ab initio characterization of the EPICv2, EPIC, MM285 and Mammal40 array probes in over 300 species. These methods include (1) inference of the source species based on signal intensities, (2) determination of the probe utility space, including MM285 array probes whose methylation readings are potentially influenced by the presence of inbred mouse strain-specific SNPs, (3) methods and data infrastructure for Infinium-I color channel specification, (4) support of replicate probe design and daughter-strand probe designs, and (5) genome-specific data preprocessing, including signal background subtraction, dye bias correction and detection P-value calculation. Applying the new computational pipeline to query inbred mouse strain-associated methylation differences, we identified epigenetic links to metabolic and neurological phenotypes, including differential methylation involved in neuron and liver tissue-specific transcription factoring binding and micro-imprinting. We also uncovered principles of epigenetic regulation at various evolutionary scales in diverse mammalian species. By providing probe annotation and informatics for genome-dependent preprocessing, SeSAMe broadens the application scope of the Infinium DNA methylation BeadChip to arbitrary genomes, facilitating comparative epigenome studies at the population scales.
Results
Utility of Infinium BeadChips on vertebrate genomes
To analyze Infinium BeadChip from arbitrary genome sources, we developed an informatics infrastructure composed of (1) computational methods for online annotation of probe masking and color channel (Figure 1A), (2) a database of probe utility space on candidate genomes (Figure 1B), (3) inference methods of species and mouse strains from probe signal intensities and (4) adapted genome-aware background subtraction, dye bias correction and other data preprocessing methods. Our new data processing workflow (openSesame) allows for customizable end-to-end species- and strain-specific data preprocessing using one line of R code (Figure 1A). Targeting 310 (305 vertebrate) species and 37 inbred mouse strains, we defined the probe utility spaces for each candidate genome. This space, maximized at the designed species/mouse strain, shrinks in distantly related genomes (Figure 1B). The shrinkage occurs more rapidly in the MM285 and the EPIC arrays than in the Mammal40 array, which targets more conserved genomic territory by design (Figure 1B, Figure S1A). Genic region probes, particularly at the promoter and 3′-end of genes, are more retained (Figure 1C, Figure S1B), and the retention rate tracks overall sequence conservation (Figure 1D).

Utility of Infinium BeadChips on vertebrate genomes. (A) SeSAMe workflow of species and strain inference and genome-specific preprocessing. (B) Annotation of four Infinium BeadChip probes on the number of functional probes across 310 genomes (showing 33 representative species). (C) Genic view of probe retention of three generations of Infinium BeadChips. (D) Evolutionarily conserved versus non-conserved regions in the cross-species retention rate of probes on the MM285 array. (E) A schematic illustration of six groups of SNP influences on DNA methylation reading (R = A,G; Y=C,T,U; D = A,G,T). (F) Validation of strain-specific SNP influence on DNA methylation readings in 191 inbred mouse strain samples. Three wild-derived strains (CAST_EiJ,PWK_PhJ and MOLF_EiJ) were shown separately due to their higher SNP number.
It is established that adjacent SNPs influence the methylation reading on Infinium BeadChips [33, 34], partially defining probe utility. A naïve multiple regression of an MM285 data set revealed that 34.85% of mouse strain-specific methylation reading is under direct influences of SNPs (Figure S1C, odds ratio = 3.18, P < 2.2e-16). As a positive control, explicitly designed strain-specific SNP probes (rs probes) are highly enriched for strain-specific readings (Figure S1D). We classified SNP influences into six groups (Figure S1E) based on the SNP location, allele information and the probe design, as illustrated in Figures 1E and S1F. In brief, SNPs that alter target CpGs or impact probe hybridization/extension can create artificially high, low or intermediate readings. In part, such suboptimal hybridization/extension is caused by mutated probe target sequences (Figure 1E) and the extension base that flips the measurement color channel (Figure S1G). We reported 63 339 probes whose reading can be influenced by neighboring genetic variants and 15 002 probes on which the SNPs are present with no influence. We validated the prediction of all the putative SNP-influenced probes in a dataset of 191 inbred mouse tissue samples from 20 strains and 6 tissue types. The beta values for artificial high and low methylation reading group probes are close to 1 and 0, respectively. In contrast, the beta values for the probes in the suboptimal hybridization group approach 0.5 (Figure 1F) as both the methylated and the unmethylated allele signals approximate signal background. The SNP-influenced probes are most prevalent in three wild-derived strains—CAST/EiJ, MOLF/EiJ and PWK/PhJ (Figure S1H), highlighting the critical need to consider these artifacts in studying these wild mouse strains. We developed an automated pipeline that annotates SNP influence for other genomes not included in our collection.
Species inference and genome-specific Infinium data preprocessing
Towards a fully automated data-driven workflow, we sought to explore whether the target species could be inferred entirely from the methylation array data. We reasoned that probes with mappable sequences are more likely to emit a signal and be successfully detected. As such, probes with genome-specific mapping differences may carry the power of genome discrimination. Indeed, clustering probes by total signal intensity grouped samples by species origin and to a lesser degree, by tissue type (Figures 2A and S2A). Based on this intuition, we developed the SPIRAL (SPecies Inference fRom Alignment) method based on probe alignment scores and their signal detection P-values (Figure 2B, see section MATERIALS AND METHODS). These two metrics both reflect probe hybridization success and are inversely associated (Figure S2B). In brief, we first selected probes with the strongest and weakest signal detection as determined by their detection P-values. These positive and negative probes were then compared to their alignment scores in the candidate genomes. The species with the maximum area under the curve (AUC) is returned as the predicted species (see Figure S2C for an example). For the designed genome, the number of negative probes can be lacking, leading to genome discrimination primarily driven by other thermodynamic properties of probe hybridization rather than sequence dissimilarity (Figure S2D and E). We empirically determine that the sample has come from the designed genome based on the probe success rate (see section MATERIALS AND METHODS, Figure S2F).

Species inference and genome-specific Infinium data preprocessing. (A) Heatmap showing the total probe signal intensity across 1644 Mammal40 datasets clustered both row and column-wise. We selected top 200 probes (ranked by AS) with AS >47 in each species, combined them to obtain 4569 unique probes. (B) The species inference workflow. (C) Comparison of species prediction accuracy on human (HM450 and EPIC) array datasets. (D) Heatmap of the species prediction AUC in a Mammal40 dataset comparing reported species (column) and predicted species (row). Rows and columns color indicates the family of each species. Only candidate species predicted by at least one sample are shown. (E) Methylation titration comparison on MM285 rat DNA datasets (GSE184410) before and after SeSAMe processing. (F) Scatter plot comparing the mean beta value of each sample to the titrated fraction with generic, species-specific data preprocess and without data preprocessing on the rat DNA titration data. Generic processing does not include species inference and SigDF updates but still relies on detection P-value to perform probe masking. (G) Comparison of correlations between replicate probes and randomly selected probes targeting different loci. We randomly selected 10 000 records from each group for visualization.
We applied SPIRAL to 211 public EPIC/HM450 array data generated on non-human species, including primates and mice. In all test cases, including 23 samples adapted to query 5-hydroxymethylation, SPIRAL correctly identified the species’ origin (Figure 2C). We then applied SPIRAL to a combined Mammal40 dataset of 883 samples from 22 species (Figure 2D) and an MM285 dataset of 30 human, rat and hamster DNA samples (Figure S2G). On the mammal40 array dataset, SPIRAL made accurate predictions in 7 families and only made mistakes between closely related candidate species (Figure 2D). Notable mistakes include flying fox (Pteropodidae) samples being confused for greater horseshoe bats, a species very close to the large flying fox, in 58 of 235 cases (Figure 2D). Similarly, from the MM285 dataset, SPIRAL can correctly identify human, rat and hamster samples (Figure S2G). The SPIRAL algorithm provides inference of the samples’ origin of species from data, which is key to genome-specific preprocessing and normalization (below).
Previous Infinium array data processing methods are established mainly under the assumption that the array has been run on the designed species. Therefore, the chip’s global performance should evaluate the success of the whole experiment. In contrast, when the array is run on an arbitrary genome, one may expect detection success from a subset of the probes. We modified existing Infinium BeadChip data processing methods, e.g. background subtraction, dye bias correction and detection P-value calculation, to work in a genome-specific manner. For example, we use genome-specific out-of-band channel specification to parameterize signal background as is needed by the noob [30] and the pOOBAH algorithms [29]. Similarly, a substantial number of probes are expected to fail in a sample from a non-designed species. Focusing on probes with predicted utility in the target species effectively removed artifactual intermediate methylation readings (Figure S2H) associated with low signal intensity (Figure S2I), and improved the detection rate estimation (Figure S2J). Our pipeline collectively corrects the skewed methylation readings in a calibration experiment running MM285 (Figure 2E and F) on rat DNA of titrated methylation levels. The pipeline is reduced to the generic processing in performance on the designed genome (Figure S2H). Most of the probe replicates targeting the same cytosines or CpGs but with different designs [35] generate largely concordant DNA methylation measurements after preprocessing (Figure 2G), suggesting its robustness.
SeSAMe associates mouse strain-specific methylations to phenotypes
With proper probe masking and signal normalization, we explored the DNA methylation diversity among inbred mouse strains. We analyzed a public data set of 238 liver, spleen, tail and frontal lobe brain tissue samples from 25 mouse strains. The DNA methylomes are primarily grouped by tissue types (Figure 3A). Within each tissue type, the samples are further sub-clustered by mouse strains (Figure 3A), suggesting the prevalence of strain-specific methylation variation. By performing a multiple regression of DNA methylation on tissue, strain and sex, we identified 18 822 strain-specific differentially methylated CpGs (SDMCs, Methods). Most SDMCs strongly depend on tissue type (Figure 3B), with effect sizes highly intertwined (Figure S3A). In contrast, the SNP probes, which measure variant allele frequencies instead of cytosine methylation levels, were predominantly tissue independent (Figure S3B). Among the four tissue types we studied, the spleen has the most SDMCs, followed by the liver, tail and brain, suggesting that methylation variation may be linked to immune and metabolic phenotypes among these mouse strains.

SeSAMe associates mouse strain-specific methylations to phenotypes. (A) t-SNE embedding of methylomes of tissue samples from 25 different strains. SNP-influenced probes are masked. (B) An UpSet plot of strain-specific differential methylated CpGs (SDMCs) showing strong dependence on tissue types. (C) Enrichment of SDMCs shared across tissues in CTCF binding sites (red), consistent intermediate methylation (IM, green) and variably methylated regions (VMR, blue). (D) Enrichment of tissue-specific SDMCs in enhancer regions among different chromatin states (from ENCODE ChromHMM). SDMCs from all strains are merged to perform enrichment test. (E) Example differentially methylated regions at the Nap1l5/Herc3 locus, specific to the BTBR strain. (F) The number of frontal lobe brain SDMCs in different strains. Only strains with both sexes represented are shown. Frontal lobe brain SDMCs with LP/J showed the most methylation loss. (G) Transcription factor binding site (TFBS) enrichment of LP/J-specific hypomethylation. nQ, nD and nO represent the size of query, database and overlap CpGs set. (H) The number of liver SDMCs in different strains. Liver SDMCs with wild-derived strains showed more hyper- than hypomethylation. (I) Enrichment of liver-specific SDMCs at Polycomb repressive complex targets and hepatocyte-associated transcription factors.
We classified the SDMCs based on whether their methylation states depend on the tissue types. Tissue-independent hypermethylated SDMCs were enriched in CTCF binding sites, intermediately methylated CpGs and variably methylated-IAPs (VM-IAPs) (Figure 3C). VM-IAPs were found to have altered methylation levels across mouse individuals [36] (Figure 3C). Our findings support a previous study showing that VM-IAPs are found in multiple tissue types and are frequently adjacent to CTCF binding sites in the genome [36]. In contrast, tissue-independent hypomethylated SDMCs are slightly enriched at quiescent regions, and pseudogene transcription start sites (Figure S3D).
Studying tissue-specific SDMCs, on the other hand, revealed an enrichment at enhancer regions (e.g. ChromHMM states TssFlnk, EnhLo, EnhPr, EnhPois and Enh) but depletion at heterochromatic regions (Het), promoters (Tss) and gene bodies (Tx) (Figure 3D). Figure 3E shows an example of a hypermethylated brain-specific SDMC at the Nap1l5 locus in the BTBR mice. Despite being present in all four tissues studied, this methylation gain spans the most extended genomic region in the frontal lobe brain tissue. In contrast, the hypermethylation occurs at only a few CpGs in the liver tissue (Figure 3E). Nap1l5 is a paternally expressed, micro-imprinted gene nested within a non-imprinted host gene Herc3. Nap1l5 and Herc3 are highly expressed in cortical neurons [37]. The brain-specific, imprinting-associated DMR in BTBR mice may be implicated in their known loss of corpus callosum neurons and autism-related phenotypes [38].
Comparing frontal lobe brain tissue in different strains suggests more hypomethylation in LP/J mice (Figure 3F) with a lower global methylation average (Figure S3E). Investigating hypomethylated CpGs in LP/J mice revealed that these SDMCs are frequently bound by the paired box protein PAX7, NFI gene complex member NFIB and homeobox transcription factors, e.g. LHX3 (Figure 3G). Notably, PAX7 is key to lineage specification of early neural crest development in mammals [39] and nonmammalian vertebrates [40]. LP/J mice have a reduced number of neural crest-derived melanocytes in the coat and choroid layer of the eye. This difference in neural crest development, likely attributed to the Ednrb gene mutation [41], leads to dark eyes in the LP/J mice. The impaired neurogenesis may also be implicated in the increased propensity for audiogenic seizures in LP/J [42]. Similarly, the LIM factor LHX3 plays a crucial role in motor neuron differentiation [43], and NFIB cooperates with NFIA in coordinating fetal mouse forebrain development [44]. Further comparison of LP/J-specific hypomethylated sites with a mouse cell type-specific methylation atlas [45] suggests that this hypomethylation pattern might be due to an enrichment of hippocampus-specific excitatory neurons, e.g. at the dentate gyrus (DG) and hippocampal subfield CA1–3, as well as adult neural precursors (ANP) (Figure S3F).
Comparing the liver methylation profiles in different strains revealed an association with body and fat weight. The three wild-derived strains, CAST/EiJ, MOLF/EiJ, PWK/PhJ, tend to have more hypermethylated than hypomethylated CpGs in the liver (Figure 3H). Most hypermethylations are of small effect sizes (delta≤0.5). These hypermethylated sites in the liver may be associated with the smaller body weight and fat composition in these wild-derived mice [46]. KK/HiJ, a type 2 noninsulin-dependent diabetes and obesity model, has the most liver methylation variation only next to the four wild-derived strains [46], suggesting impaired epigenetic regulation of metabolism. Methylation change in the liver is disproportionately found at the binding sites of Polycomb repressive complexes, including PRC2, PRC1 and other PRC-associated co-factors (Figure 3I). The Polycomb repressive complex is known to pre-mark unmethylated DNA that gains methylation during cell proliferation [47–49]. Hence differential hepatic cell proliferation may explain the strain-specific methylation differences in the liver. Furthermore, transcription factors that govern hepatic cell differentiation and function are also differentially methylated (Figure 3I). Notably, HNF4A, PROX1 and FOXAs are among the TFs most frequently associated with hypermethylation alongside CTCF/cohesion complexes (Figure S3G). HNF4A and FOXAs are central canonical TFs for hepatocyte development and function [50]. PROX1 is an early marker of the developing mouse liver, controlling hepatocyte migration during liver morphogenesis [51]. NFIL3 is highly expressed in the liver and regulates hepatic gluconeogenesis [52]. The binding sites of several metabolic nuclear receptors, such as NR1D2, NR1H2 (Liver X receptor beta) and PPARA [53], were also enriched at strain-specific methylation differences. Intriguingly, the DNA methylation reader MECP2, best known for its role in neurogenesis and causing Rett syndrome when mutated, also regulates lipid metabolism in the liver in coordination with the NCOR1 corepressor complex [54]. NCOR1 and MECP2 are both bound to differential methylated sites across strains in the liver (Figure 3I). In contrast, we only identified two liver hypomethylation-associated TFs: CBX5 and ZFP57. CBX5/HP1A is a heterochromatin protein [55], consistent with methylation loss most occurring at heavily methylated heterochromatic regions.
SeSAMe analysis reveals principles of the epigenome evolution
Changes in DNA methylation contribute to the evolution of transcriptional regulation and phenotypic variation at a level beyond that of the gene sequence evolution [56]. We applied SeSAMe to analyze the methylomes of 857 blood samples from 16 species. First, the global blood DNA methylation levels averaged over different chromatin states are stable from species to species. Active transcription start sites and gene bodies display the least methylation variation featuring the lowest and the highest average methylation levels across ChromHMM states, respectively (Figure 4A). This genic stability (Figure S4A) is consistent with the evolutionary conservation of gene expression patterns across different mammalian species [57]. In contrast, enhancers are more variably methylated than promoters across species, in agreement with these elements having evolved more rapidly [58] and contributed to the gene expression robustness, possibly through partial redundancy [59]. Blood samples of the same species share more significant similarities in the DNA methylome than those of different species (Figure S4B). Within each species group, samples are further segregated by sex (Figure S4B, right panel). Further investigation of a data set of diverse mouse and rat tissue types revealed a joint DNA methylation determination by tissue and species, spanning the two leading principal components of the methylome data (Figure 4B). The tissue-type orders along the tissue-type axis were mainly conserved between mouse and rat samples, pointing to the relative stability of tissue biology across rodents (Figure 4B, right panel). To further evaluate the conservation of tissue-specific methylation, we identified tissue-specific hypo and hypermethylation of eight different tissue types (cecum, colon, esophagus, frontal lobe brain, hindbrain, small intestine, spleen and stomach) in mice. We studied matched tissues in the rat (Figure 4C). Focusing on the MM285 probes compatible with the rat and the mouse genome, we uncovered the conservation of these tissue-specific methylations, with brain-specific methylations most conserved, followed by the spleen. This methylome conservation may implicate transcriptional conservation. For example, the preservation of brain tissue methylation is consistent with a slower transcriptional change in brain tissue across mammals [57,60]. Compared to the brain-specific methylations, the tissue-specific methylation pattern is less conserved in the gastrointestinal (GI) tract, suggesting more rapid evolution of GI-specific methylation signature in the rodent clade.

SeSAMe analysis reveals principles of the epigenome evolution. (A) Comparison of chromatin state methylation average across 857 blood methylomes from 16 species. (B) Principal component analysis of 188 methylomes profiled from the mouse and rat DNA. (C) Tissue-specific hyper- and hypomethylation signatures comparing mouse and rat tissues. (D) The effect sizes of tissue- and species-specific methylation in multiple regression modeling of 703 naked mole-rat and cetaceans Mammal40 datasets. (E) Overlap of species-, tissue- and sex-specific methylation in the 703-sample analysis in 4D. (F) Strictly tissue- and species-specific methylation in the 703-sample analysis in 4E. (G) ChromHMM state enrichment of strict species- and tissue-specific methylation. (H) Comparison of two species groups (naked mole-rat and cetaceans) and two tissue types (blood and skin) on the methylations of 158 X-linked CpGs.
To validate the interplay of tissue type and species in more remotely related species, we compared two Mammal40 datasets of blood and skin tissues of mole rats and cetaceans (whales and dolphins) (see section MATERIALS AND METHODS). We performed a multiple regression analysis to model the roles of tissue type, species group and sex in governing the DNA methylome dynamics. Again, we found highly intertwined tissue-specific and species-specific methylation (Figure 4D). About, 38% of the methylation variation depends on both tissue and species (Methods), while 32%, 28% and 2% of the methylations vary by tissue, species and sex alone (Figure 4E). Despite the extensive statistical interaction between tissue and species, we identified 403 strictly species-specific and 250 strictly tissue-specific probes with an effect size of the corresponding predictor greater than 0.3 but the effect sizes of other predictors smaller than 0.1 (Figure 4F). Querying the genomic distribution of species- and tissue-specific methylation revealed that strictly tissue-specific methylation is slightly enriched in enhancer regions (ChromHMM code Enh, TssAFlnk). In contrast, species-specific methylation differences are strongly enriched in gene bodies (ChromHMM code Tx, TxWk, Figure 4G, see section MATERIALS AND METHODS). This enrichment reflects the role of enhancer regions in lineage specification and its evolutionary conservation across species. On the other hand, species-specific methylation at gene bodies reflects more the difference in gene expression states.
Finally, from the regression, we identified 180 sex-specific methylations from the above Mammal40 datasets, and these CpG sites are all X-linked on the human genome. X-linked gene promoters can be mono-allelically methylated in the female somatic cells (Figure 4H). The monoallelic methylation is associated with the X chromosome inactivation and is largely transcriptionally silenced. Comparing cetaceans and naked mole rats, we found that the inactive X-associated DNA methylation is mainly conserved, with most X-linked promoter probes displaying an intermediate methylation level (Figure 4H). Some CpGs lose intermediate methylation patterns in either of the two species groups and become fully methylated (black boxes in Figure 4H), likely due to gene silencing on both the active and inactive X chromosomes. Some tissue dependencies have also been noted, including CpGs that lost intermediate methylation only in blood but not skin tissue (red boxes in Figure 4H). Further study of these X-linked probes in a more extensive data set of 2141 samples from 44 species revealed that bats have more intermediate methylation loss than cetaceans (Figure S4C). Collectively, we observed preservation of the sex dimorphism associated with X chromosome inactivation, while most male samples are consistently unmethylated at these sites.
Discussion
The Infinium DNA Methylation BeadChip has been one of the most used assay technologies to profile human genome-wide DNA methylation. We presented genome inference methods and adapted genome-specific data preprocessing techniques to streamline the study of DNA methylation from arbitrary genomes using a single platform. Notably, we compared the methylomes of different inbred mouse strains to reveal extensive and intricate strain-specific methylation differences, highly intertwined with sex, tissue type and organismal age as shown previously [35]. We demonstrated that some of these differences might be linked to the corresponding mouse phenotype differences in a related tissue context. For example, we showed that LP/J mice, known to have impaired neural crest development, carry differential methylation in the binding sites of transcription factors involved in neurogenesis. BTBR strain, an autism disease model with documented loss of corpus callosum neurons, possesses unique methylation patterns in genes specifically expressed in neurons. Despite these associations, it is possible that many methylation differences merely reflect the genetic distance without simple phenotypic manifestation. This would be supported by a higher number of SDMCs in wild-derived strains. Schilling compared C57BL/6 J and BALB/cJ and found that most allele-specific DNA methylation is mainly determined by cis-acting sequences [61], highlighting a complex interplay of the genotype, phenotype and DNA methylation in mammalian cells. To enhance the sensitivity of detecting such association and its interaction with other covariates such as age and tissue, one could investigate more diverse mouse populations, such as the collaborative cross and the diverse outbred populations. Our strain-specific methylation catalog represents the epigenetic heterogeneity of the founder strains in these more diverse mouse cohorts.
The evolution of whole genome DNA methylation, which first emerged in early vertebrates [13] and prevailed in most human tissues, is not fully understood. We studied DNA methylation in different species contexts across a diverse group of mammalian species. By studying chromatin-aggregated methylation, we had not observed global methylation increase from other mammals, such as sheep, whales and horses, to humans, indicating that the human-like DNA methylome has appeared before the mammalian evolution and likely appeared more abruptly than progressively, possibly aligned with the evolution of DNA methyltransferases [62].
Our analysis sheds light on the principles of future generations of multi-species array design. We showed that most probes from species-specific arrays, such as the human EPIC and mouse MM285 array, are restricted to a small phylogenetic clade. The number of usable probes decays fast as the target species move away from the designed species, posing challenges to both inter-species comparison and inter-array normalization. The Mammal40 array retains a high fraction of probe utility within the mammalian species, meeting its design objective [27]. However, the Mammal40 array only optimally covers the human genome, and the number of usable probes still decreases as the target genome becomes less closely related to the designed genome. In the Mammal40 array, the support for different alleles was partly achieved by the wobble base design [27], which may expand the species scope of these platforms without increasing the probe number. Multiple versions of the same probe may be included to account for this bias in coverage and allow a fair comparison of DNA methylation on a dinucleotide resolution.
Materials and methods
Species inference
Species- and strain-specific preprocessing
We adapted three preprocessing components, i.e. dye bias correction, detection P-value masking and background subtraction, chained in the openSesame function in the SeSAMe package (Version 1.16.0+). The adaptation includes (1) the out-of-band signals were combined with the internal negative control probes to parameterize the signal background. (2) The SPIRAL species inference was employed before other signal preprocessing components. All the preprocessing uses an updated, genome-specific color channel in the SigDF object. (3) The order of normalization steps defaults to channel inference, dye bias correction, detection P-value masking and background subtraction but can be customized to suit the user’s preference. (4) Infinium-I probes that were non-uniquely mapped were excluded from the out-of-band signal pool. (5) Masked probes are excluded in functions that explicitly equalize the beta value distributions of probes of different design types, such as BMIQ [31]. (6) A color channel inference component was added to supplement alignment-based designation to detect residual color channel switches from data. (7) Quality control-based masking is used before all the other processing components. (8) Detection P-value calculation is done before background subtraction, which will modify the signal and may affect the out-of-band signal assumption used on pOOBAH. Raw detection rate is computed by the number of probes with significant detection P-values divided by the total number of probes. The species-specific detection rate is computed by the number of functional probes with a significant signal divided by the species’ total number of putative functional probes.
Predicting the influence of strain-specific variants
The genetic variants of inbred mouse strains were downloaded from Mouse Genomes Project (ftp://ftp-mouse.sanger.ac.uk/) [64]. We used BEDTools [65] to find strain-specific variants (SNP and Indel) located within 5 bp from the 3′-end of each probe’s extension base. To get reliable variants, we used the following conditions to filter variants: (1) FILTER = PASS, (2) Genotype (GT) was homozygous (1/1, 2/2, 3/3, 4/4 or 5/5), (3) GQ > 20 and (4) DP > 8 [66]. Groups were assigned to each probe based on the SNP position, the SNP type, the probe type, the probe directionality and the strand of bisulfite conversion (Figure S1E). The effect of SNP on probe methylation reading was classified into the following six groups: (1) no effect, (2) artificial low methylation reading, (3) artificial high methylation reading, (4) suboptimal hybridization, (5) G-R (green-to-red channel switch) and (5) R-G (red-to-green channel switch). To illustrate the color channel switch caused by SNP, only CpGs grouped to G-R or R-G were selected, and the U and M intensity for the green and red channels were calculated by SeSAMe readIDATpair function using the mouse array MM285 array dataset.
DNA methylation BeadChip data
Raw IDAT files for the Infinium array data were downloaded from Gene Expression Omnibus using the following accessions. Mammal40 data were downloaded from GSE169218 [67], GSE164127 [68], GSE147004 [69] and GSE173330 [70]. MM285 data were downloaded from GSE184410. One hundred and thirteen chimpanzee data were retrieved from GSE136296; 11 mouse data were downloaded from GSE110600; 61 HM450 samples were downloaded from GSE49177 (including bonobo, chimpanzee, human, mouse and rhesus samples); 23 samples interrogate 5hmCs using TET-mediated cytosine oxidation (GSE49177) and 20 baboon samples were downloaded from GSE101733. IDAT files were preprocessed to the total intensity and the DNA methylation beta value matrices using the openSesame workflow implemented in the SeSAMe package with default options [29].
Strain-specific differential methylation analysis
Raw IDAT files for 238 mouse samples were retrieved from our prior study (GSE184410) before being processed using the openSesame pipeline with the default parameters. In total, 3000 most variable CpGs were selected before using the Rtsne package (https://github.com/jkrijthe/Rtsne) to create a 2D embedding. We performed a multiple regression of strain and sex for each tissue type to identify the strain-specific differential methylated CpGs (SDMCs). The median DNA methylation level across all strains was used as the reference level. Sex-specific probes (with modeled beta difference ≥ 0.01 between the two sexes) were excluded. Probes with an absolute value of slope coefficient ≥ 0.2 are considered SDMCs. Excluding all probes with variants, we calculated the effect sizes of strain and tissue-specific methylation differences (between the maximum and the minimum coefficient). For each strain, we define tissue independent SDMCs as CpGs whose methylation slope coefficients ≥0.2 in at least three of the four tissue types. Tissue-dependent hypermethylated SDMCs are defined as CpG probes that are hypermethylated in only one of the four tissue types. We visualized the Herc3/Nap1l5 region (from 10 kb upstream to 80 kb downstream of the Herc3 gene) using the visualizeGene function in the SeSAMe package. The global methylation average calculated for different strains was compared to C57BL/6 J. Male and female samples were separated.
Single-cell brain whole-genome bisulfite sequencing data
The processed methylation level data from a published dataset [45] was downloaded from GEO with accession ID GSE132489. The methylation beta value in each CpG site was merged with biscuit mergecg (https://github.com/huishenlab/biscuit), and the average beta values were calculated for each cell type. We plotted 2966 probes with LP/J methylation levels lower than C57BL/6 J than 0.35.
Species-specific differential methylation analysis
ChromHMM averages were calculated using the dbStat function in the SeSAMe package. Mammal40 ChromHMM state overlap was calculated using the consensus human chromatin state as a surrogate [71]. Metagene plot was calculated using the KYCG_plotMeta function in the SeSAMe package. For clarity, we randomly sampled three samples from each combination of tissue type and species group, leading to 204 datasets. Horse sexes were missing and inferred from the fraction of intermediate X-linked CpGs (beta value average between 0.3 and 0.7) with a cutoff of 0.35. The principal component analysis was performed using the prcomp function in the R stats package. To select tissue methylation signatures, we performed a nonparametric test comparing beta values of the target tissue with those of the non-target tissues. We only kept CpGs whose methylation can fully separate the two mouse sample groups. We further require the delta beta value of the two sample groups to be greater than 0.3 and the fraction of the missing beta value to be under 30%. The top 50 CpGs with the greatest absolute value of delta beta were kept for visualization. We performed multiple regression modeling tissue, species, sex and tissue–species interaction. Probes with an effect size ≥0.2 and F-test P-value ≤0.05 are considered as significantly methylated CpG specific to that variable. Tissue-specific probes are defined as ones with tissue effect size ≥0.3 and species effect size ≤0.1. Species-specific probes are defined as ones with species effect size ≥0.3 and tissue effect size ≤0.1. We excluded probes with more than 20 samples showing NA in the methylation reading.
CpG enrichment analysis
The enrichment test for strain, tissue, species-specific methylated CpGs were performed using the testEnrichment function in the SeSAMe package. In brief, we manually curated our database sets using diverse sources of publicly available data. To engineer genomic features, we used BEDTools to intersect the Infinium BeadChip manifest with genomic coordinates from ChromHMM [72], GENCODE [73], ReMap [74], and various other genomic and epigenomic annotations [34]. We used Fisher’s exact test to evaluate the statistical significance of set overlaps and report the Benjamini–Hochberg adjusted P-value each curated CpGs set.
This study introduced novel computational methods to process the methylation BeadChip data on arbitrary genomes.
Meta-analysis of strain-specific DNA methylome atlas across 25 inbred mouse strains.
Mouse strain-specific DNA methylome profiles are associated with immune, metabolic and neurological disease phenotypes.
Funding
The NIH/NIGMS (grant R35GM146978 awarded to W.Z.). It was also supported by W.Z.’s startup fund at Children’s Hospital of Philadelphia and research sponsorship by FOXO Bioscience.
Availability
The SeSAMe R package (version 1.16.0+) and documentation are available on the Bioconductor home page at https://bioconductor.org/packages/release/bioc/html/sesame.html. The HM450, EPIC, EPICv2, MM285 and Mammal40 probe annotations are available at http://zwdzwd.github.io/InfiniumAnnotation, https://github.com/zhou-lab/InfiniumAnnotationV1. The mouse strain-specific SNP maskings are available at https://github.com/zhou-lab/InfiniumAnnotationV1/raw/main/Anno/MM285/MM285.mm10.mask.tsv.gz. Python script to annotate the influence of SNPs on DNA methylation reading for future array generations and new genome assemblies is available at https://github.com/zhou-lab/InfiniumManifestAnnotator.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Data Availability
All data generated or analyzed during this study are included in this published article and its supplementary information files.
Competing interests
W.Z. receives research funding from FOXO Bioscience.
Authors’ contributions
W.Z. conceived the study, W.Z. and W.D. implemented the algorithm, D.K. and S.H. helped with the data analysis, W.Z. and W.D. wrote the manuscript, S.H. revised the manuscript, all authors read and approved the final version.
Author Biographies
Wubin Ding is a postdoctoral researcher at the Children’s Hospital of Philadelphia, his research interest includes bioinformatic data analysis and software or algorithm development.
Diljeet Kaur is a Ph.D. student from the University of Pennsylvania. Her research focuses on developmental and computational epigenetics.
Steve Horvath is a professor in Human Genetics and Biostatistics at the University of California, his methodological research area lies at the intersection of biostatistics, bioinformatics, computational biology, cancer research, genetics, epidemiology, machine learning and systems biology.
Wanding Zhou is an assistant professor at the University of Pennsylvania and Children’s Hospital of Philadelphia. His research interests include DNA methylation drift and inference of cell-type-specific epigenetic signals.
References