Abstract

5-Methylation (m5C) on mRNA molecules is a prevalent internal posttranscriptional modification in eukaryotes. Although m5C modification has been reported to regulate some biological processes, whether most mRNA m5C modifications are functional is unknown. To address this question, we analyzed the genome-wide evolutionary characteristics of m5C modifications in protein-coding genes of humans and mice. Our analysis of RNA sequencing data from 13 tissues of both species revealed that (i) the occurrence of m5C modification is exceedingly low, (ii) the fraction of m5Cs decreases with the amount of Cs across genes or tissues, (iii) m5C modifications are mostly unshared between species, and (iv) m5C sites and motifs do not exhibit greater evolutionary conservation. Additionally, we estimate that a large fraction of the observed mRNA m5C modifications may be deleterious. Together, these observations suggest that most m5C modifications in mammalian mRNAs are nonadaptive, which has important implications for understanding the biological significance of m5C and other posttranscriptional modifications.

Introduction

RNA m5C modification refers to the process of adding a methyl group to the fifth carbon atom of cytosine residues within RNA molecules. This modification was discovered as early as 1958 (Amos and Korn 1958) and is one prevalent type of the over 170 known posttranscriptional modifications identified to date (Boccaletto et al. 2018). m5C is distributed on the mRNAs, tRNAs, rRNAs, and ncRNAs in three kingdoms of species, with the highest abundance in eukaryotic tRNAs and rRNAs (Schaefer et al. 2009; Motorin et al. 2010; Tuorto et al. 2012; de Crécy-Lagard et al. 2013; Cui et al. 2017; Huang et al. 2019; Wang and Lin 2023). In vertebrates, m5C modification is dynamically regulated by methyltransferases (i.e. NSUN and DNMT family members, known as the writers) (Bohnsack et al. 2019) and demethylases (i.e. TET families and ALKBH1, known as the erasers) (Fu et al. 2014; Haag et al. 2016; Kawarada et al. 2017). Subsequently, the m5C-modified RNAs could be recognized by binding proteins (i.e. ALYREF and YBX1, known as the readers) (Chang et al. 2013; Chen et al. 2019). Unlike RNA m6A modification, which has a consensus RRACH motif in most species, most m5C modifications across different species do not exhibit such motifs (Cui et al. 2017; Yang et al. 2019a). Nevertheless, two types of m5C sites, although only accounting for a small fraction of the observed m5C sites, were recently identified (Huang et al. 2019; Liu et al. 2021, 2022; Selmi et al. 2021; Zhang et al. 2021), including the type I that contains a downstream G-rich triplet motif and is methylated by NSUN2 and the type II that contains a downstream UCCA motif and is methylated by NSUN6.

With the development of high-throughput sequencing methods at single-nucleotide resolution (e.g. bisulfite sequencing [BS-seq] (Schaefer et al. 2009)), m5Cs have been found to be widely present in the mRNAs of various eukaryotic species, including humans, mice, zebrafish, flies, and yeast (Levin et al. 2010; Huang et al. 2019; Yang et al. 2019b; Liu et al. 2021, 2022). For example, more than 130,000 m5C sites have been uncovered in the human genome (Ma et al. 2022). Additionally, some crucial molecular functions of mRNA m5C modifications have been recently revealed. For instance, disrupting m5C modifications can affect the stability (Chen et al. 2019), export (Yang et al. 2017), and translational efficiency (Li et al. 2017; Janin et al. 2019) of mRNAs. Moreover, m5C modification plays a vital role in processes, such as the maternal-to-zygotic transition in zebrafish (Yang et al. 2019b), mitochondrial dysfunction (Haag et al. 2016), early development of brain and neurons (Johnson et al. 2023), cell tolerance (Tang et al. 2020), and proliferation and migration (Xue et al. 2019). Furthermore, m5C modification is also implicated in various diseases, including cancers (Chen et al. 2019), neurodevelopmental disorders (Blanco et al. 2014), and cardiovascular diseases (Bohnsack et al. 2019). These findings have led to the prevailing view that the mRNA m5C modification is a beneficial and widely used mechanism of posttranscriptional regulation, referred to hereafter as the adaptive hypothesis.

However, the functions mentioned above are mostly identified by examining the writers, erasers, and readers of m5C modification rather than the specific m5Cs themselves. Thus, despite the reported functions of m5C modification, whether these functions are mediated by the vast majority of m5Cs or just a select few remains unclear. If the majority are functional, identifying functional m5Cs should be relatively straightforward. However, this contradicts the current research status, which does not have enough known cases of functional m5C modifications in mRNAs. Thus, it seems more likely that most of the mRNA m5C modifications may be unrelated to the reported functions. Additionally, other mRNA processing events, such as m6A modification (Liu and Zhang 2018b), alternative splicing (Saudemont et al. 2017), A-to-I editing (Xu and Zhang 2014), and backsplicing (Xu and Zhang 2021), have all been found to be error-prone (Zhang and Xu 2022). Therefore, it is possible that the m5C modifications largely arise from molecular errors and thus are mostly nonadaptive in nature, which is referred to as the error hypothesis. Distinguishing between the error and adaptive hypotheses of m5C modification is crucial, as it will shed light on the origin, function, and biological significance of this ubiquitous RNA processing event in eukaryotes and guide future m5C research. In this study, we present a series of distinct predictions based on the error hypothesis regarding the genomic patterns of m5C modifications, which are not expected a priori under the adaptive hypothesis. By analyzing high-throughput RNA BS-seq data from multiple tissues of humans and mice, we provide unequivocal genomic evidence that most mRNA m5Cs in mammals is due to RNA modification error and is selectively disfavored.

Results

Occurrence of mRNA m5C Modifications is Exceedingly low

Under the error hypothesis, m5C modification is considered a type of RNA modification error that is generally expected to be deleterious. Therefore, the m5C rate, defined as the probability of an RNA cytosine being methylated, should have been minimized by natural selection. In contrast, the adaptive hypothesis does not a priori predict a low m5C rate because, under this hypothesis, m5C rates should be high enough to generate sufficient modified RNAs to perform their beneficial functions (Zhang and Xu 2022).

To distinguish between the error and adaptive hypotheses in mRNAs, we investigated m5C rates by using a BS-seq dataset from humans and mice, which includes seven human tissues and six mouse tissues (see Materials and Methods) (Huang et al. 2019). We first counted the number of m5Cs and unmethylated Cs from the BS-seq reads of each gene and then defined the m5C rate of a gene as the number of m5Cs divided by the total number of m5Cs and unmethylated Cs in the gene. To ensure more accurate estimation of m5C rates, we considered only those protein-coding genes for which the expression level is at least 1 transcript per million (TPM) and required at least two reads for a transcribed C site. We found the median fraction of m5C-modified genes is relatively large, which is 52.4% and 30.8% in human and mouse, respectively (Fig. 1a). Despite the high fraction of m5C-modified genes, most of them have only one or two m5C sites (supplementary fig. S1, Supplementary Material online), which is much less than the number of cytosines in the genes. The median fraction of m5C sites is only 0.02% and 0.01% among the transcribed C sites in human and mouse, respectively (Fig. 1b). Moreover, the median rate of m5C, measured by the fraction of Cs that are m5C-modified, is only 0.006% in human and 0.002% in mouse (Fig. 1c). Even when only m5C-modified genes are considered, the median rate of m5C is still very low, at just 0.007% in human and 0.004% in mouse (Fig. 1d). We further examined the distribution of m5C rates among m5C-modified genes. Again, we found in the examined tissues of human and mouse, m5C rates in most genes are below 0.1% (Fig. 1e). Together, these observations indicate that mRNA m5C modification occurs at a generally low rate in human and mouse genes, as expected if m5C modification is a type of RNA modification error.

Low rates of mRNA m5C modifications in mammals. a) Percentage of m5C-modified genes. b) Percentage of C sites that are m5C-modified. c) Percentage of Cs that are m5C-modified. d) Percentage of m5Cs in m5C-modified genes. e) Distribution of the percentage of m5Cs among m5C-modified genes. In each boxplot, the bottom and top edges of a box represent the first (Q1) and third (Q3) quartiles, respectively; the horizontal line inside the box indicates the median (MD); and the whiskers extend to the most extreme values inside inner fences, MD ± 1.5(Q3-Q1).
Fig. 1.

Low rates of mRNA m5C modifications in mammals. a) Percentage of m5C-modified genes. b) Percentage of C sites that are m5C-modified. c) Percentage of Cs that are m5C-modified. d) Percentage of m5Cs in m5C-modified genes. e) Distribution of the percentage of m5Cs among m5C-modified genes. In each boxplot, the bottom and top edges of a box represent the first (Q1) and third (Q3) quartiles, respectively; the horizontal line inside the box indicates the median (MD); and the whiskers extend to the most extreme values inside inner fences, MD ± 1.5(Q3-Q1).

Fraction of mRNA m5Cs Decreases with the Amount of Cs Across Genes

If RNA m5C is primarily caused by the errors in RNA modification, it could be deleterious for at least three reasons. First, it may lower the fraction of normally functional RNAs. Second, it could waste cellular resource and energy in synthesizing and degrading defective RNAs that can achieve the same functionality as normal RNAs. Third, m5C-modified RNAs may be toxic to cells. Given the m5C rate per C residue in a gene's RNAs, the harm of m5C modification associated with the first reason is independent of the total amount of Cs transcribed from the gene (hereafter referred to as the C amount), while the harm associated with the second and third reasons increases with the C amount. Thus, the total harm caused by m5C modification in a gene is expected to increase with the C amount. Consequently, natural selection against m5C modification in a gene should intensify with the C amount of the gene. As a result, the error hypothesis predicts that the m5C rate should decline with the C amount. In contrast, the adaptive hypothesis of m5C modification does not predict this negative correlation a priori because, under this hypothesis, the extent of m5C modification of a gene depends on the specific function and regulation of the gene.

To distinguish between the error and adaptive hypothesis of m5C modification in protein-coding genes, we counted the number of m5Cs and unmethylated Cs from the mRNAs transcribed from an m5C-modified gene. The m5C rate of a gene is then estimated by the proportion of m5Cs among the total amount of Cs (i.e. the sum of m5Cs and unmethylated Cs) produced from the gene (see Materials and Methods). Using the human heart as an example, we first studied the correlation between the m5C rate of a gene and its total C amount. Consistent with the prediction of the error hypothesis, the rank correlation (ρ) between m5C rate of a gene and its C amount is significantly negative (ρ = −0.47, P < 10−16, Fig. 2a). Qualitatively similar results were observed in all other examined tissues of human and mouse (Fig. 2b).

The m5C rate of a gene (or supergene) decreases with the C amount of the gene (or supergene). a) The m5C rate of a gene decreases with the C amount of the gene in the human heart. Each dot represents an m5C-modified gene, and the solid line shows the linear least-squares regression. Spearman's rank correlation (ρ) and associated P value are presented. Outliers above the regression line are marked in red, and outliers below the regression line are marked in blue. b) Spearman's correlation between the m5C rate of a gene and its C amount among m5C-modified genes in each tissue of each mammal examined. All correlations have P < 10−4. c) The m5C rate of a supergene decreases with the median C amount of all genes belonging to the supergene in the human heart. Each dot represents a supergene. All supergenes have the same total C amount. d) Spearman's correlation between the median C amount of a supergene and its m5C rate in each tissue of each mammal examined. All correlations have P < 10−10. e) The m5C rate in the human heart tends to be higher in the paralog with the relatively low C amount compared with the paralog with the relatively high C amount within a paralogous gene pair. The analysis uses original data. Each dot represents a paralogous gene pair. Dots above and below the diagonal are colored red and blue, respectively. The numbers of red and blue dots are indicated in the corresponding colors. The P value is derived from a binomial test of the null hypothesis that there are equal numbers of red and blue dots. f) Proportion of paralogous gene pairs for which the m5C rate of the paralog with the relatively low C amount exceeds that of the paralog with the relatively high C amount in each tissue of each species examined. Both original and downsampled data are used. All fractions are significantly greater than the chance expectation of 50% (P < 10−4).
Fig. 2.

The m5C rate of a gene (or supergene) decreases with the C amount of the gene (or supergene). a) The m5C rate of a gene decreases with the C amount of the gene in the human heart. Each dot represents an m5C-modified gene, and the solid line shows the linear least-squares regression. Spearman's rank correlation (ρ) and associated P value are presented. Outliers above the regression line are marked in red, and outliers below the regression line are marked in blue. b) Spearman's correlation between the m5C rate of a gene and its C amount among m5C-modified genes in each tissue of each mammal examined. All correlations have P < 10−4. c) The m5C rate of a supergene decreases with the median C amount of all genes belonging to the supergene in the human heart. Each dot represents a supergene. All supergenes have the same total C amount. d) Spearman's correlation between the median C amount of a supergene and its m5C rate in each tissue of each mammal examined. All correlations have P < 10−10. e) The m5C rate in the human heart tends to be higher in the paralog with the relatively low C amount compared with the paralog with the relatively high C amount within a paralogous gene pair. The analysis uses original data. Each dot represents a paralogous gene pair. Dots above and below the diagonal are colored red and blue, respectively. The numbers of red and blue dots are indicated in the corresponding colors. The P value is derived from a binomial test of the null hypothesis that there are equal numbers of red and blue dots. f) Proportion of paralogous gene pairs for which the m5C rate of the paralog with the relatively low C amount exceeds that of the paralog with the relatively high C amount in each tissue of each species examined. Both original and downsampled data are used. All fractions are significantly greater than the chance expectation of 50% (P < 10−4).

Because the total C amount of a gene rises with the number of mRNA molecules (i.e. the equivalent of gene expression level) produced by the gene as well as the number of exonic C sites in the gene, we further examined the correlation between the m5C rate and these two variables separately. We measured the expression level of a gene using TPM and found it negatively correlates with the m5C rate of the gene (supplementary fig. S2a, Supplementary Material online). We also found a significantly negative correlation between the m5C rate of a gene and the number of exonic C sites in the gene (supplementary fig. S2b, Supplementary Material online). Moreover, the fraction of exonic C sites that are m5C-modified in a gene is negatively correlated with the number of total exonic C sites in the gene (supplementary fig. S2c, Supplementary Material online). These observations suggest that natural selection resists the harm of m5C modification not only by reducing the number of m5C sites but also by lowering the occurrence of m5C modification at each C site.

The negative correlation between m5C rates and the C amounts is subject to two potential biases. First, because the detectability of m5C increases with the C amount, both low and high m5C rates are observable in genes with high C amounts, whereas only high m5C rates may be observed in genes with low C amounts. Consequently, a negative correlation between m5C rates and C amounts could result simply from this detection bias. Second, because the C amount is used as the denominator in estimating m5C rate, a negative correlation between C amounts and m5C rates may simply arise from the intrinsic inverse relationship between a number and its reciprocal. To avoid these potential biases, we used a supergene method (see Materials and Methods). Specially, we ranked all genes by their C amounts and then grouped them into 100 bins with an equal amount of Cs. We then treated all genes in a bin together as a single supergene and calculated the overall m5C rate of the supergene. In the human heart, the m5C rate of a supergene decreases with the median C amount of all genes in the supergene (ρ = −0.85, P = 6.9 × 10−29; Fig. 2c). Similar results were observed in other examined tissues of human and mouse (Fig. 2d).

All the above analyses compared a heterogeneous set of genes that vary in many properties. To minimize the influence of potential confounding factors, we compared the m5C rates between paralogous genes because paralogs are similar in gene structure, DNA sequence, regulation, and function (Zhang 2013). To ensure sufficient power in the analysis, we required the C amount to be at least two times different between paralogs. Consistent with the prediction of the error hypothesis, the gene with a relatively low C amount tends to have a higher m5C rate in a pair of paralogs. For example, in the human heart, 70.7% of paralogous pairs show such a trend, which is significantly greater than the random expectation of 50% (P < 10−16, binomial test; Fig. 2e). To confirm the robustness of the observation in Fig. 2e, we repeated the above analysis 1000 times; the significant trend was confirmed in all 1,000 replications (supplementary fig. S3, Supplementary Material online). Moreover, the pattern in Fig. 2e holds across other examined tissues of human and mouse (Fig. 2f). Since the supergene method cannot pair paralogous genes, we used a downsampling approach (see Materials and Methods) to address potential statistical issues. Specifically, for each pair of paralogs, we randomly sampled Cs (including both m5Cs and unmethylated Cs) from the paralog with the higher amount of Cs to match the number of Cs observed in the other paralog. The results from the downsampled data were virtually identical to those from the original data (Fig. 2f).

Fraction of mRNA m5Cs Negatively Correlates with the Amount of Cs Across Tissues

The above analyses, which compared m5C modifications among different genes in the same tissue, focused on the influence of cis-acting elements because the genes are in the same environment of trans-acting factors. However, in addition to the cis-acting elements, m5C modifications may also be regulated by trans-acting factors. Since different tissues can provide qualitatively or quantitatively different trans-acting factors, the m5C rate of the same gene may differ across tissues. Therefore, we then compared the same protein-coding gene among different tissues to elucidate the effect of trans-acting factors. The error hypothesis predicts that, for a given gene, natural selection against m5C error should intensify in the tissue where the total C amount of the gene is higher. Consequently, there should be a negative correlation between m5C rate and the total C amount across tissues for individual genes. In contrast, the adaptive hypothesis does not predict a priori such a relationship because, under this hypothesis, m5C rate of a gene across tissues depends on its tissue-specific functions.

Because the m5C rate is very low or even zero in most genes (Fig. 1e), sampling error could obscure the potential signal in the comparisons among tissues of individual genes. To avoid this problem, we randomly grouped every 250 genes into a supergene (Xu and Zhang 2021), except for the last supergene that comprised the remainder of fewer than 250 genes after grouping. The number 250 was selected to ensure that each supergene contains sufficient genes with m5C modifications and that there are sufficient supergenes to allow meaningful statistical analysis. We then examined the m5C rate and the total C amount of each supergene in each tissue. As shown in Fig. 3a, for example, the m5C rate of a particular human supergene in a tissue generally decreases with its total C amount. Overall, in both human and mouse analyzed, significantly more than 50% of supergenes exhibit this negative correlation (Fig. 3b). Because each supergene does not have the same total C amount across tissues, to avoid potential statistical artifacts, we downsampled the C amount of a supergene in a tissue to the lowest observed level among all tissues for that supergene and then recomputed the m5C rate of the supergene in each tissue. The final results remain qualitatively unchanged (Fig. 3b).

Negative correlation between the m5C rate and C amount across tissues. The m5C rate is the number of m5Cs divided by the total number of transcribed Cs in a supergene. a) The m5C rate of a particular supergene in a tissue decreases with the total C amount of the genes belonging to the supergene in the tissue. The virtually superimposed black and red lines are the linear least-squares regressions for the original and downsampled data, respectively. b) Distribution of Spearman's correlation coefficient between m5C rate and the total C amount across tissues for all supergenes. In each boxplot, the lower and upper edges of a box represent Q1 and Q3 quartiles, respectively; the horizontal line inside the box indicates the MD; and the whiskers extend to the most extreme values inside inner fences, MD ± 1.5(Q3 − Q1). Below each boxplot is the fraction of supergenes showing a negative correlation; all fractions significantly exceed 50% (P < 0.05, binomial test).
Fig. 3.

Negative correlation between the m5C rate and C amount across tissues. The m5C rate is the number of m5Cs divided by the total number of transcribed Cs in a supergene. a) The m5C rate of a particular supergene in a tissue decreases with the total C amount of the genes belonging to the supergene in the tissue. The virtually superimposed black and red lines are the linear least-squares regressions for the original and downsampled data, respectively. b) Distribution of Spearman's correlation coefficient between m5C rate and the total C amount across tissues for all supergenes. In each boxplot, the lower and upper edges of a box represent Q1 and Q3 quartiles, respectively; the horizontal line inside the box indicates the MD; and the whiskers extend to the most extreme values inside inner fences, MD ± 1.5(Q3 − Q1). Below each boxplot is the fraction of supergenes showing a negative correlation; all fractions significantly exceed 50% (P < 0.05, binomial test).

mRNA m5C Modifications are Mostly Unshared Between Human and Mouse

Among-species conservation is a commonly used method to determine whether a molecular event has functional significance (Liu and Zhang 2018b; Xu and Zhang 2021). Therefore, we next differentiate between the error and adaptive hypothesis by exploring the conservation of m5C modification between human and mouse. Given the evolutionary divergence between these two species, it is reasonable to assume that an m5C modification in one of them is lost in the other unless the modification is functional and hence conserved. We started by analyzing the heart tissue. From the expressed one-to-one orthologous protein-coding genes, we identified 8,779,033 one-to-one orthologous C sites between the two species, of which 5,852 and 1,514 m5C sites are observed in human and mouse, respectively. However, the number of m5C sites shared by the two species is only 14 (Fig. 4a). Despite such a small number of shared m5C modification, there may still be an overestimation due to the m5C modifications shared by chance. To remove the chance expectation, we randomly sampled the same number of C sites from each m5C gene as that observed in the gene for human and mouse, respectively, and then counted the number of shared ones in the randomly picked C sites between the two species; this number represents the expected number of C sites that would be modified in both species by chance when m5C modifications in the two species are independent. This simulation was repeated 1,000 times to obtain the average number of randomly shared m5C sites. In human heart tissue, for example, we determined an average of 0.68 m5C sites shared by chance (Fig. 4b). Given that there are 5,852 human m5C sites in the one-to-one orthologs between human and mouse, but only 14 − 0.68 = 13.32 of them are shared between the two species, the fraction of human m5C modifications that are selectively conserved is 0.23%. The corresponding fraction in mouse is 0.88%. Analyzing all five tissues showed the median values of 0.47% and 1.3% of the m5C sites are shared in human and mouse, respectively (Fig. 4c and supplementary fig. S4a, Supplementary Material online).

Sharing of mRNA m5C modifications between human and mouse. a) Venn diagram showing the observed number of shared m5C sites in one-to-one orthologous genes between human and mouse heart tissue. b) Frequency distribution of the number of shared m5C sites expected by chance in human upon the control for potential detection bias. The dotted arrow shows the mean of the distribution, whereas the solid arrow indicates the observed number of shared m5C sites. The P value is the fraction of the distribution on the right side of the solid arrow. c) The fraction of shared m5C sites in other human tissues. d) Venn diagram showing the observed number of shared m5C one-to-one orthologous genes between human and mouse heart tissue. e) Frequency distribution of the number of shared m5C-modified genes expected by chance in human upon the control for potential detection bias. The dotted arrow shows the mean of the distribution, whereas the solid arrow indicates the observed number of shared m5C-modified genes. The P value is the fraction of the distribution on the right side of the solid arrow. f) The fraction of shared m5C-modified genes in other human tissues. The fraction of shared m5C sites or m5C-modified genes that are observed, expected by chance, and truly shared after removing chance expectation are displayed in c) and f).
Fig. 4.

Sharing of mRNA m5C modifications between human and mouse. a) Venn diagram showing the observed number of shared m5C sites in one-to-one orthologous genes between human and mouse heart tissue. b) Frequency distribution of the number of shared m5C sites expected by chance in human upon the control for potential detection bias. The dotted arrow shows the mean of the distribution, whereas the solid arrow indicates the observed number of shared m5C sites. The P value is the fraction of the distribution on the right side of the solid arrow. c) The fraction of shared m5C sites in other human tissues. d) Venn diagram showing the observed number of shared m5C one-to-one orthologous genes between human and mouse heart tissue. e) Frequency distribution of the number of shared m5C-modified genes expected by chance in human upon the control for potential detection bias. The dotted arrow shows the mean of the distribution, whereas the solid arrow indicates the observed number of shared m5C-modified genes. The P value is the fraction of the distribution on the right side of the solid arrow. f) The fraction of shared m5C-modified genes in other human tissues. The fraction of shared m5C sites or m5C-modified genes that are observed, expected by chance, and truly shared after removing chance expectation are displayed in c) and f).

Some studies reported that it may not be important which specific sites in a protein are phosphorylated, as long as the protein has sites that undergo phosphorylation (Moses et al. 2007; Nguyen Ba and Moses 2010; Landry et al. 2014). By analogy, the function of m5C modification in a gene may be exerted through the gene as a whole, rather than through specific sites within the gene. In this scenario, a gene that is m5C-modified in one species tends to be m5C-modified in other species, but the specific methylated site(s) in the gene may vary. To clarify this possibility, we analyzed the sharing of m5C-modified genes within the 15,861 one-to-one protein-coding orthologs between human and mouse. In the heart tissue, for example, 2,614 and 1,624 genes are m5C-modified in human and mouse, respectively. The number of m5C-modified genes shared in the two species is 467, which accounts for 17.8% and 28.8% in human and mouse, respectively (Fig. 4d). To remove the chance expectation, we randomly picked the same number of genes in human and mouse as that observed in the two species, respectively, and then counted the number of shared genes. We repeated this simulation 1,000 times and got a mean value of 267.1 genes that are shared by chance (Fig. 4e). Thus, only 467 − 267.1 = 199.9 gene-level m5C modifications may be considered functional, which constitute 199.9/2,614 = 7.6% of m5C-modified genes in human and 199.9/1,624 = 12.3% in mouse. Similarly, small fractions of shared m5C-modified genes were observed in all five tissues, with the median values of 7% in human and 12% in mouse (Fig. 4f and supplementary fig. S4b, Supplementary Material online).

mRNA m5C Sites are not Evolutionarily More Conserved

Under the adaptive hypothesis, m5C modification is generally beneficial, predicting that m5C sites should be evolutionarily more conserved than unmethylated C sites due to the additional selective constraints associated with m5C-specific functions. In contrast, no such prediction is made by the error hypothesis. Therefore, comparing the evolutionary conservation of m5C sites with that of unmethylated C sites allows us to differentiate between the adaptive and error hypotheses. We combined the seven human tissues to compile a nonredundant dataset of m5C sites and unmethylated C sites for protein-coding genes and used PhastCons score to measure their conservations (see Materials and Methods). Since m5C modifications occur in the 5′-UTR, CDS, and 3′-UTR of a protein-coding gene (Huang et al. 2019), we analyzed the three regions separately because of their different levels of background conservation. Briefly, we compared the PhastCons scores of m5C sites with those of unmethylated C sites which are located in the same region. To avoid potential bias from the unequal number of unmethylated C sites and m5C sites, we randomly selected an unmethylated C site from the same region in the same gene to serve as a control for each m5C site. However, inconsistent with the adaptive hypothesis, we found that the mean PhastCons score of m5C sites in CDS regions is 0.668, which is not higher than the 0.672 score of unmethylated C sites in the same region (Fig. 5a). To ensure the comparison's robustness, we repeated the randomization 1,000 times to obtain the distribution of the mean PhastCons scores. Still, no significant difference in conservation was observed between m5C sites and their controls (Fig. 5a). Because conservation varies among codon positions, we also performed the comparisons separately for each of the three codon positions. Even so, no significantly higher conservations were observed for m5C sites (supplementary fig. S5a, Supplementary Material online). Similarly, we did not find significantly higher conservations for m5C sites in the 5′- and 3′-UTRs either (Fig. 5a). Since PhastCons score quantifies conservation based on conserved regions (Siepel et al. 2005), we then redid the above comparisons using PhyloP score that measures conservation base-to-base (Pollard et al. 2010). Nevertheless, the results remain qualitatively unchanged (supplementary fig. S6, Supplementary Material online).

Conservation of mRNA m5C sites. Comparison of a) divergence, b) polymorphism, and c) DAF between m5C sites and unmethylated C sites. The divergence and polymorphism were measured by PhastCons score and SNP density, respectively. The inserted boxplot is an example of the comparison of divergence, polymorphism, or DAF between observed m5C sites and unmethylated C sites of a random sampling, while the distribution shows the results of 1,000 randomizations. For each random sampling, a mean PhastCons score, SNP density, or mean DAF was obtained for unmethylated C sites, and then, mean values from 1,000 randomizations were used to draw a distribution. The solid arrow indicates the mean of the distribution, whereas the dotted arrow indicates the mean PhastCons score, SNP density, or mean DAF of observed m5C sites. P values in the inserted boxplots were calculated by Mann–Whitney U test for PhastCons score and DAF and χ2 test for SNP density. Other P values are the fractions of the distribution on the a) left or b) and c) right side of the dotted arrow. The results of CDS, 5′-UTR, and 3′-UTR are shown separately from the top to the bottom.
Fig. 5.

Conservation of mRNA m5C sites. Comparison of a) divergence, b) polymorphism, and c) DAF between m5C sites and unmethylated C sites. The divergence and polymorphism were measured by PhastCons score and SNP density, respectively. The inserted boxplot is an example of the comparison of divergence, polymorphism, or DAF between observed m5C sites and unmethylated C sites of a random sampling, while the distribution shows the results of 1,000 randomizations. For each random sampling, a mean PhastCons score, SNP density, or mean DAF was obtained for unmethylated C sites, and then, mean values from 1,000 randomizations were used to draw a distribution. The solid arrow indicates the mean of the distribution, whereas the dotted arrow indicates the mean PhastCons score, SNP density, or mean DAF of observed m5C sites. P values in the inserted boxplots were calculated by Mann–Whitney U test for PhastCons score and DAF and χ2 test for SNP density. Other P values are the fractions of the distribution on the a) left or b) and c) right side of the dotted arrow. The results of CDS, 5′-UTR, and 3′-UTR are shown separately from the top to the bottom.

The above among-species conservation analyses cannot completely rule out the functionality of m5C modifications because it is also possible that they have lineage-/species-specific functions. To examine this possibility, we computed single-nucleotide polymorphism (SNP) density and derived allele frequency (DAF) for m5C sites and unmethylated C sites in m5C-modified genes. The adaptive hypothesis predicts that if m5C sites have lineage-/species-specific functions, they should show reduced intraspecific SNP density and DAF when compared with unmethylated C sites. However, the SNP density and DAF of m5C sites are not significantly lower than those of unmethylated C sites in any of the three types of gene regions (Fig. 5b and c). In addition, the same is true when the codon positions are separately analyzed (supplementary fig. S5b and c, Supplementary Material online). Thus, compared with unmethylated C sites, m5C sites are not evolutionarily more conserved either within species or between species. Taken together, these results are inconsistent with the predictions of the adaptive hypothesis, but rather with those of the error hypothesis.

mRNA m5C Motifs are not Evolutionarily More Conserved

It is reported that some mRNA m5C sites can be classified into two types according to their specific methylation writers and dependent sequence motifs (Huang et al. 2019; Liu et al. 2021, 2022; Selmi et al. 2021; Zhang et al. 2021). Briefly, type I m5C sites contain a G-rich triplet motif (i.e. CNGGG with the first C methylated) and are methylated by NSUN2, while the type II m5C sites contain a downstream UCCA motif (i.e. CUCCA with the first C methylated) and are methylated by NSUN6. Although these m5C sites account for only a small fraction (∼5.1% in our dataset, supplementary fig. S7, Supplementary Material online), they are regarded as highly reliable (Liu et al. 2021, 2022; Selmi et al. 2021). Nevertheless, they are still not evolutionarily more conserved than unmethylated C sites (supplementary fig. S8, Supplementary Material online).

For these m5C sites, if they are beneficial, their motifs should be protected by purifying selection. Hence, under the adaptive hypothesis, a higher conservation should be observed for these m5C motifs. To examine this prediction, the CNGGG or CUCCA motifs that are not methylated at the first C site were selected as the controls. To eliminate potential bias due to the unequal number of m5C motifs and their controls, we randomly picked up a control for each m5C motif from the same region in the same gene. Using the PhastCons score to measure the conservation, we found the mean conservation value is 0.647 for m5C motifs in CDS regions, which is not significantly different from 0.641 of the controls. This trend was then confirmed by our 1,000 times of randomizations (Fig. 6a). Similarly, we did not find significantly higher conservations for m5C motifs in the 5′- and 3′-UTRs either (Fig. 6a). Additionally, the patterns remain consistent even using PhyloP score to quantify the conservation (supplementary fig. S9, Supplementary Material online).

Conservation of mRNA m5C motifs. Comparison of a) divergence, b) polymorphism, and c) DAF between the motifs (i.e. CNGGG or CUCCA) of m5C sites and the same motifs of unmethylated C sites. The divergence and polymorphism were measured by PhastCons score and SNP density, respectively. The inserted boxplot is an example of the comparisons, while the distribution shows the comparison results of 1,000 randomizations. For each random sampling, a mean PhastCons score, SNP density, or mean DAF was obtained for the motifs of unmethylated C sites, and then, mean values from 1,000 randomizations were used to draw a distribution. The solid arrow indicates the mean of the distribution, whereas the dotted arrow indicates the mean PhastCons score of the motifs of observed m5C sites. P values in the inserted boxplots were calculated by Mann–Whitney U test for PhastCons score and DAF and χ2 test for SNP density. Other P values are the fractions of the distribution on the a) left or b) and c) right side of the dotted arrow. The results of CDS, 5′-UTR, and 3′-UTR are shown separately from the top to the bottom.
Fig. 6.

Conservation of mRNA m5C motifs. Comparison of a) divergence, b) polymorphism, and c) DAF between the motifs (i.e. CNGGG or CUCCA) of m5C sites and the same motifs of unmethylated C sites. The divergence and polymorphism were measured by PhastCons score and SNP density, respectively. The inserted boxplot is an example of the comparisons, while the distribution shows the comparison results of 1,000 randomizations. For each random sampling, a mean PhastCons score, SNP density, or mean DAF was obtained for the motifs of unmethylated C sites, and then, mean values from 1,000 randomizations were used to draw a distribution. The solid arrow indicates the mean of the distribution, whereas the dotted arrow indicates the mean PhastCons score of the motifs of observed m5C sites. P values in the inserted boxplots were calculated by Mann–Whitney U test for PhastCons score and DAF and χ2 test for SNP density. Other P values are the fractions of the distribution on the a) left or b) and c) right side of the dotted arrow. The results of CDS, 5′-UTR, and 3′-UTR are shown separately from the top to the bottom.

We next compared the SNP density and DAF of m5C motifs with their controls to rule out the possibility of potential lineage-specific functions. The adaptive hypothesis predicts lower SNP density and DAF for the m5C motifs. However, this pattern was not observed in any of the three types of gene regions (Fig. 6b and c). Thus, these observations suggest no additional selective constraint on m5C motifs, which is consistent with the error hypothesis rather than the adaptive hypothesis.

Most mRNA m5C Modifications are Deleterious

Although the above analyses strongly suggest that most mRNA m5C modifications are deleterious, they do not indicate what fraction of m5C modifications are harmful. To address this question, we next used an established method to estimate the fraction of deleterious m5C modifications (Saudemont et al. 2017). This estimation is based on the reasonable assumption that the fitness effect of an m5C site before the action of natural selection is independent of the total amount of Cs at the site. Because the strength of natural selection against m5C modifications increases with the C amounts, we assume that all deleterious m5C modifications have been selectively removed in genes of the highest C amount. In other words, the observed m5C rate in these genes represents the nondeleterious m5C rate (ND). Similarly, we assume that none of the deleterious m5C modifications has been selectively purged in genes of the lowest C amount. Thus, the observed m5C rate in these genes reflects the total m5C rate (T). Consequently, the fraction of deleterious m5C modifications can be calculated as Fdel = (TND)/T = 1 – ND/T. We defined genes of the lowest and highest C amount using a variety of cutoffs. In theory, using more stringent cutoffs makes the estimate of Fdel more accurate but less precise due to the reduction in sample size. When the data from all tissues were combined, we found Fdel to be greater than 90% for each species regardless of any combination of cutoffs (Fig. 7a). For example, when the cutoffs of bottom 1% and top 1% m5C-modified protein-coding genes were used to define genes with the lowest and highest C amounts, respectively, human T = 0.0014 and ND = 3.93 × 10−5, so Fdel = 97.2%. Similarly, under these cutoffs, Fdel = 99.6% for the mouse. Using the same cutoffs, we further estimated the Fdel for each tissue in each species and found all Fdel values were greater than 98%. Although Fdel varies among tissues, we observed a median Fdel of 99.4% and 99.6% in human and mouse, respectively (Fig. 7b), which are similar to the Fdel estimated from all tissues combined (Fig. 7a). Because very slightly deleterious m5C modifications may not have been fully removed by selection in the genes of the highest C amounts, the true ND should be smaller than that our estimate. Similarly, because some strongly deleterious m5C modifications may have been removed by selection even in the genes of the lowest C amounts, the true T should be larger than our estimate. Therefore, the true Fdel should be larger according to its calculation 1 – ND/T, making our Fdel estimates conservative.

Fraction of deleterious m5C modifications in mRNAs. a) Estimated fractions of deleterious m5C modifications for each species when data from all tissues are combined, using different cutoffs based on gene numbers with the lowest and highest C amount. b) Estimated fractions of deleterious m5C modifications in each tissue of each species examined under the cutoffs highlighted in a).
Fig. 7.

Fraction of deleterious m5C modifications in mRNAs. a) Estimated fractions of deleterious m5C modifications for each species when data from all tissues are combined, using different cutoffs based on gene numbers with the lowest and highest C amount. b) Estimated fractions of deleterious m5C modifications in each tissue of each species examined under the cutoffs highlighted in a).

Note that Fdel measures the fraction of m5C modifications that are deleterious before the action of purifying selection. Because some deleterious m5C modifications have been removed by selection, the fraction of observed m5C modifications that are deleterious should be lower. To this end, we used the overall m5C rate from all genes as T and the m5C rate from genes in the top 1% of highest C amounts as ND to re-estimate the deleterious fraction of observed m5C modifications (Odel). When the data from all tissues are combined, Odel is 80% for human and 79% for mouse. The Odel values are 19% to 20% lower than the corresponding Fdel values, suggesting that a substantial number of deleterious m5C modifications have been removed by natural selection. Together, our estimation indicates that the vast majority of all or observed mRNA m5C modifications are deleterious, which is broadly consistent with the above other findings in this study.

Discussion

Genome-scale discoveries of mRNA m5C modifications in mammals and functional explorations of corresponding “writers,” “erasers,” and “readers” prompted the prevailing view that mRNA m5C modifications are generally functionally important and beneficial in regulating biological processes. Nevertheless, in this study, we proposed and tested an alternative hypothesis that most mRNA m5C modifications arise from nonadaptive posttranscriptional modification errors. By analyzing BS-seq data of human and mouse tissues, we demonstrated that (i) m5C modification occurs at a very low rate, (ii) the m5C rate in a gene decreases with the C amount of the gene, (iii) the m5C rate of a gene negatively correlates with its C amount across tissues, (iv) m5C modifications are mostly unshared between species, (v) m5C sites are not evolutionarily more conserved than unmethylated C sites, (vi) no additional purifying selection protects m5C motifs, and (vii) erroneous m5C modifications account for a major proportion. None of these observations were predicted a prior by the adaptive hypothesis, but all fit the predictions of the error hypothesis. Together, these lines of empirical genome-wide evidence strongly suggest that most mammalian mRNA m5C modifications are nonadaptive.

Nevertheless, it is notable that the error hypothesis of mRNA m5C modification does not refute the possibility that a small fraction of m5C modifications may occasionally be functional. In this context, it is highly valuable to identify functional m5C modifications, even if they account for only a small fraction in mRNAs. To identify potentially functional m5C modifications, we conceived a method analogous to the search for genes hosting functional circRNAs (Xu and Zhang 2021; Li et al. 2024). Specifically, we regressed between the m5C rate and C amount across genes in Fig. 2a and calculated Cook's distance for each gene (see Materials and Methods). We then defined a gene as an outlier if its Cook's distance is more than four times the mean Cook's distance of all genes. Because most m5C modifications result from molecular errors, leading to a negative correlation between m5C rates and C amounts, functional m5C modifications are expected to break this rule and show a higher m5C rate than predicted by the C amount. Thus, the m5C modifications in genes that are identified as outliers from the regression analyses are expected to be functional. Using this approach, some genes functioning in an m5C modification manner are indeed discovered in the outliers, such as MYC (Zhang et al. 2023), CHD3 (Meng et al. 2024), RABL6 (Wang et al. 2023), SMOX (Liu et al. 2024), MUC1 (Wang et al. 2024), and NM23-H1(Lu et al. 2024). Identifying functional m5C modifications from a sea of mostly nonfunctional m5C modifications is an important but difficult task. Therefore, a more systematic approach to uncover functional m5C modifications is needed in the future.

In total, our analyses identified 55,508 m5C sites in 10,956 human genes and 15,085 m5C sites in 6,912 mouse genes from the examined tissues, which is higher than that reported by another study (Huang et al. 2019). The primary reason is the different coverage thresholds used to determine m5C sites. We employed a threshold of at least two reads to filter out false-positive m5C sites, whereas other studies used thresholds of 10 or even 20 reads (Huang et al. 2019; Johnson et al. 2022; Liu et al. 2022). While high filtering thresholds help detect potentially functional m5C modifications, they also lead to the loss of most true-positive m5C modifications caused by molecular error. Therefore, it is not appropriate to set a high threshold in our study. Nevertheless, we followed the high filtering thresholds used by Huang et al. (2019) to compile a new set of m5C modifications and repeated the analyses. The resultant patterns still support the error hypothesis (supplementary fig. S10, Supplementary Material online).

Unlike the previous study that focused solely on the coding sequence (CDS) when delineating the nonadaptive features of mRNA m6As (Liu and Zhang 2018b), our analyses encompassed the CDS, 5′-UTR, and 3′-UTR, because studies have indicated the presence of m5Cs across these regions (Yang et al. 2017; Huang et al. 2019; Liu et al. 2022). To remove confounding factors arising from differences between CDSs and UTRs, we also explored each region separately. Indeed, when assessing conservation, we have already calculated the regions separately due to differences in the conservation of the background C bases. And all results continue to fully support our molecular error hypothesis (Fig. 5). Nevertheless, our present study is limited to protein-coding genes and does not examine the characteristics of m5Cs in noncoding genes. Given the substantial occurrence of m5C modifications in noncoding RNAs (Chen et al. 2021), such as tRNA, it remains relevant to investigate the adaptability of m5C modification in noncoding RNAs.

Although the m5C data used in this study are sufficient to examine the error hypothesis, other available BS-seq m5C datasets were not considered. To test the robustness of our findings, we analyzed newly identified m5C sites from developmental stages (Liu et al. 2022), where m5C modification is more abundant compared with tissues. Consequently, we found that the fraction of shared m5C modifications was 1.1% at the site level and 18.5% at the gene level in humans (supplementary fig. S11, Supplementary Material online). We further compared m5C sites and motifs with unmethylated C sites and motifs to assess conservation differences. Consistent with the results observed in tissues, neither the m5C sites (supplementary fig. S12, Supplementary Material online) nor their motifs (supplementary fig. S13, Supplementary Material online) showed higher conservation. Therefore, even with the use of a new dataset enriched with m5C sites, our results remain robust.

Previous studies have shown that sequence context and RNA structure are important factors for m5C modifications (Liu et al. 2021, 2022; Selmi et al. 2021). Specifically, type I m5C modifications, typically catalyzed by NSUN2, occur more frequently in motifs or regions with a higher proportion of Gs, fewer Us, and more base-pairing structures in +1 to +4 motif regions. In contrast, type II m5C modifications, generally catalyzed by NSUN6, are more common in regions with reduced base pairing in the 5 nt flanking regions but increased base pairing in the +10 to +15 nt flanking regions. Although the biogenesis of m5C modifications appears to be influenced by these sequence and structural features, this does not necessarily indicate that m5C modifications are adaptive, because whether these features are favored by natural selection is unclear. To address this question, we compared the ancestral state with current state for the m5C-associated motifs or regions of the NSUN2-dependent m5C sites (i.e. type I) and NSUN6-dependent m5C sites (i.e. type II). Because m5C identification is based on the reference genome, we used the reference alleles as the representation of current alleles. Taking the proportion of Gs in the +1 to +4 region of type I m5C motifs as an example, the adaptive hypothesis predicts that G would be favored by natural selection, resulting in a higher proportion of Gs in the motifs of current state compared with ancestral state. However, our analysis showed that the fraction of Gs in the motifs of current state was 67.5%, which is not significantly different from 67.4% in the motifs of ancestral state (P = 0.99, Fisher's exact test, supplementary fig. S14a, Supplementary Material online). This lack of difference is not attributable to the genomic background, as no significant differences were observed in unmethylated C motifs either (supplementary fig. S14a, Supplementary Material online). Similar results were found for the fraction of Us and the degree of base pairing in the +1 to +4 region of type I m5C motifs (supplementary fig. S14b and c, Supplementary Material online) as well as for the degree of base pairing in the 5 nt and +10 to +15 nt flanking regions of type II m5C motifs (supplementary fig. S14d and e, Supplementary Material online). Thus, no positive selection is observed for m5C-associated features, suggesting a nonadaptive nature of m5C modification.

One might question why m5C remains in biological systems, given its nature of molecular error. Basically, there are two lines of potential explanations for its presence in cells (Zhang and Xu 2022). First, biological process is stochastic in nature, errors cannot be completely avoided. When the m5C enzyme (i.e. the writer) occasionally acts on an incorrect substrate, an erroneous m5C reaction may happen. Second, m5C modification errors have not been eliminated by natural selection for several possible reasons: (i) the reduction of m5C modification errors is not cost-free and natural selection would not favor such reduction if the associated cost exceeds the benefit; (ii) while the net benefit of reducing m5C modification errors may be positive, this benefit diminishes as error rates decrease, ultimately rendering selection for error reduction too weak to overcome the effects of genetic drift; (iii) cells may have evolved mechanisms to lessen the damage caused by the m5C modification error such that the benefit of reducing error is minimized; and (iv) it is possible that no mutation exists that could entirely eliminate m5C modification errors.

Our findings on m5C modification, along with the previous study on m6A (Liu and Zhang 2018b), demonstrate that RNA modifications are generally error-prone. Nevertheless, more than 170 types of posttranscriptional modifications have been identified to date (Boccaletto et al. 2018), and whether other modifications are nonadaptive remains unknown. Thus, it would be interesting to test the error hypothesis for these modifications in the future. Together, these findings echo other recent discoveries that a number of steps in transcription and translation are fallible, including, for example, alternative transcriptional start (Xu et al. 2019), adenosine-to-inosine (A-to-I) editing (Xu and Zhang 2014), cytidine-to-uridine (C-to-U) editing (Liu and Zhang 2018a), alternative linear splicing (Saudemont et al. 2017), backsplicing (Xu and Zhang 2021), alternative polyadenylation (Xu and Zhang 2018), and translational stop-codon read-through (Li and Zhang 2019). Collectively, these studies suggest that cellular processes are fraught with variability and far from the perfection commonly imagined, offering critical insights into our understanding of the fundamentals of biology (Zhang and Yang 2015; Zhang and Xu 2022).

Materials and Methods

m5C Sites

mRNA m5C sites were identified from the dataset generated by Huang et al. (2019). This dataset, containing seven human tissues and six mouse tissues, was produced using BS-seq that is specific for RNA m5C studies. The raw reads were downloaded from the NCBI GEO database (accession number: GSE122260) and then trimmed with Fastp (Chen et al. 2018) to remove adaptor contaminations and low-quality nucleotides. The resulting clean reads were mapped to reference genomes (GRCh37 or GRCm38) using hisat2. Uniquely mapped reads with quality larger than 30 were used to call candidate m5C sites and corresponding coverages by meRanCall (-md 5 -sc 10 -ei 0.1 -cr 0.99 -fdr 0.01 -bed63 -np -gref) module of the meRanTK pipeline (v1.2.1b) (Rieder et al. 2016). The false discovery rate of m5C was controlled at 0.01. m5C sites were identified only from the genes with at least 1 TPM and were further filtered by the minimal coverage of 2 or 3 Cs and 20 minimal coverage as stricter cutoff (Huang et al. 2019).

The NSUN2-dependent m5C sites (i.e. type I) and NSUN6-dependent m5C sites (i.e. type II) were downloaded from the study of Liu et al. (2021), while m5C sites in developmental stages were downloaded from another study of Liu et al. (2022).

C Amount

The C amount of a C site refers to the number of RNA C residues transcribed from the C site, which is equal to the sum of m5C coverage and unmethylated C coverage at the site. The C amount for each C site was acquired by samtools (v1.20) (Li et al. 2009) from the BAM files aligned from BS-seq reads. The C amount of a gene or supergene is the sum of RNA m5C and unmethylated C residues transcribed from the gene or supergene. The m5C rate of a gene is calculated as the total number of RNA m5C residues divided by the total number of transcribed C residues in the gene.

Corrections of Unequal Surveys of m5C Modification Among Genes

Due to the variation of C amounts among genes, the detection of m5C is more sensitive in genes with high C amounts than in those with low C amounts. To eliminate the potential influence of this unequal sensitivity, we used two different approaches. The first is the supergene approach. Briefly, we ranked all genes by their C amounts and then grouped them into 100 bins representing 100 supergenes, ensuring the total C amount per bin to be the same for all bins. The m5C amount and total C amount were summed up across all genes in the bin. The supergene approach cannot be used under certain circumstances, such as the comparison between paralogs. Thus, we used the downsampling as the alternative. Basically, we randomly picked the number of transcribed m5Cs from the gene with a relatively high C amount to match the level observed in the gene with a relatively low C amount. This downsampling equalizes the survey depth of m5C between the two genes. The supergene approach is preferred over downsampling when both can be used, because the former uses all data, while the latter uses only part of the data. In comparing m5C rates among tissues, because many m5C genes would lack m5C coverage in multiple tissues after downsampling, leading to a loss of statistical power, we combined the supergene approach with downsampling, as described in the Results section.

Cook's Distance

Cook's distance (Di) is used in regression analysis to find influential outliers in a set of predictor variables (Cook 1977; RStudio Team 2013; Inc 2022). In other words, the aim of Di is to identify points that negatively affect your regression model. The measurement is a combination of each observation's leverage and residual values; the higher the leverage and residuals, the higher the Cook's distance. It summarizes how much all the values in the regression model change when the ith observation is removed. The formula for Cook's distance is:

where y^j is the jth gene's, junction's, or site's fitted response value, y^j(i) is the jth gene's, junction's, or site's fitted response value when gene, junction, or site i is removed, n is the number of genes, junctions, or sites, MSE is the mean squared error, and p is the number of coefficients in the regression model.

Paralogs and Orthologs

Paralogous genes were downloaded from Ensembl (release 89; May 2017) for human and mouse. We obtained 3,678 human protein-coding gene families with 51,657 pairs of paralogs and 3,856 mouse gene families with 79,968 pairs of paralogs. From each gene family, we randomly selected only one paralogous pair that exhibits a 2-fold or greater difference in the C amount to ensure a sufficient statistical power. Orthologous genes between human and mouse were downloaded from Ensembl (release 89; May 2017). We identified 16,797 one-to-one orthologous genes, which were the only ones considered in our analysis. To identify orthologous C sites between species, we used the UCSC liftOver tool (https://genome.ucsc.edu/util.html) to align the sites of one species with the genome of another species.

Conservation Score and SNP

To examine the evolutionary conservation of m5C sites and their motifs, we downloaded the PhastCons and PhyloP scores from UCSC (http://hgdownload.cse.ucsc.edu/goldenpath/hg19/phastCons46way/placentalMammals/placentalMammals.phastCons46way.bw and https://hgdownload.soe.ucsc.edu/goldenPath/hg19/phyloP46way/placentalMammals.phyloP46way.bw), respectively. These scores are computed from genome alignments of 46 placental mammals, including humans (hg19). SNP data were downloaded from the Ensembl database (GRCh37.75), with insertions and deletions not considered. Thus, a total of 53,336,692 SNPs were used for the analyses. SNP density was calculated as the number of SNP sites divided by the total number of nucleotide sites considered. The nucleotide observed at a SNP was categorized as ancestral if it is the same as the nucleotide of the “AA” field in the polymorphism VCF file; other nucleotides at the SNP are derived. The DAF at a SNP is the frequency of the derived allele at the SNP. Additionally, human ancestral alleles were downloaded from ENSEMBL as well (https://ftp.ensembl.org/pub/release-75/fasta/ancestral_alleles/).

Quantification and Statistical Analysis

All statistical analyses in this study were performed using R software, and the corresponding statistical methods are described in the main text and figure legends.

Supplementary Material

Supplementary material is available at Molecular Biology and Evolution online.

Acknowledgments

The authors thank Dr. Jianheng Liu for sharing the data from Huang et al. (2019) and helpful instructions on analyzing the data.

Author Contributions

C.X. conceived the study. C.X. and Z.L. designed and wrote the paper. Z.L. and K.M. analyzed the data. All authors contributed to the manuscript revision.

Funding

This study was supported by National Science and Technology Innovation 2030 Major Projects for “Brain Science and Brain-Inspired Research” (2022ZD0214400), National Natural Science Foundation of China (32270704, 32100518, and 32472630), and Medical Engineering Crossover Fund of Shanghai Jiao Tong University (YG2022QN084).

Data Availability

mRNA m5C data of human and mouse tissues in this study are available in NCBI GEO database (accession number: GSE122260). The NSUN2-dependent m5C sites (i.e. type I) and NSUN6-dependent m5C sites (i.e. type II) were downloaded from the study of Liu et al. (2021), while m5C sites in developmental stages were downloaded from another study of Liu et al. (2022).

References

Amos
 
H
,
Korn
 
M
.
5-Methyl cytosine in the RNA of Escherichia coli
.
Biochim Biophys Acta
.
1958
:
29
(
2
):
444
445
. .

Blanco
 
S
,
Dietmann
 
S
,
Flores
 
JV
,
Hussain
 
S
,
Kutter
 
C
,
Humphreys
 
P
,
Lukk
 
M
,
Lombard
 
P
,
Treps
 
L
,
Popis
 
M
, et al.  
Aberrant methylation of tRNAs links cellular stress to neuro-developmental disorders
.
EMBO J.
 
2014
:
33
(
18
):
2020
2039
. .

Boccaletto
 
P
,
Machnicka
 
MA
,
Purta
 
E
,
Piątkowski
 
P
,
Bagiński
 
B
,
Wirecki
 
TK
,
de Crécy-Lagard
 
V
,
Ross
 
R
,
Limbach
 
PA
,
Kotter
 
A
, et al.  
MODOMICS: a database of RNA modification pathways. 2017 update
.
Nucleic Acids Res.
 
2018
:
46
(
D1
):
D303
D307
. .

Bohnsack
 
K
,
Höbartner
 
C
,
Bohnsack
 
M
.
Eukaryotic 5-methylcytosine (m5C) RNA methyltransferases: mechanisms, cellular functions, and links to disease
.
Genes (Basel)
.
2019
:
10
(
2
):
102
. .

Chang
 
C-T
,
Hautbergue
 
GM
,
Walsh
 
MJ
,
Viphakone
 
N
,
van Dijk
 
TB
,
Philipsen
 
S
,
Wilson
 
SA
.
Chtop is a component of the dynamic TREX mRNA export complex
.
EMBO J.
 
2013
:
32
(
3
):
473
486
. .

Chen
 
S
,
Zhou
 
Y
,
Chen
 
Y
,
Gu
 
J
.
fastp: an ultra-fast all-in-one FASTQ preprocessor
.
Bioinformatics
.
2018
:
34
(
17
):
i884
i890
. .

Chen
 
X
,
Li
 
A
,
Sun
 
B-F
,
Yang
 
Y
,
Han
 
Y-N
,
Yuan
 
X
,
Chen
 
R-X
,
Wei
 
W-S
,
Liu
 
Y
,
Gao
 
C-C
, et al.  
5-Methylcytosine promotes pathogenesis of bladder cancer through stabilizing mRNAs
.
Nat Cell Biol.
 
2019
:
21
(
8
):
978
990
. .

Chen
 
YS
,
Yang
 
WL
,
Zhao
 
YL
,
Yang
 
YG
.
Dynamic transcriptomic m(5) C and its regulatory role in RNA processing
.
Wiley Interdiscip Rev RNA
.
2021
:
12
(
4
):
e1639
. .

Cook
 
RD
.
Detection of influential observation in linear regression
.
Technometrics
.
1977
:
19
(
1
):
15
. .

Cui
 
X
,
Liang
 
Z
,
Shen
 
L
,
Zhang
 
Q
,
Bao
 
S
,
Geng
 
Y
,
Zhang
 
B
,
Leo
 
V
,
Vardy
 
LA
,
Lu
 
T
, et al.  
5-Methylcytosine RNA methylation in Arabidopsis thaliana
.
Mol Plant.
 
2017
:
10
(
11
):
1387
1399
. .

de Crécy-Lagard
 
V
,
Edelheit
 
S
,
Schwartz
 
S
,
Mumbach
 
MR
,
Wurtzel
 
O
,
Sorek
 
R
.
Transcriptome-wide mapping of 5-methylcytidine RNA modifications in bacteria, archaea, and yeast reveals m5C within archaeal mRNAs
.
PLoS Genet.
 
2013
:
9
(
6
):
e1003602
. .

Fu
 
L
,
Guerrero
 
CR
,
Zhong
 
N
,
Amato
 
NJ
,
Liu
 
Y
,
Liu
 
S
,
Cai
 
Q
,
Ji
 
D
,
Jin
 
S-G
,
Niedernhofer
 
LJ
, et al.  
Tet-mediated formation of 5-hydroxymethylcytosine in RNA
.
J Am Chem Soc.
 
2014
:
136
(
33
):
11582
11585
. .

Haag
 
S
,
Sloan
 
KE
,
Ranjan
 
N
,
Warda
 
AS
,
Kretschmer
 
J
,
Blessing
 
C
,
Hübner
 
B
,
Seikowski
 
J
,
Dennerlein
 
S
,
Rehling
 
P
, et al.  
NSUN3 and ABH1 modify the wobble position of mt-tRNA Met to expand codon recognition in mitochondrial translation
.
EMBO J.
 
2016
:
35
(
19
):
2104
2119
. .

Huang
 
T
,
Chen
 
W
,
Liu
 
J
,
Gu
 
N
,
Zhang
 
R
.
Genome-wide identification of mRNA 5-methylcytosine in mammals
.
Nat Struct Mol Biol.
 
2019
:
26
(
5
):
380
388
. .

Inc T.
 MATLAB version: 9.13. 0 (R2022b). Natick (MA):
The MathWorks Inc
.;
2022
. https://www.mathworks.com/help/stats/cooks-distance.html
.

Janin
 
M
,
Ortiz-Barahona
 
V
,
De Moura
 
MC
,
Martínez-Cardús
 
A
,
Llinàs-Arias
 
P
,
Soler
 
M
,
Nachmani
 
D
,
Pelletier
 
J
,
Schumann
 
U
,
Calleja-Cervantes
 
ME
, et al.  
Epigenetic loss of RNA-methyltransferase NSUN5 in glioma targets ribosomes to drive a stress adaptive translational program
.
Acta Neuropathol.
 
2019
:
138
(
6
):
1053
1074
. .

Johnson
 
Z
,
Xu
 
X
,
Lin
 
Y
,
Xie
 
H
.
Dynamics of RNA m5C modification during brain development
.
Genomics
.
2023
:
115
(
3
):
110604
. .

Johnson
 
Z
,
Xu
 
X
,
Pacholec
 
C
,
Xie
 
H
.
Systematic evaluation of parameters in RNA bisulfite sequencing data generation and analysis
.
NAR Genom Bioinform
.
2022
:
4
(
2
):
lqac045
. .

Kawarada
 
L
,
Suzuki
 
T
,
Ohira
 
T
,
Hirata
 
S
,
Miyauchi
 
K
,
Suzuki
 
T
.
ALKBH1 is an RNA dioxygenase responsible for cytoplasmic and mitochondrial tRNA modifications
.
Nucleic Acids Res.
 
2017
:
45
(
12
):
7401
7415
. .

Landry
 
CR
,
Freschi
 
L
,
Zarin
 
T
,
Moses
 
AM
.
Turnover of protein phosphorylation evolving under stabilizing selection
.
Front Genet
.
2014
:
5
:
245
. .

Levin
 
JZ
,
Yassour
 
M
,
Adiconis
 
X
,
Nusbaum
 
C
,
Thompson
 
DA
,
Friedman
 
N
,
Gnirke
 
A
,
Regev
 
A
.
Comprehensive comparative analysis of strand-specific RNA sequencing methods
.
Nat Methods.
 
2010
:
7
(
9
):
709
715
. .

Li
 
C
,
Zhang
 
J
.
Stop-codon read-through arises largely from molecular errors and is generally nonadaptive
.
PLoS Genet.
 
2019
:
15
(
5
):
e1008141
. .

Li
 
H
,
Handsaker
 
B
,
Wysoker
 
A
,
Fennell
 
T
,
Ruan
 
J
,
Homer
 
N
,
Marth
 
G
,
Abecasis
 
G
,
Durbin
 
R
,
Genome Project Data Processing S
.
The sequence alignment/map format and SAMtools
.
Bioinformatics
.
2009
:
25
(
16
):
2078
2079
. .

Li
 
Q
,
Li
 
X
,
Tang
 
H
,
Jiang
 
B
,
Dou
 
Y
,
Gorospe
 
M
,
Wang
 
W
.
NSUN2-mediated m5C methylation and METTL3/METTL14-mediated m6A methylation cooperatively enhance p21 translation
.
J Cell Biochem.
 
2017
:
118
(
9
):
2587
2598
. .

Li
 
Z
,
Sarker
 
B
,
Zhao
 
F
,
Zhou
 
T
,
Zhang
 
J
,
Xu
 
C
.
COL: a method for identifying putatively functional circular RNAs
.
J Genet Genomics
.
2024
:
51
(
11
):
1338
1341
. .

Liu
 
J
,
Huang
 
T
,
Chen
 
W
,
Ding
 
C
,
Zhao
 
T
,
Zhao
 
X
,
Cai
 
B
,
Zhang
 
Y
,
Li
 
S
,
Zhang
 
L
, et al.  
Developmental mRNA m5C landscape and regulatory innovations of massive m5C modification of maternal mRNAs in animals
.
Nat Commun.
 
2022
:
13
(
1
):
2484
. .

Liu
 
J
,
Huang
 
T
,
Zhang
 
Y
,
Zhao
 
T
,
Zhao
 
X
,
Chen
 
W
,
Zhang
 
R
.
Sequence- and structure-selective mRNA m5C methylation by NSUN6 in animals
.
Natl Sci Rev.
 
2021
:
8
(
6
):
nwaa273
. .

Liu
 
L
,
Chen
 
Y
,
Zhang
 
T
,
Cui
 
G
,
Wang
 
W
,
Zhang
 
G
,
Li
 
J
,
Zhang
 
Y
,
Wang
 
Y
,
Zou
 
Y
, et al.  
YBX1 promotes esophageal squamous cell carcinoma progression via m5C-dependent SMOX mRNA stabilization
.
Adv Sci (Weinh)
.
2024
:
11
(
20
):
e2302379
. .

Liu
 
Z
,
Zhang
 
J
.
Human C-to-U coding RNA editing is largely nonadaptive
.
Mol Biol Evol
.
2018a
:
35
(
4
):
963
969
. .

Liu
 
Z
,
Zhang
 
J
.
Most m6A RNA modifications in protein-coding regions are evolutionarily unconserved and likely nonfunctional
.
Mol Biol Evol
.
2018b
:
35
(
3
):
666
675
. .

Lu
 
Z
,
Liu
 
B
,
Kong
 
D
,
Zhou
 
X
,
Pei
 
D
,
Liu
 
D
.
NSUN6 regulates NM23-H1 expression in an m5C manner to affect epithelial-mesenchymal transition in lung cancer
.
Med Princ Pract
.
2024
:
33
(
1
):
56
65
. .

Ma
 
J
,
Song
 
B
,
Wei
 
Z
,
Huang
 
D
,
Zhang
 
Y
,
Su
 
J
,
de Magalhães
 
JP
,
Rigden
 
DJ
,
Meng
 
J
,
Chen
 
K
.
m5C-Atlas: a comprehensive database for decoding and annotating the 5-methylcytosine (m5C) epitranscriptome
.
Nucleic Acids Res.
 
2022
:
50
(
D1
):
D196
D203
. .

Meng
 
H
,
Miao
 
H
,
Zhang
 
Y
,
Chen
 
T
,
Yuan
 
L
,
Wan
 
Y
,
Jiang
 
Y
,
Zhang
 
L
,
Cheng
 
W
.
YBX1 promotes homologous recombination and resistance to platinum-induced stress in ovarian cancer by recognizing m5C modification
.
Cancer Lett
.
2024
:
597
:
217064
. .

Moses
 
AM
,
Liku
 
ME
,
Li
 
JJ
,
Durbin
 
R
.
Regulatory evolution in proteins by turnover and lineage-specific changes of cyclin-dependent kinase consensus sites
.
Proc Natl Acad Sci U S A
.
2007
:
104
(
45
):
17713
17718
. .

Motorin
 
Y
,
Lyko
 
F
,
Helm
 
M
.
5-Methylcytosine in RNA: detection, enzymatic formation and biological functions
.
Nucleic Acids Res.
 
2010
:
38
(
5
):
1415
1430
. .

Nguyen Ba
 
AN
,
Moses
 
AM
.
Evolution of characterized phosphorylation sites in budding yeast
.
Mol Biol Evol
.
2010
:
27
(
9
):
2027
2037
. .

Pollard
 
KS
,
Hubisz
 
MJ
,
Rosenbloom
 
KR
,
Siepel
 
A
.
Detection of nonneutral substitution rates on mammalian phylogenies
.
Genome Res.
 
2010
:
20
(
1
):
110
121
. .

Rieder
 
D
,
Amort
 
T
,
Kugler
 
E
,
Lusser
 
A
,
Trajanoski
 
Z
.
meRanTK: methylated RNA analysis ToolKit
.
Bioinformatics
.
2016
:
32
(
5
):
782
785
. .

RStudio Team
. RStudio v0.96.230. 2013. https://rpubs.com/DragonflyStats/Cooks-Distance.

Saudemont
 
B
,
Popa
 
A
,
Parmley
 
JL
,
Rocher
 
V
,
Blugeon
 
C
,
Necsulea
 
A
,
Meyer
 
E
,
Duret
 
L
.
The fitness cost of mis-splicing is the main determinant of alternative splicing patterns
.
Genome Biol.
 
2017
:
18
(
1
):
208
. .

Schaefer
 
M
,
Pollex
 
T
,
Hanna
 
K
,
Lyko
 
F
.
RNA cytosine methylation analysis by bisulfite sequencing
.
Nucleic Acids Res.
 
2009
:
37
(
2
):
e12
. .

Selmi
 
T
,
Hussain
 
S
,
Dietmann
 
S
,
Heiß
 
M
,
Borland
 
K
,
Flad
 
S
,
Carter
 
J-M
,
Dennison
 
R
,
Huang
 
Y-L
,
Kellner
 
S
, et al.  
Sequence- and structure-specific cytosine-5 mRNA methylation by NSUN6
.
Nucleic Acids Res.
 
2021
:
49
(
2
):
1006
1022
. .

Siepel
 
A
,
Bejerano
 
G
,
Pedersen
 
JS
,
Hinrichs
 
AS
,
Hou
 
M
,
Rosenbloom
 
K
,
Clawson
 
H
,
Spieth
 
J
,
Hillier
 
LW
,
Richards
 
S
, et al.  
Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes
.
Genome Res
.
2005
:
15
(
8
):
1034
1050
. .

Tang
 
Y
,
Gao
 
C-C
,
Gao
 
Y
,
Yang
 
Y
,
Shi
 
B
,
Yu
 
J-L
,
Lyu
 
C
,
Sun
 
B-F
,
Wang
 
H-L
,
Xu
 
Y
, et al.  
OsNSUN2-mediated 5-methylcytosine mRNA modification enhances rice adaptation to high temperature
.
Dev Cell.
 
2020
:
53
(
3
):
272
286.e7
. .

Tuorto
 
F
,
Liebers
 
R
,
Musch
 
T
,
Schaefer
 
M
,
Hofmann
 
S
,
Kellner
 
S
,
Frye
 
M
,
Helm
 
M
,
Stoecklin
 
G
,
Lyko
 
F
.
RNA cytosine methylation by Dnmt2 and NSun2 promotes tRNA stability and protein synthesis
.
Nat Struct Mol Biol.
 
2012
:
19
(
9
):
900
905
. .

Wang
 
L
,
Lin
 
S
.
Emerging functions of tRNA modifications in mRNA translation and diseases
.
J Genet Genomics
.
2023
:
50
(
4
):
223
232
. .

Wang
 
N
,
Chen
 
RX
,
Deng
 
MH
,
Wei
 
WS
,
Zhou
 
ZH
,
Ning
 
K
,
Li
 
YH
,
Li
 
XD
,
Ye
 
YL
,
Wen
 
JH
, et al.  
m(5)C-dependent cross-regulation between nuclear reader ALYREF and writer NSUN2 promotes urothelial bladder cancer malignancy through facilitating RABL6/TK1 mRNAs splicing and stabilization
.
Cell Death Dis
.
2023
:
14
(
2
):
139
. .

Wang
 
R
,
Xue
 
W
,
Kan
 
F
,
Zhang
 
H
,
Wang
 
D
,
Wang
 
L
,
Wang
 
J
.
NSUN2 affects diabetic retinopathy progression by regulating MUC1 expression through RNA m(5)C methylation
.
J Transl Med
.
2024
:
22
(
1
):
476
. .

Xu
 
C
,
Park
 
J-K
,
Zhang
 
J
.
Evidence that alternative transcriptional initiation is largely nonadaptive
.
PLoS Biol.
 
2019
:
17
(
3
):
e3000197
. .

Xu
 
C
,
Zhang
 
J
.
Alternative polyadenylation of mammalian transcripts is generally deleterious, not adaptive
.
Cell Syst.
 
2018
:
6
(
6
):
734
742.e4
. .

Xu
 
C
,
Zhang
 
J
.
Mammalian circular RNAs result largely from splicing errors
.
Cell Rep.
 
2021
:
36
(
4
):
109439
. .

Xu
 
G
,
Zhang
 
J
.
Human coding RNA editing is generally nonadaptive
.
Proc Natl Acad Sci U S A.
 
2014
:
111
(
10
):
3769
3774
. .

Xue
 
S
,
Xu
 
H
,
Sun
 
Z
,
Shen
 
H
,
Chen
 
S
,
Ouyang
 
J
,
Zhou
 
Q
,
Hu
 
X
,
Cui
 
H
.
Depletion of TRDMT1 affects 5-methylcytosine modification of mRNA and inhibits HEK293 cell proliferation and migration
.
Biochem Biophys Res Commun.
 
2019
:
520
(
1
):
60
66
. .

Yang
 
L
,
Perrera
 
V
,
Saplaoura
 
E
,
Apelt
 
F
,
Bahin
 
M
,
Kramdi
 
A
,
Olas
 
J
,
Mueller-Roeber
 
B
,
Sokolowska
 
E
,
Zhang
 
W
, et al.  
M5c methylation guides systemic transport of messenger RNA over graft junctions in plants
.
Curr Biol.
 
2019a
:
29
(
15
):
2465
2476.e5
. .

Yang
 
X
,
Yang
 
Y
,
Sun
 
B-F
,
Chen
 
Y-S
,
Xu
 
J-W
,
Lai
 
W-Y
,
Li
 
A
,
Wang
 
X
,
Bhattarai
 
DP
,
Xiao
 
W
, et al.  
5-Methylcytosine promotes mRNA export—NSUN2 as the methyltransferase and ALYREF as an m5C reader
.
Cell Res.
 
2017
:
27
(
5
):
606
625
. .

Yang
 
Y
,
Wang
 
L
,
Han
 
X
,
Yang
 
W-L
,
Zhang
 
M
,
Ma
 
H-L
,
Sun
 
B-F
,
Li
 
A
,
Xia
 
J
,
Chen
 
J
, et al.  
RNA 5-methylcytosine facilitates the maternal-to-zygotic transition by preventing maternal mRNA decay
.
Mol Cell
.
2019b
:
75
(
6
):
1188
1202.e11
. .

Zhang
 
H
,
Zhai
 
X
,
Liu
 
Y
,
Xia
 
Z
,
Xia
 
T
,
Du
 
G
,
Zhou
 
H
,
Franziska Strohmer
 
D
,
Bazhin
 
AV
,
Li
 
Z
, et al.  
NOP2-mediated m5C modification of c-Myc in an EIF3A-dependent manner to reprogram glucose metabolism and promote hepatocellular carcinoma progression
.
Research (Wash D C)
.
2023
:
6
:
0184
. .

Zhang
 
J
. Gene duplication. In:
Losos
 
J
, editor.
The Princeton guide to evolution
.
Princeton, NJ
:
Princeton University Press
;
2013
. p.
397
405
.

Zhang
 
J
,
Xu
 
C
.
Gene product diversity: adaptive or not?
 
Trends Genet.
 
2022
:
38
(
11
):
1112
1122
. .

Zhang
 
JZ
,
Yang
 
JR
.
Determinants of the rate of protein sequence evolution
.
Nat Rev Genet.
 
2015
:
16
(
7
):
409
420
. .

Zhang
 
Z
,
Chen
 
T
,
Chen
 
H-X
,
Xie
 
Y-Y
,
Chen
 
L-Q
,
Zhao
 
Y-L
,
Liu
 
B-D
,
Jin
 
L
,
Zhang
 
W
,
Liu
 
C
, et al.  
Systematic calibration of epitranscriptomic maps using a synthetic modification-free RNA library
.
Nat Methods.
 
2021
:
18
(
10
):
1213
1222
. .

Author notes

Conflict of Interest: The authors declare no conflicts of interest with respect to the research, authorship, and/or publication of this article.

This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] for reprints and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site—for further information please contact [email protected].
Associate Editor: Xuhua Xia
Xuhua Xia
Associate Editor
Search for other works by this author on:

Supplementary data