Computational determination of gene age and characterization of evolutionary dynamics in human

Summary of computational methods for gene age identification

Method	Strategy	Criteria	Phylogenetic framework	Hierarchy	Focal species	Duplication inference
Domazet-Loso et al. [16, 18]	SSO^a	BLASTP (E-value < 10⁻³)	From Archaea/Bacteria to Homo sapiens	19	Homo sapiens	No
Wolf et al. [17]	SSO^a	BLASTP (E-value < 10⁻¹ or 10⁻⁶)	From Archaea/Bacteria to Homo sapiens	7, 7 and 7	Homo sapiens; Drosophila melanogaster; Aspergillus fumigatus	No
Cai & Petrov [19]	SSO^a	Phylogenetic profiles from PhyloPat; BLASTP (E-value < 10⁻³)	From Saccharomyces cerevisiae to Homo sapiens	11 and 9	Homo sapiens	No
Capra JA et al. [75]	SSO^a	Searching orthology from the Princeton Protein Orthology Database and PANTHER Database	From Archaea/Bacteria to Homo sapiens	Determined by the uploaded phylogenetic relationships	Multi-species	No
Zhou et al. [20]	SSP^b	BLASTN (E-value < 10⁻⁵; length > 200 bp or coverage > 0.7; identity > 0.8)	Drosophila	4	Drosophila melanogaster	Yes
Zhang et al. [21]	SSP^b	BLASTP (E-value < 10⁻⁶); BLASTN (E-value < 10–¹⁶) coverage > 0.7; identity > 0.5)	From Danio rerio to Homo sapiens	14	Homo sapiens	Yes
Yin et al. [22]	SSP^b	BLASTP (E-value < 10⁻³); MCL (I = 1.5)	From Archaea/Bacteria to Homo sapiens	26	Homo sapiens	Yes

Method	Strategy	Criteria	Phylogenetic framework	Hierarchy	Focal species	Duplication inference
Domazet-Loso et al. [16, 18]	SSO^a	BLASTP (E-value < 10⁻³)	From Archaea/Bacteria to Homo sapiens	19	Homo sapiens	No
Wolf et al. [17]	SSO^a	BLASTP (E-value < 10⁻¹ or 10⁻⁶)	From Archaea/Bacteria to Homo sapiens	7, 7 and 7	Homo sapiens; Drosophila melanogaster; Aspergillus fumigatus	No
Cai & Petrov [19]	SSO^a	Phylogenetic profiles from PhyloPat; BLASTP (E-value < 10⁻³)	From Saccharomyces cerevisiae to Homo sapiens	11 and 9	Homo sapiens	No
Capra JA et al. [75]	SSO^a	Searching orthology from the Princeton Protein Orthology Database and PANTHER Database	From Archaea/Bacteria to Homo sapiens	Determined by the uploaded phylogenetic relationships	Multi-species	No
Zhou et al. [20]	SSP^b	BLASTN (E-value < 10⁻⁵; length > 200 bp or coverage > 0.7; identity > 0.8)	Drosophila	4	Drosophila melanogaster	Yes
Zhang et al. [21]	SSP^b	BLASTP (E-value < 10⁻⁶); BLASTN (E-value < 10–¹⁶) coverage > 0.7; identity > 0.5)	From Danio rerio to Homo sapiens	14	Homo sapiens	Yes
Yin et al. [22]	SSP^b	BLASTP (E-value < 10⁻³); MCL (I = 1.5)	From Archaea/Bacteria to Homo sapiens	26	Homo sapiens	Yes

a) Similarity Searching Only; b) Similarity Searching with priori information.

Table 1

Summary of computational methods for gene age identification

Method	Strategy	Criteria	Phylogenetic framework	Hierarchy	Focal species	Duplication inference
Domazet-Loso et al. [16, 18]	SSO^a	BLASTP (E-value < 10⁻³)	From Archaea/Bacteria to Homo sapiens	19	Homo sapiens	No
Wolf et al. [17]	SSO^a	BLASTP (E-value < 10⁻¹ or 10⁻⁶)	From Archaea/Bacteria to Homo sapiens	7, 7 and 7	Homo sapiens; Drosophila melanogaster; Aspergillus fumigatus	No
Cai & Petrov [19]	SSO^a	Phylogenetic profiles from PhyloPat; BLASTP (E-value < 10⁻³)	From Saccharomyces cerevisiae to Homo sapiens	11 and 9	Homo sapiens	No
Capra JA et al. [75]	SSO^a	Searching orthology from the Princeton Protein Orthology Database and PANTHER Database	From Archaea/Bacteria to Homo sapiens	Determined by the uploaded phylogenetic relationships	Multi-species	No
Zhou et al. [20]	SSP^b	BLASTN (E-value < 10⁻⁵; length > 200 bp or coverage > 0.7; identity > 0.8)	Drosophila	4	Drosophila melanogaster	Yes
Zhang et al. [21]	SSP^b	BLASTP (E-value < 10⁻⁶); BLASTN (E-value < 10–¹⁶) coverage > 0.7; identity > 0.5)	From Danio rerio to Homo sapiens	14	Homo sapiens	Yes
Yin et al. [22]	SSP^b	BLASTP (E-value < 10⁻³); MCL (I = 1.5)	From Archaea/Bacteria to Homo sapiens	26	Homo sapiens	Yes

Method	Strategy	Criteria	Phylogenetic framework	Hierarchy	Focal species	Duplication inference
Domazet-Loso et al. [16, 18]	SSO^a	BLASTP (E-value < 10⁻³)	From Archaea/Bacteria to Homo sapiens	19	Homo sapiens	No
Wolf et al. [17]	SSO^a	BLASTP (E-value < 10⁻¹ or 10⁻⁶)	From Archaea/Bacteria to Homo sapiens	7, 7 and 7	Homo sapiens; Drosophila melanogaster; Aspergillus fumigatus	No
Cai & Petrov [19]	SSO^a	Phylogenetic profiles from PhyloPat; BLASTP (E-value < 10⁻³)	From Saccharomyces cerevisiae to Homo sapiens	11 and 9	Homo sapiens	No
Capra JA et al. [75]	SSO^a	Searching orthology from the Princeton Protein Orthology Database and PANTHER Database	From Archaea/Bacteria to Homo sapiens	Determined by the uploaded phylogenetic relationships	Multi-species	No
Zhou et al. [20]	SSP^b	BLASTN (E-value < 10⁻⁵; length > 200 bp or coverage > 0.7; identity > 0.8)	Drosophila	4	Drosophila melanogaster	Yes
Zhang et al. [21]	SSP^b	BLASTP (E-value < 10⁻⁶); BLASTN (E-value < 10–¹⁶) coverage > 0.7; identity > 0.5)	From Danio rerio to Homo sapiens	14	Homo sapiens	Yes
Yin et al. [22]	SSP^b	BLASTP (E-value < 10⁻³); MCL (I = 1.5)	From Archaea/Bacteria to Homo sapiens	26	Homo sapiens	Yes

a) Similarity Searching Only; b) Similarity Searching with priori information.

In the SSO group, sequence similarity is detected merely by basic local alignment search tool (BLAST) or other similar algorithms, so that gene age can be inferred in a free evolutionary scale. The typical method in this group is ‘phylostratigraphy’, that is, reconstruction of macroevolutionary trends based on the principle of founder gene formation and punctuated emergence of protein families based on protein BLAST (BLASTP) [16, 18]. Ultimately, evolutionary origins of human genes were ascertained within 19 phylogenetic classes ranging from archaea/bacteria to Homo sapiens in [16, 18]. Based on this idea, several similar methods were lately proposed by categorizing human genes into seven age classes in [17] and 11 age classes in [19]. However, the reliability of the SSO strategy on homolog clustering is highly debated, simply because it is ineffective in differentiating paralogs and thus unable to accurately capture duplication events.

Contrastingly, the SSP group requires priori information (such as, phylogeny, gene order in the chromosomes and flanking region) to effectively infer the divergence of homologs and accurately determine the genes’ ages. For example, based on comparisons of flanking genes and gene orders in the chromosomes, evolutionary ages of genes were determined in Drosophila spanning ~40 million years [20] and identified in human by grouping into 14 age classes [21]. However, because of the fact that chromosomal information highly depends on chromosomal structure consistency, the SSP group is unable to define the gene age in a free evolutionary framework.

In our recent study, we combined homology clustering with phylogeny inference to assign the gene age at a more elaborate evolutionary time scale and categorized human genes into 26 evolutionary age classes ranging from archaea/bacteria to Homo sapiens [22]. Strikingly, in our procedure, homolog clustering and phylogeny inference were collectively used for gene age identification; BLASTP and Markov Cluster algorithm [23], termed as ‘BLAST+MCL’, were utilized for homolog clustering, and both horizontal and vertical evolution [24–27] were considered in phylogeny inference. As a consequence, parent and child genes can be discriminated and duplication events could be well traced back to the corresponding evolutionary time [22].

Effectiveness of homolog clustering

Homolog clustering is of critical significance on gene age identification. As it is highly disputed that BLAST-induced errors on age identification could create possible biological artifacts [28–32], we evaluate the effectiveness of homolog clustering by comparing two popular methods, viz., BLAST and BLAST+MCL (as mentioned above). Following a similar simulation procedure in [30], we simulate 6695 protein sequences at 11 different taxonomic levels (Supplementary Figure S1) using TreePuzzle [33] and ROSE [34] (with evolutionary rates in Gamma distributions [35]). Accordingly, we adopt BLAST and BLAST+MCL to perform homolog clustering on these simulated sequences. If a homolog could not be detected in a given distant taxa, then it is counted as an error. We estimate the error rates of BLAST and BLAST+MCL by considering three different E-value thresholds (Table 2). Strikingly, the error rates of homolog detection using BLAST and BLAST+MCL are 12.22% and 0.074% for E-value < 10⁻¹, 14.78% and 0.089% for E-value < 10⁻³ and 17.49% and 0.104% for E-value < 10⁻⁶, respectively, clearly showing that BLAST+MCL achieves higher accuracy in homolog clustering.

Table 2

Comparison of error rates of detecting homologs in most distant taxa between two methods on detection of simulated proteins

Method	Error rates based on different E-values
	E < 10⁻¹	E < 10⁻³	E < 10⁻⁶
BLAST	12.22%	14.78%	17.49%
BLAST+MCL	0.074%	0.089%	0.104%

Method	Error rates based on different E-values
	E < 10⁻¹	E < 10⁻³	E < 10⁻⁶
BLAST	12.22%	14.78%	17.49%
BLAST+MCL	0.074%	0.089%	0.104%

Table 2

Comparison of error rates of detecting homologs in most distant taxa between two methods on detection of simulated proteins

Method	Error rates based on different E-values
	E < 10⁻¹	E < 10⁻³	E < 10⁻⁶
BLAST	12.22%	14.78%	17.49%
BLAST+MCL	0.074%	0.089%	0.104%

Method	Error rates based on different E-values
	E < 10⁻¹	E < 10⁻³	E < 10⁻⁶
BLAST	12.22%	14.78%	17.49%
BLAST+MCL	0.074%	0.089%	0.104%

As reported in [31], an accurate method for gene age determination should be the one that is less affected by gene properties and accordingly presents no or less correlation between gene age and gene property. Therefore, it has been argued that the trends of evolutionary rate and protein length when a gene ages may be caused by BLAST errors [30, 31]. Here we plot gene age against protein length and evolutionary rate by considering BLAST and BLAST+MCL (Figure 1), respectively. It is consistently found that biased gene ages determined by BLAST do present an artifact correlation with protein length (rho = 0.107) or evolutionary rate (rho = −0.590), whereas BLAST+MCL does not present such correlation (rho = 0.048 for protein length and rho = −0.038 for evolutionary rate; Figure 1A and1B). Additionally, it has been also debated whether sequence conservation may influence gene age determination [30]. Thus, we further confirm that BLAST is sensitive to sequence conservation as testified by the increasing error rate, whereas BLAST+MCL yields very low error rates across a variety of sequence conservation (Figure 1C). Collectively, these results demonstrate the robustness of BLAST+MCL, as testified by that gene properties have no or little influence on gene age estimation.

Figure 1

Evaluation of gene age estimation by BLAST and BLAST+MCL, considering (A) protein length, (B) evolutionary rate and (C) length of conserved sequence. Sequences were simulated by the method detailed in [30].

Capturing important evolutionary events

Under a large-scale refined evolutionary framework, genes possessing different evolutionary ages are potentially informative to capture important evolutionary events. As reported in our recent study [22], genes are explosively birthed at the origin time of multicellularity, and importantly, the birth of specific genes could be refined at a more accurate evolutionary scale. For instance, MYEOV has been previously reported to de novo arise in human-specific lineage [36], whereas our age estimation [27] based on an extremely refined evolutionary time scale clearly shows that it is occurred more specifically at the origin of hominoidea [22]. Remarkably, previous studies based on a rough evolutionary scale for gene age identification [16, 18] are unable to accurately trace the exact evolutionary time points of gene loss event, accordingly yielding an imprecise finding that 96% of duplicate genes share the same age. In contrast, our age estimation [27] based on a long phylogenetic framework spanning~∼4,000 million years from archaea/bacteria to Homo sapiens spanning (Figure 2) clearly reveals that genes are heavily lost at the origination time of scandentia (age class 8) and primates (age class 7) and that 11% of duplication events can be traced back to the origin of metazoans (age class 21) and 16% of duplication events are assigned to the origin of chordates (age class 17), agreeing well with previous hypotheses that the origin of multicellularity species greatly increases the hierarchical complexity [37, 38] and that duplication events probably promote the shaping of vertebrate genomes [39].

Figure 2

Gene loss and duplication events across a range of age classes.

Dominant evolutionary signatures related to gene age

Evidences have accumulated that multiple signatures from different omics levels are related to gene age, involving sequence structure, evolutionary rate, expression level, functional essentiality, etc., which, collectively, are summarized in Table 3. Basically, genes with different ages present heterogeneity in multiple evolutionary signatures; young genes, by comparison with old genes, tend to encode shorter proteins, possess less introns [17], harbor more premature termination codon mutations [36, 40], are less connected in the protein–protein interaction network (PPIN) [41, 42], evolve more rapidly [17, 19, 28] and experience more variable selection pressure [43]. Albeit it is believed that young genes play less essential functional roles [44], it has been also reported that young genes could quickly become essential as observed in Drosophila [45].

Table 3

Signatures associated with gene age by comparing young genes against old genes

Signature	Young genes	Old genes	Reference
Sequence length	Shorter	Longer	[17]
Intron number	Less	More	[17]
Premature termination codon mutation	More	Less	[36, 40]
Evolution rate	More rapidly	More slowly	[17, 19, 28]
Variable selection pressure	More	Less	[43]
Connections in the interaction network	Less	More	[41, 42]
Essentiality	Less essential	More essential	[44] [45]
Expression	Likely to be lowly expressed	Likely to be highly expressed	[19, 46, 47]
Tissue-specificity	More likely	Less likely	[48]
cis-protein QTLs	More	Less	[49]

Signature	Young genes	Old genes	Reference
Sequence length	Shorter	Longer	[17]
Intron number	Less	More	[17]
Premature termination codon mutation	More	Less	[36, 40]
Evolution rate	More rapidly	More slowly	[17, 19, 28]
Variable selection pressure	More	Less	[43]
Connections in the interaction network	Less	More	[41, 42]
Essentiality	Less essential	More essential	[44] [45]
Expression	Likely to be lowly expressed	Likely to be highly expressed	[19, 46, 47]
Tissue-specificity	More likely	Less likely	[48]
cis-protein QTLs	More	Less	[49]

Abbreviation: QTLs, quantitative trait locis.

Table 3

Signatures associated with gene age by comparing young genes against old genes

Signature	Young genes	Old genes	Reference
Sequence length	Shorter	Longer	[17]
Intron number	Less	More	[17]
Premature termination codon mutation	More	Less	[36, 40]
Evolution rate	More rapidly	More slowly	[17, 19, 28]
Variable selection pressure	More	Less	[43]
Connections in the interaction network	Less	More	[41, 42]
Essentiality	Less essential	More essential	[44] [45]
Expression	Likely to be lowly expressed	Likely to be highly expressed	[19, 46, 47]
Tissue-specificity	More likely	Less likely	[48]
cis-protein QTLs	More	Less	[49]

Signature	Young genes	Old genes	Reference
Sequence length	Shorter	Longer	[17]
Intron number	Less	More	[17]
Premature termination codon mutation	More	Less	[36, 40]
Evolution rate	More rapidly	More slowly	[17, 19, 28]
Variable selection pressure	More	Less	[43]
Connections in the interaction network	Less	More	[41, 42]
Essentiality	Less essential	More essential	[44] [45]
Expression	Likely to be lowly expressed	Likely to be highly expressed	[19, 46, 47]
Tissue-specificity	More likely	Less likely	[48]
cis-protein QTLs	More	Less	[49]

Abbreviation: QTLs, quantitative trait locis.

Additionally, young human genes are likely to be lowly expressed [19, 46, 47] and present distinct temporal and spatial expression patterns, so that young genes are more likely to be tissue-specific genes, whereas old genes tend to be housekeeping genes [48]. Also, it has been identified that young primate-specific genes possess more cis-protein quantitative trait locis (QTLs) and cis-expression QTLs with higher effect sizes and locate closer to the transcription start site [49]. Additionally, young and old duplicates differ strikingly in DNA methylation patterns [50].

Although multiple signatures are associated with gene age, they are highly interdependent, [51, 52], making it difficult to decipher which signature(s) is/are dominantly associated with gene age. In our recent study, it has been identified that guanine-cytosine (GC) content and connectivity in PPIN act as dominant signatures when a gene ages [22]. Consistently, it has been recently reported that de novo genes in primates originated from long noncoding RNAs possess GC-rich property [36] and de novo genes in yeast emerge from GC-rich intergenic regions [53], presumably enabling stable open reading frames (ORFs) with lower chance of nonsense mutations, and that the driving role for new genes evolving essential functions and participating in development is the connectivity in PPIN [42, 45]. Specially, considering the dominance of PPIN, more investigations could be carried out on the interactions between de novo genes from noncoding RNAs and coding genes and on the potential origination of genes from noncoding genes in testis, given the hypertranscription status of chromatin and rapid positive selection on genes in testis [54, 55]. Taken together, GC content and connectivity in PPIN, albeit there are still multiple omics-level signatures left in the genome during evolution, are dominant signatures associated closely with gene age.

The age landscape of human genes across chromosomes

Regarding the age distribution of human genes across chromosomes, notably, it has been revealed that many new retrogenes (genes from RNA-based duplication) come from X chromosome (known as ‘out of X’) [21, 56, 57] and are fixed on the autosomes, being shaped in the inactivation of X chromosome during and after meiosis [56]. This mechanism was first identified in Drosophila [58] and then in mammal [57, 59]. Further efforts have been made to investigate newly originated genes on X chromosome [21, 56, 57]. Moreover, it has been identified that young human genes are clustered on chromosome 14 [60].

Based on a more refined evolutionary time scale where human genes are classified into 26 age classes, here we investigate the age distribution of human genes and identify young gene clusters (YGCs) and old gene clusters (OGCs) on all chromosomes. Interestingly, we detect a total of 29 YGCs and 23 OGCs across all human chromosomes (Figure 3); YGCs are relatively scattered across sex chromosomes as well as a variety of autosomes (including chromosome 14, consistent with a previous study [60]; see Table S1 for details), whereas OGCs tend to be enriched on small autosomes (especially on chromosomes 13, 18 and 21; see Table S2 for details). Gene ontology (GO) enrichment analyses indicate that YGCs are functionally related with human immune systems, defense response and olfactory signaling pathways (Table S3). By contrast, OGCs are related with functions, such as chemical synaptic transmission and cellular response to compounds (Table S4). With the help of DEG (Database of Essential Genes) [61], we consistently find that 50.70% of genes in OGCs tend to be essential genes, whereas only 15.05% of genes in YGCs are essential. Together, such uneven distribution of YGCs and OGCs and their functional significance strongly indicate heterogeneous rates of gene emergence and divergent evolutionary pressures acting on different human chromosomes, which can be explained by previous findings that chromosomal instability may contribute to gene birth and gene order change [62, 63].

Figure 3

The age landscape of human genes across chromosomes. On any given chromosome, a YGC is defined as a region that contains a set of genes accounting for >3% of total genes, where >80% of genes are born after the origin of mammalia. An OGC is defined as a region on a chromosome that contains a set of genes accounting for >5% of total genes, where >85% of genes are born before the origin of metazoan. Furthermore, any individual YGC or OGC should possess more than five genes (n > 5). Gene lists of YGCs and OGCs are summarized in Table S1 and Table S2, respectively. Gene Ontology enrichment analyses are conducted by http://metascape.org.

Figure 4

Relative abundance of human cancer and disease genes across a long phylogenetic time scale spanning ∼4000 million years. (A) CGs versus DGs. (B) MDGs versus PDGs. Suppose that p_i is the percentage of cancer (disease) genes at age class i and p is the percentage of entire cancer (disease) genes across all age classes, gene abundance is formulated as the ratio of p_i to p, indicating the relative abundance of cancer (disease) genes for any given age class, where gene abundance >1 means over-representation and gene abundance <1 means under-representation.

Evolutionary origins of DGs and CGs

Under this more refined evolutionary time scale framework, human genes can be traced back to different evolutionary origins [11], making it possible to elucidate the evolutionary age of CGs and DGs, which may be helpful to provide evolutionary insights for better understanding of human disease and CGs [64–67]. CGs and DGs have been found to arise early in the evolution [18, 38, 68]. Based on our gene age classification with more refined evolutionary time scale, we accordingly estimate the ages of CGs (obtained from COSMIC at http://cancer.sanger.ac.uk/cosmic) and DGs (obtained from DG-CST at http://dgcst.ceinge.unina.it/) and compare their evolutionary origins (Figure 4A). Detailed gene lists of CGs and DGs are summarized in Table S5.

It has been reported that the majority of DGs arise before the speciation of bilateria [18, 68] about 600 million years ago (MYA). However, our results show that DGs are abundantly present before 1500–4000 MYA. Additionally, we find that CGs tend to be younger than DGs, enriching at 500–1500 MYA, especially at the origin of metazoan, which is consistent with a previous report that CGs are over-originated at the evolutionary time point for transitions to multicellularity [38]. Albeit most CGs could be traced to ancient evolutionary times, it has been recently found that newly birthed genes are likely to be involved in tumorigenesis [69–71]. As DGs can be further classified into monogenic disease genes (MDGs) and polygenic disease genes (PDGs), we further reveal that PDGs, which are especially enriched at the origin of chordata, tend to be younger than MDGs (Figure 4B). Considering the association of gene age with evolutionary signatures as discussed above, this result is in agreement with previous findings that PDGs, compared with MDGs, have weaker selection pressure, have more disordered residues and lower expression levels [64], are partially recessive [72] and evolve faster [73]. Taken together, evolutionary tracing of disease genes and CGs is helpful for in-depth investigation of underlying complex mechanisms associated with human Mendelian diseases and cancer.

Conclusions and Perspectives

Facilitated by revolutionary high-throughput sequencing technologies, progresses have been made in investigating the evolutionary origins of human genes. Here we summarize existing computational methods for gene age identification, evaluate the effectiveness in accurate age determination by combining homolog clustering with phylogenetic inference, detect important evolutionary dynamics during human evolution, investigate the age landscape of human genes across all chromosomes and characterize evolutionary features of human disease and CGs. Given that deciphering the evolutionary dynamics of genes with different ages helps to uncover complex mechanisms governing biological diversity, continued efforts are further needed for precision determination of gene age at a more refined evolutionary time scale. Undoubtedly, challenges are ahead, especially considering the evidence that de novo genes originate from noncoding regions [74] and that CGs and disease genes present heterogeneity in their evolutionary origins, which, in return, could provide opportunities for better understanding the complex mechanisms coupled with novel phenotypes and biological diversity during human evolution.

Key Points

Gene age can be accurately determined by combining homolog clustering with phylogeny inference.
Accurate identification of gene age aids in capturing important evolutionary events.
GC content and protein–protein interaction connectivity are dominant signatures associated closely with gene age.
A total of 29 YGCs and 23 OGCs are identified across human chromosomes.
CGs and DGs exhibit divergent evolutionary origins.

Acknowledgements

We sincerely thank three anonymous reviewers for their valuable comments on this work.

Funding

This study was supported by grants from The Strategic Priority Research Program of the Chinese Academy of Sciences (XDB13040500; XDA08020102), National Key Research Program of China (2017YFC0907502), National Programs for High Technology Research and Development (863 Program; 2015AA020108 and 2012AA020409), National Natural Science Foundation of China (31200978), the 13th Five-year Informatization Plan of Chinese Academy of Sciences (XXH13505–05), International Partnership Program of the Chinese Academy of Sciences (153F11KYSB20160008) and the ‘100-Talent Program’ of Chinese Academy of Sciences.

Hongyan Yin is a postdoctoral associate at Hainan Key Laboratory for Sustainable Utilization of Tropical Bioresources, Institute of Tropical Agriculture and Forestry, Hainan University, China. Her research interests lie in computational evolutionary biology and mechanisms of plant disease resistance.

Mengwei Li is a PhD student at CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences. His research focuses on cancer bioinformatics and DNA methylation.

Lin Xia is a PhD student at CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences. Her research focuses on big data integration and RNA editing.

Chaozu He is a professor at Hainan Key Laboratory for Sustainable Utilization of Tropical Bioresources, Institute of Tropical Agriculture and Forestry, Hainan University, China. His research interests include molecular genetics in plant development and disease resistance.

Zhang Zhang is a professor at BIG Data Center & CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences. His research interests are big data integration and computational evolutionary & health genomics.

References

1.

Ohno

S

.

Evolution by Gene Duplication.

London, New York

:

Allen & Unwin; Springer

,

1970

.

2.

Long

M

,

Betran

E

,

Thornton

K

, et al.

The origin of new genes: glimpses from the young and old

.

Nat Rev Genet

2003

;

4

(

11

):

865

–

75

.

3.

Conant

GC

,

Wolfe

KH

.

Turning a hobby into a job: How duplicated genes find new functions

.

Nat Rev Genet

2008

;

9

(

12

):

938

–

50

.

4.

Crisp

A

,

Boschetti

C

,

Perry

M

, et al.

Expression of multiple horizontally acquired genes is a hallmark of both vertebrate and invertebrate genomes

.

Genome Biol

2015

;

16

(

1

):

50

.

5.

Boucher

Y

,

Douady

CJ

,

Papke

RT

, et al.

Lateral gene transfer and the origins of prokaryotic groups

.

Annu Rev Genet

2003

;

37

:

283

–

328

.

6.

Keeling

PJ

,

Palmer

JD

.

Horizontal gene transfer in eukaryotic evolution

.

Nat Rev Genet

2008

;

9

(

8

):

605

–

18

.

7.

Gilbert

W

.

Why genes in pieces

.

Nature

1978

;

271

(

5645

):

501

.

8.

Toll-Riera

M

,

Bosch

N

,

Bellora

N

, et al.

Origin of primate orphan genes: a comparative genomics approach

.

Mol Biol Evol

2009

;

26

(

3

):

603

–

12

.

9.

Knowles

DG

,

McLysaght

A

.

Recent de novo origin of human protein-coding genes

.

Genome Res

2009

;

19

(

10

):

1752

–

9

.

10.

Kaessmann

H

,

Vinckenbosch

N

,

Long

MY

.

RNA-based gene duplication: mechanistic and evolutionary insights

.

Nat Rev Genet

2009

;

10

(

1

):

19

–

31

.

11.

Chen

SD

,

Krinsky

BH

,

Long

MY

.

New genes as drivers of phenotypic evolution

.

Nat Rev Genet

2013

;

14

(

9

):

645

–

60

.

12.

Long

MY

,

VanKuren

NW

,

Chen

SD

, et al.

New gene evolution: little did we know

.

Annu Rev Genet

2013

;

47

:

307

–

33

.

13.

Kaessmann

H

.

Origins, evolution, and phenotypic impact of new genes

.

Genome Res

2010

;

20

(

10

):

1313

–

26

.

14.

Gao

L

,

Wu

K

,

Liu

Z

, et al.

Chromatin accessibility landscape in human early embryos and its association with evolution

.

Cell

2018

;

173

(

1

):

248

–

59.e15

.

15.

Capra

JA

,

Stolzer

M

,

Durand

D

, et al.

How old is my gene?

Trends Genet

2013

;

29

(

11

):

659

–

68

.

16.

Domazet-Loso

T

,

Brajkovic

J

,

Tautz

D

.

A phylostratigraphy approach to uncover the genomic history of major adaptations in metazoan lineages

.

Trends Genet

2007

;

23

(

11

):

533

–

9

.

17.

Wolf

YI

,

Novichkov

PS

,

Karev

GP

, et al.

The universal distribution of evolutionary rates of genes and distinct characteristics of eukaryotic genes of different apparent ages

.

Proc Natl Acad Sci U S A

2009

;

106

(

18

):

7273

–

80

.

18.

Domazet-Loso

T

,

Tautz

D

.

An ancient evolutionary origin of genes associated with human genetic diseases

.

Mol Biol Evol

2008

;

25

(

12

):

2699

–

707

.

19.

Cai

JJ

,

Petrov

DA

.

Relaxed purifying selection and possibly high rate of adaptation in primate lineage-specific genes

.

Genome Biol Evol

2010

;

2

:

393

–

409

.

20.

Zhou

Q

,

Zhang

GJ

,

Zhang

Y

, et al.

On the origin of new genes in Drosophila

.

Genome Res

2008

;

18

(

9

):

1446

–

55

.

21.

Zhang

YE

,

Vibranovski

MD

,

Landback

P

, et al.

Chromosomal redistribution of male-biased genes in mammalian evolution with two bursts of gene gain on the X chromosome

.

PLoS Biol

2010

;

8

(

10

):

pii:e1000494

.

22.

Yin

H

,

Wang

G

,

Ma

L

, et al.

What signatures dominantly associate with gene age?

Genome Biol Evol

2016

;

8

(

10

):

3083

–

9

.

23.

Li

L

,

Stoeckert

CJ

,

Roos

DS

.

OrthoMCL: identification of ortholog groups for eukaryotic genomes

.

Genome Res

2003

;

13

(

9

):

2178

–

89

.

24.

Doolittle

WF

.

Phylogenetic classification and the universal tree

.

Science

1999

;

284

(

5423

):

2124

–

9

.

25.

Bapteste

E

,

O'Malley

MA

,

Beiko

RG

, et al.

Prokaryotic evolution and the tree of life are two different things

.

Biol Direct

2009

;

4

:

34

.

26.

Martin

WF

.

Early evolution without a tree of life

.

Biol Direct

2011

;

6

:

36

.

27.

Guindon

S

,

Dufayard

JF

,

Lefort

V

, et al.

New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0

.

Syst Biol

2010

;

59

(

3

):

307

–

21

.

28.

Alba

MM

,

Castresana

J

.

Inverse relationship between evolutionary rate and age of mammalian genes

.

Mol Biol Evol

2005

;

22

(

3

):

598

–

606

.

29.

Elhaik

E

,

Sabath

N

,

Graur

D

.

The “Inverse relationship between evolutionary rate and age of mammalian genes” is an artifact of increased genetic distance with rate of evolution and time of divergence

.

Mol Biol Evol

2006

;

23

(

1

):

1

–

3

.

30.

Moyers

BA

,

Zhang

JZ

.

Phylostratigraphic bias creates spurious patterns of genome evolution

.

Mol Biol Evol

2015

;

32

(

1

):

258

–

67

.

31.

Moyers

BA

,

Zhang

J

.

Further simulations and analyses demonstrate open problems of phylostratigraphy

.

Genome Biol Evol

2017

;

9

(

6

):

1519

–

27

.

32.

Domazet-Loso

T

,

Carvunis

AR

,

Alba

MM

, et al.

No evidence for phylostratigraphic bias impacting inferences on patterns of gene emergence and evolution

.

Mol Biol Evol

2017

;

34

(

4

):

843

–

56

.

PubMed

33.

Schmidt

HA

,

Strimmer

K

,

Vingron

M

, et al.

TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing

.

Bioinformatics

2002

;

18

(

3

):

502

–

4

.

34.

Stoye

J

,

Evers

D

,

Meyer

F

.

Rose: generating sequence families

.

Bioinformatics

1998

;

14

(

2

):

157

–

63

.

35.

Zhang

JZ

,

Gu

X

.

Correlation between the substitution rate and rate variation among sites in protein evolution

.

Genetics

1998

;

149

(

3

):

1615

–

25

.

PubMed

36.

Chen

JY

,

Shen

QS

,

Zhou

WZ

, et al.

Emergence, retention and selection: a trilogy of origination for functional de novo proteins from ancestral LncRNAs in primates

.

Plos Genet

2015

;

11

(

7

):

e1005391

.

37.

Rainey

PB

.

Unity from conflict

.

Nature

2007

;

446

(

7136

):

616

–

6

.

38.

Domazet-Loso

T

,

Tautz

D

.

Phylostratigraphic tracking of cancer genes suggests a link to the emergence of multicellularity in metazoa

.

BMC Biol

2010

;

8

:

66

.

39.

Blomme

T

,

Vandepoele

K

,

De Bodt

S

, et al.

The gain and loss of genes during 600 million years of vertebrate evolution

.

Genome Biol

2006

;

7

(

5

):

R43

.

40.

Yang

HW

,

He

BZ

,

Ma

HJ

, et al.

Expression profile and gene age jointly shaped the genome-wide distribution of premature termination codons in a Drosophila melanogaster population

.

Mol Biol Evol

2015

;

32

(

1

):

216

–

28

.

41.

Kunin

V

,

Pereira-Leal

JB

,

Ouzounis

CA

.

Functional evolution of the yeast protein interaction network

.

Mol Biol Evol

2004

;

21

(

7

):

1171

–

6

.

42.

Zhang

WY

,

Landback

P

,

Gschwend

AR

, et al.

New genes drive the evolution of gene interaction networks in the human and mouse genomes

.

Genome Biol

2015

;

16

:

202

.

43.

Vishnoi

A

,

Kryazhimskiy

S

,

Bazykin

GA

, et al.

Young proteins experience more variable selection pressures than old proteins

.

Genome Res

2010

;

20

(

11

):

1574

–

81

.

44.

Chen

WH

,

Trachana

K

,

Lercher

MJ

, et al.

Younger genes are less likely to be essential than older genes, and duplicates are less likely to be essential than singletons of the same age

.

Mol Biol Evol

2012

;

29

(

7

):

1703

–

6

.

45.

Chen

S

,

Zhang

YE

,

Long

M

.

New genes in Drosophila quickly become essential

.

Science

2010

;

330

(

6011

):

1682

–

5

.

46.

Lemos

B

,

Bettencourt

BR

,

Meiklejohn

CD

, et al.

Evolution of proteins and gene expression levels are coupled in Drosophila and are independently associated with mRNA abundance, protein length, and number of protein-protein interactions

.

Mol Biol Evol

2005

;

22

(

5

):

1345

–

54

.

47.

Pal

C

,

Papp

B

,

Hurst

LD

.

Highly expressed genes in yeast evolve slowly

.

Genetics

2001

;

158

(

2

):

927

–

31

.

PubMed

48.

Yin

H

,

Ma

L

,

Wang

G

, et al.

Old genes experience stronger translational selection than young genes

.

Gene

2016

;

590

:

29

–

34

.

49.

Popadin

KY

,

Gutierrez-Arcelus

M

,

Lappalainen

T

, et al.

Gene age predicts the strength of purifying selection acting on gene expression variation in humans

.

Am J Hum Genet

2014

;

95

(

6

):

660

–

74

.

50.

Keller

TE

,

Yi

SV

.

DNA methylation and evolution of duplicate genes

.

Proc Natl Acad Sci U S A

2014

;

111

(

16

):

5932

–

7

.

51.

Kim

SH

,

Yi

SV

.

Understanding relationship between sequence and functional evolution in yeast proteins

.

Genetica

2007

;

131

(

2

):

151

–

6

.

52.

Park

J

,

Xu

K

,

Park

T

, et al.

What are the determinants of gene expression levels and breadths in the human genome?

Hum Mol Genet

2012

;

21

(

1

):

46

–

56

.

53.

Vakirlis

NN

,

Hebert

AS

,

Opulente

DA

, et al.

A molecular portrait of de novo genes in yeasts

.

Mol Biol Evol

2017

.

54.

Tautz

D

,

Domazet-Loso

T

.

The evolutionary origin of orphan genes

.

Nat Rev Genet

2011

;

12

(

10

):

692

–

702

.

55.

Kleene

KC

.

Sexual selection, genetic conflict, selfish genes, and the atypical patterns of gene expression in spermatogenic cells

.

Dev Biol

2005

;

277

(

1

):

16

–

26

.

56.

Zhang

YE

,

Vibranovski

MD

,

Krinsky

BH

, et al.

Age-dependent chromosomal distribution of male-biased genes in Drosophila

.

Genome Res

2010

;

20

(

11

):

1526

–

33

.

57.

Potrzebowski

L

,

Vinckenbosch

N

,

Marques

AC

, et al.

Chromosomal gene movements reflect the recent origin and biology of therian sex chromosomes

.

PLoS Biol

2008

;

6

(

4

):

709

–

16

.

Crossref

58.

Betran

E

,

Thornton

K

,

Long

M

.

Retroposed new genes out of the X in Drosophila

.

Genome Res

2002

;

12

(

12

):

1854

–

9

.

59.

Bradley

J

,

Baltus

A

,

Skaletsky

H

, et al.

An X-to-autosome retrogene is required for spermatogenesis in mice

.

Nat Genet

2004

;

36

(

8

):

872

–

6

.

60.

Neme

R

,

Tautz

D

.

Phylogenetic patterns of emergence of new genes support a model of frequent de novo evolution

.

BMC Genomics

2013

;

14

:

117

.

61.

Luo

H

,

Lin

Y

,

Gao

F

, et al.

DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements

.

Nucleic Acids Res

2014

;

42

(

Database issue

):

D574

–

80

.

62.

Charlesworth

D

.

Evolution of recombination rates between sex chromosomes

.

Philos T R Soc B

2017

;

372

(

1736

).

63.

Eichler

EE

,

Sankoff

D

.

Structural dynamics of eukaryotic chromosome evolution

.

Science

2003

;

301

(

5634

):

793

–

7

.

64.

Podder

S

,

Ghosh

TC

.

Exploring the differences in evolutionary rates between monogenic and polygenic disease genes in human

.

Mol Biol Evol

2010

;

27

(

4

):

934

–

41

.

65.

Zhang

HY

,

Chen

LL

,

Li

XJ

, et al.

Evolutionary inspirations for drug discovery

.

Trends Pharmacol Sci

2010

;

31

(

10

):

443

–

8

.

66.

Wang

ZY

,

Fu

LY

,

Zhang

HY

.

Can medical genetics and evolutionary biology inspire drug target identification?

Trends Mol Med

2012

;

18

(

2

):

69

–

71

.

67.

Hood

L

,

Tian

Q

.

Systems approaches to biology and disease enable translational systems medicine

.

Genomics Proteomics Bioinformatics

2012

;

10

(

4

):

181

–

5

.

68.

Cheng

FX

,

Jia

PL

,

Wang

Q

, et al.

Studying Tumorigenesis through Network Evolution and Somatic Mutational Perturbations in the Cancer Interactome

.

Mol Biol Evol

2014

;

31

(

8

):

2156

–

69

.

69.

Zhang

YE

,

Long

MY

.

New genes contribute to genetic and phenotypic novelties in human evolution

.

Curr Opin Genet Dev

2014

;

29

:

90

–

6

.

70.

Zhang

Q

,

Su

B

.

Evolutionary origin and human-specific expansion of a cancer/testis antigen gene family

.

Mol Biol Evol

2014

;

31

(

9

):

2365

–

75

.

71.

Shang

B

,

Gao

A

,

Pan

Y

, et al.

CT45A1 acts as a new proto-oncogene to trigger tumorigenesis and cancer metastasis

.

Cell Death Dis

2014

;

5

:

e1285

.

72.

Wright

AF

,

Hastie

ND

.

Complex genetic diseases: controversy over the Croesus code

.

Genome Biol

2001

;

2

(

8

):

COMMENT2007

.

73.

Furney

SJ

,

Alba

MM

,

Lopez-Bigas

N

.

Differences in the evolutionary history of disease genes affected by dominant or recessive mutations

.

BMC Genomics

2006

;

7

:

165

.

74.

Awan

HM

,

Shah

A

,

Rashid

F

, et al.

Primate-specific long non-coding RNAs and microRNAs

.

Genomics Proteomics Bioinformatics

2017

;

15

(

3

):

187

–

95

.

75.

Capra

JA

,

Williams

AG

,

Pollard

KS

.

ProteinHistorian: tools for the comparative analysis of eukaryote protein origin

.

PLoS Comput Biol

2012

;

8

(

6

):

e1002567

.