Abstract

Retrocopies are gene duplicates arising from reverse transcription of mature mRNA transcripts and their insertion back into the genome. While long being regarded as processed pseudogenes, more and more functional retrocopies have been discovered. How the stripped-down retrocopies recover expression capability and become functional paralogs continually intrigues evolutionary biologists. Here, we investigated the function and evolution of retrocopies in the context of 3D genome organization. By mapping retrocopy–parent pairs onto sequencing-based and imaging-based chromatin contact maps in human and mouse cell lines and onto Hi-C interaction maps in 5 other mammals, we found that retrocopies and their parental genes show a higher-than-expected interchromosomal colocalization frequency. The spatial interactions between retrocopies and parental genes occur frequently at loci in active subcompartments and near nuclear speckles. Accordingly, colocalized retrocopies are more actively transcribed and translated and are more evolutionarily conserved than noncolocalized ones. The active transcription of colocalized retrocopies may result from their permissive epigenetic environment and shared regulatory elements with parental genes. Population genetic analysis of retroposed gene copy number variants in human populations revealed that retrocopy insertions are not entirely random in regard to interchromosomal interactions and that colocalized retroposed gene copy number variants are more likely to reach high frequencies, suggesting that both insertion bias and natural selection contribute to the colocalization of retrocopy–parent pairs. Further dissection implies that reduced selection efficacy, rather than positive selection, contributes to the elevated allele frequency of colocalized retroposed gene copy number variants. Overall, our results hint a role of interchromosomal colocalization in the “resurrection” of initially neutral retrocopies.

Introduction

Gene duplication is a major process leading to the birth of new genes, which contributes significantly to the adaptive evolution and phenotypic novelty of species in diverse clades (Ohno 1970; Zhang 2003; Chen et al. 2013). Gene retrotransposition is an RNA-based gene duplication mechanism whereby mRNA derived from protein-coding genes are reverse transcribed to DNA and inserted back into the genome (Kaessmann et al. 2009). In mammals, gene retroduplication is mainly mediated by the non-LTR retrotransposon LINE-1s (L1s) and occasionally by LTR retrotransposons (Esnault et al. 2000; Tan et al. 2016).

Although RNA-based gene duplication is less common than DNA-based duplication (i.e. segmental duplication and whole genome duplication; Kaessmann 2010), thousands of retrotransposed gene copies (retrocopies) have been identified in humans and other mammalian species (Navarro and Galante 2015; Carelli et al. 2016; Rosikiewicz et al. 2017). However, to what extent this large amount of retrocopies contributes to new gene origination and genome evolution is still an open question. On the one hand, retrocopies have been stripped of introns and promoters, rendering them to be “dead on arrival,” and were frequently dismissed as “processed pseudogenes” (Mighell et al. 2000; Zhang et al. 2003). On the other hand, increasing number of protein-coding retrocopies (retrogenes) with explicit functions has been reported in various organisms (Casola and Betran 2017). For example, a recent study reported that a mammalian retrogene HAPSTR2 encodes a protein that retains the biochemical features of its parental gene HAPSTR1 and acts as a safeguard in stress signaling and resilience (Amici et al. 2023). Moreover, accumulating evidence indicates that many retrocopies are robustly expressed, often in a tissue-specific manner (Carelli et al. 2016; Qian et al. 2022). These retrocopy-derived RNAs may function as natural antisense transcripts (NATs), long noncoding RNAs (lncRNAs), or sponging microRNAs to regulate their progenitors or host genes (Kubiak and Makałowska 2017; Ciomborowska-Basheer et al. 2021). Ribosome profiling data further suggest that a sizable proportion of retrocopies may be actively translated (Kim et al. 2014; Ji et al. 2015; Qian et al. 2022). However, the specific proteins they produce and their functions remain subjects for further investigation.

How some retrocopies gain their expression capability and “awake” as functional paralogs continually attracts evolutionary biologists (Vinckenbosch et al. 2006). Previous research discovered that retrocopies may exploit preexisting promoters, evolve new promoters from scratch, or recruit protopromoters from their genomic vicinity (Carelli et al. 2016). For example, the promoter of the aforementioned retrogene HAPSTR2 originated from a nearby protopromoter (Amici et al. 2023). However, current studies mostly focused on the role of local genomic context in the expression of retrocopies (Carelli et al. 2016; Machado and Antunes 2020), while the influences of high-order chromatin organization on retrocopy expression and functionality are largely unexplored.

In eukaryotes, high-order chromatin organization and transcription are tightly linked, with reciprocal influence on each other (Bonev and Cavalli 2016; van Steensel and Furlong 2019). Apart from the well-studied intrachromosomal (cis) structures, such as A/B compartments, topologically associating domains (TADs), and chromatin loops, interchromosomal (trans) interactions also play a critical role in gene regulation (Quinodoz et al. 2018; Maass et al. 2019) and genome evolution (Dai et al. 2014; Xie et al. 2016). For example, human interchromosomal paralogous gene pairs generated by whole genome duplication (i.e. ohnolog pairs) tend to be colocalized in the nucleus, which may associate with stronger dosage balance in those ohnologs (Xie et al. 2016). Likewise, neighboring genes in yeast exhibit interchromosomal spatial proximity after their separation (Dai et al. 2014), which indicates a role for interchromosomal interactions in the coregulation and evolution of functionally related genes. Inspired by these previous studies and the fact that most retrocopies and their parental genes are located on different chromosomes, we sought to ask whether retrocopy–parent pairs show interchromosomal colocalization in the nucleus, and if so, how spatial proximity links to the function and evolution of retrocopies.

Here, we used chromatin contact maps obtained by orthogonal techniques in human and mouse cell lines and Hi-C interaction matrices in chimpanzees, rhesus macaques, marmosets, cows, and dogs to detected significant interchromosomal interactions between retrocopies and their parental genes. Through extensive simulations, we demonstrated that mammalian retrocopy–parent pairs exhibit a higher colocalization frequency than random chromatin fragment pairs under various null models. Compared to noncolocalized retrocopies, colocalized retrocopies are enriched in active subcompartments and speckles, and are more actively transcribed and translated, supporting a connection between spatial proximity and functionality of retrocopies. Such differences might be resulted from the higher number of shared regulatory elements with parental genes and the more active epigenetic landscape of colocalized retrocopies. Furthermore, by investigating retroposed gene copy number variants (retroCNVs) in human populations, we discovered that insertion bias and nonadaptive evolution shaped the colocalization between retrocopies and parental genes. Our results shed lights on the function and evolution of retrocopies in the context of 3D genome organization.

Results

Human Interchromosomal Retrocopy–Parent Pairs Exhibit Widespread Nuclear Colocalization

To determine the extent of spatial colocalization between retrocopies and their parental genes, we retrieved 5,159 human retrocopy–parent pairs from Carelli et al. (2016). After filtering steps (Materials and Methods), 4,427 interchromosomal pairs were retained for subsequent analyses. Using Hi-C contact maps at 250-kb resolution of 7 human cell lines (GM12878, HMEC, HUVEC, IMR90, K562, KBM7, and NEHK; Rao et al. 2014), we found that between 738 (16.7%; in IMR90) and 1,019 (23.0%; in GM12878) retrocopy–parent pairs exhibit significant interchromosomal interactions, i.e. spatial colocalization (Fig. 1a; supplementary fig. S1 and table S1, Supplementary Material online). More than half (2,511; 56.7%) retrocopy–parent pairs are colocalized in at least 1 cell line, and 1,465 (33.1%) pairs exhibit spatial proximity in 2 or more cell lines (Fig. 1b).

Spatial colocalization between retrocopies and their parental genes. a) Circos plot highlighting the chromatin interactions between retrocopies and parental genes in the GM12878 cell line. From outer to inner are the ideogram of human chromosomes, the density of parental genes, the density of retrocopies, and significant chromatin interactions between retrocopies and parental genes. b) Bar plot showing the number of human interchromosomal retrocopy–parent pairs as a function of the number of cell lines wherein the spatial colocalization is detected. c) Distribution of colocalization frequency between chromatin pairs under different null models, while the arrow points to the colocalization frequency of true retrocopy–parent pairs. Simulations and true frequency are based on Hi-C data of GM12878. P < 0.001 in all cases.
Fig. 1.

Spatial colocalization between retrocopies and their parental genes. a) Circos plot highlighting the chromatin interactions between retrocopies and parental genes in the GM12878 cell line. From outer to inner are the ideogram of human chromosomes, the density of parental genes, the density of retrocopies, and significant chromatin interactions between retrocopies and parental genes. b) Bar plot showing the number of human interchromosomal retrocopy–parent pairs as a function of the number of cell lines wherein the spatial colocalization is detected. c) Distribution of colocalization frequency between chromatin pairs under different null models, while the arrow points to the colocalization frequency of true retrocopy–parent pairs. Simulations and true frequency are based on Hi-C data of GM12878. P < 0.001 in all cases.

We reasoned that if the spatial proximity between retrocopies and parental genes had biological significance, then a higher-than-expected colocalization frequency should be observed. Therefore, we performed various simulations to determine the statistical significance of retrocopy–parent colocalization (supplementary fig. S2, Supplementary Material online). First, we simulated 1,000 data sets, in each of which 4,427 random fragment pairs with matching chromosomes and sequence lengths as the retrocopy–parent pairs were sampled. Notably, we found that the colocalization frequency of true retrocopy–parent pairs is significantly higher than simulated chromatin pairs (P < 0.001; Fig. 1c; supplementary fig. S3, Supplementary Material online), suggesting nonrandom colocalization between retrocopies and parental genes. Second, we kept the information of retrocopies unchanged while sampling random coding and noncoding genes with matching chromosomes as parental genes. Third, we kept the information of parental genes unchanged while sampling random fragments with matching chromosomes and sequence lengths as retrocopies. Fourth, similar to the second simulation, except that we constrained the randomly sampled genes to exclusively protein-coding genes. For these simulations, we also observed a significant higher colocalization frequency in true retrocopy–parent pairs (P < 0.05, except for simulation 4 of NHEK; Fig. 1c; supplementary fig. S3, Supplementary Material online). However, compared to the first null model, the colocalization frequencies in these simulated data are higher, indicating that some properties (e.g. local genomic and epigenomic environments) of retrocopies or parental genes may contribute to interchromosomal colocalization.

Lastly, to further investigate the influence of retrocopies and parental genes on spatial proximity, we randomly shuffled the relationship between retrocopies and cognate genes for 1,000 times. Particularly, the observed colocalization frequencies in GM12878, HMEC, HUVEC, K562, and NHEK are still significantly higher than expected (P < 0.05), while the significance is marginal in IMR90 and KBM7 (P = 0.082 and 0.078, respectively; Fig. 1c; supplementary fig. S3, Supplementary Material online). To further evaluate the significance of interchromosomal colocalization between retrocopies and their parental genes, we analyzed contact maps derived from 2 orthogonal methods: (i) split-pool recognition of interactions by tag extension (SPRITE), which is a sequencing-based but ligation-free technique that enables detection of multiway genome-wide chromatin interactions, and (ii) multiplexed error-robust fluorescence in situ hybridization (MERFISH), which is an imaging-based approach to map chromatin structure at genome scale. In GM12878, we observed a significant overlap in the colocalization status of retrocopy–parent pairs derived from Hi-C (Rao et al. 2014) and SPRITE (Quinodoz et al. 2018) when assessed at the same resolution (1 Mb) and significance level (q < 0.05; supplementary fig. S4a, Supplementary Material online). Furthermore, when employing significant contacts identified by SPRITE, retrocopies and their parental genes exhibit a colocalization frequency higher than expected (P < 0.05 in all simulations; supplementary fig. S4b, Supplementary Material online). Notably, in IMR90, retrocopy–parent pairs identified as colocalized by Hi-C display shorter median spatial distances and higher proximity frequencies, as measured by MERFISH (Su et al. 2020), compared to noncolocalized pairs (P = 3.5 × 10−6 and 2.0 × 10−13, respectively; supplementary fig. S5, Supplementary Material online).

Colocalized Retrocopies Are Enriched in Active Subcompartments and Are Close to Nuclear Speckles

We next asked where did retrocopies and parental genes colocalize with respect to the 3D genome organization in the nucleus. By clustering interchromosomal contact matrix of the ultradeep-sequenced GM12878 cell line, Rao et al. (2014) divided the active A and inactive B compartments into 5 primary subcompartments (A1, A2, B1, B2, and B3), which were associated with distinct genomic and epigenomic features. We found that in GM12878, retrocopies are slightly enriched in A2 subcompartment while depleted in B2 and B3 subcompartments (Fig. 2a). Such enrichment of active subcompartments and depletion of inactive subcompartments are more apparent for parental genes. In addition, compared to protein-coding genes as a whole, retrocopies’ parental genes show stronger enrichment of active subcompartments and depletion of inactive subcompartments, which is consistent with previous findings that highly expressed genes are more prone to generate retrocopies (Zhang et al. 2003; Podlaha and Zhang 2009; Sisu et al. 2014).

Chromatin positioning of retrocopies and parental genes in the 3D genome. a) Enrichments of retrocopies, parental genes, unprocessed pseudogenes, and retroCNVs in 5 primary subcompartments and TAD in the GM12878 cell line. b) Sankey diagram showing the connection of colocalized (left panel) and noncolocalized (right panel) retrocopy–parent pairs in the context of subcompartment annotations in GM12878. The height of bar indicates the proportion of loci that fall in respective categories. Lines connect retrocopies and their corresponding parent genes. NA denotes a fragment is not annotated as any of the 5 subcompartments. c) Comparison of SON TSA-Seq signal between colocalized and noncolocalized retrocopies/parental genes in K562 (***P < 0.001; 2-tailed Wilcoxon test). d) Proportion of colocalized and noncolocalized retrocopies/parental genes that overlapped with different SON TSA-Seq deciles in K562. Decile 1 (estimated distance: 0.868 to 1.020 μm) and decile 10 (estimated distance: 0.054 to 0.399 μm) represent the top 10% of chromatin that are farthest and closest to SON, respectively, and other deciles are those that fall between. Coloc_retro, colocalized retrocopies; Noncoloc_retro, noncolocalized retrocopies; Coloc_parent, colocalized parental genes; Noncoloc_parent, noncolocalized parental genes.
Fig. 2.

Chromatin positioning of retrocopies and parental genes in the 3D genome. a) Enrichments of retrocopies, parental genes, unprocessed pseudogenes, and retroCNVs in 5 primary subcompartments and TAD in the GM12878 cell line. b) Sankey diagram showing the connection of colocalized (left panel) and noncolocalized (right panel) retrocopy–parent pairs in the context of subcompartment annotations in GM12878. The height of bar indicates the proportion of loci that fall in respective categories. Lines connect retrocopies and their corresponding parent genes. NA denotes a fragment is not annotated as any of the 5 subcompartments. c) Comparison of SON TSA-Seq signal between colocalized and noncolocalized retrocopies/parental genes in K562 (***P < 0.001; 2-tailed Wilcoxon test). d) Proportion of colocalized and noncolocalized retrocopies/parental genes that overlapped with different SON TSA-Seq deciles in K562. Decile 1 (estimated distance: 0.868 to 1.020 μm) and decile 10 (estimated distance: 0.054 to 0.399 μm) represent the top 10% of chromatin that are farthest and closest to SON, respectively, and other deciles are those that fall between. Coloc_retro, colocalized retrocopies; Noncoloc_retro, noncolocalized retrocopies; Coloc_parent, colocalized parental genes; Noncoloc_parent, noncolocalized parental genes.

Remarkably, colocalized retrocopies show strong enrichment of A1 and A2 subcompartments and strong depletion of B2 and B3 subcompartments, resembling the pattern of parental genes (Fig. 2a). In contrast, noncolocalized retrocopies are depleted in active A1 subcompartment and show little depletion of inactive subcompartments. Dividing colocalized and noncolocalized retrocopies into intergenic and intragenic subclasses does not affect the general pattern (Fig. 2a), indicating that the subcompartment association is not affected by host genes. Besides, parental genes and colocalized retrocopies are slightly enriched in TADs. Contrary to retrocopies, unprocessed pseudogenes are depleted in A1 and A2 subcompartments and TAD while slightly enriched in B1 and B2 subcompartments (Fig. 2a), reflecting different 3D chromatin environments between processed and unprocessed “pseudogenes.” Using subcompartment annotations inferred by deep learning on relatively low-resolution Hi-C maps (Xiong and Ma 2019), we obtained similar results in HMEC, HUVEC, IMR90, and K562 (supplementary fig. S6, Supplementary Material online).

Of note, for colocalized retrocopy–parent pairs, both ends tend to be overlapped with A1 or A2 subcompartments, whereas for noncolocalized pairs, a substantial proportion of retrocopies are overlapped with B2 or B3 subcompartments, despite the fact that the majority of parental genes still overlap with active subcompartments (Fig. 2b; supplementary fig. S7, Supplementary Material online).

Subcompartments are strongly associated with subnuclear structures, such as nuclear lamina, nucleolus, and speckles, which are associated with genome regulation (Chen et al. 2018). Thus, we compared the spatial positioning relative to subnuclear structures of colocalized and noncolocalized retrocopies using tyramide signal amplification (TSA)-Seq data in K562 (Chen et al. 2018). TSA-Seq is a technique that combines TSA and high-throughput sequencing to map chromosome organization relative to subnuclear structures genome-wide. We found that the log2(pull-down/input) signals of SON and pSC35 TSA-Seq of colocalized retrocopies or parental genes are significantly higher than their noncolocalized counterparts, whereas an opposite trend is observed for LaminAC and LaminB TSA-Seq signals (P < 0.001; Fig. 2c; supplementary fig. S8a to c, Supplementary Material online). SON and pSC35 are 2 marker proteins for nuclear speckles while LaminAC and LaminB are markers for nuclear lamina, and high signal means shorter cytological distance to the respective subnuclear structure. Thus, our results indicate that colocalized retrocopies and parental genes are closer to speckles, while their noncolocalized counterparts are closer to lamina.

SON TSA-Seq signals can be further converted to actual distance (in µm) between chromatin and nuclear speckles, and loci can be assigned to different TSA deciles based on their distance to speckles (Chen et al. 2018). Leveraging these data, we found that colocalized retrocopies have an average distance of 0.48 µm to speckles, longer than that of colocalized parental genes but significantly shorter than that of noncolocalized retrocopies (P < 0.001; supplementary fig. S8d, Supplementary Material online). Consistently, colocalized parental genes have the highest proportion of fragments falling in the 9th and 10th SON TSA deciles (chromatin that is mostly close to speckle), followed by colocalized retrocopies and noncolocalized parental genes, while only a small proportion of noncolocalized retrocopies fall in the 9th and 10th deciles (Fig. 2d). Furthermore, by integrating TSA-Seq, DamID, and Hi-C data in the K562 cell line, Wang et al. (2021) partitioned the genome into 10 different spatial compartmentalization states relative to nuclear speckles, lamina, and nucleolus. We found that colocalized retrocopies are enriched in interior active regions and speckle-associated chromatin while depleted in interior repressive regions and lamina-associated chromatin, similar to the pattern of parental genes (supplementary fig. S9, Supplementary Material online).

Since both ends of colocalized retrocopy–parent pairs tend to have shorter spatial distances to speckles (supplementary fig. S10a, Supplementary Material online) in comparison with noncolocalized pairs, we randomly sampled protein-coding genes with matching speckle distance (±0.01 μm away) as true parental genes to form simulated chromatin pairs with retrocopies for 1,000 times in K562. We found that after controlling speckle distances, true retrocopy–parent pairs still exhibit higher-than-expected colocalization frequency (P < 0.001; supplementary fig. S10b, Supplementary Material online), suggesting that speckle proximity cannot fully explain retrocopy–parent colocalization. However, because there are 10 to 50 speckles in a nucleus (Spector and Lamond 2011), we cannot exclude the possibility that retrocopies and their parental genes tend to be close to the same nuclear speckle.

RNA Biogenesis and Shared Regulatory Elements Are Linked to the Interchromosomal Colocalization between Retrocopies and Parental Genes

Although retrocopies are often thought to be processed pseudogenes, most of them show evidence of transcription (Carelli et al. 2016). The distinct positioning of colocalized and noncolocalized retrocopies in the nucleus suggests they may have different transcription profile, and we then asked whether the spatial colocalization with parental genes is associated with the expression of retrocopies. Using RNA-Seq data of GM12878, HUVEC, K562, and NHEK from the ENCODE project (Djebali et al. 2012), we have also detected evidence of transcription for most retrocopies (supplementary table S1, Supplementary Material online), with a substantial proportion of them showing robust expression. For instance, when assessing polyadenylated (polyA+) RNAs in whole cells, there are 772 to 1,350 (17.4% to 30.5%) retrocopies exhibiting fragments per kilobase per million mapped fragments (FPKM) ≥ 1 (supplementary fig. S11a, Supplementary Material online). The proportion of robustly expressed (FPKM ≥ 1) retrocopies for polyA+ RNAs is larger than 15% and 20% in the nucleus and cytosol, respectively (supplementary fig. S11b and c, Supplementary Material online). There are also substantial nonpolyadenylated (polyA−) RNAs transcribed from retrocopies in the nucleus and whole cell, but relatively few in the cytosol (supplementary fig. S11d to f, Supplementary Material online). More importantly, colocalized retrocopies and parental genes have significantly higher expression level than their noncolocalized counterparts in most cell lines, for RNAs in different compartments (whole cell, nucleus, and cytosol), and for both polyA+ and polyA− RNAs (P < 0.01; Fig. 3a; supplementary figs. S12 and S13, Supplementary Material online). In addition, the GRO-Seq and PolII TSA-Seq data in K562 (Chen et al. 2018) show that colocalized retrocopies and parental genes generate more nascent RNAs and are closer to RNA polymerase II than their noncolocalized counterparts (P < 0.001; Fig. 3b), suggesting that RNA biogenesis is more active at colocalized loci.

Expression and regulatory element sharing of retrocopies and parental genes. a) Plots of cumulative fraction of expression of retrocopies (left panels) and parental genes (right panels) in GM12878. The expression levels of polyadenylated (polyA+; top panels) and nonpolyadenylated (polyA−; bottom panels) RNAs are plotted separately. b) Comparison of GRO-Seq signal (left panel) and Pol II TSA-Seq signal (right panel) between colocalized and noncolocalized retrocopies/parental genes in K562. c) Comparison of the number of shared TFBSs between colocalized and noncolocalized retrocopy–parent pairs. ***P < 0.001; 2-tailed Wilcoxon test.
Fig. 3.

Expression and regulatory element sharing of retrocopies and parental genes. a) Plots of cumulative fraction of expression of retrocopies (left panels) and parental genes (right panels) in GM12878. The expression levels of polyadenylated (polyA+; top panels) and nonpolyadenylated (polyA−; bottom panels) RNAs are plotted separately. b) Comparison of GRO-Seq signal (left panel) and Pol II TSA-Seq signal (right panel) between colocalized and noncolocalized retrocopies/parental genes in K562. c) Comparison of the number of shared TFBSs between colocalized and noncolocalized retrocopy–parent pairs. ***P < 0.001; 2-tailed Wilcoxon test.

Enhancer sharing has been hypothesized to account for the spatial colocalization and coregulation of paralogous genes located on the same chromosome (Ibn-Salem et al. 2017). Also, transcription factors (TFs) play a part in interchromosomal interactions (Kim and Shendure 2019). We thus interrogated the relationship between shared regulatory elements and spatial colocalization. As expected, we found that the flanking regions (putative transcription start site, i.e. TSS ± 3 kb) of colocalized retrocopies and parental genes harbor more TF binding sites (TFBSs) than noncolocalized ones in most cell lines (P < 0.05; supplementary fig. S14, Supplementary Material online). In particular, the number of shared TFBSs between colocalized retrocopies and parental genes is significantly greater than that of noncolocalized retrocopy–parent pairs (P < 0.001; Fig. 3c). It should be noted that the choice of flanking regions (e.g. TSS ± 1 kb or upstream 3 kb) does not affect our results (supplementary fig. S15, Supplementary Material online). Significantly, colocalized retrocopies show a higher correlation coefficient of expression with their parental genes across 17,282 GTEx samples (GTEx Consortium 2020) in comparison with noncolocalized pairs (supplementary fig. S16, Supplementary Material online), supporting the involvement of shared TFBSs in retrocopy–parent coexpression. Interestingly, the corresponding TFs of shared motifs are mostly zinc finger proteins, such as ZNF263, ZNF384, ZNF460, and ZNF135, consistent with a recent report in which zinc finger factors are found to play a role in speckle-associated interchromosomal interactions (Joo et al. 2023).

Colocalized and noncolocalized retrocopies are enriched in different subcompartments, which are associated with distinct epigenomic features and may affect the expression of retrocopies. We then explicitly examined the signals of 12 epigenomic features (including DNase I sensitivity, the histone variant H2A.Z, and 10 histone modifications) for retrocopies. Generally, colocalized retrocopies and their flanking regions have stronger signals of activating histone modifications (e.g. H3K4me1, H3K27ac, H3K36me3, and H3K79me2) in comparison with noncolocalized retrocopies (Fig. 4; supplementary figs. S17 to S21, Supplementary Material online), indicating an association between epigenetic landscape and interchromosomal colocalization. Notably, retrocopies, even those noncolocalized ones, harbor stronger activating histone modifications than unprocessed pseudogenes (Fig. 4; supplementary figs. S17 to S21, Supplementary Material online), further supporting the delimitation between processed and unprocessed “pseudogenes” in the context of 3D genome environment.

Epigenetic signals for DNase I hypersensitivity, histone variant H2A.Z, and 10 histone modifications along colocalized retrocopies, noncolocalized retrocopies, and unprocessed pseudogenes and their flanking regions (±3 kb) in the GM12878 cell line. Solid lines represent the mean fold change signal, and shadowed areas denote the standard error. TSS, putative transcription start site; TES, putative transcription end site.
Fig. 4.

Epigenetic signals for DNase I hypersensitivity, histone variant H2A.Z, and 10 histone modifications along colocalized retrocopies, noncolocalized retrocopies, and unprocessed pseudogenes and their flanking regions (±3 kb) in the GM12878 cell line. Solid lines represent the mean fold change signal, and shadowed areas denote the standard error. TSS, putative transcription start site; TES, putative transcription end site.

To evaluate the relative importance of different features in the colocalization between retrocopies and their parental genes, we trained a logistic regression model using 35 genetic and epigenetic features of retrocopies and parental genes as predictors (supplementary table S2, Supplementary Material online) in each cell line where matched data were available. For GM12878, the machine learning model is able to predict retrocopy–parent colocalization with an accuracy of 0.798 and an area under the receiver operating characteristic curve (AUC) of 0.785 (supplementary fig. S22a, Supplementary Material online). Among the features included, the number of shared motifs between retrocopies and parental genes is the most important, with various histone modifications of retrocopies also contributing significantly (supplementary fig. S22b, Supplementary Material online). The number of shared motifs and epigenetic features of retrocopies are also top predictors in models of HUVEC, K562, and NHEK, but the performance is worse for these cell lines (supplementary figs. S23 to S25, Supplementary Material online).

Colocalized Retrocopies Are More Likely to Be Functional and Evolutionarily Conserved

Retrocopies may preserved the coding potentials of their parental genes before degenerative mutations disrupting the open reading frames (ORFs); therefore, gaining expression is the first and most important step of being functional for a retrocopy. The resulting RNAs may serve as functional units per se, such as NATs, lncRNAs, or microRNAs (Kubiak and Makałowska 2017). There are also sporadic cases that retrocopies act as important protein-coding genes in development and disease (Casola and Betran 2017). To further assess the functional relevance of retrocopies at the translation level, we leveraged the ribosome profiling (i.e. Ribo-Seq) data across 101 studies curated on RPFdb (Wang et al. 2019) and found that colocalized retrocopies have significantly higher reads per kilobase per million mapped reads (RPKM) values than noncolocalized retrocopies in 5 of the 7 cell lines (P < 0.001; supplementary fig. S26, Supplementary Material online). The difference in Ribo-Seq signals is statistically marginal between retrocopies that are colocalized in 0 and 1 cell line (P = 7.8 × 10−2), while retrocopies with recurrent colocalized in 2 or more cell lines are covered by significantly more reads than the other 2 categories of retrocopies (P = 5.5 × 10−13 and 3.7 × 10−6, respectively; Fig. 5a). Because Ribo-Seq reads could be derived from genomic fragments randomly copurified with ribosomes, we further screened retrocopy transcripts with actively translated ORFs. These ORFs were detected using RibORF (Ji et al. 2015), which makes use of the 3-nt periodicity and uniformity of read distribution across codons of translated regions, thus reducing spurious signals of translation. When only transcripts with at least 1 detected ORF were considered, colocalized retrocopies still present higher Ribo-Seq signals than noncolocalized ones (supplementary fig. S27, Supplementary Material online). If the higher levels in transcription and translation of colocalized retrocopies did confer any functional significance, they would be more evolutionarily conserved than noncolocalized ones. As expected, we observed that the average phastCons conservation scores of colocalized retrocopies are significantly higher than noncolocalized ones (Fig. 5b; supplementary fig. S28, Supplementary Material online). This pattern is retained when we use LINSIGHT scores (Huang et al. 2017) to measure the conservation of retrocopies (supplementary fig. S29, Supplementary Material online). In addition, colocalized retrocopies and their putative promoters (1 kb upstream of TSSs) are more likely to overlap with National Human Genome Research Institute (NHGRI)-EBI GWAS single nucleotide polymorphisms (SNPs; Buniello et al. 2019) than noncolocalized ones (P < 0.01 in 6 out of the 7 cell lines, Fisher exact test; Fig. 5c), further supporting their higher potential of functionality. We found that the phastCons scores of retrocopies are positively correlated with their levels of RNA-Seq (R = 0.45, P < 2.2 × 10−16) and Ribo-Seq (R = 0.28, P < 2.2 × 10−16; Fig. 5d). However, after accounting for the variation of RNA expression level, the partial correlation coefficient between phastCons score and Ribo-Seq signal is only 0.088 (P = 5.5 × 10−8), while the partial correlation coefficient between phastCons score and RNA expression level is 0.374 (P = 1.8 × 10−12) when controlling for Ribo-Seq signal, suggesting that retrocopies may be of more functional relevance at the RNA level.

Coding potential and functional importance of colocalized and noncolocalized retrocopies. a, b) Comparison of Ribo-Seq RPKM values a) and phastCons scores b) among retrocopies that are colocalized in 0, 1, and ≥2 cell lines. For Ribo-Seq data, only retrocopies with RPKM ≥ 1 were included. c) Percentage of colocalized and noncolocalized retrocopies in the 7 cell lines that are overlapped with at least 1 NHGRI-EBI GWAS SNP. d) Correlation between phastCons score and RNA expression level (left panel) and between phastCons score and Ribo-Seq signal (right panel) of retrocopies. Solid line and shadowed area in each plot denote the linear regression line and its confidence interval. ***P < 0.001, **P < 0.01, *P < 0.05, and NS, not significant; 2-tailed Wilcoxon test in a) and b) and Fisher's exact test in c).
Fig. 5.

Coding potential and functional importance of colocalized and noncolocalized retrocopies. a, b) Comparison of Ribo-Seq RPKM values a) and phastCons scores b) among retrocopies that are colocalized in 0, 1, and ≥2 cell lines. For Ribo-Seq data, only retrocopies with RPKM ≥ 1 were included. c) Percentage of colocalized and noncolocalized retrocopies in the 7 cell lines that are overlapped with at least 1 NHGRI-EBI GWAS SNP. d) Correlation between phastCons score and RNA expression level (left panel) and between phastCons score and Ribo-Seq signal (right panel) of retrocopies. Solid line and shadowed area in each plot denote the linear regression line and its confidence interval. ***P < 0.001, **P < 0.01, *P < 0.05, and NS, not significant; 2-tailed Wilcoxon test in a) and b) and Fisher's exact test in c).

Of particular note is that for 2 sister retrocopies derived from the same parental gene, the colocalized one generally has higher levels of expression, motif sharedness, and conservation than the noncolocalized one in GM12878 (P = 1.4 × 10−2, 1.5 × 10−5, and 3.0 × 10−2, respectively; supplementary fig. S30, Supplementary Material online). But the trends are not apparent in other cell lines (data not shown), possibly due to the small sample size of desired retrocopies and the nosier nature of colocalization in cell lines other than GM12878.

In addition, to ask whether the colocalization pattern of retrocopy–parent pairs and its functional significance is conserved across mammalian species, we performed similar analyses with retrocopy–parent pairs of mice, chimpanzees, rhesus macaques, marmosets, cows, and dogs (supplementary table S3, Supplementary Material online; Carelli et al. 2016). Using Hi-C maps of corresponding species at 250-kb resolution (Rao et al. 2014; Zhang et al. 2019; Luo et al. 2021; Li et al. 2022), we found that retrocopy–parent pairs in 5 out of these 6 species exhibit significantly higher colocalization frequencies than randomly chromatin pairs under various null models (supplementary fig. S31, Supplementary Material online). The result in dogs is an exception (P = 0.132 for simulation 5), possibly due to the large number and short length of dog's chromosomes (Li et al. 2022). In mouse embryonic stem cells (mESCs), we found that colocalized retrocopy–parent pairs determined by Hi-C (Bonev et al. 2017) and SPRITE (Quinodoz et al. 2018) have a great proportion of overlap (supplementary fig. S32a, Supplementary Material online). Leveraging high-resolution genome-wide seqFISH+ imaging data in mESCs (Takei et al. 2023), we also observed a significant shorter mean spatial distance for retrocopy–parent pairs identified as colocalized by Hi-C in comparison with noncolocalized pairs (P = 6.1 × 10−94; supplementary fig. S33a, Supplementary Material online). More importantly, when significant interactions (q < 0.05) identified by SPRITE or the top 10% interchromosomal chromatin pairs with the shortest spatial distances detected by seqFISH+ were used to determine interchromosomal colocalization in mESCs, we found a higher-than-expected colocalization frequency for true retrocopy–parent pairs (supplementary figs. S32b and S33b, Supplementary Material online).

In addition, we observed that colocalized retrocopies and parental genes are also expressed at a higher level than noncolocalized ones in mouse CH12-LX cells (supplementary fig. S34, Supplementary Material online). Comparison of Ribo-Seq signals and phastCons scores indicates mouse colocalized retrocopies are more actively translated and tend to be more conserved than noncolocalized ones (P = 0.002 and 0.012, respectively; supplementary fig. S35, Supplementary Material online). Besides, the conservation of retrocopies is positively correlated with their levels of transcription and translation in mouse, with a greater correlation coefficient between conservation and RNA expression (supplementary fig. S36, Supplementary Material online).

Biased Insertion and Natural Selection Shaped the Interchromosomal Colocalization between Retrocopies and Parental Genes

Lastly, we explored what molecular and evolutionary mechanisms might have shaped the spatial colocalization between retrocopies and parental genes. To this end, we identified 625 polymorphic retrocopies (retroCNVs) in 5 human populations (CEU, CHS, LWK, PEL, and YRI; 491 individuals) using sideRETRO (Miller et al. 2021; supplementary table S4, Supplementary Material online; Materials and Methods). Similar to Zhang et al. (2021), we did not distinguish homozygotes and heterozygotes and only accounted for the presence or absence of each allele in different individuals. Our list captures most retroCNVs identified in other studies (Abyzov et al. 2013; Ewing et al. 2013; Schrider et al. 2013; Zhang et al. 2017; Feng and Li 2021; Batcher et al. 2022; supplementary table S5, Supplementary Material online), and the neighbor-joining tree based on retroCNV presence frequency is topologically the same as the tree built from genome-wide SNPs (supplementary fig. S37, Supplementary Material online), indicating the overall accuracy of retroCNV identification. The presence frequency spectrum of retroCNVs is more skewed toward low frequency in comparison with SNPs (supplementary fig. S38, Supplementary Material online), congruent with the highly deleterious effect of retroCNV insertions. Ninety-eight retroCNVs are present in only 1 individual, which may be enriched in de novo retrocopy insertions. We then performed the same simulations for these presumably de novo retrocopies and their parental genes as did for putative “fixed” retrocopy–parent pairs (supplementary fig. S2, Supplementary Material online) and found that the colocalization frequency between de novo retrocopy insertion sites and their parental genes is only significantly higher than expected in some simulations (Fig. 6a; supplementary fig. S39, Supplementary Material online). This result suggests that the insertion of retrocopies is not random, but nonrandom insertion cannot fully explain the current colocalization pattern between retrocopies and parental genes.

Colocalization and evolution of retroCNVs in humans. a) Density plots showing the distribution of colocalization frequency between chromatin pairs under different null models for singleton retroCNVs (left panel) and high-frequency retroCNVs (right panel). The arrows point to the colocalization frequency between true retroCNVs and their parental genes. Simulations and true frequencies are based on Hi-C data in GM12878. P-values are indicated for simulation 5. b) The presence frequency spectra for retroCNVs that are colocalized in 0, 1, and ≥2 cell lines.
Fig. 6.

Colocalization and evolution of retroCNVs in humans. a) Density plots showing the distribution of colocalization frequency between chromatin pairs under different null models for singleton retroCNVs (left panel) and high-frequency retroCNVs (right panel). The arrows point to the colocalization frequency between true retroCNVs and their parental genes. Simulations and true frequencies are based on Hi-C data in GM12878. P-values are indicated for simulation 5. b) The presence frequency spectra for retroCNVs that are colocalized in 0, 1, and ≥2 cell lines.

Then we investigated the potential role of postinsertion selection in determining retrocopy–parent colocalization. In contrast to de novo retrocopy insertions, high-frequency (occurs in ≥ 28 individuals; n = 96, a sample size that is comparable to singletons) retroCNVs and their parental genes exhibit a higher-than-expected colocalization frequency in all simulations (P < 0.01 except for simulations 4 and 5 in HUVEC; Fig. 6a; supplementary fig. S40, Supplementary Material online), indicating the action of postinsertion selection on retrocopy survival in the context of interchromosomal interactions. Furthermore, we found that the presence frequency spectrum of retroCNVs exhibiting colocalization in 2 or more cell lines is less skewed toward low frequency compared to retroCNVs showing colocalization in 0 or 1 cell line (Fig. 6b), suggesting relaxed negative selection on colocalized retroCNVs. The difference in frequency spectrum holds true for retroCNV subclasses base on colocalization status in individual cell lines (supplementary fig. S41, Supplementary Material online). With respect to the 3D genome organization, recurrently colocalized retroCNVs are enriched in the active A1 subcompartment (Fig. 2a; supplementary fig. S4, Supplementary Material online) and interior active regions and speckles (supplementary fig. S6, Supplementary Material online). These active chromatin regions may have higher density of selected sites and/or more intense selection, which are negatively correlated with local effective population size (Gossmann et al. 2011), leading to reduced selection efficacy on recurrently colocalized retroCNVs. Indeed, the density of conserved sites (phastCons score > 0.5) flanking colocalized retroCNVs (insertion site ± 50 kb) is significantly higher than noncolocalized retroCNVs (supplementary fig. S42, Supplementary Material online). Also, the unfolded site frequency spectrum of SNPs in the flanking regions (insertion site ± 50 kb) of recurrently colocalized retroCNVs is more skewed toward low frequency (supplementary fig. S43, Supplementary Material online).

Because there is also a higher proportion of intermediate-to-high-frequency alleles for recurrently colocalized retroCNVs, we sought to detect whether the elevated frequency of colocalized retroCNVs is driven by positive selection. If a retroCNV insertion was under strong positive selection, selective sweep would (i) reduce the nucleotide diversity (π) among sequences near the insertion for retroCNV carriers compared to noncarriers; (ii) reduce Tajima's D among sequences near the insertion for retroCNV carriers compared to genomic background; and (iii) increase the linkage disequilibrium (LD) between SNPs near the insertion for retroCNV carriers compared to genomic background (Zhang and Tautz 2022). Based on these principles, we scanned signals of positive selection for each retroCNV with allele frequency ≥ 0.5 in individual populations using SNPs from the 1000 Genomes Project (Auton et al. 2015; Materials and Methods). We found that neither colocalized nor noncolocalized high-frequency retroCNVs are under positive selection using the above criteria (supplementary table S6, Supplementary Material online). Therefore, our results indicate that the intermediate to high frequency of both colocalized and noncolocalized retroCNVs is probably the result of nonadaptive process or weak positive selection that left untraceable signal.

Discussion

In the present study, we revealed that retrocopy–parent pairs exhibit higher frequency of interchromosomal colocalization than expected under various null models and that colocalized retrocopies are more active in terms of transcription and translation. We further explored the potential mechanisms, functional consequences, and evolutionary implications of the nonrandom colocalization between retrocopies and their parental genes.

We focused exclusively on interactions between retrocopies and parental genes that are located on different chromosomes, i.e. interchromosomal contacts. Compared to intrachromosomal structures (e.g. TADs and chromatin loops), the mechanisms and functional consequences of interchromosomal interactions are less well understood. Although often being thought as noise or technical artifacts, emerging evidence suggests that interchromosomal interactions are relevant to fundamental cellular activity, such as transcription and splicing (Maass et al. 2019). Using CRISPR/Cas9 live-cell imaging, Maass et al. (2018) showed that interchromosomal contacts are as stable and frequent as intrachromosomal interactions. In another study, Quinodoz et al. (2018) detected 2 hubs of interchromosomal interactions that shape 3D genome organization in the nucleus using SPRITE. Along with accumulating evidences of trans gene regulation in health and in disease (Patel et al. 2014; Bashkirova and Lomvardas 2019; Maass et al. 2019; Lei et al. 2023; Moon et al. 2023), recent progress indicates that interchromosomal interactions play structural and functional roles in cells. Our finding that retrocopy function is linked to trans interactions with parental genes also hints nonrandom organization of interchromosomal contacts.

Interchromosomal contacts often occur at farther spatial distances than cis interactions (Maass et al. 2018), rendering their detection difficult for regular ligation-based methods (e.g. Hi-C). In this study, our main analyses of interchromosomal interactions rely on Hi-C contact maps at 250-kb resolution, given the constraints of sequencing depth. Leveraging the ultrahigh-depth Hi-C data in GM12878, we further detected spatial colocalization between retrocopies and parental genes at 100- and 500 kb resolutions. Under the same criterion (q < 0.05 in FitHiC2), the number of colocalized pairs of retrocopy–parent pairs decreases with the increase of resolution. Nevertheless, the colocalization pairs based on high-resolution data are almost a subset of pairs detected using low-resolution contact maps (supplementary fig. S44, Supplementary Material online). Furthermore, retrocopies and their parental genes always possess higher-than-expected colocalization frequencies at different resolutions (supplementary fig. S45, Supplementary Material online). More importantly, chromatin contact maps derived from orthogonal methods (SPRITE, MERFISH, and seqFISH+) in humans and mice also support the significant colocalization between retrocopy–parent pairs. However, a limitation in our assessment of interchromosomal interactions is the low resolution of certain contact maps and the scarcity of cell lines with complementary data simultaneously available.

Subsequently, we established that colocalized and noncolocalized retrocopies exhibit distinct properties in terms of subcompartment annotations and cytological distance to subnuclear structures (Fig. 2  supplementary figs. S4 to S7, Supplementary Material online). For subcompartments, both A1 and A2 harbor highly expressed genes and activating chromatin marks, and are depleted in nucleolus-associated domains (NADs), but they show difference in replication timing and GC content. B1 represents facultative heterochromatin that exhibits intermediate level of transcriptional activity and high level of H3K27me3. B2 and B3 are associated with nuclear lamina, but B2 is enriched in NADs whereas B3 is strongly depleted in NADs (Rao et al. 2014; Xiong and Ma 2019). Using TSA-Seq data, Chen et al. (2018) further revealed that chromatin in A1 subcompartment is close to nuclear speckles while A2 subcompartment corresponds to region with nonspeckle-associated transcription. Therefore, our observations (Fig. 2; supplementary figs. S4 to S7, Supplementary Material online) imply that many interchromosomal interactions between retrocopies and their parental genes may be mediated by nuclear speckles. Nuclear speckles are domains enriched in pre-mRNA splicing factors (Spector and Lamond 2011) and are also hubs of interchromosomal interactions where active transcription takes place nearby (Galganski et al. 2017; Quinodoz et al. 2018; Chen and Belmont 2019; Kim et al. 2020). Remarkably, zinc finger proteins are found to facilitate the establishment of speckle-associated interchromosomal contacts (Joo et al. 2023), in accordance with our findings that the number of shared regulatory motifs is the most importance feature in predicting retrocopy–parent colocalization and that most shared motifs are bind by TFs in the zinc finger family. Thus, our findings further support a role of TFs in interchromosomal interactions (Kim et al. 2019; Kim and Shendure 2019).

Speckle-associated interchromosomal interactions, along with more active chromatin marks and higher number of shared motifs associated with colocalized retrocopies, signify colocalized retrocopies should be more actively transcribed. The implications of higher expression of colocalized retrocopies could be 2-fold. First, retrocopy-derived RNAs could function as NATs, lncRNAs, or microRNAs in diverse cellular activities, including regulation of their parental genes (Kubiak and Makałowska 2017). Besides, RNAs transcribed from retrocopies may be processed and translated and function at the protein level (Qian et al. 2022). Second, the action of transcription and the resulting RNAs can orchestrate both intra- and interchromosomal chromatin organization (van Steensel and Furlong 2019; Bertero 2021; Quinodoz and Guttman 2022). In the nucleus, 3D genome organization and transcription are intermingled, with reciprocal influence on each other. Recent single-cell Hi-C and RNA-Seq data discovered widespread chromatin interaction rewiring before transcription activation, strongly supporting a regulatory role of chromatin organization on gene expression (Liu et al. 2023). Meanwhile, TFs can bring DNA fragments on the same or different chromosomes to spatial proximity through direct oligomerization or cofactor oligomerization, leading to transcription-mediated interactions in cis or trans (Kim et al. 2019; Kim and Shendure 2019). In addition, noncoding RNAs in the nucleus may serve as chromosome scaffolds or seeds to guide chromatin organization (Melé and Rinn 2016; Quinodoz et al. 2021; Bouwman et al. 2022). However, the actual impact of retrocopy transcription and retrocopy-derived RNAs on interchromosomal interactions needs additional investigations.

We further showed that both biased insertion and natural selection contribute to the nonrandom colocalization between retrocopies and parental genes (Fig. 6). Mechanistically, new L1 insertions are linked to DNA replication, with a preference in early-replicating regions of the genome (Flasch et al. 2019; Sultana et al. 2019). Because early-replicating regions are enriched with active subcompartments and speckles and are more likely to be involved in interchromosomal interactions (Vouzas and Gilbert 2021), if retrocopy insertions have the same property as L1s, this insertion bias would partly explain the higher-than-expected colocalization frequency between de novo retroCNVs and parental genes.

Our population genetic analysis indicates that colocalized retroCNVs are more likely to reach high frequency, but this elevated frequency is probably driven by nonadaptive forces. Then, what is the evolutionary significance of retrocopy–parent colocalization? Given the fact that colocalized “fixed” retrocopies are more actively transcribed and have higher potential of functionality than noncolocalized ones, we argue that the spatial proximity with parental genes may facilitate the “resurrection” of initially neutral retrocopies. This is similar to the situation where new genes frequently originated in the promiscuous testis (Kaessmann 2010). Colocalized retrocopies are also residing in chromatin with activating epigenetic marks, are cytologically close to speckles, and extensively share regulatory elements with their progenitors, all of which may facilitate their exploitation of cellular transcription machinery and the exaptation of these seemingly neutral retroposed gene copies.

However, we are aware of some potential caveats should be considered in interpreting the population genetic analysis of retroCNVs. First, our search for retroCNVs is not exhaustive. Only 5 representative human populations were used to identify retroCNVs, and only 98 singletons were identified to represent potential de novo retroCNVs, which may reduce the statistical power of our analysis. Second, our approach ignored retroCNVs that are present in the reference genomes, but absent in some other individuals. However, such retroCNVs may represent a relatively small proportion of total retroduplication variants (Zhang et al. 2017). Third, we were unable to derive the ancestral state of retroCNVs. Some low-frequency retroCNVs may arise from losses of ancestral retrocopy insertions. Nevertheless, our investigation exemplifies a first yet valuable attempt to understand the evolutionary dynamics of retrocopies in the context of 3D genome organization. In the future, more thorough survey of retroCNVs at large scale, coupled with long-read sequencing (e.g. Feng and Li 2021; Troskie et al. 2021), should give us a deeper understanding of the link between 3D genome organization and the function and evolution of retrocopies.

In conclusion, we demonstrated that retrocopies and their parental genes displayed significant colocalization in the nucleus and characterized the spatial positioning and functional profiling of colocalized and noncolocalized retrocopies in great detail. Our results highlight the importance of 3D genome organization in the function and evolution of stripped-down retrocopies.

Materials and Methods

Retrocopy–Parent Pair Data Sets

The lists of retrocopies and their cognate parental genes of human (GRCh37), mouse (NCBIM37), chimpanzee (CHIMP2.1.4), and rhesus macaque (MMUL_1) were retrieved from the Supplementary Table S1 of Carelli et al. (2016). The retrocopy information of marmoset (C_jacchus3.2.1), cow (UMD3.1), and dog (CanFam3.1) was downloaded from RetroGeneDB2 (Rosikiewicz et al. 2017). Retrocopy–parent pairs that fit any of the following criteria were removed from further analyses: (i) a retrocopy or its parental gene was not located on autosomes or the X chromosome; (ii) a retrocopy and its parental gene were located on the same chromosome; and (iii) the information of parental gene of a retrocopy was inferred from other species (Carelli et al. 2016).

Identification of Colocalized Retrocopy–Parent Pairs Using Hi-C Data

The raw interchromosomal contact matrices at 250-kb resolution for the 7 human cell lines and 1 mouse cell line (CH12-LX) were downloaded from NCBI (GEO accession: GSE63525; Rao et al. 2014). These contact matrices were constructed using Hi-C read pairs that were uniquely mapped to the respective reference genomes (mapping quality ≥ 30). The contact matrices in hic format of mESC (4DNESDXUWBD9; mm10), chimpanzee (panTro6) and marmoset (calJac3) induced pluripotent stem cell (iPSC; GSE116862), and macaque cortex plate (GSE163177; rheMac8) were downloaded from 4DN portal or NCBI and were converted to FitHiC2 format using HiCExplorer v3.7.2 (Wolff et al. 2020). For cows and dogs, we downloaded the raw reads of ear skin Hi-C sequencing from NCBI (GSE167581) and constructed contact matrices using HiC-Pro v3.1.0 (Servant et al. 2015). Only contacts that were supported by high mapping quality read pairs (mapping quality ≥ 30) were retained for downstream analyses.

Significant interchromosomal interactions were identified using FitHiC2 v2.0.7 (Ay et al. 2014; Kaul et al. 2020). The Knight–Ruiz matrix balancing algorithm (Knight and Ruiz 2013) was applied to normalized Hi-C data during identification. Unlike intrachromosomal interactions where the observed contacts can be compared with the null distribution of contacts based on the linear distances between fragments to determine statistical significance, P-values of interchromosomal interactions in FitHiC2 were calculated using a uniform probability model (Duan et al. 2010). At the level of q-value (Benjamini–Hochberg-adjusted P-value) < 0.05, 5,152,588 significant interchromosomal interactions at 250-kb were identified. Due to the much shallow depth of Hi-C sequencing in the other cell lines and species, far fewer significant interactions at q < 0.05 were detected. In order to perform comparable analyses, we selected the top 5 million fragment pairs with lowest P-values as significant interactions for all samples.

The colocalization status between retrocopies and their parental genes was determined from the interaction of their residing genomic fragments using the findOverlaps function of the R package GenomicInteractions v1.30.0 (Harmston et al. 2015). We converted the genomic coordinates of retrocopies to that of Hi-C matrices using the UCSC liftover tool when there were genome assembly mismatches. If a retrocopy and its parental gene were overlapped respectively with 2 ends of an interchromosomal fragment pair that was identified as significant interaction, then they were denoted as “colocalized”; otherwise, they were “noncolocalized.” Visualization of significant interaction between retrocopies and parental genes was plotted with the R package circlize v0.4.14 (Gu et al. 2014).

Statistical Tests on Chromatin Spatial Colocalization of Retrocopy–Parent Pairs

To assess the statistical significance of interchromosomal interactions between retrocopies and parental genes, we performed 5 different simulations. (i) Random fragment pairs with matching chromosomes and sequence lengths as the retrocopy–parent pairs were generated to represent simulated chromatin pairs. (ii) The information of retrocopies was kept unchanged while sampling random genes (including coding and noncoding ones) with matching chromosomes as parental genes to form simulated chromatin pairs. (iii) The information of parental genes was kept unchanged while sampling random fragments with matching chromosomes and sequence lengths as retrocopies to be simulated chromatin pairs. (iv) Similar to the second simulation, but the randomly sampled genes were restricted to protein-coding genes only. (v) The pairing relationship between retrocopies and parental genes was randomly shuffled. For each sampling, the same number of chromatin pairs as true retrocopy–parent pairs was generated, and 1,000 samplings were carried out for each simulation. Colocalization of simulated chromatin pairs was determined using Hi-C data of respective cell lines, and the P-value was calculated as the fraction of simulated data sets that exhibited a colocalization frequency higher than that of true retrocopy–parent pairs (see also supplementary fig. S2, Supplementary Material online for graphic illustrations).

Analysis of SPRITE Data

The SPRITE interchromosomal interaction matrices of GM12878 and mESC were downloaded from NCBI (GSE114242). Similar to the original paper (Quinodoz et al. 2018), we applied FitHiC2 v2.0.7 (Ay et al. 2014; Kaul et al. 2020) to detect significant interactions using contact maps binned at 1 Mb resolution based on SPRITE clusters containing 2 to 1,000 reads without downweighting for cluster size.

Analysis of MERFISH and 2-Layer seqFISH+ Imaging-Based Data

The physical coordinates (x, y, and z) of 1,041 uniformly distributed genomic loci detected by MERFISH in IMR90 cells (Su et al. 2020) were downloaded from Zenodo (doi: 10.5281/zenodo.3928890). The spatial distance between any pair of loci was calculated as the Euclidean distance between their fitted 3D Gaussian centers using the scripts provided by the authors (https://github.com/ZhuangLab/Chromatin_Analysis_2020_cell). The median distance of a given pair of loci was calculated from ∼5,400 cells where both ends were detected. The proximity frequency between 2 loci was computed as the number of cells in which their spatial distance was less than 1 μm divided by the total number of detected cells. The coordinates of DNA spots were extended to 1.5 Mb before overlapping with retrocopies and parental genes.

For 2-layer seqFISH+ data in mESC (Takei et al. 2023), we downloaded the physical coordinates (x, y, and z) of 100,049 genomic loci at 25-kb resolution in 1,076 cells from Zenodo (doi: 10.5281/zenodo.7693825). The spatial distance between any pair of loci binned at 200-kb resolution was calculated using the scripts provided by the authors (https://github.com/CaiGroup/dna-seqfish-plus-multi-omics). The mean distance between any retrocopy–parent pair was obtained by overlapping the mean distance matrix at 200-kb resolution calculated across detected cells.

Genome-Wide Enrichment with Subcompartments, TADs, and SPIN States

Genome-wide annotations of subcompartments in GM12878 and TADs in the 7 human cell lines were downloaded from NCBI (GEO accession: GSE63525; Rao et al. 2014). Subcompartment annotations for HMEC, HUVEC, IMR90, and K562 were retrieved from https://cmu.box.com/s/n4jh3utmitzl88264s8bzsfcjhqnhaa0. In cell lines other than GM12878, subcompartment annotation was achieved through the SNIPER computational approach, in which interchromosomal interactions were imputed and subcompartments were inferred by denoising autoencoder and multilayer perceptron classifier using Hi-C data sets with moderate depth (Xiong and Ma 2019).

The annotation of 10 spatial compartmentalization states for K562 was downloaded from https://github.com/ma-compbio/SPIN.

Enrichment or depletion of retrocopies, parental genes, unprocessed pseudogenes, and retroCNV insertion sites in subcompartments, TAD, and SPIN states was computed using the Genomic Association Tester (GAT; Heger et al. 2013). Only mappable regions of the hg19 genome were included for the sampling of random regions to determine the enrichment or depletion of test genomic elements. Retrocopies and parental genes were divided into “colocalized” and “noncolocalized” ones based on Hi-C data of individual cell lines. To assess the impact of host genes, retrocopies were further divided into “intragenic” and “intergenic” subcategories, based on whether they overlap with protein-coding genes. Protein-coding genes and unprocessed pseudogenes (GENCODE release 39; lifted to GRCh37) were included for comparison. Note that “transcribed unprocessed pseudogenes” annotated by GENCODE were not included. For retroCNVs, we used the estimated coordinates of insertion sites for enrichment analysis.

Analysis of TSA-Seq and Related Data

Genomic tracks of SON, pSC35, lamin B, lamin A/C, and Pol II TSA-Seq signals, GRO-Seq signal, and speckle distance in bigwig format for the K562 cell line were downloaded from NCBI (GEO accession: GSE81553; Chen et al. 2018). The signal strength in 20 kb nonoverlapping bins were calculated using deepTools v3.5.1 (Ramírez et al. 2016) and intersected with retrocopies and parental genes using bedtools v2.30.0 (Quinlan and Hall 2010).

To assess the influence of speckle distance on colocalization, we carried out an additional simulation, in which the information of retrocopies was kept unchanged, while sampling random protein-coding genes with matching speckle distance (±0.01 μm away) as true parental genes to form simulated chromatin pairs. The P-value was calculated as the number of occurrences where the colocalization frequency of simulated pairs is higher than true retrocopy–parent pairs over 1,000 samplings.

RNA-Seq, TFBS, and Histone Modification Analyses

The genomic alignment files in bam format of RNA-Seq for GM12878, HUVEC, K562, and NHEK were downloaded from the ENCODE RNA Dashboard website (https://public-docs.crg.eu/rguigo/Data/jlagarde/encode_RNA_dashboard//hg19/). RNA-Seq data of mouse CH12-LX cell line was downloaded from ENCODE (ENCSR000AJV). Only uniquely mapped reads were kept for downstream analysis. The FPKM values of retrocopies and other genes were estimated using StringTie v2.2.1 (Pertea et al. 2016) and are averaged across replicates.

The position frequency matrices (PFMs) of 949 human TFs were downloaded from the JASPAR2022 database (https://jaspar.genereg.net; Castro-Mondragon et al. 2022), and FIMO v5.4.1 (Grant et al. 2011) was used to scan the occurrences of TFBSs in the flanking sequences (±3, ±1, or upstream 3 kb of putative TSSs) of retrocopies and parental genes. Only TFBSs with q < 0.01 were retained for further analysis. The transcripts per million (TPM) values of human genes and transcripts across 17,382 samples were downloaded from GTEx portal (release V8).

Genome-wide fold change signals of DNase I sensitivity, histone variant H2A.Z, and 10 histone modifications (H3K4me1, H3K4me2, H3K4me3, H3K9ac, H3K9me3, H3K27ac, H3K27me3, H3K36me3, H3K9me2, and H4K20me1) in bigwig format for GM12878, HMEC, HUVEC, IMR90, K562, and NHEK were obtained from the Roadmap Epigenomics Project (https://egg2.wustl.edu/roadmap/web_portal/). The signals of retrocopies and unprocessed pseudogenes and their flanking regions (±3 kb) were computed and visualized with deepTools v3.5.1 (Ramírez et al. 2016).

Logistic Regression and Evaluation of Feature Importance

For GM12878, HUVEC, K562, and NHEK, we built a logistic regression model to predict the colocalization status of retrocopy–parent pairs using 35 genetic and epigenetic features related to retrocopies and parental genes as predictors (supplementary table S2, Supplementary Material online). The model was trained using 70% randomly selected retrocopy–parent pairs, and the remaining 30% of data were used to evaluate the performance of trained model. The R package Boruta v7.0.0 (Kursa and Rudnicki 2010) was applied to assess the importance of individual features.

Coding Potential and Evolutionary Importance of Retrocopies

The RPKM values and ORF information of human and mouse genes inferred from ribosome profiling data were downloaded from RPFdb v2.0 (Wang et al. 2019). For human and mouse, 101 and 52 Ribo-Seq studies were included in the RPFdb database. The maximum RPKM of a given gene among all Ribo-Seq samples was used to compare the coding potential between colocalized and noncolocalized retrocopies, similar to Qian et al. (2022). The ORFs were detected with RibORF (Ji et al. 2015), which combines the 3-nt periodicity and uniformity of read distribution across codons of translated regions to identify translated ORFs from Ribo-Seq data.

The human 46-way phastCons scores and the mouse 60-way phastCons scores were downloaded from UCSC Genome Browser. For both species, the phastCons score of the placental mammal subset was used. For humans, we also downloaded the precomputed LINSIGHT score from http://compgen.cshl.edu/LINSIGHT/LINSIGHT.bw. LINSIGHT leverages both functional and population genomic data to detect deleterious noncoding variants. The average phastCons/LINSIGHT scores of retrocopies were computed with the bigWigAverageOverBed tool of UCSC.

A total of 516,945 GWAS SNPs associating with various diseases were downloaded from the NHGRI GWAS Catalog (Buniello et al. 2019). Duplicated recodes were removed, leaving 254,286 unique SNPs for further analysis.

Identification and Population Genetic Analysis of retroCNVs

To investigate the insertion bias and evolutionary forces of newly inserted retrocopies in relation to interchromosomal interactions, we used sideRETRO v1.1.3 (Miller et al. 2021) to identify retroCNVs using whole genome sequencing (WGS) data from the 1000 Genomes Project. sideRETRO utilize abnormal alignments (i.e. discordantly aligned paired-end reads and split reads) in WGS or whole exome sequencing (WES) to identify unfixed retrocopies absent in the reference genome, but present in the sequenced individual. Considering the computation complexity, only the WGS mapping files of 491 individuals from 5 diverse populations (CEU, CHS, LWK, PEL, and YRI) were used. These individuals were sequenced to a high depth (30×) and were good resources for detecting retroCNVs.

The initial list of retroCNVs was further filtered with the following criteria: (i) retroCNVs that were not located on autosomes and the X chromosome were removed; (ii) if the insertion site of a retroCNV and its parental gene were on the same chromosome, they were excluded from further analysis; and (iii) retroCNVs with a missing rate greater than 20% were discarded.

To detect possible positive selection on retroCNVs, we calculated 3 population genetic statistics for each retroCNV that had a frequency ≥ 50% in any of the 5 populations: (i) the average nucleotide diversity (π) of 100 kb flanking regions centered at the insertion sites of retroCNV carriers over that of noncarriers ( πratio; Schrider et al. 2013); (ii) average Tajima's D of 100 kb flanking regions centered at the insertion sites of retroCNV carriers (Llopart et al. 2002); and (iii) average LD (r2) of 10 kb flanking regions centered at the insertion sites of retroCNV carriers (Cardoso-Moreira et al. 2016). These statistics were computed for individual populations using matching polymorphism data from the 1000 Genomes Project. In order to evaluate the statistical significance of the estimates, we randomly generated “pseudo” insertion site in the genome while keeping the same setup of carriers and noncarriers as the focal retroCNV. A total of 200 samplings were carried out, and the Wilcoxon rank-sum test was used to determine how likely to get a smaller πratio, a smaller Tajima's D, or a greater LD from “pseudo” insertion sites as that from a true retroCNV insertion (Zhang and Tautz 2022).

Supplementary Material

Supplementary material is available at Molecular Biology and Evolution online.

Acknowledgments

We thank members of the Yang laboratory for their discussion. We are grateful to 2 anonymous reviewers for their invaluable suggestions.

Author Contributions

R.Y. conceived and supervised the project. Y.Y. and R.Y. designed the study. Y.Y., Y.T., Z.W., and K.Z. performed data analysis. Y.Y. wrote the manuscript with input from R.Y. and Y.T. All authors reviewed and approved the manuscript.

Funding

This work was supported by the Fund of Northwest A&F University (Z111021404) and the “100-Talent Program” of Shaanxi Province of China (A289021612) to R.Y.

Data Availability

No new data were generated or analyzed in support of this research. The custom written scripts used in this study are available at GitHub (https://github.com/yanyubin/Retrocopy_Evolution). The sources of public data and their corresponding genome assembly versions are summarized in supplementary table S7, Supplementary Material online.

References

Abyzov
 
A
,
Iskow
 
R
,
Gokcumen
 
O
,
Radke
 
DW
,
Balasubramanian
 
S
,
Pei
 
B
,
Habegger
 
L
;
1000 Genomes Project Consortium
,
Lee
 
C
,
Gerstein
 
M
.
Analysis of variable retroduplications in human populations suggests coupling of retrotransposition to cell division
.
Genome Res
.
2013
:
23
(
12
):
2042
2052
. https://doi.org/10.1101/gr.154625.113.

Amici
 
DR
,
Cingoz
 
H
,
Alasady
 
MJ
,
Alhayek
 
S
,
Phoumyvong
 
CM
,
Sahni
 
N
,
Yi
 
SS
,
Mendillo
 
ML
.
The HAPSTR2 retrogene buffers stress signaling and resilience in mammals
.
Nat Commun
.
2023
:
14
(
1
):
152
. https://doi.org/10.1038/s41467-022-35697-1.

Auton
 
A
,
Abecasis
 
GR
,
Altshuler
 
DM
,
Durbin
 
RM
,
Abecasis
 
GR
,
Bentley
 
DR
,
Chakravarti
 
A
,
Clark
 
AG
,
Donnelly
 
P
,
Eichler
 
EE
, et al.  
A global reference for human genetic variation
.
Nature
 
2015
:
526
(
7571
):
68
74
. https://doi.org/10.1038/nature15393.

Ay
 
F
,
Bailey
 
TL
,
Noble
 
WS
.
Statistical confidence estimation for Hi-C data reveals regulatory chromatin contacts
.
Genome Res
.
2014
:
24
(
6
):
999
1011
. https://doi.org/10.1101/gr.160374.113.

Bashkirova
 
E
,
Lomvardas
 
S
.
Olfactory receptor genes make the case for inter-chromosomal interactions
.
Current Opin Genet Dev
.
2019
:
55
:
106
113
. https://doi.org/10.1016/j.gde.2019.07.004.

Batcher
 
K
,
Varney
 
S
,
York
 
D
,
Blacksmith
 
M
,
Kidd
 
JM
,
Rebhun
 
R
,
Dickinson
 
P
,
Bannasch
 
D
.
Recent, full-length gene retrocopies are common in canids
.
Genome Res
.
2022
:
32
(
8
):
1602
1611
. https://doi.org/10.1101/gr.276828.122.

Bertero
 
A
.
RNA biogenesis instructs functional inter-chromosomal genome architecture
.
Front Genet
.
2021
:
12
:
645863
. https://doi.org/10.3389/fgene.2021.645863.

Bonev
 
B
,
Cavalli
 
G
.
Organization and function of the 3D genome
.
Nat Rev Genet
.
2016
:
17
(
11
):
661
678
. https://doi.org/10.1038/nrg.2016.112.

Bonev
 
B
,
Mendelson Cohen
 
N
,
Szabo
 
Q
,
Fritsch
 
L
,
Papadopoulos
 
GL
,
Lubling
 
Y
,
Xu
 
X
,
Lv
 
X
,
Hugnot
 
JP
,
Tanay
 
A
, et al.  
Multiscale 3D genome rewiring during mouse neural development
.
Cell
 
2017
:
171
(
3
):
557
572
. https://doi.org/10.1016/j.cell.2017.09.043.

Bouwman
 
BAM
,
Crosetto
 
N
,
Bienko
 
M
.
RNA gradients: shapers of 3D genome architecture
.
Curr Opin Cell Biol
.
2022
:
74
:
7
12
. https://doi.org/10.1016/j.ceb.2021.12.001.

Buniello
 
A
,
MacArthur
 
JAL
,
Cerezo
 
M
,
Harris
 
LW
,
Hayhurst
 
J
,
Malangone
 
C
,
McMahon
 
A
,
Morales
 
J
,
Mountjoy
 
E
,
Sollis
 
E
, et al.  
The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019
.
Nucleic Acids Res
.
2019
:
47
(
D1
):
D1005
D1012
. https://doi.org/10.1093/nar/gky1120.

Cardoso-Moreira
 
M
,
Arguello
 
JR
,
Gottipati
 
S
,
Harshman
 
LG
,
Grenier
 
JK
,
Clark
 
AG
.
Evidence for the fixation of gene duplications by positive selection in Drosophila
.
Genome Res
.
2016
:
26
(
6
):
787
798
. https://doi.org/10.1101/gr.199323.115.

Carelli
 
FN
,
Hayakawa
 
T
,
Go
 
Y
,
Imai
 
H
,
Warnefors
 
M
,
Kaessmann
 
H
.
The life history of retrocopies illuminates the evolution of new mammalian genes
.
Genome Res
.
2016
:
26
(
3
):
301
314
. https://doi.org/10.1101/gr.198473.115.

Casola
 
C
,
Betran
 
E
.
The genomic impact of gene retrocopies: what have we learned from comparative genomics, population genomics, and transcriptomic analyses?
 
Genome Biol Evol
.
2017
:
9
(
6
):
1351
1373
. https://doi.org/10.1093/gbe/evx081.

Castro-Mondragon
 
JA
,
Riudavets-Puig
 
R
,
Rauluseviciute
 
I
,
Berhanu Lemma
 
R
,
Turchi
 
L
,
Blanc-Mathieu
 
R
,
Lucas
 
J
,
Boddie
 
P
,
Khan
 
A
,
Manosalva Perez
 
N
, et al.  
JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles
.
Nucleic Acids Res
.
2022
:
50
(
D1
):
D165
D173
. https://doi.org/10.1093/nar/gkab1113.

Chen
 
Y
,
Belmont
 
AS
.
Genome organization around nuclear speckles
.
Curr Opin Genet Dev
.
2019
:
55
:
91
99
. https://doi.org/10.1016/j.gde.2019.06.008.

Chen
 
S
,
Krinsky
 
BH
,
Long
 
M
.
New genes as drivers of phenotypic evolution
.
Nat Rev Genet
.
2013
:
14
(
9
):
645
660
. https://doi.org/10.1038/nrg3521.

Chen
 
Y
,
Zhang
 
Y
,
Wang
 
Y
,
Zhang
 
L
,
Brinkman
 
EK
,
Adam
 
SA
,
Goldman
 
R
,
van Steensel
 
B
,
Ma
 
J
,
Belmont
 
AS
.
Mapping 3D genome organization relative to nuclear compartments using TSA-Seq as a cytological ruler
.
J Cell Biol
.
2018
:
217
(
11
):
4025
4048
. https://doi.org/10.1083/jcb.201807108.

Ciomborowska-Basheer
 
J
,
Staszak
 
K
,
Kubiak
 
MR
,
Makalowska
 
I
.
Not so dead genes—retrocopies as regulators of their disease-related progenitors and hosts
.
Cells
 
2021
:
10
(
4
):
912
. https://doi.org/10.3390/cells10040912.

Dai
 
Z
,
Xiong
 
Y
,
Dai
 
X
.
Neighboring genes show interchromosomal colocalization after their separation
.
Mol Biol Evol
.
2014
:
31
(
5
):
1166
1172
. https://doi.org/10.1093/molbev/msu065.

Djebali
 
S
,
Davis
 
CA
,
Merkel
 
A
,
Dobin
 
A
,
Lassmann
 
T
,
Mortazavi
 
A
,
Tanzer
 
A
,
Lagarde
 
J
,
Lin
 
W
,
Schlesinger
 
F
, et al.  
Landscape of transcription in human cells
.
Nature
 
2012
:
489
(
7414
):
101
108
. https://doi.org/10.1038/nature11233.

Duan
 
Z
,
Andronescu
 
M
,
Schutz
 
K
,
McIlwain
 
S
,
Kim
 
YJ
,
Lee
 
C
,
Shendure
 
J
,
Fields
 
S
,
Blau
 
CA
,
Noble
 
WS
.
A three-dimensional model of the yeast genome
.
Nature
 
2010
:
465
(
7296
):
363
367
. https://doi.org/10.1038/nature08973.

Esnault
 
C
,
Maestre
 
J
,
Heidmann
 
T
.
Human LINE retrotransposons generate processed pseudogenes
.
Nat Genet
.
2000
:
24
(
4
):
363
367
. https://doi.org/10.1038/74184.

Ewing
 
AD
,
Ballinger
 
TJ
,
Earl
 
D
,
Harris
 
CC
,
Ding
 
L
,
Wilson
 
RK
,
Haussler
 
D
;
Broad Institute Genome Sequencing and Analysis Program and Platform
,
Harris
 
CC
,
Ding
 
L
, et al.  
Retrotransposition of gene transcripts leads to structural variation in mammalian genomes
.
Genome Biol
.
2013
:
14
(
3
):
R22
. https://doi.org/10.1186/gb-2013-14-3-r22.

Feng
 
X
,
Li
 
H
.
Higher rates of processed pseudogene acquisition in humans and three great apes revealed by long read assemblies
.
Mol Biol Evol
.
2021
:
38
(
7
):
2958
2966
. https://doi.org/10.1093/molbev/msab062.

Flasch
 
DA
,
Macia
 
A
,
Sanchez
 
L
,
Ljungman
 
M
,
Heras
 
SR
,
García-Pérez
 
JL
,
Wilson
 
TE
,
Moran
 
JV
.
Genome-wide de novo L1 retrotransposition connects endonuclease activity with replication
.
Cell
 
2019
:
177
(
4
):
837
851
. https://doi.org/10.1016/j.cell.2019.02.050.

Galganski
 
L
,
Urbanek
 
MO
,
Krzyzosiak
 
WJ
.
Nuclear speckles: molecular organization, biological function and role in disease
.
Nucleic Acids Res
.
2017
:
45
(
18
):
10350
10368
. https://doi.org/10.1093/nar/gkx759.

Gossmann
 
TI
,
Woolfit
 
M
,
Eyre-Walker
 
A
.
Quantifying the variation in the effective population size within a genome
.
Genetics
 
2011
:
189
(
4
):
1389
1402
. https://doi.org/10.1534/genetics.111.132654.

Grant
 
CE
,
Bailey
 
TL
,
Noble
 
WS
.
FIMO: scanning for occurrences of a given motif
.
Bioinform
.
2011
:
27
(
7
):
1017
1018
. https://doi.org/10.1093/bioinformatics/btr064.

GTEx Consortium
.
The GTEx Consortium atlas of genetic regulatory effects across human tissues
.
Science
 
2020
:
369
(
6509
):
1318
1330
. https://doi.org/10.1126/science.aaz1776.

Gu
 
Z
,
Gu
 
L
,
Eils
 
R
,
Schlesner
 
M
,
Brors
 
B
.
Circlize implements and enhances circular visualization in R
.
Bioinform
.
2014
:
30
(
19
):
2811
2812
. https://doi.org/10.1093/bioinformatics/btu393.

Harmston
 
N
,
Ing-Simmons
 
E
,
Perry
 
M
,
Barešić
 
A
,
Lenhard
 
B
.
GenomicInteractions: an R/Bioconductor package for manipulating and investigating chromatin interaction data
.
BMC Genom
.
2015
:
16
(
1
):
963
. https://doi.org/10.1186/s12864-015-2140-x.

Heger
 
A
,
Webber
 
C
,
Goodson
 
M
,
Ponting
 
CP
,
Lunter
 
G
.
GAT: a simulation framework for testing the association of genomic intervals
.
Bioinform
.
2013
:
29
(
16
):
2046
2048
. https://doi.org/10.1093/bioinformatics/btt343.

Huang
 
YF
,
Gulko
 
B
,
Siepel
 
A
.
Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data
.
Nat Genet
.
2017
:
49
(
4
):
618
624
. https://doi.org/10.1038/ng.3810.

Ibn-Salem
 
J
,
Muro
 
EM
,
Andrade-Navarro
 
MA
.
Co-regulation of paralog genes in the three-dimensional chromatin architecture
.
Nucleic Acids Res
.
2017
:
45
(
1
):
81
91
. https://doi.org/10.1093/nar/gkw813.

Ji
 
Z
,
Song
 
R
,
Regev
 
A
,
Struhl
 
K
.
Many lncRNAs, 5′UTRs, and pseudogenes are translated and some are likely to express functional proteins
.
Elife
 
2015
:
4
:
e08890
. https://doi.org/10.7554/eLife.08890.

Joo
 
J
,
Cho
 
S
,
Hong
 
S
,
Min
 
S
,
Kim
 
K
,
Kumar
 
R
,
Choi
 
JM
,
Shin
 
Y
,
Jung
 
I
.
Probabilistic establishment of speckle-associated inter-chromosomal interactions
.
Nucleic Acids Res
.
2023
:
51
(
11
):
5377
5395
. https://doi.org/10.1093/nar/gkad211.

Kaessmann
 
H
.
Origins, evolution, and phenotypic impact of new genes
.
Genome Res
.
2010
:
20
(
10
):
1313
1326
. https://doi.org/10.1101/gr.101386.109.

Kaessmann
 
H
,
Vinckenbosch
 
N
,
Long
 
M
.
RNA-based gene duplication: mechanistic and evolutionary insights
.
Nat Rev Genet
.
2009
:
10
(
1
):
19
31
. https://doi.org/10.1038/nrg2487.

Kaul
 
A
,
Bhattacharyya
 
S
,
Ay
 
F
.
Identifying statistically significant chromatin contacts from Hi-C data with FitHiC2
.
Nat Protoc
.
2020
:
15
(
3
):
991
1012
. https://doi.org/10.1038/s41596-019-0273-0.

Kim
 
S
,
Dunham
 
MJ
,
Shendure
 
J
.
A combination of transcription factors mediates inducible interchromosomal contacts
.
Elife
 
2019
:
8
:
e42499
. https://doi.org/10.7554/eLife.42499.

Kim
 
MS
,
Pinto
 
SM
,
Getnet
 
D
,
Nirujogi
 
RS
,
Manda
 
SS
,
Chaerkady
 
R
,
Madugundu
 
AK
,
Kelkar
 
DS
,
Isserlin
 
R
,
Jain
 
S
, et al.  
A draft map of the human proteome
.
Nature
 
2014
:
509
(
7502
):
575
581
. https://doi.org/10.1038/nature13302.

Kim
 
S
,
Shendure
 
J
.
Mechanisms of interplay between transcription factors and the 3D genome
.
Mol Cell
.
2019
:
76
(
2
):
306
319
. https://doi.org/10.1016/j.molcel.2019.08.010.

Kim
 
J
,
Venkata
 
NC
,
Hernandez Gonzalez
 
GA
,
Khanna
 
N
,
Belmont
 
AS
.
Gene expression amplification by nuclear speckle association
.
J Cell Biol
.
2020
:
219
(
1
):
e201904046
. https://doi.org/10.1083/jcb.201904046.

Knight
 
PA
,
Ruiz
 
D
.
A fast algorithm for matrix balancing
.
IMA J Numer Anal
.
2013
:
33
(
3
):
1029
1047
. https://doi.org/10.1093/imanum/drs019.

Kubiak
 
MR
,
Makałowska
 
I
.
Protein-coding genes’ retrocopies and their functions
.
Viruses
 
2017
:
9
(
4
):
80
. https://doi.org/10.3390/v9040080.

Kursa
 
MB
,
Rudnicki
 
WR
.
Feature selection with the Boruta package
.
J Stat Softw
.
2010
:
36
(
11
):
1
13
. https://doi.org/10.18637/jss.v036.i11.

Lei
 
X
,
Tian
 
X
,
Wang
 
H
,
Xu
 
X
,
Li
 
G
,
Liu
 
W
,
Wang
 
D
,
Xiao
 
Z
,
Zhang
 
M
,
Li
 
MJ
, et al.  
Noncoding SNP at rs1663689 represses ADGRG6 via interchromosomal interaction and reduces lung cancer progression
.
EMBO Rep
.
2023
:
24
(
7
):
e56212
. https://doi.org/10.15252/embr.202256212.

Li
 
D
,
He
 
M
,
Tang
 
Q
,
Tian
 
S
,
Zhang
 
J
,
Li
 
Y
,
Wang
 
D
,
Jin
 
L
,
Ning
 
C
,
Zhu
 
W
, et al.  
Comparative 3D genome architecture in vertebrates
.
BMC Biol
.
2022
:
20
(
1
):
99
. https://doi.org/10.1186/s12915-022-01301-7.

Liu
 
Z
,
Chen
 
Y
,
Xia
 
Q
,
Liu
 
M
,
Xu
 
H
,
Chi
 
Y
,
Deng
 
Y
,
Xing
 
D
.
Linking genome structures to functions by simultaneous single-cell Hi-C and RNA-Seq
.
Science
.
2023
:
380
(
6649
):
1070
1076
. https://doi.org/10.1126/science.adg3797.

Llopart
 
A
,
Comeron
 
JM
,
Brunet
 
FG
,
Lachaise
 
D
,
Long
 
M
.
Intron presence-absence polymorphism in Drosophila driven by positive Darwinian selection
.
Proc Natl Acad Sci U S A
.
2002
:
99
(
12
):
8121
8126
. https://doi.org/10.1073/pnas.122570299.

Luo
 
X
,
Liu
 
Y
,
Dang
 
D
,
Hu
 
T
,
Hou
 
Y
,
Meng
 
X
,
Zhang
 
F
,
Li
 
T
,
Wang
 
C
,
Li
 
M
, et al.  
3D genome of macaque fetal brain reveals evolutionary innovations during primate corticogenesis
.
Cell
 
2021
:
184
(
3
):
723
740
. https://doi.org/10.1016/j.cell.2021.01.001.

Maass
 
PG
,
Barutcu
 
AR
,
Rinn
 
JL
.
Interchromosomal interactions: a genomic love story of kissing chromosomes
.
J Cell Biol
.
2019
:
218
(
1
):
27
38
. https://doi.org/10.1083/jcb.201806052.

Maass
 
PG
,
Barutcu
 
AR
,
Weiner
 
CL
,
Rinn
 
JL
.
Inter-chromosomal contact properties in live-cell imaging and in Hi-C
.
Mol Cell
.
2018
:
69
(
6
):
1039
1045
. https://doi.org/10.1016/j.molcel.2018.02.007.

Machado
 
JP
,
Antunes
 
A
.
The genomic context of retrocopies increases their chance of functional relevancy in mammals
.
Genomics
 
2020
:
112
(
3
):
2410
2417
. https://doi.org/10.1016/j.ygeno.2020.01.013.

Melé
 
M
,
Rinn
 
JL
.
“Cat's cradling” the 3D genome by the act of LncRNA transcription
.
Mol Cell
.
2016
:
62
(
5
):
657
664
. https://doi.org/10.1016/j.molcel.2016.05.011.

Mighell
 
AJ
,
Smith
 
NR
,
Robinson
 
PA
,
Markham
 
AF
.
Vertebrate pseudogenes
.
Febs Lett
.
2000
:
468
(
2-3
):
109
114
. https://doi.org/10.1016/S0014-5793(00)01199-6.

Miller
 
TLA
,
Orpinelli Rego
 
F
,
Buzzo
 
JLL
,
Galante
 
PAF
.
sideRETRO: a pipeline for identifying somatic and polymorphic insertions of processed pseudogenes or retrocopies
.
Bioinform
.
2021
:
37
(
3
):
419
421
. https://doi.org/10.1093/bioinformatics/btaa689.

Moon
 
BS
,
Huang
 
D
,
Gao
 
F
,
Cai
 
M
,
Lyu
 
G
,
Zhang
 
L
,
Chen
 
J
,
Lu
 
W
.
Long range inter-chromosomal interaction of Oct4 distal enhancer loci regulates ESCs pluripotency
.
Cell Death Discov
.
2023
:
9
(
1
):
61
. https://doi.org/10.1038/s41420-023-01363-8.

Navarro
 
FCP
,
Galante
 
PAF
.
A genome-wide landscape of retrocopies in primate genomes
.
Genome Biol Evol
.
2015
:
7
(
8
):
2265
2275
. https://doi.org/10.1093/gbe/evv142.

Ohno
 
S
.
Evolution by gene duplication
.
Berlin, New York
:
Springer-Verlag
;
1970
.

Patel
 
B
,
Kang
 
Y
,
Cui
 
K
,
Litt
 
M
,
Riberio
 
MS
,
Deng
 
C
,
Salz
 
T
,
Casada
 
S
,
Fu
 
X
,
Qiu
 
Y
, et al.  
Aberrant TAL1 activation is mediated by an interchromosomal interaction in human T-cell acute lymphoblastic leukemia
.
Leukemia
 
2014
:
28
(
2
):
349
361
. https://doi.org/10.1038/leu.2013.158.

Pertea
 
M
,
Kim
 
D
,
Pertea
 
GM
,
Leek
 
JT
,
Salzberg
 
SL
.
Transcript-level expression analysis of RNA-Seq experiments with HISAT, StringTie and Ballgown
.
Nat Protoc
.
2016
:
11
(
9
):
1650
1667
. https://doi.org/10.1038/nprot.2016.095.

Podlaha
 
O
,
Zhang
 
J
.
Processed pseudogenes: the ‘fossilized footprints’ of past gene expression
.
Trends Genet
.
2009
:
25
(
10
):
429
434
. https://doi.org/10.1016/j.tig.2009.09.002.

Qian
 
SH
,
Chen
 
L
,
Xiong
 
YL
,
Chen
 
ZX
.
Evolution and function of developmentally dynamic pseudogenes in mammals
.
Genome Biol
.
2022
:
23
(
1
):
235
. https://doi.org/10.1186/s13059-022-02802-y.

Quinlan
 
AR
,
Hall
 
IM
.
BEDTools: a flexible suite of utilities for comparing genomic features
.
Bioinform
.
2010
:
26
(
6
):
841
842
. https://doi.org/10.1093/bioinformatics/btq033.

Quinodoz
 
SA
,
Guttman
 
M
.
Essential roles for RNA in shaping nuclear organization
.
Cold Spring Harb Perspect Biol
.
2022
:
14
:
a039719
. https://doi.org/10.1101/cshperspect.a039719.

Quinodoz
 
SA
,
Jachowicz
 
JW
,
Bhat
 
P
,
Ollikainen
 
N
,
Banerjee
 
AK
,
Goronzy
 
IN
,
Blanco
 
MR
,
Chovanec
 
P
,
Chow
 
A
,
Markaki
 
Y
, et al.  
RNA promotes the formation of spatial compartments in the nucleus
.
Cell
 
2021
:
184
(
23
):
5775
5790
. https://doi.org/10.1016/j.cell.2021.10.014.

Quinodoz
 
SA
,
Ollikainen
 
N
,
Tabak
 
B
,
Palla
 
A
,
Schmidt
 
JM
,
Detmar
 
E
,
Lai
 
MM
,
Shishkin
 
AA
,
Bhat
 
P
,
Takei
 
Y
, et al.  
Higher-order inter-chromosomal hubs shape 3D genome organization in the nucleus
.
Cell
 
2018
:
174
(
3
):
744
757
. https://doi.org/10.1016/j.cell.2018.05.024.

Ramírez
 
F
,
Ryan
 
DP
,
Grüning
 
B
,
Bhardwaj
 
V
,
Kilpert
 
F
,
Richter
 
AS
,
Heyne
 
S
,
Dündar
 
F
,
Manke
 
T
.
deepTools2: a next generation web server for deep-sequencing data analysis
.
Nucleic Acids Res
.
2016
:
44
(
W1
):
W160
W165
. https://doi.org/10.1093/nar/gkw257.

Rao
 
SSP
,
Huntley
 
MH
,
Durand
 
NC
,
Stamenova
 
EK
,
Bochkov
 
ID
,
Robinson
 
JT
,
Sanborn
 
AL
,
Machol
 
I
,
Omer
 
AD
,
Lander
 
ES
, et al.  
A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping
.
Cell
 
2014
:
159
(
7
):
1665
1680
. https://doi.org/10.1016/j.cell.2014.11.021.

Rosikiewicz
 
W
,
Kabza
 
M
,
Kosinski
 
JG
,
Ciomborowska-Basheer
 
J
,
Kubiak
 
MR
,
Makalowska
 
I
.
RetrogeneDB—a database of plant and animal retrocopies
.
Database
 
2017
:
2017
:
bax038
. https://doi.org/10.1093/database/bax038.

Schrider
 
DR
,
Navarro
 
FCP
,
Galante
 
PAF
,
Parmigiani
 
RB
,
Camargo
 
AA
,
Hahn
 
MW
,
de Souza
 
SJ
.
Gene copy-number polymorphism caused by retrotransposition in humans
.
PLoS Genet
.
2013
:
9
(
1
):
e1003242
. https://doi.org/10.1371/journal.pgen.1003242.

Servant
 
N
,
Varoquaux
 
N
,
Lajoie
 
BR
,
Viara
 
E
,
Chen
 
CJ
,
Vert
 
JP
,
Heard
 
E
,
Dekker
 
J
,
Barillot
 
E
.
HiC-Pro: an optimized and flexible pipeline for Hi-C data processing
.
Genome Biol
.
2015
:
16
(
1
):
259
. https://doi.org/10.1186/s13059-015-0831-x.

Sisu
 
C
,
Pei
 
B
,
Leng
 
J
,
Frankish
 
A
,
Zhang
 
Y
,
Balasubramanian
 
S
,
Harte
 
R
,
Wang
 
D
,
Rutenberg-Schoenberg
 
M
,
Clark
 
W
, et al.  
Comparative analysis of pseudogenes across three phyla
.
Proc Natl Acad Sci U S A
.
2014
:
111
(
37
):
13361
13366
. https://doi.org/10.1073/pnas.1407293111.

Spector
 
DL
,
Lamond
 
AI
.
Nuclear speckles
.
Cold Spring Harb Perspect Biol
.
2011
:
3
(
2
):
a000646
. https://doi.org/10.1101/cshperspect.a000646.

Su
 
JH
,
Zheng
 
P
,
Kinrot
 
SS
,
Bintu
 
B
,
Zhuang
 
X
.
Genome-scale imaging of the 3D organization and transcriptional activity of chromatin
.
Cell
 
2020
:
182
(
6
):
1641
1659
. https://doi.org/10.1016/j.cell.2020.07.032.

Sultana
 
T
,
van Essen
 
D
,
Siol
 
O
,
Bailly-Bechet
 
M
,
Philippe
 
C
,
Zine El Aabidine
 
A
,
Pioger
 
L
,
Nigumann
 
P
,
Saccani
 
S
,
Andrau
 
JC
, et al.  
The landscape of L1 retrotransposons in the human genome is shaped by pre-insertion sequence biases and post-insertion selection
.
Mol Cell
.
2019
:
74
(
3
):
555
570
. https://doi.org/10.1016/j.molcel.2019.02.036.

Takei
 
Y
,
Yang
 
Y
,
White
 
J
,
Yun
 
J
,
Prasad
 
M
,
Ombelets
 
L
,
Schindler
 
S
,
Cai
 
L
.
High-resolution spatial multi-omics reveals cell-type specific nuclear compartments. bioRxiv. 2023. https://doi.org/10.1101/2023.05.07.539762
.

Tan
 
S
,
Cardoso-Moreira
 
M
,
Shi
 
W
,
Zhang
 
D
,
Huang
 
J
,
Mao
 
Y
,
Jia
 
H
,
Zhang
 
Y
,
Chen
 
C
,
Shao
 
Y
, et al.  
LTR-mediated retroposition as a mechanism of RNA-based duplication in metazoans
.
Genome Res
.
2016
:
26
(
12
):
1663
1675
. https://doi.org/10.1101/gr.204925.116.

Troskie
 
RL
,
Jafrani
 
Y
,
Mercer
 
TR
,
Ewing
 
AD
,
Faulkner
 
GJ
,
Cheetham
 
SW
.
Long-read cDNA sequencing identifies functional pseudogenes in the human transcriptome
.
Genome Biol
.
2021
:
22
(
1
):
146
. https://doi.org/10.1186/s13059-021-02369-0.

van Steensel
 
B
,
Furlong
 
EEM
.
The role of transcription in shaping the spatial organization of the genome
.
Nat Rev Mol Cell Biol
.
2019
:
20
(
6
):
327
337
. https://doi.org/10.1038/s41580-019-0114-6.

Vinckenbosch
 
N
,
Dupanloup
 
I
,
Kaessmann
 
H
.
Evolutionary fate of retroposed gene copies in the human genome
.
Proc Natl Acad Sci U S A
.
2006
:
103
(
9
):
3220
3225
. https://doi.org/10.1073/pnas.0511307103.

Vouzas
 
AE
,
Gilbert
 
DM
.
Mammalian DNA replication timing
.
Cold Spring Harb Perspect Biol
.
2021
:
13
(
7
):
a040162
. https://doi.org/10.1101/cshperspect.a040162.

Wang
 
H
,
Yang
 
L
,
Wang
 
Y
,
Chen
 
L
,
Li
 
H
,
Xie
 
Z
.
RPFdb v2.0: an updated database for genome-wide information of translated mRNA generated from ribosome profiling
.
Nucleic Acids Res
.
2019
:
47
(
D1
):
D230
D234
. https://doi.org/10.1093/nar/gky978.

Wang
 
Y
,
Zhang
 
Y
,
Zhang
 
R
,
van Schaik
 
T
,
Zhang
 
L
,
Sasaki
 
T
,
Peric-Hupkes
 
D
,
Chen
 
Y
,
Gilbert
 
DM
,
van Steensel
 
B
, et al.  
SPIN reveals genome-wide landscape of nuclear compartmentalization
.
Genome Biol
.
2021
:
22
(
1
):
36
. https://doi.org/10.1186/s13059-020-02253-3.

Wolff
 
J
,
Rabbani
 
L
,
Gilsbach
 
R
,
Richard
 
G
,
Manke
 
T
,
Backofen
 
R
,
Grüning
 
BA
.
Galaxy HiCExplorer 3: a web server for reproducible Hi-C, capture Hi-C and single-cell Hi-C data analysis, quality control and visualization
.
Nucleic Acids Res
.
2020
:
48
(
W1
):
W177
W184
. https://doi.org/10.1093/nar/gkaa220.

Xie
 
T
,
Yang
 
QY
,
Wang
 
XT
,
McLysaght
 
A
,
Zhang
 
HY
.
Spatial colocalization of human ohnolog pairs acts to maintain dosage-balance
.
Mol Biol Evol
.
2016
:
33
(
9
):
2368
2375
. https://doi.org/10.1093/molbev/msw108.

Xiong
 
K
,
Ma
 
J
.
Revealing Hi-C subcompartments by imputing inter-chromosomal chromatin interactions
.
Nat Commun
.
2019
:
10
(
1
):
5069
. https://doi.org/10.1038/s41467-019-12954-4.

Zhang
 
J
.
Evolution by gene duplication: an update
.
Trends Ecol Evol
.
2003
:
18
(
6
):
292
298
. https://doi.org/10.1016/S0169-5347(03)00033-8.

Zhang
 
Z
,
Harrison
 
PM
,
Liu
 
Y
,
Gerstein
 
M
.
Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome
.
Genome Res
.
2003
:
13
(
12
):
2541
2558
. https://doi.org/10.1101/gr.1429003.

Zhang
 
Y
,
Li
 
S
,
Abyzov
 
A
,
Gerstein
 
MB
.
Landscape and variation of novel retroduplications in 26 human populations
.
PLoS Comput Biol
.
2017
:
13
(
6
):
e1005567
. https://doi.org/10.1371/journal.pcbi.1005567.

Zhang
 
Y
,
Li
 
T
,
Preissl
 
S
,
Amaral
 
ML
,
Grinstein
 
JD
,
Farah
 
EN
,
Destici
 
E
,
Qiu
 
Y
,
Hu
 
R
,
Lee
 
AY
, et al.  
Transcriptionally active HERV-H retrotransposons demarcate topologically associating domains in human pluripotent stem cells
.
Nat Genet
.
2019
:
51
(
9
):
1380
1388
. https://doi.org/10.1038/s41588-019-0479-7.

Zhang
 
W
,
Tautz
 
D
.
Tracing the origin and evolutionary fate of recent gene retrocopies in natural populations of the house mouse
.
Mol Biol Evol
.
2022
:
39
(
2
):
msab360
. https://doi.org/10.1093/molbev/msab360.

Zhang
 
W
,
Xie
 
C
,
Ullrich
 
K
,
Zhang
 
YE
,
Tautz
 
D
.
The mutational load in natural populations is significantly affected by high primary rates of retroposition
.
Proc Natl Acad Sci U S A
.
2021
:
118
(
6
):
e2013043118
. https://doi.org/10.1073/pnas.2013043118.

This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
Associate Editor: Katja Nowick
Katja Nowick
Associate Editor
Search for other works by this author on:

Supplementary data