-
PDF
- Split View
-
Views
-
Cite
Cite
Shuanghui Chen, Chang Lei, Xiaohan Zhao, Yuwen Pan, Dongsheng Lu, Shuhua Xu, AncestryPainter 2.0: Visualizing Ancestry Composition and Admixture History Graph, Genome Biology and Evolution, Volume 16, Issue 11, November 2024, evae249, https://doi.org/10.1093/gbe/evae249
- Share Icon Share
Abstract
The earlier version of AncestryPainter is a Perl program that displays the ancestry composition of numerous individuals using a rounded graph. Motivated by the requests of users in practical applications, we updated AncestryPainter to version 2.0 by coding in an R package and improving the layout, providing more options and compatible statistical functions for graphing. Apart from improving visualization functions per se in this update, we added an extra graphing module to visualize genetic distance through radial bars of varying lengths surrounding a core. Notably, AncestryPainter 2.0 allows for multiple pie charts at the center of the graph to display the ancestry composition of more than one target population and implements a method admixture history graph to infer the admixture sequence of multiple ancestry populations. We validated the six admixture history graph metrics using both simulated and real data and implemented a Pearson coefficient-based metric with the best performance in AncestryPainter 2.0. Furthermore, a statistical module was implemented to merge ancestry proportion matrices. AncestryPainter 2.0 is freely available at https://github.com/Shuhua-Group/AncestryPainterV2 and https://pog.fudan.edu.cn/#/Software.
The visualization of ancestry composition and genetic distance plays an important role in presenting the findings in population genetic studies, though it can be challenging to display the analysis results of a large number of individuals with proper aesthetic features and statistical functions. We received a lot of requests from users who encouraged us to upgrade AncestryPainter 1.0 for a better visualization and presentation of ancestry analysis. The version 2.0 of AncestryPainter allows multiple pie charts in the circular graph to display the ancestry composition of multiple target populations and provides an additional graphing module to visualize genetic distance, as well as two statistical modules to merge ancestry proportion matrices and infer admixture sequences of multiple ancestry populations through correlation analyses, respectively. AncestryPainter 2.0 is expected to greatly facilitate the visualization and processing of the results in population genetics studies.
Introduction
As the amount of sequenced and genotyped genomes grows rapidly, the analysis and depiction of the population structure and genetic affinity of a larger number of human groups have become increasingly common. The visualization of ancestry composition and genetic distance plays a crucial role in presenting the findings of population genetics studies. The conventional method of displaying ancestry composition is to align individuals in a rectangular graph, which can be challenging to print when dealing with a large number of individuals (Lazaridis et al. 2014). To address the aforementioned issue, we thereby developed a computational program named AncestryPainter using a circular graph to display ancestry composition. Moreover, version 1.0 of AncestryPainter can categorize input populations based on their representative ancestry and automatically sort populations and individuals according to their ancestry proportion. Alternatively, users can specify the population order by themselves. Users can also customize the population order and modify graphic features such as ancestry colors (Feng et al. 2018).
Although AncestryPainter 1.0 has been applied to many population genetics studies (Sala et al. 2019; Cerny et al. 2021; Guzman-Solis et al. 2021; Khan and Khan 2021; Ma et al. 2021; Urnikyte et al. 2021; Zhang et al. 2021a, 2021c; Aboagye et al. 2022; Changmai et al. 2022; Wang et al. 2022; Li et al. 2024; Su et al. 2024; Sun et al. 2024; Zhou et al. 2024), we received feedbacks from the users who pointed out some of its limitations which hindered broader application. First, AncestryPainter 1.0 was written mainly in Perl but generates an R script to plot figures; this code structure inhibits users from conveniently modifying parameters when calling plotting functions within the R environment. Second, the limited aesthetic parameters and monotonous layout, which only allows a single pie chart to highlight the ancestry of a specific population or individual at the center of the plot, further restrict its use. Finally, no statistical function compatible with plotting functions (e.g. the clustering function in the R package “pheatmap” [https://github.com/raivokolde/pheatmap]) is implemented in AncestryPainter.
In this study, we developed version 2.0 of AncestryPainter using R language. This updated version retains most of the previous features while offering multiple layout styles for targets in a circular graph. Using the same basic plotting functions, we further designed a plot to display the genetic distance, in which the bars indicating genetic distance are arranged radially around the central target. To enhance the visual attraction of the plot, AncestryPainter 2.0 provides a variety of aesthetic parameters. Moreover, some statistical functions were implemented to merge ancestry composition and infer the admixture topology in this package. Our R package aims to further facilitate the visualization and processing of the results in population genetics studies.
Results and Discussion
Visualization of Ancestry Makeup and Genetic Distance
AncestryPainter 2.0 implements a “sectorplot” to visualize the ancestry composition of multiple populations. The users of our software have to provide an ancestry matrix with rows as individuals and columns as ancestry proportion, along with the annotation including individual ID and group ID. Users can specify the color code of ancestry components and the population order. If not, the colors of ancestry will be randomly generated, and the populations will be categorized into K (ancestry component number, see Methods) groups and then sorted according to their representative ancestry (i.e. the ancestry accounting for the largest proportion in this population), similar to what is done in AncestryPainter 1.0 (Feng et al. 2018).
An important function and new feature of our software is to display the ancestry composition of multiple target population(s) using pie charts in the center of the plot. In contrast to the AncestryPainter 1.0 which allows only one pie chart indicating one target population in the center, the newly developed version 2.0 supports multiple target pie charts. This feature is inspired by some users of AncestryPainter 1.0 (Sala et al. 2019; Khan and Khan 2021; Ma et al. 2021; Zhang et al. 2021a). The positions of the target pie charts can be adjusted via the arguments defined as (i) the distance between the centers of the target pie charts and the plot and (ii) the angle between the line from the center of the plot to the center of the target pie chart and the right horizontal axis of the plot. The new arguments make it feasible for users to design the layout of target pie charts with great freedom, either within or beyond the ring made up of the sectors (Fig. 1a).

Visualization of ancestry composition and genetic distance using AncestryPainterV2. a) Plotted with “sectorplot” is the ancestry composition of 100 randomly picked populations in the Human Origins Dataset given K = 8. The ancestry composition of three target populations (Xuun, French, and Dai) is displayed in pie charts at the center of the plot. See Example 1 (Supplementary material online) for the code. b) Plotted with “radiationplot” is the genetic distance from the Tujia group in the Human Origins Dataset to 14 randomly picked populations from East Asia and Southeast Asia. The genetic distance is indicated by the length of the bars radially surrounding the core indicating the target. See Example 2 (Supplementary material online) for the code.
Moreover, we designed some optional graphing elements and features to help annotate or beautify the plot. Users can add arrows from the sector indicating one population to the corresponding target pie chart, or legends that display the color code and names of the ancestry components. Users can also modify the font, size, and color of the target labels, the position of the legend, etc. For the output figures, we removed the option in version 1.0 to output graphs in “.pdf” or “.png” format directly. Instead, users can generate graphs using internal R functions “pdf”, “png”, etc.
Another graphing function implemented in AncestryPainter 2.0, “radiationplot” can be used to visualize the genetic distance from one target population to another population. The plotting pattern was first present in a publication on the ancestral origin of Tibetans (Lu et al. 2016). The required input of this plot is a four-column matrix containing information on populations, regions, genetic differences, and color codes. This plot includes a core indicating the target population surrounded radially by the sectors showing the genetic distance, with outer rings displaying the value range. The number and numeric range of outer rings can be modified by users as well. Similar to “sectorplot”, the sectors around the core can be automatically sorted according to their values. Moreover, “radiationplot” supports aesthetics and annotation, such as text size/font and legends (Fig. 1b).
Merging Ancestry Composition Matrices
A statistical function compatible with graphing functions is also introduced into AncestryPainter 2.0 for merging multiple ancestry proportion matrices estimated with the same dataset and the same ancestry component number (K) to obtain the averaged ancestry proportion for each individual. The required input can be the file names of the ancestry proportion data frames or the data frames directly. Each data frame contains (2 + K) columns, including two columns of individual and population annotation and an ancestry proportion matrix of K ancestry components. Users can assign any one of the inputs as the reference matrix for merging. The function “ancmerge” outputs an R list including (i) a merged ancestry proportion matrix with annotation, (ii) a data frame showing the representative group (with the largest ancestry proportion of the corresponding component) and the supporting ratio of each ancestry component, and two vectors showing the matrices, (iii) conformed, and (iv) conflicted with the reference.
Inferring Admixture Topology
Pugach et al. (2016) have introduced a metric, admixture history graph (AHG), to infer the admixture sequence of multiple ancestry populations by calculating correlation coefficients between ancestry components (hereafter referred to as AHG), based on the idea that the admixture proportion of two previously admixed ancestries and that of a third ancestry would be independent in subsequent admixture events. AHG has been employed in several previous studies (Feng et al. 2017; Ma et al. 2021; Oliveira et al. 2022; Oliveira et al. 2023; Lei et al. 2024), later to infer the admixture sequence of ancestry proxies for the target populations (Table 1; supplementary table S1, Supplementary Material online; Methods), and a few updated versions of the covariance-based metric have been proposed in the aforementioned studies. Here, we validated the efficiency of different variations of AHG by three admixture scenario models (Fig. 2; Methods). For the “(AB)C” model, given the initial admixture proportion of A and C varying from 0.1 to 0.9, the metric “cov” obtained accuracy greater than 0.8 only if the initial proportion of C was not more than 0.6, and the distribution of accuracy values was asymmetric, indicating low robustness of this method. Similarly, the metric “mean_cov” showed a weakness with an extreme ratio of A (0.1 or 0.9). Compared with metrics “cov” and “mean_cov”, the other four methods (“cor” “mean_cor”, “cov_log”, and “cor_log”) showed better performance, while “cor” and “cov-log” could obtain low accuracy (<0.5) if the proportion of C was 0.1. In addition, “mean_cor” and “cor_log” had higher accuracy (>0.6) than other metrics (Fig. 3a; supplementary table S3, Supplementary Material online).

Admixture scenario models for the validation of AHG metrics. Simulated populations are marked in the oval frames, and the vertical axis left to the graph shows the admixture time. Admixture proportion is marked beside the arrows indicating admixture: a) (AB)C scenario, in which both p1 and p2 vary from 0.1 to 0.9, with a step size of 0.1; b) ((AB)(CD)) scenario, in which both p1 and p2 vary from 0.1 to 0.9, with a step size of 0.1, and p3 varies from 0.2 to 0.5, with a step size of 0.1; c) ((AB)C)D scenario, in which p1 varies from 0.2 to 0.5, with a step size of 0.1, and both p2 and p3 vary from 0.1 to 0.9, with a step size of 0.1. Abbreviation: G ago, Generations ago.

The accuracy of six AHG metrics on three admixture scenario models. The heatmaps show the accuracy of six AHG metrics on three admixture scenario models: a) (AB)C scenario; b) ((AB)(CD)) scenario; and c) ((AB)C)D scenario. The legend indicating accuracy values is shown on the right of each heat map. The initial proportion of admixed Population (AB) in b) was specified as 0.2. The initial proportion of Population A in c) was specified as 0.2.
Summary of the studies using admixture history graph to infer population admixture and target population description
Author . | Year . | Population ID . | Language family . | Description . | Admixture sequence of ancestries . | AHG metric . |
---|---|---|---|---|---|---|
Pugach et al. | 2016 | Evens_Kamchatka | North Tungusic | An even-subgroup from the Kamchatka Peninsula | ([Far Eastern, Central Siberian], East Asian) | cov |
Feng et al. | 2017 | XJU | Turkic | Uyghurs in Northwestern China | ([East Asia, Central Asia Siberia], [South Asia, West Eurasia]) | cor |
Ma et al. | 2021 | NXH | Sino-Tibetan | Huis in Northwestern China | ([{South Asia, West Eurasia}, East Asia], Central Asia Siberia), or ([East Asia, Central Asia Siberia], [{South Asia, West Eurasia}, Central Asia Siberia]) | cor |
Oliveira et al. | 2022 | NTT | … | Ancient individuals from East Nusa Tenggara, Wallacea islands | ([Papuan, Southern Asian], Austronesian) | cov_log |
Oliveira et al. | 2023 | Kwepe | Kwadi | Kwepes from the Angolan Namib Desert | ([Southern Africa, East Africa], West Africa) | cov_log |
Lei et al. | 2024 | KZK | Turkic | Kazakhs in Northwestern China | ([East Asia, Central Asia Siberia], [South Asia, West Eurasia]) | cor_log |
Author . | Year . | Population ID . | Language family . | Description . | Admixture sequence of ancestries . | AHG metric . |
---|---|---|---|---|---|---|
Pugach et al. | 2016 | Evens_Kamchatka | North Tungusic | An even-subgroup from the Kamchatka Peninsula | ([Far Eastern, Central Siberian], East Asian) | cov |
Feng et al. | 2017 | XJU | Turkic | Uyghurs in Northwestern China | ([East Asia, Central Asia Siberia], [South Asia, West Eurasia]) | cor |
Ma et al. | 2021 | NXH | Sino-Tibetan | Huis in Northwestern China | ([{South Asia, West Eurasia}, East Asia], Central Asia Siberia), or ([East Asia, Central Asia Siberia], [{South Asia, West Eurasia}, Central Asia Siberia]) | cor |
Oliveira et al. | 2022 | NTT | … | Ancient individuals from East Nusa Tenggara, Wallacea islands | ([Papuan, Southern Asian], Austronesian) | cov_log |
Oliveira et al. | 2023 | Kwepe | Kwadi | Kwepes from the Angolan Namib Desert | ([Southern Africa, East Africa], West Africa) | cov_log |
Lei et al. | 2024 | KZK | Turkic | Kazakhs in Northwestern China | ([East Asia, Central Asia Siberia], [South Asia, West Eurasia]) | cor_log |
Note: Three studies (Pugach et al. 2016; Oliveira et al. 2022, 2023) involved multiple admixed populations and only one of the populations has been picked as an example for each study in this table. For the full list of target populations, see supplementary table S1, Supplementary Material online.
Summary of the studies using admixture history graph to infer population admixture and target population description
Author . | Year . | Population ID . | Language family . | Description . | Admixture sequence of ancestries . | AHG metric . |
---|---|---|---|---|---|---|
Pugach et al. | 2016 | Evens_Kamchatka | North Tungusic | An even-subgroup from the Kamchatka Peninsula | ([Far Eastern, Central Siberian], East Asian) | cov |
Feng et al. | 2017 | XJU | Turkic | Uyghurs in Northwestern China | ([East Asia, Central Asia Siberia], [South Asia, West Eurasia]) | cor |
Ma et al. | 2021 | NXH | Sino-Tibetan | Huis in Northwestern China | ([{South Asia, West Eurasia}, East Asia], Central Asia Siberia), or ([East Asia, Central Asia Siberia], [{South Asia, West Eurasia}, Central Asia Siberia]) | cor |
Oliveira et al. | 2022 | NTT | … | Ancient individuals from East Nusa Tenggara, Wallacea islands | ([Papuan, Southern Asian], Austronesian) | cov_log |
Oliveira et al. | 2023 | Kwepe | Kwadi | Kwepes from the Angolan Namib Desert | ([Southern Africa, East Africa], West Africa) | cov_log |
Lei et al. | 2024 | KZK | Turkic | Kazakhs in Northwestern China | ([East Asia, Central Asia Siberia], [South Asia, West Eurasia]) | cor_log |
Author . | Year . | Population ID . | Language family . | Description . | Admixture sequence of ancestries . | AHG metric . |
---|---|---|---|---|---|---|
Pugach et al. | 2016 | Evens_Kamchatka | North Tungusic | An even-subgroup from the Kamchatka Peninsula | ([Far Eastern, Central Siberian], East Asian) | cov |
Feng et al. | 2017 | XJU | Turkic | Uyghurs in Northwestern China | ([East Asia, Central Asia Siberia], [South Asia, West Eurasia]) | cor |
Ma et al. | 2021 | NXH | Sino-Tibetan | Huis in Northwestern China | ([{South Asia, West Eurasia}, East Asia], Central Asia Siberia), or ([East Asia, Central Asia Siberia], [{South Asia, West Eurasia}, Central Asia Siberia]) | cor |
Oliveira et al. | 2022 | NTT | … | Ancient individuals from East Nusa Tenggara, Wallacea islands | ([Papuan, Southern Asian], Austronesian) | cov_log |
Oliveira et al. | 2023 | Kwepe | Kwadi | Kwepes from the Angolan Namib Desert | ([Southern Africa, East Africa], West Africa) | cov_log |
Lei et al. | 2024 | KZK | Turkic | Kazakhs in Northwestern China | ([East Asia, Central Asia Siberia], [South Asia, West Eurasia]) | cor_log |
Note: Three studies (Pugach et al. 2016; Oliveira et al. 2022, 2023) involved multiple admixed populations and only one of the populations has been picked as an example for each study in this table. For the full list of target populations, see supplementary table S1, Supplementary Material online.
When we validated the efficiency of AHG metrics on the (AB)(CD) model, to make the differences of metrics more prominent, we chose a relatively biased proportion (0.2) of the admixed population (AB) (Fig. 3a), while the initial proportion of A and C varied from 0.1 to 0.9 (step size: 0.1). The metric “cov” showed the worst performance, and it was possible that “mean_cov” obtained a very low accuracy when A had an extremely low or high initial proportion (0.1 or 0.9). The rest of the metrics showed a better performance while a small proportion of A or C (0.1 or 0.9) could also reduce the accuracy. Among these metrics, “cor”, “mean_cor”, and “cor_log” had relatively higher accuracy, ranging from 0.44 to 1, while the accuracy of “cov_log” might drop down to less than 0.4 (Fig. 3b; supplementary table S4, Supplementary Material online).
For the ((AB)C)D model, we specified the initial proportion of A as 0.2, with C and D ranging from 0.1 to 0.9. Similar to the results of the (AB)(CD) model, “cov_log”, “cor”, “mean_cor”, and “cor_log” outperformed “cov” and “mean_cov”. Moreover, “cov_log” and “cor_log” had higher median accuracy (>0.4) than “cor” (0.22) and “mean_cor” (0.38), indicating that it was more likely to obtain higher accurate admixture topology with “cov_log” and “cor_log” metrics (Fig. 3c; supplementary table S5, Supplementary Material online).
Overall, the metric “cor_log” showed the best performance among all metrics. We further evaluated its robustness with varying (AB) and A proportion (0.3, 0.4, 0.5) in the (AB)(CD) model and the ((AB)C)D model, respectively (supplementary fig. S1, Supplementary Material online; supplementary tables S6 and S7, Supplementary Material online). It turned out that “cor_log” obtained an accuracy greater than 0.7 in most of the instances for the (AB)(CD) model (supplementary fig. S1a, Supplementary Material online; supplementary table S6, Supplementary Material online). For the ((AB)C)D model, if the extremely biased instances (the proportion of C or D = 0.1; the proportion of C or D = 0.9) were not taken into account, 90% of the “cor_log” results were greater than 0.5 (supplementary fig. S1b, Supplementary Material online; supplementary table S7, Supplementary Material online). Given that admixture scenario similar to “(AB)(CD)” was more prevalent than “((AB)C)D” in previous studies (Feng et al. 2017; Ma et al. 2021; Lei et al. 2024), the lower performance of “cor_log” on the “((AB)C)D” scenario model might not hinder the application of “cor_log” in population admixture studies.
We evaluated the efficiency of AHG metrics with a real dataset composed of two African American populations (African Americans in Southwest United States [ASW] and African Caribbeans in Barbados [ACB]) as well as their African and European ancestry proxies. The evaluation was based on the assumptions that (i) due to the latitudinal proximity and navigation systems, Western African populations contributed the major African ancestry to the African offspring living in Caribe-Central/North America (namely, ACB and ASW in this study); (ii) the population structure of Western African populations are previously admixed, including multiple ancestry components; and (iii) after the admixture of European ancestry components and African ancestry components, the African descendants living in America met with post-admixture homogenization to different degrees, in which the African descendants in the United States were largely affected while the African offspring in Barbados received a much milder influence (Gouveia et al. 2020). We ran ADMIXTURE to infer the ancestry composition of ASW and ACB together with the reference populations across Africa and Europe. The results at K = 5 showed distinctly separated ancestry of African populations (respectively represented by the populations in Western Africa [AFR_W], Western/Central Africa [AFR_C], Eastern Africa [AFR_E], and Southern Africa [AFR_S]) and highly admixed population structure of African offspring in America (Fig. 4a). We performed AHG tests with resampling 5,000 times upon the two major African ancestry components. As expected, strong AHG signals of “(AFR_W, AFR_C), EUR”, which meant two African ancestries, AFR_W and AFR_C, admixed before the interface of EUR, were detected by all AHG metrics except “cov”. However, due to the post-admixture homogenization, more Western/Central African-derived ancestry was likely to be introduced into the gene pool of African offspring in the United States, thus AHG signals of “(AFR_W,EUR), AFR_C” could be explicitly observed though with a lower supporting number than “(AFR_W, AFR_C), EUR”. The AHG metrics “cov_log”, “mean_cor”, and “cor_log” detected the signals with supporting numbers 619, 395, and 253, respectively, while the supporting numbers of the other three metrics were close to 0. Meanwhile, African offspring in Barbados were less affected by the post-admixture homogenization. As a consequence, the supporting number of “(AFR_C, EUR), AFR_W” and “(AFR_W, EUR), AFR_C” could be relatively smaller than “(AFR_W, AFR_C), EUR” in ACB population. The metrics “mean_cov”, “cor_log”, and “mean_cor” with supporting numbers as 256, 735, and 786 for the “(AFR_C, EUR), AFR_W” signal outperformed other metrics with supporting numbers higher than 800 (Fig. 4b). Considering the good performance of “cor_log” on both simulated and real data, we employed “cor_log” as the AHG metric of AncestryPainter 2.0.

The accuracy of six AHG metrics on the real dataset. a) ADMIXTURE results (K = 5) of the real dataset, plotted with the “sectorplot” function in AncestryPainter 2.0; b) Admixture sequence inferred using AHG. The admixture topology “(A,B), C” means ancestry A admixes with ancestry B, and then ancestry C joins in the admixed ancestry. Abbreviation: AFR_W, Western African ancestry; AFR_C, Western/Central African ancestry; AFR_E, Eastern African ancestry; AFR_S, Southern African ancestry; EUR, European ancestry.
Future Developments
In this study, we developed a new version of AncestryPainter which can be used to illustrate the ancestry compositions and genetic distance along with statistical functions to merge multiple ancestry proportion matrices or infer admixture topology. Moreover, we introduced the AHG algorithm into AncestryPainter for the inference of admixture topology. We compared the accuracy of six AHG metrics on three different admixture scenario models using simulated populations. The metric “cor_log” showed an overall better performance than other metrics, and thus, we implemented this metric in the AHG function.
The AHG method is easy to operate and has a high accuracy with (AB)C and (AB)(CD) admixture scenario models. However, the accuracy of all AHG metrics is low when the proportion of any ancestor is too small. It can be interpreted as the effect of genetic drift, which can be simulated by AdmixSim2 (Zhang et al. 2021b). When descendants are generated, an ancestry component may be lost or drastically decreased due to genetic drift and the ancestry proportion in the descendants tends to form a truncated normal distribution with large variance, which disturbs the correlation between previously admixed ancestry components. Accordingly, all AHG metrics do not perform well in the ((AB)C)D scenario, a continuous admixture model, which may result from the large variance of each ancestral component after admixture. The AHG accuracy for the ((AB)C)D scenario might grow if all four ancestral proportions have a substantial admixture proportion (supplementary fig. S1b, Supplementary Material online). Collectively, AHG can be used as a “preliminary estimate” to infer the admixture topology and has to be combined with other methods, e.g. the three-population test (f3) (Patterson et al. 2012).
While the graphing module of AncestryPainter has greatly facilitated visualization of ancestry composition in recent population genetic studies (Sala et al. 2019; Cerny et al. 2021; Guzman-Solis et al. 2021; Khan and Khan 2021; Ma et al. 2021; Urnikyte et al. 2021; Zhang et al. 2021a, 2021c; Aboagye et al. 2022; Changmai et al. 2022; Wang et al. 2022; Li et al. 2024; Su et al. 2024; Sun et al. 2024; Zhou et al. 2024) and has been equipped with many new features in this study, there is still room for implementation of more functions and features, for instance, using multiple concentric circles in a single image to allow the displaying of ancestry makeup assuming different numbers of ancestry components, or annotating the subgroup information on a finer scale. Furthermore, tree structure or network graphs can be added to display the phylogenic relationship of populations and admixture topology.
Materials and Methods
Example of Dataset
To illustrate the utilities of the graphing modules and the “ancmerge” function in AncestryPainter 2.0, we used the genome-wide single-nucleotide polymorphisms (SNPs) of 2,415 modern human individuals in the Human Origins (HO) dataset (Lazaridis et al. 2014) and 7 Kyrgyz individuals from the Estonian Biocentre Human Genome Diversity Panel (EGDP) (Pagani et al. 2016) to generate the example data. We converged the HO and EGDP data by bcftools (Danecek et al. 2021) and performed ADMIXTURE (Alexander et al. 2009) to estimate the ancestry makeup of the individuals for ten repeats with different SNP subsets, specifying the ancestry component number (K) as eight. The ten SNP subsets were generated using the method described in the post-admixture adaptation study of Xinjiang Uyghurs (Pan et al. 2022). In addition, we ran an in-house Python script to calculate the genetic distance (FST) (Weir and Cockerham 1984) between populations.
Using Sectors for Visualization
The graphic functions of AncestryPainter 2.0 are composed primarily based on the R package “graphics”. The sectors visualizing ancestry proportion or genetic distance are plotted by the function “polygon” in “graphics”. The coordinates of sectors on the canvas depend on (i) the order of the ancestry component indicated in the input data and (ii) the initial plotting position. The sector size correlates with the quantity of ancestry proportion or genetic distance. In addition, we utilize other functions in the “graphics” package, such as “text” and “arrows” to annotate sectors.
Merging Ancestry Proportion Matrices
This section is translated from the in-house Python script authored by Pan et al. (2022) (https://github.com/Shuhua-Group/ADMIXTURE.merge). This function merges the ancestry proportion matrices (called “target matrices”) estimated by software such as ADMIXTURE with the same dataset and the same ancestry component number (K). This function calculates and compares the correlation (measured by Pearson coefficient) between one ancestry component in a user-defined reference matrix (i.e. a reference component) and each of the ancestry components in the matrices to be merged (i.e. target components), and then matches the reference component with the target of the highest correlation coefficient. The function counts the number of target components matched with each reference component and calculates the supporting ratio of all ancestry components in a reference matrix. The supporting ratio is defined as the ratio of the matched target component number to the total number of target matrices. In the merged matrix, the proportion of an ancestry component for each individual is the average of a group of matched ancestry components. A target matrix with all ancestry components matching those of the reference is defined as a consensus matrix “supporting” the reference; otherwise, it is regarded as a “conflicted” one. A larger number of consensus matrices indicate the reliability of the reference and vice versa.
AHG Metrics
In the original AHG test (Pugach et al. 2016), the correlation efficiency is estimated with the covariance of (i) the ratio of the admixture proportion of two random-picking ancestry components and (ii) the admixture proportion of the third ancestry component. For example, an already admixed population with two different ancestry components A and B meets with another episode of admixture bringing into this population a new ancestry component C, and the arrays of ancestry proportion of individuals in Populations A, B, and C are available. The correlation coefficient can be calculated as follows:
This coefficient is expected to be zero. Practically, the admixture topology with the lowest corresponding correlation coefficient among the three combinations, i.e. , , and , can be inferred as the best-fit. The supporting ratio of each admixture topology can be estimated by using ancestry proportion arrays of randomly picked individuals from the given population.
This metric has been modified and then applied to our previous study of the Uyghurs (Feng et al. 2017) and the Huis (Ma et al. 2021) in Northwestern China, in which the covariance was substituted by Pearson coefficient, because the latter can adjust the bias caused by admixture proportion differences among ancestry components:
However, the correlation efficient values of the same population combination (e.g. and ) can be distinct if the positions of ancestry components in the fraction are swapped (e.g. replacing A/B by B/A). To solve the issue, we defined a novel metric as an arithmetic mean of the two covariance or correlation values with swapping ancestry components:
or
In addition, Oliveira et al. (2022) updated the original AHG metric by introducing logarithm-transformation, which eliminates the effect of swapping ancestry component positions.
Drawing on the metrics proposed above, we could also optimize the calculation of the correlation coefficients as:
Validation of AHG Metrics Using Simulated Data
To validate the efficiency of these metrics, we examined our methods and previously published ones via three types of admixture scenario models (Fig. 2). These models were established based on that proposed by Feng et al. (2017).
(AB)C scenario: Populations A and B were initially admixed 120 generations ago to form the AB population. After 90 generations of self-evolution, the AB population was then mixed with Population C 30 generations ago, leading to the formation of the initial (AB)C population. This (AB)C population underwent a further 30 generations of self-evolution to arrive at the final (AB)C population (Fig. 2a). When Population A and Population B admixed to form Population (AB), the admixture proportion of Population A was varied incrementally from 0.1 to 0.9, with a step size of 0.1. Similarly, when Population (AB) was subsequently admixed with Population C, the admixture proportion of Population C also varied incrementally from 0.1 to 0.9, with a step size of 0.1.
((AB)(CD)) scenario: Populations A and B underwent admixture 120 generations ago to form the composite Population (AB), while concurrently, Populations C and D were mixed to form the composite Population (CD). Each of these newly formed populations, (AB) and (CD), then proceeded through 90 generations of isolated evolution before admixing 30 generations ago, giving rise to the initial combined population (AB)(CD). This combined population (AB)(CD) then experienced 30 generations of self-evolution to reach its final state (Fig. 2b). During the admixture event between Populations A and B, the contribution from Population A was incrementally set from 0.1 to 0.9 in steps of 0.1. Similarly, for the admixture between Populations C and D, Population C's contribution was also incrementally set from 0.1 to 0.9 in steps of 0.1. During the admixture of composite Populations (AB) and (CD), the proportion of (AB) was set at 0.2, 0.3, 0.4, and 0.5.
((AB)C)D scenario: Populations A and B were initially admixed 150 generations ago to form the composite (AB) population. This AB population then underwent 90 generations of independent evolution before engaging in admixture with Population C 60 generations ago, culminating in the formation of the (AB)C population. After a further 30 generations of self-evolution, this (AB)C population was then admixed with Population D to form the ((AB)C)D population 30 generations ago. The ((AB)C)D population continued to evolve independently for an additional 30 generations to achieve its final genetic composition (Fig. 2c). During the admixture event between Populations A and B, the admixture proportions from Population A were specifically set at 0.2, 0.3, 0.4, and 0.5. When Population (AB) was admixed with Population C, the admixture proportion from Population C varied sequentially from 0.1 to 0.9, with a step increment of 0.1. Subsequently, for the admixture event between Population (AB)C and Population D, Population D's admixture proportion also ranged sequentially from 0.1 to 0.9, with the same step increment of 0.1.
Populations in the three kinds of admixture scenario models were simulated by AdmixSim2 (Zhang et al. 2021b), an individual-based forward-time simulation tool that can flexibly and efficiently simulate population genomics data under complex evolutionary scenarios. For all three scenarios, the Populations A, B, C, and D were randomly generated in accordance with the AdmixSim2 manual (https://github.com/Shuhua-Group/AdmixSim2/tree/master), without involving specific population information.
During the simulation, we maintained a constant sample size of 5,000 individuals per population per generation. Due to factors such as genetic drift, some ancestral components may be represented at very low frequencies (below 1 × 10−6) in the outcomes. These minor ancestral components are then set to a threshold value of 1 × 10−6, and the proportional frequencies of the remaining ancestral components are accordingly adjusted to ensure that the sum of all ancestral component proportions equals 1. In addition, we forced the proportion of each ancestry component in the ultimate admixed population to be greater than 0.
Further, we generated simulated data within the (AB)C scenario to evaluate the impact of varying sample sizes on algorithmic performance. Here, the metric employed was “mean_cor”. When Population A and Population B were mixed to form Population (AB), the admixture proportion of Population A was varied incrementally from 0.1 to 0.5, with a step size of 0.1. Similarly, when Population (AB) was subsequently admixed with Population C, the admixture proportion of Population C also varied incrementally from 0.1 to 0.9, with a step size of 0.1. After the simulation data were prepared, we sequentially sampled 25, 50, 75, and 100 individuals from the admixed population, repeating this process 100 times for each sample size, to assess the effect of sample size on the efficacy of the algorithm. We documented the number of instances in which the algorithm accurately inferred the correct admixture model (accuracy). It turned out that a greater sampling size resulted in higher accuracy of inference and all the methods obtained the highest accuracy when the sampling size was 100 (supplementary table S2, Supplementary Material online). Therefore, we sampled 100 individuals from the simulated population 100 times for each admixture scenario to calculate the AHG metrics.
Inferring Admixture Sequence of African American Populations Using AHG
We validated the effect of AHG metrics using a real human genome dataset which was constructed based on a published study for the origin of American African populations (Gouveia et al. 2020). We chose the ASW and ACB individuals in the 1000 Genomes project phase 3 release (KGP) (1000 Genomes Project Consortium 2015) as target populations and subset African ancestry and European ancestry individuals from the HO and the 1000 Genomes project phase 3 release as the reference panel. (i) Eastern African populations: BantuSA, Tswana, Wambo, Ju_hoan_North, Taa_North, and Ju_hoan_South from HO; LWK from KGP. (ii) Southern African populations: Dinka, Luo, Sandawe from HO. (iii) Western African and Central African populations from KGP: ESN, YRI, GWD, and MSL. (iv) European individuals from KGP: Utah residents (CEPH) with Northern and Western European ancestry, CEU and Iberian populations in Spain, IBS. The SNPs of selected HO and KGP individuals were converged using bcftools (Danecek et al. 2021) and then pruned with plink (Purcell et al. 2007).
To infer the ancestry proportion of the real dataset for the validation of AHG metrics, we performed ADMIXTURE upon the converged dataset with K ranging from 2 to 10. For each K, ADMIXTURE was performed ten times with different random seeds, and the output “.Q” files were merged using the “ancmerge” function in AncestryPainter 2.0. The presented results are based on the output of ADMIXTURE runs with K = 5 because of its lowest cross-validation error (supplementary fig. S2, Supplementary Material online). We excluded outliers with substantial non-African ancestry from AHG tests. We did AHG tests using six metrics for the main ancestry components of ASW and ACB with resampling 5,000 times and 30 individuals for each time. AHG metrics of both simulated and real data were calculated and plotted using R.
Supplementary Material
Supplementary material is available at Genome Biology and Evolution online.
Acknowledgments
We thank Dr Qidi Feng for sharing her experience on the first version of this package. We thank Dr Xumin Ni for his suggestions on the evaluation of AHG metrics. We also thank Dr Alec Downie and Dr Iker Rivas-González who provided advice on the presentation of background information. The computational work in this study was supported by the CFFF Computing Platform and the Human Phenome Data Center of Fudan University.
Authors’ Contributions
S.X. conceived and designed the study and supervised the project. S.C., C.L., H.Z., Y.P., and D.L. contributed to the computer code. C.L. developed a key algorithm. H.Z. examined and improved the computer code. S.C. coordinated the computer coding, and packed the software. S.C. and C.L. drafted the manuscript. S.X. revised the manuscript. All authors read and approved the final manuscript.
Funding
This work was supported by the National Key Research and Development Program of China [No. 2023YFC2605400]; the National Natural Science Foundation of China (NSFC) grants [32288101, 32030020]; the Shanghai Science and Technology Commission Program [23JS1410100]; and the Office of Global Partnerships (Key Projects Development Fund).
Data Availability
Example data and source code of AncestryPainter 2.0 are available on GitHub, at https://github.com/Shuhua-Group/AncestryPainterV2 and on the HumPOG lab website, at https://pog.fudan.edu.cn/#/Software.
Literature Cited
Author notes
Shuanghui Chen and Chang Lei contributed equally to this work.