AncestryPainter 2.0: Visualizing Ancestry Composition and Admixture History Graph

Author Notes

Abstract

The earlier version of AncestryPainter is a Perl program that displays the ancestry composition of numerous individuals using a rounded graph. Motivated by the requests of users in practical applications, we updated AncestryPainter to version 2.0 by coding in an R package and improving the layout, providing more options and compatible statistical functions for graphing. Apart from improving visualization functions per se in this update, we added an extra graphing module to visualize genetic distance through radial bars of varying lengths surrounding a core. Notably, AncestryPainter 2.0 allows for multiple pie charts at the center of the graph to display the ancestry composition of more than one target population and implements a method admixture history graph to infer the admixture sequence of multiple ancestry populations. We validated the six admixture history graph metrics using both simulated and real data and implemented a Pearson coefficient-based metric with the best performance in AncestryPainter 2.0. Furthermore, a statistical module was implemented to merge ancestry proportion matrices. AncestryPainter 2.0 is freely available at https://github.com/Shuhua-Group/AncestryPainterV2 and https://pog.fudan.edu.cn/#/Software.

population genetics, visualization, ancestry composition, genetic distance, admixture history graph

Significance

The visualization of ancestry composition and genetic distance plays an important role in presenting the findings in population genetic studies, though it can be challenging to display the analysis results of a large number of individuals with proper aesthetic features and statistical functions. We received a lot of requests from users who encouraged us to upgrade AncestryPainter 1.0 for a better visualization and presentation of ancestry analysis. The version 2.0 of AncestryPainter allows multiple pie charts in the circular graph to display the ancestry composition of multiple target populations and provides an additional graphing module to visualize genetic distance, as well as two statistical modules to merge ancestry proportion matrices and infer admixture sequences of multiple ancestry populations through correlation analyses, respectively. AncestryPainter 2.0 is expected to greatly facilitate the visualization and processing of the results in population genetics studies.

Introduction

As the amount of sequenced and genotyped genomes grows rapidly, the analysis and depiction of the population structure and genetic affinity of a larger number of human groups have become increasingly common. The visualization of ancestry composition and genetic distance plays a crucial role in presenting the findings of population genetics studies. The conventional method of displaying ancestry composition is to align individuals in a rectangular graph, which can be challenging to print when dealing with a large number of individuals (Lazaridis et al. 2014). To address the aforementioned issue, we thereby developed a computational program named AncestryPainter using a circular graph to display ancestry composition. Moreover, version 1.0 of AncestryPainter can categorize input populations based on their representative ancestry and automatically sort populations and individuals according to their ancestry proportion. Alternatively, users can specify the population order by themselves. Users can also customize the population order and modify graphic features such as ancestry colors (Feng et al. 2018).

Although AncestryPainter 1.0 has been applied to many population genetics studies (Sala et al. 2019; Cerny et al. 2021; Guzman-Solis et al. 2021; Khan and Khan 2021; Ma et al. 2021; Urnikyte et al. 2021; Zhang et al. 2021a, 2021c; Aboagye et al. 2022; Changmai et al. 2022; Wang et al. 2022; Li et al. 2024; Su et al. 2024; Sun et al. 2024; Zhou et al. 2024), we received feedbacks from the users who pointed out some of its limitations which hindered broader application. First, AncestryPainter 1.0 was written mainly in Perl but generates an R script to plot figures; this code structure inhibits users from conveniently modifying parameters when calling plotting functions within the R environment. Second, the limited aesthetic parameters and monotonous layout, which only allows a single pie chart to highlight the ancestry of a specific population or individual at the center of the plot, further restrict its use. Finally, no statistical function compatible with plotting functions (e.g. the clustering function in the R package “pheatmap” [https://github.com/raivokolde/pheatmap]) is implemented in AncestryPainter.

In this study, we developed version 2.0 of AncestryPainter using R language. This updated version retains most of the previous features while offering multiple layout styles for targets in a circular graph. Using the same basic plotting functions, we further designed a plot to display the genetic distance, in which the bars indicating genetic distance are arranged radially around the central target. To enhance the visual attraction of the plot, AncestryPainter 2.0 provides a variety of aesthetic parameters. Moreover, some statistical functions were implemented to merge ancestry composition and infer the admixture topology in this package. Our R package aims to further facilitate the visualization and processing of the results in population genetics studies.

Results and Discussion

Visualization of Ancestry Makeup and Genetic Distance

AncestryPainter 2.0 implements a “sectorplot” to visualize the ancestry composition of multiple populations. The users of our software have to provide an ancestry matrix with rows as individuals and columns as ancestry proportion, along with the annotation including individual ID and group ID. Users can specify the color code of ancestry components and the population order. If not, the colors of ancestry will be randomly generated, and the populations will be categorized into K (ancestry component number, see Methods) groups and then sorted according to their representative ancestry (i.e. the ancestry accounting for the largest proportion in this population), similar to what is done in AncestryPainter 1.0 (Feng et al. 2018).

An important function and new feature of our software is to display the ancestry composition of multiple target population(s) using pie charts in the center of the plot. In contrast to the AncestryPainter 1.0 which allows only one pie chart indicating one target population in the center, the newly developed version 2.0 supports multiple target pie charts. This feature is inspired by some users of AncestryPainter 1.0 (Sala et al. 2019; Khan and Khan 2021; Ma et al. 2021; Zhang et al. 2021a). The positions of the target pie charts can be adjusted via the arguments defined as (i) the distance between the centers of the target pie charts and the plot and (ii) the angle between the line from the center of the plot to the center of the target pie chart and the right horizontal axis of the plot. The new arguments make it feasible for users to design the layout of target pie charts with great freedom, either within or beyond the ring made up of the sectors (Fig. 1a).

Fig. 1.

Visualization of ancestry composition and genetic distance using AncestryPainterV2. a) Plotted with “sectorplot” is the ancestry composition of 100 randomly picked populations in the Human Origins Dataset given K = 8. The ancestry composition of three target populations (Xuun, French, and Dai) is displayed in pie charts at the center of the plot. See Example 1 (Supplementary material online) for the code. b) Plotted with “radiationplot” is the genetic distance from the Tujia group in the Human Origins Dataset to 14 randomly picked populations from East Asia and Southeast Asia. The genetic distance is indicated by the length of the bars radially surrounding the core indicating the target. See Example 2 (Supplementary material online) for the code.

Open in new tab Download slide

Moreover, we designed some optional graphing elements and features to help annotate or beautify the plot. Users can add arrows from the sector indicating one population to the corresponding target pie chart, or legends that display the color code and names of the ancestry components. Users can also modify the font, size, and color of the target labels, the position of the legend, etc. For the output figures, we removed the option in version 1.0 to output graphs in “.pdf” or “.png” format directly. Instead, users can generate graphs using internal R functions “pdf”, “png”, etc.

Another graphing function implemented in AncestryPainter 2.0, “radiationplot” can be used to visualize the genetic distance from one target population to another population. The plotting pattern was first present in a publication on the ancestral origin of Tibetans (Lu et al. 2016). The required input of this plot is a four-column matrix containing information on populations, regions, genetic differences, and color codes. This plot includes a core indicating the target population surrounded radially by the sectors showing the genetic distance, with outer rings displaying the value range. The number and numeric range of outer rings can be modified by users as well. Similar to “sectorplot”, the sectors around the core can be automatically sorted according to their values. Moreover, “radiationplot” supports aesthetics and annotation, such as text size/font and legends (Fig. 1b).

Merging Ancestry Composition Matrices

A statistical function compatible with graphing functions is also introduced into AncestryPainter 2.0 for merging multiple ancestry proportion matrices estimated with the same dataset and the same ancestry component number (K) to obtain the averaged ancestry proportion for each individual. The required input can be the file names of the ancestry proportion data frames or the data frames directly. Each data frame contains (2 + K) columns, including two columns of individual and population annotation and an ancestry proportion matrix of K ancestry components. Users can assign any one of the inputs as the reference matrix for merging. The function “ancmerge” outputs an R list including (i) a merged ancestry proportion matrix with annotation, (ii) a data frame showing the representative group (with the largest ancestry proportion of the corresponding component) and the supporting ratio of each ancestry component, and two vectors showing the matrices, (iii) conformed, and (iv) conflicted with the reference.

Inferring Admixture Topology

Pugach et al. (2016) have introduced a metric, admixture history graph (AHG), to infer the admixture sequence of multiple ancestry populations by calculating correlation coefficients between ancestry components (hereafter referred to as AHG), based on the idea that the admixture proportion of two previously admixed ancestries and that of a third ancestry would be independent in subsequent admixture events. AHG has been employed in several previous studies (Feng et al. 2017; Ma et al. 2021; Oliveira et al. 2022; Oliveira et al. 2023; Lei et al. 2024), later to infer the admixture sequence of ancestry proxies for the target populations (Table 1; supplementary table S1, Supplementary Material online; Methods), and a few updated versions of the covariance-based metric have been proposed in the aforementioned studies. Here, we validated the efficiency of different variations of AHG by three admixture scenario models (Fig. 2; Methods). For the “(AB)C” model, given the initial admixture proportion of A and C varying from 0.1 to 0.9, the metric “cov” obtained accuracy greater than 0.8 only if the initial proportion of C was not more than 0.6, and the distribution of accuracy values was asymmetric, indicating low robustness of this method. Similarly, the metric “mean_cov” showed a weakness with an extreme ratio of A (0.1 or 0.9). Compared with metrics “cov” and “mean_cov”, the other four methods (“cor” “mean_cor”, “cov_log”, and “cor_log”) showed better performance, while “cor” and “cov-log” could obtain low accuracy (<0.5) if the proportion of C was 0.1. In addition, “mean_cor” and “cor_log” had higher accuracy (>0.6) than other metrics (Fig. 3a; supplementary table S3, Supplementary Material online).

Fig. 2.

Admixture scenario models for the validation of AHG metrics. Simulated populations are marked in the oval frames, and the vertical axis left to the graph shows the admixture time. Admixture proportion is marked beside the arrows indicating admixture: a) (AB)C scenario, in which both p₁ and p₂ vary from 0.1 to 0.9, with a step size of 0.1; b) ((AB)(CD)) scenario, in which both p₁ and p₂ vary from 0.1 to 0.9, with a step size of 0.1, and p₃ varies from 0.2 to 0.5, with a step size of 0.1; c) ((AB)C)D scenario, in which p₁ varies from 0.2 to 0.5, with a step size of 0.1, and both p₂ and p₃ vary from 0.1 to 0.9, with a step size of 0.1. Abbreviation: G ago, Generations ago.

Open in new tab Download slide

Fig. 3.

The accuracy of six AHG metrics on three admixture scenario models. The heatmaps show the accuracy of six AHG metrics on three admixture scenario models: a) (AB)C scenario; b) ((AB)(CD)) scenario; and c) ((AB)C)D scenario. The legend indicating accuracy values is shown on the right of each heat map. The initial proportion of admixed Population (AB) in b) was specified as 0.2. The initial proportion of Population A in c) was specified as 0.2.

Open in new tab Download slide

Table 1

Open in new tab

Summary of the studies using admixture history graph to infer population admixture and target population description

Author	Year	Population ID	Language family	Description	Admixture sequence of ancestries	AHG metric
Pugach et al.	2016	Evens_Kamchatka	North Tungusic	An even-subgroup from the Kamchatka Peninsula	([Far Eastern, Central Siberian], East Asian)	cov
Feng et al.	2017	XJU	Turkic	Uyghurs in Northwestern China	([East Asia, Central Asia Siberia], [South Asia, West Eurasia])	cor
Ma et al.	2021	NXH	Sino-Tibetan	Huis in Northwestern China	([{South Asia, West Eurasia}, East Asia], Central Asia Siberia), or ([East Asia, Central Asia Siberia], [{South Asia, West Eurasia}, Central Asia Siberia])	cor
Oliveira et al.	2022	NTT	…	Ancient individuals from East Nusa Tenggara, Wallacea islands	([Papuan, Southern Asian], Austronesian)	cov_log
Oliveira et al.	2023	Kwepe	Kwadi	Kwepes from the Angolan Namib Desert	([Southern Africa, East Africa], West Africa)	cov_log
Lei et al.	2024	KZK	Turkic	Kazakhs in Northwestern China	([East Asia, Central Asia Siberia], [South Asia, West Eurasia])	cor_log

Author	Year	Population ID	Language family	Description	Admixture sequence of ancestries	AHG metric
Pugach et al.	2016	Evens_Kamchatka	North Tungusic	An even-subgroup from the Kamchatka Peninsula	([Far Eastern, Central Siberian], East Asian)	cov
Feng et al.	2017	XJU	Turkic	Uyghurs in Northwestern China	([East Asia, Central Asia Siberia], [South Asia, West Eurasia])	cor
Ma et al.	2021	NXH	Sino-Tibetan	Huis in Northwestern China	([{South Asia, West Eurasia}, East Asia], Central Asia Siberia), or ([East Asia, Central Asia Siberia], [{South Asia, West Eurasia}, Central Asia Siberia])	cor
Oliveira et al.	2022	NTT	…	Ancient individuals from East Nusa Tenggara, Wallacea islands	([Papuan, Southern Asian], Austronesian)	cov_log
Oliveira et al.	2023	Kwepe	Kwadi	Kwepes from the Angolan Namib Desert	([Southern Africa, East Africa], West Africa)	cov_log
Lei et al.	2024	KZK	Turkic	Kazakhs in Northwestern China	([East Asia, Central Asia Siberia], [South Asia, West Eurasia])	cor_log

Note: Three studies (Pugach et al. 2016; Oliveira et al. 2022, 2023) involved multiple admixed populations and only one of the populations has been picked as an example for each study in this table. For the full list of target populations, see supplementary table S1, Supplementary Material online.

Table 1

Open in new tab

Summary of the studies using admixture history graph to infer population admixture and target population description

Author	Year	Population ID	Language family	Description	Admixture sequence of ancestries	AHG metric
Pugach et al.	2016	Evens_Kamchatka	North Tungusic	An even-subgroup from the Kamchatka Peninsula	([Far Eastern, Central Siberian], East Asian)	cov
Feng et al.	2017	XJU	Turkic	Uyghurs in Northwestern China	([East Asia, Central Asia Siberia], [South Asia, West Eurasia])	cor
Ma et al.	2021	NXH	Sino-Tibetan	Huis in Northwestern China	([{South Asia, West Eurasia}, East Asia], Central Asia Siberia), or ([East Asia, Central Asia Siberia], [{South Asia, West Eurasia}, Central Asia Siberia])	cor
Oliveira et al.	2022	NTT	…	Ancient individuals from East Nusa Tenggara, Wallacea islands	([Papuan, Southern Asian], Austronesian)	cov_log
Oliveira et al.	2023	Kwepe	Kwadi	Kwepes from the Angolan Namib Desert	([Southern Africa, East Africa], West Africa)	cov_log
Lei et al.	2024	KZK	Turkic	Kazakhs in Northwestern China	([East Asia, Central Asia Siberia], [South Asia, West Eurasia])	cor_log

Author	Year	Population ID	Language family	Description	Admixture sequence of ancestries	AHG metric
Pugach et al.	2016	Evens_Kamchatka	North Tungusic	An even-subgroup from the Kamchatka Peninsula	([Far Eastern, Central Siberian], East Asian)	cov
Feng et al.	2017	XJU	Turkic	Uyghurs in Northwestern China	([East Asia, Central Asia Siberia], [South Asia, West Eurasia])	cor
Ma et al.	2021	NXH	Sino-Tibetan	Huis in Northwestern China	([{South Asia, West Eurasia}, East Asia], Central Asia Siberia), or ([East Asia, Central Asia Siberia], [{South Asia, West Eurasia}, Central Asia Siberia])	cor
Oliveira et al.	2022	NTT	…	Ancient individuals from East Nusa Tenggara, Wallacea islands	([Papuan, Southern Asian], Austronesian)	cov_log
Oliveira et al.	2023	Kwepe	Kwadi	Kwepes from the Angolan Namib Desert	([Southern Africa, East Africa], West Africa)	cov_log
Lei et al.	2024	KZK	Turkic	Kazakhs in Northwestern China	([East Asia, Central Asia Siberia], [South Asia, West Eurasia])	cor_log

When we validated the efficiency of AHG metrics on the (AB)(CD) model, to make the differences of metrics more prominent, we chose a relatively biased proportion (0.2) of the admixed population (AB) (Fig. 3a), while the initial proportion of A and C varied from 0.1 to 0.9 (step size: 0.1). The metric “cov” showed the worst performance, and it was possible that “mean_cov” obtained a very low accuracy when A had an extremely low or high initial proportion (0.1 or 0.9). The rest of the metrics showed a better performance while a small proportion of A or C (0.1 or 0.9) could also reduce the accuracy. Among these metrics, “cor”, “mean_cor”, and “cor_log” had relatively higher accuracy, ranging from 0.44 to 1, while the accuracy of “cov_log” might drop down to less than 0.4 (Fig. 3b; supplementary table S4, Supplementary Material online).

For the ((AB)C)D model, we specified the initial proportion of A as 0.2, with C and D ranging from 0.1 to 0.9. Similar to the results of the (AB)(CD) model, “cov_log”, “cor”, “mean_cor”, and “cor_log” outperformed “cov” and “mean_cov”. Moreover, “cov_log” and “cor_log” had higher median accuracy (>0.4) than “cor” (0.22) and “mean_cor” (0.38), indicating that it was more likely to obtain higher accurate admixture topology with “cov_log” and “cor_log” metrics (Fig. 3c; supplementary table S5, Supplementary Material online).

Overall, the metric “cor_log” showed the best performance among all metrics. We further evaluated its robustness with varying (AB) and A proportion (0.3, 0.4, 0.5) in the (AB)(CD) model and the ((AB)C)D model, respectively (supplementary fig. S1, Supplementary Material online; supplementary tables S6 and S7, Supplementary Material online). It turned out that “cor_log” obtained an accuracy greater than 0.7 in most of the instances for the (AB)(CD) model (supplementary fig. S1a, Supplementary Material online; supplementary table S6, Supplementary Material online). For the ((AB)C)D model, if the extremely biased instances (the proportion of C or D = 0.1; the proportion of C or D = 0.9) were not taken into account, 90% of the “cor_log” results were greater than 0.5 (supplementary fig. S1b, Supplementary Material online; supplementary table S7, Supplementary Material online). Given that admixture scenario similar to “(AB)(CD)” was more prevalent than “((AB)C)D” in previous studies (Feng et al. 2017; Ma et al. 2021; Lei et al. 2024), the lower performance of “cor_log” on the “((AB)C)D” scenario model might not hinder the application of “cor_log” in population admixture studies.

We evaluated the efficiency of AHG metrics with a real dataset composed of two African American populations (African Americans in Southwest United States [ASW] and African Caribbeans in Barbados [ACB]) as well as their African and European ancestry proxies. The evaluation was based on the assumptions that (i) due to the latitudinal proximity and navigation systems, Western African populations contributed the major African ancestry to the African offspring living in Caribe-Central/North America (namely, ACB and ASW in this study); (ii) the population structure of Western African populations are previously admixed, including multiple ancestry components; and (iii) after the admixture of European ancestry components and African ancestry components, the African descendants living in America met with post-admixture homogenization to different degrees, in which the African descendants in the United States were largely affected while the African offspring in Barbados received a much milder influence (Gouveia et al. 2020). We ran ADMIXTURE to infer the ancestry composition of ASW and ACB together with the reference populations across Africa and Europe. The results at K = 5 showed distinctly separated ancestry of African populations (respectively represented by the populations in Western Africa [AFR_W], Western/Central Africa [AFR_C], Eastern Africa [AFR_E], and Southern Africa [AFR_S]) and highly admixed population structure of African offspring in America (Fig. 4a). We performed AHG tests with resampling 5,000 times upon the two major African ancestry components. As expected, strong AHG signals of “(AFR_W, AFR_C), EUR”, which meant two African ancestries, AFR_W and AFR_C, admixed before the interface of EUR, were detected by all AHG metrics except “cov”. However, due to the post-admixture homogenization, more Western/Central African-derived ancestry was likely to be introduced into the gene pool of African offspring in the United States, thus AHG signals of “(AFR_W,EUR), AFR_C” could be explicitly observed though with a lower supporting number than “(AFR_W, AFR_C), EUR”. The AHG metrics “cov_log”, “mean_cor”, and “cor_log” detected the signals with supporting numbers 619, 395, and 253, respectively, while the supporting numbers of the other three metrics were close to 0. Meanwhile, African offspring in Barbados were less affected by the post-admixture homogenization. As a consequence, the supporting number of “(AFR_C, EUR), AFR_W” and “(AFR_W, EUR), AFR_C” could be relatively smaller than “(AFR_W, AFR_C), EUR” in ACB population. The metrics “mean_cov”, “cor_log”, and “mean_cor” with supporting numbers as 256, 735, and 786 for the “(AFR_C, EUR), AFR_W” signal outperformed other metrics with supporting numbers higher than 800 (Fig. 4b). Considering the good performance of “cor_log” on both simulated and real data, we employed “cor_log” as the AHG metric of AncestryPainter 2.0.

Fig. 4.

The accuracy of six AHG metrics on the real dataset. a) ADMIXTURE results (K = 5) of the real dataset, plotted with the “sectorplot” function in AncestryPainter 2.0; b) Admixture sequence inferred using AHG. The admixture topology “(A,B), C” means ancestry A admixes with ancestry B, and then ancestry C joins in the admixed ancestry. Abbreviation: AFR_W, Western African ancestry; AFR_C, Western/Central African ancestry; AFR_E, Eastern African ancestry; AFR_S, Southern African ancestry; EUR, European ancestry.

Open in new tab Download slide

Future Developments

In this study, we developed a new version of AncestryPainter which can be used to illustrate the ancestry compositions and genetic distance along with statistical functions to merge multiple ancestry proportion matrices or infer admixture topology. Moreover, we introduced the AHG algorithm into AncestryPainter for the inference of admixture topology. We compared the accuracy of six AHG metrics on three different admixture scenario models using simulated populations. The metric “cor_log” showed an overall better performance than other metrics, and thus, we implemented this metric in the AHG function.

The AHG method is easy to operate and has a high accuracy with (AB)C and (AB)(CD) admixture scenario models. However, the accuracy of all AHG metrics is low when the proportion of any ancestor is too small. It can be interpreted as the effect of genetic drift, which can be simulated by AdmixSim2 (Zhang et al. 2021b). When descendants are generated, an ancestry component may be lost or drastically decreased due to genetic drift and the ancestry proportion in the descendants tends to form a truncated normal distribution with large variance, which disturbs the correlation between previously admixed ancestry components. Accordingly, all AHG metrics do not perform well in the ((AB)C)D scenario, a continuous admixture model, which may result from the large variance of each ancestral component after admixture. The AHG accuracy for the ((AB)C)D scenario might grow if all four ancestral proportions have a substantial admixture proportion (supplementary fig. S1b, Supplementary Material online). Collectively, AHG can be used as a “preliminary estimate” to infer the admixture topology and has to be combined with other methods, e.g. the three-population test (f₃) (Patterson et al. 2012).

While the graphing module of AncestryPainter has greatly facilitated visualization of ancestry composition in recent population genetic studies (Sala et al. 2019; Cerny et al. 2021; Guzman-Solis et al. 2021; Khan and Khan 2021; Ma et al. 2021; Urnikyte et al. 2021; Zhang et al. 2021a, 2021c; Aboagye et al. 2022; Changmai et al. 2022; Wang et al. 2022; Li et al. 2024; Su et al. 2024; Sun et al. 2024; Zhou et al. 2024) and has been equipped with many new features in this study, there is still room for implementation of more functions and features, for instance, using multiple concentric circles in a single image to allow the displaying of ancestry makeup assuming different numbers of ancestry components, or annotating the subgroup information on a finer scale. Furthermore, tree structure or network graphs can be added to display the phylogenic relationship of populations and admixture topology.

Materials and Methods

Example of Dataset

To illustrate the utilities of the graphing modules and the “ancmerge” function in AncestryPainter 2.0, we used the genome-wide single-nucleotide polymorphisms (SNPs) of 2,415 modern human individuals in the Human Origins (HO) dataset (Lazaridis et al. 2014) and 7 Kyrgyz individuals from the Estonian Biocentre Human Genome Diversity Panel (EGDP) (Pagani et al. 2016) to generate the example data. We converged the HO and EGDP data by bcftools (Danecek et al. 2021) and performed ADMIXTURE (Alexander et al. 2009) to estimate the ancestry makeup of the individuals for ten repeats with different SNP subsets, specifying the ancestry component number (K) as eight. The ten SNP subsets were generated using the method described in the post-admixture adaptation study of Xinjiang Uyghurs (Pan et al. 2022). In addition, we ran an in-house Python script to calculate the genetic distance (F_ST) (Weir and Cockerham 1984) between populations.

Using Sectors for Visualization

The graphic functions of AncestryPainter 2.0 are composed primarily based on the R package “graphics”. The sectors visualizing ancestry proportion or genetic distance are plotted by the function “polygon” in “graphics”. The coordinates of sectors on the canvas depend on (i) the order of the ancestry component indicated in the input data and (ii) the initial plotting position. The sector size correlates with the quantity of ancestry proportion or genetic distance. In addition, we utilize other functions in the “graphics” package, such as “text” and “arrows” to annotate sectors.

Merging Ancestry Proportion Matrices

This section is translated from the in-house Python script authored by Pan et al. (2022) (https://github.com/Shuhua-Group/ADMIXTURE.merge). This function merges the ancestry proportion matrices (called “target matrices”) estimated by software such as ADMIXTURE with the same dataset and the same ancestry component number (K). This function calculates and compares the correlation (measured by Pearson coefficient) between one ancestry component in a user-defined reference matrix (i.e. a reference component) and each of the ancestry components in the matrices to be merged (i.e. target components), and then matches the reference component with the target of the highest correlation coefficient. The function counts the number of target components matched with each reference component and calculates the supporting ratio of all ancestry components in a reference matrix. The supporting ratio is defined as the ratio of the matched target component number to the total number of target matrices. In the merged matrix, the proportion of an ancestry component for each individual is the average of a group of matched ancestry components. A target matrix with all ancestry components matching those of the reference is defined as a consensus matrix “supporting” the reference; otherwise, it is regarded as a “conflicted” one. A larger number of consensus matrices indicate the reliability of the reference and vice versa.

AHG Metrics

In the original AHG test (Pugach et al. 2016), the correlation efficiency is estimated with the covariance of (i) the ratio of the admixture proportion of two random-picking ancestry components and (ii) the admixture proportion of the third ancestry component. For example, an already admixed population with two different ancestry components A and B meets with another episode of admixture bringing into this population a new ancestry component C, and the arrays of ancestry proportion of individuals in Populations A, B, and C are available. The correlation coefficient can be calculated as follows:

\begin{matrix} coef (A, B; C) = | cov (\frac{A}{B}, C) | (1, " cov ") \end{matrix}

This coefficient is expected to be zero. Practically, the admixture topology with the lowest corresponding correlation coefficient among the three combinations, i.e. $coef (A, B; C)$ ⁠, $coef (A, C; B)$ ⁠, and $coef (B, C; A)$ ⁠, can be inferred as the best-fit. The supporting ratio of each admixture topology can be estimated by using ancestry proportion arrays of randomly picked individuals from the given population.

This metric has been modified and then applied to our previous study of the Uyghurs (Feng et al. 2017) and the Huis (Ma et al. 2021) in Northwestern China, in which the covariance was substituted by Pearson coefficient, because the latter can adjust the bias caused by admixture proportion differences among ancestry components:

\begin{matrix} coef (A, B; C) = | cor (\frac{A}{B}, C) | (2, " cor ") \end{matrix}

However, the correlation efficient values of the same population combination (e.g. $coef (A, B; C)$ and $coef (B, A; C)$ ⁠) can be distinct if the positions of ancestry components in the fraction are swapped (e.g. replacing A/B by B/A). To solve the issue, we defined a novel metric as an arithmetic mean of the two covariance or correlation values with swapping ancestry components:

\begin{aligned} \begin{matrix} coef (A, B; C) = \frac{1}{2} | cov (\frac{A}{B}, C) | + \frac{1}{2} | cov (\frac{B}{A}, C) | \\ (3, " mean_cov ") \end{matrix} \end{aligned}

\begin{aligned} \begin{matrix} coef (A, B; C) = \frac{1}{2} | cor (\frac{A}{B}, C) | + \frac{1}{2} | cor (\frac{B}{A}, C) | \\ (4, " mean_cor ") \end{matrix} \end{aligned}

In addition, Oliveira et al. (2022) updated the original AHG metric by introducing logarithm-transformation, which eliminates the effect of swapping ancestry component positions.

\begin{matrix} coef (A, B; C) = | cov (\log (\frac{A}{B}), \log (C)) | (5, " cov_\log ") \end{matrix}

Drawing on the metrics proposed above, we could also optimize the calculation of the correlation coefficients as:

\begin{matrix} coef (A, B; C) = | cor (\log (\frac{A}{B}), \log (C)) | (6, " cor_\log ") \end{matrix}

Validation of AHG Metrics Using Simulated Data

To validate the efficiency of these metrics, we examined our methods and previously published ones via three types of admixture scenario models (Fig. 2). These models were established based on that proposed by Feng et al. (2017).

(AB)C scenario: Populations A and B were initially admixed 120 generations ago to form the AB population. After 90 generations of self-evolution, the AB population was then mixed with Population C 30 generations ago, leading to the formation of the initial (AB)C population. This (AB)C population underwent a further 30 generations of self-evolution to arrive at the final (AB)C population (Fig. 2a). When Population A and Population B admixed to form Population (AB), the admixture proportion of Population A was varied incrementally from 0.1 to 0.9, with a step size of 0.1. Similarly, when Population (AB) was subsequently admixed with Population C, the admixture proportion of Population C also varied incrementally from 0.1 to 0.9, with a step size of 0.1.
((AB)(CD)) scenario: Populations A and B underwent admixture 120 generations ago to form the composite Population (AB), while concurrently, Populations C and D were mixed to form the composite Population (CD). Each of these newly formed populations, (AB) and (CD), then proceeded through 90 generations of isolated evolution before admixing 30 generations ago, giving rise to the initial combined population (AB)(CD). This combined population (AB)(CD) then experienced 30 generations of self-evolution to reach its final state (Fig. 2b). During the admixture event between Populations A and B, the contribution from Population A was incrementally set from 0.1 to 0.9 in steps of 0.1. Similarly, for the admixture between Populations C and D, Population C's contribution was also incrementally set from 0.1 to 0.9 in steps of 0.1. During the admixture of composite Populations (AB) and (CD), the proportion of (AB) was set at 0.2, 0.3, 0.4, and 0.5.
((AB)C)D scenario: Populations A and B were initially admixed 150 generations ago to form the composite (AB) population. This AB population then underwent 90 generations of independent evolution before engaging in admixture with Population C 60 generations ago, culminating in the formation of the (AB)C population. After a further 30 generations of self-evolution, this (AB)C population was then admixed with Population D to form the ((AB)C)D population 30 generations ago. The ((AB)C)D population continued to evolve independently for an additional 30 generations to achieve its final genetic composition (Fig. 2c). During the admixture event between Populations A and B, the admixture proportions from Population A were specifically set at 0.2, 0.3, 0.4, and 0.5. When Population (AB) was admixed with Population C, the admixture proportion from Population C varied sequentially from 0.1 to 0.9, with a step increment of 0.1. Subsequently, for the admixture event between Population (AB)C and Population D, Population D's admixture proportion also ranged sequentially from 0.1 to 0.9, with the same step increment of 0.1.

Populations in the three kinds of admixture scenario models were simulated by AdmixSim2 (Zhang et al. 2021b), an individual-based forward-time simulation tool that can flexibly and efficiently simulate population genomics data under complex evolutionary scenarios. For all three scenarios, the Populations A, B, C, and D were randomly generated in accordance with the AdmixSim2 manual (https://github.com/Shuhua-Group/AdmixSim2/tree/master), without involving specific population information.

During the simulation, we maintained a constant sample size of 5,000 individuals per population per generation. Due to factors such as genetic drift, some ancestral components may be represented at very low frequencies (below 1 × 10⁻⁶) in the outcomes. These minor ancestral components are then set to a threshold value of 1 × 10⁻⁶, and the proportional frequencies of the remaining ancestral components are accordingly adjusted to ensure that the sum of all ancestral component proportions equals 1. In addition, we forced the proportion of each ancestry component in the ultimate admixed population to be greater than 0.

Further, we generated simulated data within the (AB)C scenario to evaluate the impact of varying sample sizes on algorithmic performance. Here, the metric employed was “mean_cor”. When Population A and Population B were mixed to form Population (AB), the admixture proportion of Population A was varied incrementally from 0.1 to 0.5, with a step size of 0.1. Similarly, when Population (AB) was subsequently admixed with Population C, the admixture proportion of Population C also varied incrementally from 0.1 to 0.9, with a step size of 0.1. After the simulation data were prepared, we sequentially sampled 25, 50, 75, and 100 individuals from the admixed population, repeating this process 100 times for each sample size, to assess the effect of sample size on the efficacy of the algorithm. We documented the number of instances in which the algorithm accurately inferred the correct admixture model (accuracy). It turned out that a greater sampling size resulted in higher accuracy of inference and all the methods obtained the highest accuracy when the sampling size was 100 (supplementary table S2, Supplementary Material online). Therefore, we sampled 100 individuals from the simulated population 100 times for each admixture scenario to calculate the AHG metrics.

Inferring Admixture Sequence of African American Populations Using AHG

We validated the effect of AHG metrics using a real human genome dataset which was constructed based on a published study for the origin of American African populations (Gouveia et al. 2020). We chose the ASW and ACB individuals in the 1000 Genomes project phase 3 release (KGP) (1000 Genomes Project Consortium 2015) as target populations and subset African ancestry and European ancestry individuals from the HO and the 1000 Genomes project phase 3 release as the reference panel. (i) Eastern African populations: BantuSA, Tswana, Wambo, Ju_hoan_North, Taa_North, and Ju_hoan_South from HO; LWK from KGP. (ii) Southern African populations: Dinka, Luo, Sandawe from HO. (iii) Western African and Central African populations from KGP: ESN, YRI, GWD, and MSL. (iv) European individuals from KGP: Utah residents (CEPH) with Northern and Western European ancestry, CEU and Iberian populations in Spain, IBS. The SNPs of selected HO and KGP individuals were converged using bcftools (Danecek et al. 2021) and then pruned with plink (Purcell et al. 2007).

To infer the ancestry proportion of the real dataset for the validation of AHG metrics, we performed ADMIXTURE upon the converged dataset with K ranging from 2 to 10. For each K, ADMIXTURE was performed ten times with different random seeds, and the output “.Q” files were merged using the “ancmerge” function in AncestryPainter 2.0. The presented results are based on the output of ADMIXTURE runs with K = 5 because of its lowest cross-validation error (supplementary fig. S2, Supplementary Material online). We excluded outliers with substantial non-African ancestry from AHG tests. We did AHG tests using six metrics for the main ancestry components of ASW and ACB with resampling 5,000 times and 30 individuals for each time. AHG metrics of both simulated and real data were calculated and plotted using R.

Supplementary Material

Supplementary material is available at Genome Biology and Evolution online.

Acknowledgments

We thank Dr Qidi Feng for sharing her experience on the first version of this package. We thank Dr Xumin Ni for his suggestions on the evaluation of AHG metrics. We also thank Dr Alec Downie and Dr Iker Rivas-González who provided advice on the presentation of background information. The computational work in this study was supported by the CFFF Computing Platform and the Human Phenome Data Center of Fudan University.

Authors’ Contributions

S.X. conceived and designed the study and supervised the project. S.C., C.L., H.Z., Y.P., and D.L. contributed to the computer code. C.L. developed a key algorithm. H.Z. examined and improved the computer code. S.C. coordinated the computer coding, and packed the software. S.C. and C.L. drafted the manuscript. S.X. revised the manuscript. All authors read and approved the final manuscript.

Funding

This work was supported by the National Key Research and Development Program of China [No. 2023YFC2605400]; the National Natural Science Foundation of China (NSFC) grants [32288101, 32030020]; the Shanghai Science and Technology Commission Program [23JS1410100]; and the Office of Global Partnerships (Key Projects Development Fund).

Data Availability

Example data and source code of AncestryPainter 2.0 are available on GitHub, at https://github.com/Shuhua-Group/AncestryPainterV2 and on the HumPOG lab website, at https://pog.fudan.edu.cn/#/Software.

Literature Cited

1000 Genomes Project Consortium

A global reference for human genetic variation

Nature

2015

526

(

7571

–

Aboagye

Adadey

Esoh

Jonas

de Kock

Amenga-Etego

Awandare

Wonkam

Age estimate of GJB2-p.(Arg143Trp) founder variant in hearing impairment in Ghana, suggests multiple independent origins across populations

Biology (Basel)

2022

(

476

10.3390/biology11030476

Alexander

Novembre

Lange

Fast model-based estimation of ancestry in unrelated individuals

Genome Res

2009

(

1655

–

1664

10.1101/gr.094052.109

Cerny

Fortes-Lima

Triska

Demographic history and admixture dynamics in African Sahelian populations

Hum Mol Genet

2021

(

R29

–

R36

Changmai

Pinhasi

Pietrusewsky

Stark

Ikehara-Quebral

Reich

Flegontov

Ancient DNA from Protohistoric Period Cambodia indicates that South Asians admixed with local populations as early as 1st–3rd centuries CE

Sci Rep.

2022

(

22507

10.1038/s41598-022-26799-3

Danecek

Bonfield

Liddle

Marshall

Ohan

Pollard

Whitwham

Keane

McCarthy

Davies

, et al.

Twelve years of SAMtools and BCFtools

Gigascience

2021

(

giab008

10.1093/gigascience/giab008

Feng

AncestryPainter: a graphic program for displaying ancestry composition of populations and individuals

Genomics Proteomics Bioinformatics

2018

(

382

–

385

10.1016/j.gpb.2018.05.002

Feng

Yuan

Yang

Liu

Lou

Ning

Wang

, et al.

Genetic history of Xinjiang's Uyghurs suggests bronze age multiple-way contacts in Eurasia

Mol Biol Evol

2017

(

2572

–

2582

10.1093/molbev/msx177

Gouveia

Borda

Leal

Moreira

Bergen

Kehdy

FSG

Alvim

Aquino

Araujo

, et al.

Origins, admixture dynamics, and homogenization of the African gene pool in the Americas

Mol Biol Evol

2020

(

1647

–

1656

10.1093/molbev/msaa033

Guzman-Solis

Villa-Islas

Bravo-Lopez

Sandoval-Velasco

Wesp

Gomez-Valdes

Moreno-Cabrera

Meraz

Solis-Pichardo

Schaaf

, et al.

Ancient viral genomes reveal introduction of human pathogenic viruses into Mexico during the transatlantic slave trade

eLife

2021

e68612

Khan

Genome-wide population structure inferences of human coxsackievirus-A; insights the genotypes diversity and evolution

Infect Genet Evol

2021

105068

10.1016/j.meegid.2021.105068

Lazaridis

Patterson

Mittnik

Renaud

Mallick

Kirsanow

Sudmant

Schraiber

Castellano

Lipson

, et al.

Ancient human genomes suggest three ancestral populations for present-day Europeans

Nature

2014

513

(

7518

409

–

413

Lei

Liu

Zhang

Pan

Gao

Yang

Guan

Mamatyusupu

, et al.

Ancestral origins and admixture history of Kazakhs

Mol Biol Evol

2024

(

msae144

10.1093/molbev/msae144

Wang

Duan

Sun

Chen

Wang

Sun

Yang

Chen

, et al.

Evolutionary history and biological adaptation of Han Chinese people on the Mongolian Plateau

hLife

2024

(

296

–

313

10.1016/j.hlife.2024.04.005

Google Scholar

Crossref

WorldCat

Lou

Yuan

Wang

Zhang

Yang

Deng

Zhou

, et al.

Ancestral origins and genetic history of Tibetan highlanders

Am J Hum Genet

2016

(

580

–

594

10.1016/j.ajhg.2016.07.002

Yang

Gao

Pan

Chen

Genetic origins and sex-biased admixture of the huis

Mol Biol Evol

2021

(

3804

–

3819

10.1093/molbev/msab158

Oliveira

Fehn

Amorim

Stoneking

Rocha

Genome-wide variation in the Angolan Namib Desert reveals unique pre-Bantu ancestry

Sci Adv.

2023

(

eadh3822

10.1126/sciadv.adh3822

Oliveira

Nagele

Carlhoff

Pugach

Koesbardiati

Hubner

Meyer

Oktaviana

Takenaka

Katagiri

, et al.

Ancient genomes from the last three millennia support multiple human dispersals into Wallacea

Nat Ecol Evol

2022

(

1024

–

1034

10.1038/s41559-022-01775-2

Pagani

Lawson

Jagoda

Morseburg

Eriksson

Mitt

Clemente

Hudjashov

DeGiorgio

Saag

, et al.

Genomic analyses inform on migration events during the peopling of Eurasia

Nature

2016

538

(

7624

238

–

242

Pan

Zhang

Ning

Gao

Zhao

Yang

Guan

Mamatyusupu

, et al.

Genomic diversity and post-admixture adaptation in the Uyghurs

Natl Sci Rev

2022

(

nwab124

Patterson

Moorjani

Luo

Mallick

Rohland

Zhan

Genschoreck

Webster

Reich

Ancient admixture in human history

Genetics

2012

192

(

1065

–

1093

10.1534/genetics.112.145037

Pugach

Matveev

Spitsyn

Makarov

Novgorodov

Osakovsky

Stoneking

Pakendorf

The complex admixture history and recent southern origins of Siberian populations

Mol Biol Evol

2016

(

1777

–

1795

10.1093/molbev/msw055

Purcell

Neale

Todd-Brown

Thomas

Ferreira

Bender

Maller

Sklar

de Bakker

Daly

, et al.

PLINK: a tool set for whole-genome association and population-based linkage analyses

Am J Hum Genet

2007

(

559

–

575

Sala

Caputo

Corach

Genetic structure of Mataco-Guaycuru speakers from Argentina and the extent of their genetic admixture with neighbouring urban populations

Sci Rep

2019

(

17559

10.1038/s41598-019-54146-6

Wang

Duan

Sun

Wang

Yang

Huang

Zhong

, et al.

Population genetic admixture and evolutionary history in the Shandong Peninsula inferred from integrative modern and ancient genomic resources

BMC Genomics

2024

(

611

10.1186/s12864-024-10514-9

Sun

Wang

Duan

Liu

Chen

Wang

Sun

Wang

, et al.

Differentiated adaptative genetic architecture and language-related demographical history in south China inferred from 619 genomes from 56 populations

BMC Biol.

2024

(

10.1186/s12915-024-01854-9

Urnikyte

Molyte

Kucinskas

Genome-wide landscape of north-eastern European populations: a view from Lithuania

Genes (Basel)

2021

(

1730

10.3390/genes12111730

Wang

C-Z

X-E

Shi

M-S

S-H

Whole mitochondrial genome analysis of the Daur ethnic minority from Hulunbuir in the Inner Mongolia Autonomous Region of China

BMC Ecol Evol

2022

(

10.1186/s12862-022-02019-4

Weir

Cockerham

Estimating F-statistics for the analysis of population structure

Evolution

1984

(

1358

–

1370

10.1111/j.1558-5646.1984.tb05657.x

Zhang

Ning

Scott

Bjorn

Wei

Wang

Fan

Abuduresule

, et al.

The genomic origins of the Bronze Age Tarim Basin mummies

Nature

2021a

599

(

7884

256

–

261

10.1038/s41586-021-04052-7

Zhang

Liu

Yuan

Pan

AdmixSim 2: a forward-time simulator for modeling complex population admixture

BMC Bioinformatics

2021b

(

506

10.1186/s12859-021-04415-x

Zhang

Wang

Chen

Wang

, et al.

Genomic insight into the population admixture history of Tungusic-speaking Manchu people in northeast China

Front Genet

2021c

754492

10.3389/fgene.2021.754492

Zhou

Zhang

Liu

Luo

Characterizing genetic variation on the Z chromosome in Schistosoma japonicum reveals host-parasite co-evolution

Parasit Vectors

2024

(

207

10.1186/s13071-024-06250-4

Author notes

Shuanghui Chen and Chang Lei contributed equally to this work.

This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] for reprints and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site—for further information please contact [email protected].

Associate Editor:

Download all slides

Month:	Total Views:
November 2024	170
December 2024	426
January 2025	152
February 2025	146
March 2025	117
April 2025	75

Article Contents

AncestryPainter 2.0: Visualizing Ancestry Composition and Admixture History Graph

Abstract

Introduction

Results and Discussion

Visualization of Ancestry Makeup and Genetic Distance

Merging Ancestry Composition Matrices

Inferring Admixture Topology

Future Developments

Materials and Methods

Example of Dataset

Using Sectors for Visualization

Merging Ancestry Proportion Matrices

AHG Metrics

Validation of AHG Metrics Using Simulated Data

Inferring Admixture Sequence of African American Populations Using AHG

Supplementary Material

Acknowledgments

Authors’ Contributions

Funding

Data Availability

Literature Cited

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

AncestryPainter 2.0: Visualizing Ancestry Composition and Admixture History Graph

Abstract

Introduction

Results and Discussion

Visualization of Ancestry Makeup and Genetic Distance

Merging Ancestry Composition Matrices

Inferring Admixture Topology

Future Developments

Materials and Methods

Example of Dataset

Using Sectors for Visualization

Merging Ancestry Proportion Matrices

AHG Metrics

Validation of AHG Metrics Using Simulated Data

Inferring Admixture Sequence of African American Populations Using AHG

Supplementary Material

Acknowledgments

Authors’ Contributions

Funding

Data Availability

Literature Cited

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only