-
PDF
- Split View
-
Views
-
Cite
Cite
Calvin Pan, Joseph Kim, Lamei Chen, Qi Wang, Christopher Lee, The HIV positive selection mutation database, Nucleic Acids Research, Volume 35, Issue suppl_1, 1 January 2007, Pages D371–D375, https://doi.org/10.1093/nar/gkl855
- Share Icon Share
Abstract
The HIV positive selection mutation database is a large-scale database available at Author Webpage that provides detailed selection pressure maps of HIV protease and reverse transcriptase, both of which are molecular targets of antiretroviral therapy. This database makes available for the first time a very large HIV sequence dataset (sequences from ∼50 000 clinical AIDS samples, generously contributed by Specialty Laboratories, Inc.), which makes possible high-resolution selection pressure mapping. It provides information about not only the selection pressure on individual sites but also how selection pressure at one site is affected by mutations on other sites. It also includes datasets from other public databases, namely the Stanford HIV database [S. Y. Rhee, M. J. Gonzales, R. Kantor, B. J. Betts, J. Ravela and R. W. Shafer (2003) Nucleic Acids Res., 31, 298–303]. Comparison between these datasets in the database enables cross-validation with independent datasets and also specific evaluation of the effect of drug treatment.
INTRODUCTION
The HIV-1 virus is the causative agent of AIDS, a growing worldwide epidemic and also a fascinating system for studying fundamental scientific questions. For example, one major clinical problem in the treatment of AIDS is HIV's ability to develop resistance to antiviral drugs rapidly, often within weeks of introduction of a new drug (1–3). Foremost among the factors responsible for this are the virus' extremely high mutation rate (4,5) and replication rate (3,6–8). For this reason, there is great medical interest in understanding both the specific causes of drug resistance, and predicting fast versus slow evolutionary pathways to multiple drug resistance. At the same time, HIV provides an extraordinary wealth of data about fundamental scientific questions such as the fitness landscape for protein evolution (9,10).
Evolutionary biology has developed a powerful and general approach for investigating such problems: metrics of selection pressure that measure whether a particular genetic change is selected for or against during evolution. Such metrics can reveal important selection forces either constraining or driving evolution of a protein, directly from raw sequence variation data (11,12). One very widely used metric of selection pressure on amino acid mutations is known as Ka/Ks or dn/ds (13,14) and measures the ratio of observed amino acid mutations over observed synonymous mutations, normalized by the ratio expected under a neutral model. Thus a Ka/Ks = 1 value indicates neutral selection. Ordinarily Ka/Ks is ≪ 1, indicating negative selection against amino acid mutations (far fewer observed than expected under a neutral model). Ka/Ks > 1 is referred to as positive selection (i.e. amino acid mutations increase reproductive fitness) and is observed in rare cases where new evolutionary challenges create strong pressure for rapid evolution of a protein (e.g. immune system genes like MHC that are involved in recognizing pathogenic antigens). Ordinarily, a single Ka/Ks value is calculated for a whole gene, but with very large datasets it becomes possible to estimate distinct Ka/Ks values for individual codon positions or amino acid mutations. This yields a ‘selection pressure map’ of a gene, revealing its detailed functional constraints and in rare cases positive selection peaks that signal important new evolutionary pressures such as drug treatment. We used Ka/Ks because it provides a powerful tool for detecting positive selection. Phylogenetic analysis of our HIV sequence dataset using Phylip (15) shows a star-like topology (data are available at Author Webpage, but will be presented in detail elsewhere), in agreement with previous studies (16,17).
We have assembled a large-scale database that provides researchers detailed selection pressure maps of HIV proteins involved in drug resistance. These data have many possible applications, including prediction of mutations contributing to drug resistance, distinguishing primary drug resistance mutations from accessory mutations, rate measurements of fast versus slow evolutionary pathways to multiple drug resistance, and the evolutionary dynamics of different types of mutations as the virus moves from untreated to drug-treated conditions and back. This database makes available for the first time a very large HIV sequence dataset (sequences from ∼50 000 clinical AIDS samples), which makes possible high-resolution selection pressure mapping, as well as smaller datasets from other public databases. The methods and most of the data described herein have been published previously (12,18).
DATABASE CONTENT, INTERFACE AND APPLICATIONS
Datasets
The primary dataset consists of sequences for HIV protease and reverse transcriptase (RT) for ∼50 000 clinical AIDS patient samples from the United States, collected during 1999–2003 (12), and mostly under drug treatment. These data cover 1.4 kb each [300 000 chromatograms; six overlapping reads per sample, including both strands; see (12) for details] and were generously contributed by Specialty Laboratories Inc. Owing to HIV's high mutation rate, on average each sequence contains 32 mutations/kb [with respect to the Los Alamos reference sequence (12)], for a total of more than 2 million mutation observations in the dataset (12). Over 5000 distinct codon mutations were observed, each with an average count of 364 samples (12). For comparison, this density of polymorphism information is equivalent to sequencing ∼1 million people. This very large dataset, made available publicly for the first time, has made detailed selection pressure mapping possible. Of the samples, 99.3% are subtype B; non-subtype-B samples were excluded from the analysis (12). The dataset is fully HIPAA-compliant; all information concerning the source patients was removed by Specialty.
The database currently includes two additional datasets, also covering HIV protease and RT. These datasets were obtained from the Stanford HIV database (19). The Stanford-Treated dataset consists of 1797 subtype B samples with known drug treatments. This dataset provides a useful comparison with the Specialty results, for validating whether a specific mutation is reproducibly selected by drug treatment. The Stanford-Untreated dataset consists of 2628 subtype B samples not under drug treatment. By comparing results from this dataset with Specialty and Stanford-Treated, users can assess whether a specific mutation is more likely to be associated with drug resistance or other types of phenotypic fitness effects (e.g. interactions with the immune system).
The Specialty raw sequence data are available as a gzip'ed FASTA file at Author Webpage.
Amino acid selection pressure mapping


The interface to the positive selection mutation database is a clickable imagemap. Clicking on any codon position performs a query and returns the results in an easy-to-read format. (Specialty dataset is shown.)
Selection pressure interaction mapping


Selection pressure interaction map. The degree to which a mutation at one site X (horizontal axis) affects the selection pressure at another site Y (vertical axis) is shown as the condtional selection ratio for all amino acid mutations at site Y conditioned on any amino acid mutation at site X. The color coding scale indicates increasing values of positive conditional selection ratio. Interactions showing conditional selection ratios >1 (positive conditional selection) with LOD scores >3 are shown, with blue indicating stronger interactions and yellow indicating weaker ones. Clicking any particular square provides details on the numbers used in the calculation.
These data can yield useful insights into HIV drug resistance. For example, the data show a significant interaction between protease site 90 (a known drug resistance mutation site) and site 10 (Figure 3). Amino acid mutations at 90 displayed strong, unconditional positive selection, indicating that they directly cause drug resistance. In contrast, mutations at 10 are negatively selected in the absence of the 90 mutation, but become positively selected in the presence of the 90 mutation (Figure 3). These results closely match previous experimental studies showing that mutations at 90 cause drug resistance, while mutations at 10 have an accessory effect of compensating for the destabilizing effect of mutations at 90 (21). Thus, our database can help users by providing information that can distinguish primary drug-resistance mutations from accessory mutations (18). Users can navigate through links on every result page, to see mutations that strongly select for a given mutation, mutations that are strongly selected for by this mutation, or links to the Stanford (22) and Los Alamos HIV databases (23) giving further information about mutations at this site.

For the two possible pathways from wild-type protease to the 10/90 double mutant, we computed the conditional Ka/Ks values for each mutation conditioned on the presence or absence of the other mutation (shown as numbers next to each edge in the figure). For example, in the absence of the 10 mutation, the 90 mutation shows strong positive selection in both the Specialty and Stanford-Treated datasets, but was negatively selected in the Stanford-Untreated dataset. Since the steady-state speed of a multistep path is determined by its slowest step, we highlighted the rate-limiting step in each path (boldface). For example, in the Specialty dataset, the steady-state rate of the upper pathway appears to be ∼10-fold faster than that of the lower pathway. (a) Specialty dataset, (b) Stanford-Treated dataset and (c) Stanford-Untreated dataset.
Comparison between the independent datasets in the database can shed additional light on such questions. For example, users can assess whether positively selected mutations in the Specialty dataset are really due to drug resistance, by comparing with the Stanford-Treated and Stanford-Untreated datasets. As shown in Figure 3b and c, the Stanford-Treated data strongly corroborate the Specialty result, while the Stanford-Untreated data show that 90 is indeed involved in drug resistance; it becomes strongly negatively selected in the absence of drug treatment. These data can help users distinguish genuine drug-treatment mutations from those that affect phenotype in other ways, e.g. interactions with the host immune system. Detailed analysis of these datasets demonstrates that the Ka/Ks results are highly reproducible: independent datasets from different sets of patients show strong quantitative agreement (18).
FUTURE ADDITIONS
We are currently working to add new data and features to the database. We will add a number of new datasets to the database. First, we will add data for additional HIV genes, such as the env gene, which is important for HIV immune evasion (24); although these datasets have smaller numbers of sequences, our analysis has shown that useful Ka/Ks mapping information can be obtained from such counts. Second, we will analyze mutation data from patients under specific drug-treatment to compare selection pressures caused by different drugs. Third, we will add datasets for other HIV subtypes (e.g. subtype C) to reveal, where selection pressure patterns appear to be consistent with those seen in subtype B (allowing diagnostic criteria from subtype B to be applied to other subtypes) versus where there are important differences. Fourth, we will add a new very large dataset for the Hepatitis C core gene, consisting of approximately 60 000 samples, generously donated by Specialty Laboratories. Lastly, we will add new analyses and graphical interfaces to the database, including phylogenetic analysis and clickable pathway diagrams.
Funding to pay the Open Access publication charges for this article was provided by NIH Grants U54 RR021813 entitled Center for Computational Biology (CCB) and T32-HG002536.
Conflict of interest statement. None declared.
Comments