-
PDF
- Split View
-
Views
-
Cite
Cite
Tamara Goldfarb, Vamsi K Kodali, Shashikant Pujar, Vyacheslav Brover, Barbara Robbertse, Catherine M Farrell, Dong-Ha Oh, Alexander Astashyn, Olga Ermolaeva, Diana Haddad, Wratko Hlavina, Jinna Hoffman, John D Jackson, Vinita S Joardar, David Kristensen, Patrick Masterson, Kelly M McGarvey, Richard McVeigh, Eyal Mozes, Michael R Murphy, Susan S Schafer, Alexander Souvorov, Brett Spurrier, Pooja K Strope, Hanzhen Sun, Anjana R Vatsan, Craig Wallin, David Webb, J Rodney Brister, Eneida Hatcher, Avi Kimchi, William Klimke, Aron Marchler-Bauer, Kim D Pruitt, Françoise Thibaud-Nissen, Terence D Murphy, NCBI RefSeq: reference sequence standards through 25 years of curation and annotation, Nucleic Acids Research, Volume 53, Issue D1, 6 January 2025, Pages D243–D257, https://doi.org/10.1093/nar/gkae1038
- Share Icon Share
Abstract
Reference sequences and annotations serve as the foundation for many lines of research today, from organism and sequence identification to providing a core description of the genes, transcripts and proteins found in an organism's genome. Interpretation of data including transcriptomics, proteomics, sequence variation and comparative analyses based on reference gene annotations informs our understanding of gene function and possible disease mechanisms, leading to new biomedical discoveries. The Reference Sequence (RefSeq) resource created at the National Center for Biotechnology Information (NCBI) leverages both automatic processes and expert curation to create a robust set of reference sequences of genomic, transcript and protein data spanning the tree of life. RefSeq continues to refine its annotation and quality control processes and utilize better quality genomes resulting from advances in sequencing technologies as well as RNA-Seq data to produce high-quality annotated genomes, ortholog predictions across more organisms and other products that are easily accessible through multiple NCBI resources. This report summarizes the current status of the eukaryotic, prokaryotic and viral RefSeq resources, with a focus on eukaryotic annotation, the increase in taxonomic representation and the effect it will have on comparative genomics. The RefSeq resource is publicly accessible at https://www.ncbi.nlm.nih.gov/refseq.

Introduction
For over 25 years, the RefSeq resource (1,2) has served as a trusted collection of high-quality genome, transcript and protein sequences, offering exceptional breadth of representation across diverse taxa. The guiding principle of the RefSeq project is to provide a stable, non-redundant, curated set of reference sequences, with a focus on quality, transparency and value for a vast array of scientific uses, through a combination of computation, manual curation and collaboration with other resources and scientific groups. RefSeq is a product of the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM), located on the campus of the US National Institutes of Health (NIH).
RefSeq sequences and genomes are generally based on, but distinct from, data provided in the archival repositories of the International Nucleotide Sequence Database Collaboration (INSDC) (3), sometimes referred to as GenBank (4). In particular, the annotation provided on RefSeq records is distinct, and many RefSeq transcript and protein records have no equivalent GenBank record. RefSeq genomic, RNA and protein records have unique identifiers that include an underscore (e.g. NC_000001.11) and combine sequence with additional information including nomenclature, publications and feature annotation. The accessions include distinct prefixes [as previously described (1)] reflective of the molecule type (DNA, RNA, protein), genome components that they represent (e.g. chromosome, scaffold or genomic region) and curation status. RefSeq records are integrated into many NCBI resources such as Gene (5), BLAST (6), Orthologs, dbSNP (7), clinical repositories such as ClinVar (8) and visualization resources including Genome Data Viewer (GDV) (9) and Comparative Genome Viewer (CGV) (10). The stability and high quality of RefSeq data make it frequently used as a reference standard for variation reporting in clinical communities as well as input data in a wide variety of bioinformatic studies.
With the availability of low-cost DNA sequencing and high-quality genomes, comparative genomics has emerged as a powerful analysis tool to study gene function and disease etiology. High-quality RefSeq genome annotations as well as analysis, visualization and access tools provided by NCBI are a part of the toolkit being developed for NIH’s Comparative Genomics Resource (CGR) project (11), which was launched in 2020 to facilitate comparative analysis of eukaryotic genomes to advance biomedical research. CGR aims to help researchers utilize the expanding breadth of eukaryotic data for a wide range of scientific applications, and the RefSeq dataset provides an integral foundation for many CGR resources.
This report focuses on work done since our last comprehensive report (1) to increase the breadth, depth, and quality of the collection, enhance the stability of the dataset for reporting purposes, leverage collaborations to enrich the quality of data and improve data access through web and programmatic methods. While this report touches on the full taxonomic scope of RefSeq, the focus of this paper is on eukaryotes. In-depth descriptions of prokaryotic RefSeq can be found in recent reports (12,13).
Current status of RefSeq
The broad taxonomic scope of the RefSeq project includes genomes from viruses, prokaryotes and eukaryotes, as well as organelles, plasmids and targeted loci. As of July 2024, there are over 28 000 named species with annotated genomes in the RefSeq dataset, including over 6600 virus, 20 000 prokaryote and 1900 eukaryote species, with dramatic growth of all data types throughout the history of the RefSeq project (Figure 1 and Supplementary Figure S1). The number of vertebrate, invertebrate and fungal species annotated by RefSeq has more than doubled in the last 5 years. Reference organelles (e.g. mitochondria and plastids) and RefSeq Targeted Loci (RTL) sequences for ribosomal DNA (primarily prokaryotic 16S sequences), fungi internal transcribed spacers (ITS) (1) and prokaryotic anti-microbial resistance (AMR) genes provide broader species representation now spanning over 31 000 named species for organelles and 39 000 species for RTL (Figure 1).

Growth of species represented in the RefSeq dataset. Count of named species with one or more genome divided by taxonomic group (e.g. MAM: mammals), combined across all groups (GEN: genomes), or an organelle, plasmid or targeted loci sequence. Eukaryotic organisms with genomes are depicted in a zoomed-in view in the panel on the right. The following abbreviations are used in the plots: RTL, RefSeq targeted loci; ORG, organelles; GEN, genomes; PROK, prokaryotes; VRL, virus + bacteriophage; PMD, plasmids; FUN, fungi; VRT, non-mammalian vertebrates; INV, invertebrates; MAM, mammals; PLNT, plants; PRZ, protozoans.
The scope of RefSeq genomes varies by taxonomic groups. For eukaryotes, the dataset is largely restricted to one genome per species, with only a few exceptions, and only includes a subset of eukaryote species with currently available genomes with a focus on those with higher user interest and to increase taxonomic diversity in the collection (Figure 2 and Supplementary Figure S2). The RefSeq virus dataset aims to include at least one genome for each species and also includes over 8000 additional genomes to represent taxa not yet considered formally named by NCBI Taxonomy and diversity within some species such as Dengue viruses. The RefSeq prokaryote dataset includes many more genomes, currently over 370 000, selected from available cultured isolate and metagenome-assembled genomes (MAGs) (12). For eukaryotes and prokaryotes, NCBI picks one genome as the ‘reference genome’ for each species (previously sometimes called the ‘representative genome’), and preferentially uses genomes from the RefSeq dataset, if available; however, archival genomes not currently included in RefSeq may also be labeled as reference genomes, in particular for eukaryotes not currently included in the RefSeq collection. The choice of reference genome for eukaryote species is updated as soon as new assemblies become available, whereas the prokaryote reference genomes are revised three times per year. The selection criteria are available online (https://www.ncbi.nlm.nih.gov/datasets/docs/v2/policies-annotation/genome-processing/reference-selection/) and were previously described for prokaryotes (13).

Phylogenetic tree of eukaryotic organisms represented in the RefSeq database. The tree was constructed using NCBI Phylogenetic Distance Tree tool, with distances computed based on universal proteins from the BUSCO (37) Hidden Markov Model (HMM) libraries. The final tree is a merge of multiple trees while preserving the root and subtree lengths.
In addition to differences in scope, there are differences in the data model used for different taxonomic groups. In eukaryotes, transcript records with distinct accessions (NM_, NR_, XM_, XR_ accession prefixes) are instantiated for the products of RNA features annotated on nuclear genome sequences, whereas prokaryotes, viruses and organelles do not include separate records for RNA products and generally do not have annotated mRNA features per standard conventions used in these taxonomic divisions. Furthermore, prokaryotes use non-redundant protein records (WP_ accession prefix) where the same protein record is used to represent identical protein products produced by coding regions on different genomes or by multiple related genes in the same genome.
The differences in scope, data models and genome assembly characteristics across taxonomic divisions has influenced growth of the genomic, RNA and protein records available in RefSeq, which now exceed 450 million sequence records (Supplementary Figure S1). Inclusion of multiple genomes per species for prokaryotes has led to a more dramatic increase in counts of genomic and protein sequences compared to eukaryotes and viruses. In contrast, RNA records are largely restricted to eukaryotes, with growth reflecting both the increase in species as well as increased representation of alternately spliced RNAs, especially in metazoan genomes. During the past 5 years, the number of RNA and protein records in mammals, fungi and invertebrates has more than doubled, while there has been a quadrupling in these records for non-mammalian vertebrates during the same timeframe (Supplementary Figure S1). Genomic record counts in eukaryotes are largely flat over the last decade, which is a result of upgrading species to higher-quality genomes with fewer total sequences offsetting the increase in total genomes.
Across the RefSeq dataset, there is a strong emphasis on quality in both the choice of included genomes and the annotation provided. Quality control (QC) procedures vary by taxonomic group, but generally include mechanisms to select for higher-quality genomes based on assembly characteristics including sequence counts and N50 values, select against genomes with annotation problems like incomplete gene sets or an excessive number of frameshifted or partial genes, and removal of genomes with apparent issues in organism identification (14). Particular effort has been invested in recent years to reduce or eliminate foreign sequence contamination in RefSeq's collection of genomes through a combination of excluding genomes, replacement with higher-quality assemblies available for the species and targeted removal of contaminated sequences in coordination with the original GenBank submitters, leading to a 90% and 50% reduction of contamination in the RefSeq eukaryote and prokaryote datasets, respectively (https://github.com/ncbi/fcs) (15). Reports of the remaining contaminant sequences are also available on RefSeq genomes file transfer protocol (FTP) (https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/) to use for additional filtering. Overall, NCBI RefSeq serves as a high-confidence reference dataset that spans the tree of life to support a wide variety of bioinformatic uses.
Annotation of RefSeq genomes and sequences
For all taxonomic groups, RefSeq genomes are created by identifying candidate genome assemblies available in the INSDC archival dataset, replicating the genomic sequences to create an equivalent but distinct RefSeq assembly, and then processing it for annotation. In some cases, genomic sequences are removed (e.g. due to contamination or short length) or alternate organelle records are incorporated, but no genome sequences are changed in the process of creating a RefSeq assembly. The annotation process is tailored to the specific taxonomic group of the genome. Prokaryotic genomes undergo a distinct process compared to the eukaryotic genomes, which in turn differ from viral genomes. Furthermore, within the eukaryotic domain, different annotation strategies are employed for metazoans and plants, versus smaller eukaryotes such as fungi and protists (Table 1). The following sections provide a brief overview of the genome selection and annotation processes for various taxonomic groups, with a more detailed description specifically for higher eukaryotic organisms, such as metazoans.
Pipeline . | Taxonomic group . |
---|---|
Prokaryotic Genome Annotation Pipeline (PGAP) | Prokaryotes, Archaea and plasmids |
Eukaryotic Genome Annotation Pipeline (EGAP) | Vertebrates, higher plants, arthropods and some invertebrates |
Annotation Propagation Pipeline | Model organisms or strains, fungi, protozoans, viruses, organelles |
Pipeline . | Taxonomic group . |
---|---|
Prokaryotic Genome Annotation Pipeline (PGAP) | Prokaryotes, Archaea and plasmids |
Eukaryotic Genome Annotation Pipeline (EGAP) | Vertebrates, higher plants, arthropods and some invertebrates |
Annotation Propagation Pipeline | Model organisms or strains, fungi, protozoans, viruses, organelles |
Pipeline . | Taxonomic group . |
---|---|
Prokaryotic Genome Annotation Pipeline (PGAP) | Prokaryotes, Archaea and plasmids |
Eukaryotic Genome Annotation Pipeline (EGAP) | Vertebrates, higher plants, arthropods and some invertebrates |
Annotation Propagation Pipeline | Model organisms or strains, fungi, protozoans, viruses, organelles |
Pipeline . | Taxonomic group . |
---|---|
Prokaryotic Genome Annotation Pipeline (PGAP) | Prokaryotes, Archaea and plasmids |
Eukaryotic Genome Annotation Pipeline (EGAP) | Vertebrates, higher plants, arthropods and some invertebrates |
Annotation Propagation Pipeline | Model organisms or strains, fungi, protozoans, viruses, organelles |
Viruses
Creation of Virus RefSeq records can be initiated in two ways: the International Committee on Taxonomy of Viruses (ICTV) (16) selects a new exemplar to be considered as an example of a well-characterized isolate to represent a virus species, or a recommendation is made by a viral expert to create a representative for a species or genotype independently of ICTV. ICTV also provides recommendations for taxonomic nomenclature that is featured in both RefSeq and GenBank records and creation of RefSeq records generally follows within a few months of the release of a new ICTV Virus Metadata Resource document. For segmented viruses, there is a preference for selecting all segments of a complete genome isolated from a single sample; however, it is not a requirement, and some segmented virus genomes may represent a mix of samples. For both single sequence and segmented viruses, a genome assembly record is created to represent the set of RefSeq sequences.
Virus RefSeq records must be created from existing GenBank sequences labeled as virus or bacteriophage, i.e. it must not be a span on a eukaryotic or prokaryotic chromosome. Nominated records should also have the highest quality possible with minimal sequencing gaps and have the most complete annotation possible. In most cases, the RefSeq annotation is based on the annotation provided on the archival sequence, with automated mechanisms to harmonize and standardize the annotation format across the RefSeq virus dataset. For selected species with significant importance to public health, curators may manually add gene and protein annotation, and additional metadata found in publications. Edited RefSeq virus records include the status REVIEWED REFSEQ in the COMMENT section, such as on NC_045512.2, whereas RefSeqs that have not undergone additional editing or review have the status PROVISIONAL REFSEQ. NCBI virus appreciates continuing community assistance in selection and improvements of RefSeq records.
Bacteria and archaea
Candidate bacteria and archaea genomes are selected from the archival repositories based on criteria for sequence quality, completeness, accuracy of taxonomic identification and levels of contamination, as most recently described in 2024 (12). Genomes are then annotated with the automated PGAP, to provide consistent high-quality annotation across the dataset (https://www.ncbi.nlm.nih.gov/refseq/annotation_prok/). A small number of model genomes forego PGAP and use annotation provided by the INSDC submitter, for example for the Escherichia coli K-12 substr. MG1655 reference genome GCF_000005845.2 with annotation derived from EcoCyc (17). In addition to its use for RefSeq, PGAP is used to annotate prokaryotic genomes assembled in the Pathogen Detection project for use in surveillance, GenBank genome submissions by request of submitters, and is also available as a stand-alone software package (https://github.com/ncbi/pgap/). A set of plasmids not otherwise associated with a RefSeq genome are also selected and annotated with PGAP to increase representation of this critical and diverse data type within the RefSeq dataset. RefSeq prokaryote genomes are generally reannotated with PGAP once a year on a rolling schedule to incorporate annotation updates due to software or evidence improvements.
PGAP generates both structural annotation (feature locations on the assembled genome) and functional annotation (https://www.ncbi.nlm.nih.gov/refseq/annotation_prok/process/). The structural annotation includes protein-coding genes, non-coding RNAs (ribosomal, transfer, certain select non-coding), CRISPR arrays and pseudogenes. To assign functional annotation, predicted proteins are searched against protein family models (PFMs) (https://www.ncbi.nlm.nih.gov/protfam/), a hierarchical collection of evidence composed of HMMs, BlastRules and domain architectures. Proteins are assigned the name and attributes of the highest-precedence PFM that is hit. Annotation is assessed using completeness and quality checks including a low gene count, an abnormal gene-to-sequence ratio, missing ribosomal RNA (rRNA) or transfer RNA (tRNA) genes, many frameshifted proteins and completeness based on CheckM (18), and genomes with atypical results are excluded from the RefSeq dataset.
Metazoa and viridiplantae
Animal (Metazoan) and land plant (Viridiplantae) genomes are selected for incorporation to the RefSeq dataset and in most cases annotated with NCBI’s EGAP (https://www.ncbi.nlm.nih.gov/refseq/annotation_euk/) to provide consistent, high-quality gene annotation. A subset of eukaryotic genome assemblies in GenBank that have met certain quality criteria such as high contiguity and accuracy are selected and replicated to create a RefSeq assembly. To be considered, candidate species must also generally have accompanying short- or long-read transcriptomic data available in the NCBI Sequence Read Archive (SRA) (19), as the annotation process heavily relies on this data. Typically, RefSeq annotates only one genome per species, selected based on its quality and importance to the scientific community, and updates to newer, higher-quality genomes as they become available. As the number of high-quality genomes submitted to INSDC repositories grows, RefSeq prioritizes candidates for annotation based on factors such as community interest, economic significance and medical importance. A majority of the genomes added to RefSeq in recent years have been in response to requests received from individuals in the scientific community and submitted through the NCBI Help Desk (https://support.nlm.nih.gov).
Over the past two decades, NCBI RefSeq has used EGAP to annotate >1700 animal and plant genomes from over 1200 species represented in the current dataset, including over 240 genomes annotated in 2023 alone (Figure 3). The species annotated with EGAP represent a wide variety of plants, vertebrates, arthropods and other metazoans (Figure 2 and Supplementary Figure S2). Some lineages such as mammals (Eutheria) and flies (Diptera) have higher representation due to a combination of research significance and availability, whereas other lineages are more sparsely represented with longer branch lengths between included species (e.g. Arachnida), generally reflecting lower availability of candidate genomes (Figure 2 and Supplementary Figure S2).

Count of eukaryote genomes annotated per year. Reannotation counts include reannotation of the same or new assemblies for the same species. Only January–August is included for 2024.
The EGAP utilizes a modular framework that handles all annotation tasks, starting from retrieving data from public repositories, through sequence alignment and gene model prediction, to the final step of releasing annotation products to NCBI resources. The pipeline is actively maintained and continues to be enhanced with new features, and incorporates components developed at both NCBI and by external, third-party sources.
EGAP is an evidence-based annotation process that leverages transcript and protein alignments to the genome to inform gene prediction and structural annotation. Gnomon (https://www.ncbi.nlm.nih.gov/refseq/annotation_euk/gnomon/), the gene prediction tool developed at NCBI, combines multiple types of alignments into overlapping, non-conflicting chains that form the basis of transcript and protein models, which are supplemented with an ab initio HMM-based algorithm. While Gnomon has the capability to produce gene model predictions entirely ab initio, this feature is primarily used to address rare gaps in predictions that lack full-length evidence. This rigorous approach ensures the high accuracy of the resulting annotation products.
The primary sources of evidence for annotation are threefold: complementary DNAs (cDNAs) and Expressed Sequence Tags (ESTs) from GenBank, transcriptomic data (including both short- and long-read sequences) from SRA, and protein sequences from GenBank and RefSeq. Traditional transcript sequences, such as cDNAs and ESTs, are aligned to the genome using Splign (20), while protein sequences are aligned with ProSplign. The availability of transcripts and proteins for alignment varies significantly across species. To maximize coverage, we align transcripts from closely related species when necessary. With the decreased cost of sequencing, the majority of evidence data now comes from short-read transcriptomic studies, particularly RNA-Seq, which are aligned to the genome using Splign or, more recently, using STAR (21). Although STAR produces highly accurate alignments, we have observed a small number of chimeric alignments that span paralogous genes located adjacent to one another, which can lead to erroneous readthrough gene models. To address this issue, EGAP incorporates a post-processing step that utilizes the compart algorithm from Splign to reliably select accurate, non-chimeric alignments. In addition to short-read RNA-Seq, EGAP also aligns long-read transcriptomic data generated using PacBio Iso-Seq or Oxford Nanopore technologies using minimap2 (22). A similar post-processing step is employed to reduce chimeric alignments.
EGAP is equipped to utilize specialized short-read sequencing data to enhance the accuracy of gene models. CAGE-Seq (Cap Analysis of Gene Expression) (23) has been employed as a high-throughput method to determine precise transcription start sites (TSSs) on a genome-wide scale (24,25). When available, CAGE-Seq reads are aligned to the genome, clustered to identify plausible TSS locations, and used to refine the 5′ termini of transcript models. Similarly, polyadenylated (polyA) tails, a hallmark of 3′ termini of mRNA, are identified in transcript alignments, including those from both short- and long-read sequences, and used to refine transcript 3′ termini, with markup included on the transcript record (e.g. XM_005596074.4).
Some sequencing technologies are prone to an elevated rate of small insertions or deletions (indels), which can result in frameshifts if they occur in coding regions. Nonsense sequence changes (internal stop codons in a coding region) and frameshifting indels are identified based on protein alignments incorporated into Gnomon models, which are categorized as either pseudogenes or protein-coding models based on modeling of nonsense and frameshift distributions across the genome and homology characteristics. Gene models that include frameshifts or internal stop codons but have one-to-one orthologs are expected to be under selective constraint and classified as protein-coding with a ‘LOW QUALITY PROTEIN’ label to indicate potential issues with the sequence and details of the gene model. Gene models with internal stop codons that are translated as selenocysteine are also identified through protein alignments and properly annotated to account for the atypical biology (e.g. XP_002100490.3) (26).
In addition to Gnomon models for protein-coding, long non-coding RNA and pseudogenes, EGAP integrates models from additional sources: (i) curated RefSeq transcripts (see below), (ii) miRNAs imported from miRBase (27), (iii) tRNAs predicted using the tRNAscan-SE tool (28) and (iv) rRNAs, small nucleolar RNAs (snoRNAs), small nuclear RNAs (snRNAs) and other short non-coding RNA (ncRNA) types identified through searches of eukaryotic RFAM HMMs against the genome using the Infernal’s cmsearch tool (29,30). Models from different sources are evaluated for overlap and supporting evidence and consolidated into a non-redundant set of high-quality models. Alternatively spliced genes may be represented by integrating a combination of curated RefSeq transcripts and Gnomon models backed by a single full-length transcript or RNA-Seq reads from a single BioSample. Stable identifiers for genes (GeneID Dbxrefs), model transcripts and proteins from Gnomon and Rfam (XM_, XR_ or XP_ accession prefixes) and curated RefSeq transcripts and proteins (NM_, NR_ or NP_ prefixes) are assigned and tracked across assembly and annotation updates for a species to ensure FAIR (Findable, Accessible, Interoperable, Reuseable) (31) data access.
The structural annotation produced by EGAP is enriched with functional information including nomenclature for genes and proteins. RefSeq relies on nomenclature authorities such as HUGO Gene Nomenclature Committee (HGNC) (32), Mouse Genome Informatics (MGI) (33) and Rat Genome Database (RGD) (34) as primary sources for gene symbols and descriptions (see the ‘Collaborations’ section below). For organisms lacking a dedicated nomenclature entity, EGAP assigns appropriate names based on orthologs computed versus a well annotated reference organism using a process based on protein alignments and local synteny (https://www.ncbi.nlm.nih.gov/kis/info/how-are-orthologs-calculated/). RefSeq is currently expanding the taxonomic scope of ortholog calculations beyond vertebrates and arthropods to a larger set of organisms across the tree of life. These efforts will not only enhance gene nomenclature for RefSeq annotations but also facilitate comparative genomics analyses. UniProtKB/SwissProt (35) is used as the authority for protein names where available, either through orthology or homology alignments.
Annotation quality is assessed based on alignment quality and incorporation into gene models, expected counts of protein coding genes and orthologs, fraction of gene models based in part or entirely on ab initio prediction and alignment of proteins compared to reference sets derived from UniProtKB/SwissProt, FlyBase (36) or other sources. Genomes with a high rate of models with frameshifts or internal stop codons (>10% of protein-coding genes) are generally excluded from the RefSeq dataset and reported back to the genome submitter for improvement. BUSCO (37) is a popular tool for assessing quality, and is run on all EGAP annotations in ‘protein’ mode using one longest protein per protein-coding gene and assessing completeness and duplication rates in the final RefSeq annotation dataset compared to the closest available BUSCO lineage dataset. In protein mode, BUSCO provides an assessment of a combination of the genome and RefSeq annotation quality. More than half of current EGAP annotations have a completeness score of 98% or higher, with a mean of 97.3%, providing an independent metric for the overall quality of EGAP annotations. Note, BUSCO scores are best interpreted relative to related genomes and species. For some genomes, BUSCO scores may be artificially low due to significant evolutionary divergence from the organisms included in the preparation of the BUSCO reference set.
Fungi, protozoa and other model organisms
NCBI uses the Eukaryotic Annotation Propagation Pipeline to broaden the coverage of annotated eukaryotes in RefSeq by leveraging good-quality annotations archived in GenBank. This pipeline propagates user-submitted annotations to RefSeq, applying logic to enhance the quality of the original annotation, resulting in standardized formatting and improved product names. The genomes in this group include eukaryotes that are significant for research related to diseases (primarily fungi, protozoa and select protostomia) and model organisms with actively curated annotations from Model Organism Databases including Arabidopsis thaliana (TAIR) (38), Caenorhabditis elegans (WormBase) (39), Drosophila melanogaster (FlyBase) (36), Saccharomyces cerevisiae (SGD) (40) and Schizosaccharomyces pombe (PomBase) (41).
Fungi assemblies selected for promotion to the RefSeq collection usually represent one genome per species; however, in clades with more genetic diversity below species level (e.g. variety and forma specialis) more assemblies are selected. The selection process currently involves evaluation of parameters including verification of taxonomic identity, contamination, annotation coverage and assembly quality metrics including contig counts and N50 lengths. Taxonomic identity is verified by leveraging type material (the specimen upon which a new species description is based) sequences as references and a combination of the following methods: phylogenetic placement in a protein distance tree (https://github.com/ncbi/tree-tool), in a whole genome k-mer tree, average nucleotide identities and marker loci identities. Annotation coverage is evaluated against the fungi BUSCO dataset (fungi_odb10), and we avoid selecting assemblies that score with fewer than 90% complete genes or >10% duplicates.
Fungi RefSeq assemblies cover 7 phyla consisting of 33 classes, of which about half are within the Ascomycota (Supplementary Figure S2). All species in the critical and high priority pathogen groups as defined by the World Health Organization (WHO) (42) are represented in the fungi Refseq assembly dataset. Recently the fungi RefSeq assembly dataset was used in a comprehensive analysis of microbial content in whole-genome sequencing samples from The Cancer Genome Atlas project by Ge et al. (43). Several fungi RefSeq assemblies are also part of the EukCC reference database (44), which are being used in metagenomic analysis.
Protozoa are a multiphyletic group of single-celled organisms informally defined at NCBI as eukaryotes other than metazoa, viridiplantae or fungi. The protozoa RefSeq dataset currently includes 99 species and is incredibly diverse, including disease-associated genera such as Plasmodium, Leishmania, Trypanosoma and others. While protozoa are currently not in scope for EGAP annotation, efforts are currently underway to expand their coverage in RefSeq including import of annotation from other sources such as VEuPathDB (45).
Organelles and RefSeq targeted loci
Organelles
Complete mitochondrion, plastid and other organelle genomes are integrated into the RefSeq dataset primarily through propagation of the INSDC submission. Mammalian mitochondria annotation is often supplemented with manual curation. RefSeq metazoan mitochondria records incorporate a standardized set of gene symbols for protein-coding genes (ATP6, ATP8, COX1, COX2, COX3, CYTB, ND1, ND2, ND3, ND4, ND4L, ND5, ND6) that are also used on over one hundred thousand mitochondrion records in GenBank.
Fungi RefSeq targeted loci
The most frequently sequenced region in the fungal genome is the ribosomal cistron, which is used for species identification and evolutionary analyses. Sequence records from the ribosomal cistron represent the highest diversity of fungal organisms in GenBank compared to any other locus. The small subunit (SSU) rRNA gene is the most conserved region in the ribosomal cistron, and the internal spacer sequence (ITS) region the least, while the large subunit (LSU) rRNA gene is of intermediate conservation. The ITS region collectively refers to two divergent regions ITS1 and ITS2 that are separated by the short, conserved 5.8S rRNA gene. The SSU rRNA has poor species-level resolution but is useful in characterizing rapidly evolving groups. In contrast, the ITS region has the highest probability of successful species-level identification for the broadest range of fungi and has been widely accepted as the first universal barcode marker (46).
The Fungi RTL project at NCBI consists of curated SSU, ITS and LSU sequence records derived from a type specimen and accessioned into a public biorepository. The RTL dataset is accessible via BioProject PRJNA41209, FTP download or via the custom BLAST databases for each locus type. The ITS RefSeq dataset (1) is connected to >470 entries in the NCBI BioCollections resource (47) and a quarter of the associated biorepositories are large collections which hold the majority (∼91%) of type specimens in the reference sequence dataset. RTL records are interconnected with the NCBI Taxonomy resource (48), and collectively result in verified name–strain–sequence type associations. The value of publicly available and validated strains and reference sequences is exemplified in our recent collaboration on Colletotrichum identification, and the resulting curation project captured in RefSeq ITS sequence records (49).
Since the first announcement of the ITS RefSeq project (50) and subsequent reports on the ITS and LSU RefSeq records (1) in 2016, thousands more records were added, and improvements made, resulting in > 19 600 fungi species represented. For example, the software package Ribovore was implemented to evaluate potential SSU and LSU sequences for RefSeq inclusion and improved curation efficiency (51).
Amplification and sequencing of the ITS region from microbiome and environmental DNA sample studies remain a popular method to describe fungal diversity. Despite the dramatic increase of genome assembly sequences, reference databases of the ribosomal RNA loci still play an important role in molecular identification of fungi species. The RTL dataset covers critical and high priority fungal pathogens as defined by the WHO, >90% of the genera listed in Clinical ATLAS of fungi (52) and genera listed as being the 100 most cited fungal genera (52,53)
Prokaryotes RefSeq targeted loci
In prokaryotes, the 16S rRNA sequence has traditionally been a standard molecular marker utilized in the description of a new species and remains in common use even with a growing shift to genome-based classification (54,55). The prokaryote RTL dataset provides a curated set of over 27 000 16S rRNA marker sequences, of which over 98% are from type strains. RTL records indicate their GenBank source record in the COMMENT section (e.g. NR_175563.1 is identical to CP011371.1:2 032 893–2 034 431), and NCBI Datasets provides convenient links to both RefSeq and GenBank sequences and genomes from type material (e.g. https://www.ncbi.nlm.nih.gov/datasets/taxonomy/1768/). The bacterial and archaeal 16S rRNA datasets are available from NCBI BioProject (PRJNA33175 and PRJNA33317, respectively). A custom BLAST database, named ‘16S ribosomal RNA sequences (Bacteria and Archaea)’, is also available.
AMR is a significant public health threat, with a variety of resistance mechanisms conveyed by different genes. A stand-alone tool called AMRFinderPlus (56) (https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder/) has been developed to identify genetic elements in the genomes of prokaryotic pathogens as part of the Pathogen Detection project (https://www.ncbi.nlm.nih.gov/pathogens/) (56). The full set of the underlying reference sequences including >9700 genes and point mutations is currently represented in the Pathogen Detection Reference Gene Catalog (https://www.ncbi.nlm.nih.gov/pathogens/refgene/). RefSeq provides the core genes as genomic records (NG_ accession prefix), including those assigned beta-lactamase, mobile colistin resistance (mcr) and Qnr alleles by request from the scientific community (57,58,59) under BioProject PRJNA313047.
RefSeq annotation products
Individual RefSeq records integrate sequence, annotation, organism and metadata information. RefSeq genomes are composed of sets of annotated genomic sequences and are assigned an assembly accession using the GCF_ prefix, e.g. GCF_000001405.40 for human GRCh38.p14 (also referred to as hg38), which is paired with, but distinct from the INSDC assembly with a GCA_ prefix (e.g. GCA_000001405.29). Since the same assembly may be annotated multiple times, many RefSeq assemblies incorporate an annotation name, including EGAP and newer PGAP annotations, which is based on the RefSeq assembly accession and annotation date (e.g. GCF_000001405.40-RS_2024_08). We recommend including the annotation name in publications to describe the dataset used in accordance with FAIR best practices.
A variety of data formats are available for the genomic, transcript and protein data associated with a RefSeq genome, including sequences in FASTA format for genome, transcript, protein and annotated RNA and coding sequence (CDS) features; feature annotation in NCBI flatfile (GBFF or GPFF), GFF3 or GTF formats; and summary reports with assembly sequences, feature counts, locations and metadata in tabular, XML or JSONL format through NCBI Datasets (60) or FTP (see below).
Additional data is available for specific groups of RefSeq genomes. Information on one-to-one orthologs is available from NCBI Gene and NCBI Orthologs, and downloadable through NCBI Datasets. Functional information in the form of gene ontology (GO) terms including descriptions of probable molecular function, biological processes and cellular location of products is available for many genomes. For prokaryotes, GO terms are associated with PFMs used for associating functional information with proteins, and incorporated into annotated CDSs and proteins in the GBFF, GFF3 or GTF formatted genome annotation files (12). For eukaryotes, GO terms are provided for genes in GO Annotation File format using InterProScan (61) to identify and propagate GO terms for the majority of protein-coding genes included in the annotation. This process utilizes the latest versions of InterProScan, which incorporate analyses based on PANTHER reference trees (62), resulting in broader, more accurate GO mappings. GO data for eukaryotes is also available for search and download through NCBI Gene. Many EGAP annotations generated since June 2022 include gene expression information as tabular files computed using featureCounts (63) with the short-read RNA-Seq data aligned for the structural annotation, providing data on gene abundance across different samples. Coverage graphs of the expression data are also available in bigWig format.
Specialized annotation products are also available for some genomes. RefSeq sequences are frequently used as reporting standards by the clinical research and diagnostics community. In order to support this critical group of users, we provide periodic annotation updates for the previous GRCh37.p13 assembly (also referred to as hg19), most recently in September 2024 (GCF_000001405.25-RS_2024_09), and alignment files of all RefSeq transcripts, both current and historical data, in BAM format to support remapping of variation data between genome and transcript sequences (https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/historical/). We are also exploring approaches to provide annotation for additional genomes for a species where it may be useful for understanding haplotype diversity. Initial pilot annotations are currently available for several human and rat (64) genomes (https://ftp.ncbi.nlm.nih.gov/genomes/all/pilot/).
Expert curation for eukaryote genomes
In addition to the annotation produced through automated pipelines, manual curation performed by expert curators in the RefSeq group improves the structural and functional annotation available for a wide variety of organisms. Curation activities for prokaryotes have been described recently (12). In eukaryotes, RefSeq curators review structural and functional annotation in human and key model organisms including mouse, rat, chicken and zebrafish. Areas of curation focus include fixing annotation errors flagged by QC tests, improving gene nomenclature and ortholog calls in collaboration with nomenclature groups, and providing accurate annotation of genes exhibiting exceptional biology (for example, non-AUG start codons and genes encoding proteins that contain selenocysteine). The types of data typically reviewed by curators include, but are not limited to, transcript data, conservation data, peptide, ribosome profiling, protein structure prediction data, epigenetic information, CAGE data, tissue-specific gene expression and polyA-seq. Literature reviews are also performed by curators to consider all experimental evidence in making annotation decisions. Curated RefSeq transcripts and proteins are incorporated into subsequent genome annotation updates and can also inform Gnomon models produced in other species.
The effect of improvements in assembly and annotation methods is exemplified by the evolution of the human genome annotation since the beginning of the RefSeq project (Figure 4). Early automated annotations suffered from poor methods to identify protein-coding genes, resulting in large variations and over-annotation. Improvements in methodology and incorporation of expert curation improved the dataset over time, which has now stabilized with 20 078 protein-coding and 17 063 pseudogenes in the latest annotation (GCF_000001405.40-RS_2024_08). However, the incorporation of RNA-Seq and long-read transcriptomic evidence—beginning in 2013 and 2016, respectively—has resulted in a steady rise in the annotation of various types of ncRNA genes, now totaling 22 175, along with alternatively spliced transcripts for most genes. CAGE and polyA data have been used to annotate experimentally validated TSSs and 3′ termini for > 85% of transcripts, providing a robust substrate for promoter and expression studies. As advancements in transcript sequencing, proteomics, protein structure prediction and other technologies continue to progress, the RefSeq team of curation experts is well-positioned to enhance the annotation of model organisms and extend these improvements to automated EGAP annotations across eukaryotes.

Improvements to human genome annotation by RefSeq. The graph depicts the number of protein-coding, non-coding and pseudogenes annotated with each iteration, dating back to 2001. Each bar represents an individual annotation. The X-axis labels represent an approximate timescale of the annotation dates.
In addition to conventional genes, the human and mouse RefSeq genome annotations include a set of nongenic elements such as enhancers, silencers and protein-binding sites based on experimental evidence curated from the literature as part of the RefSeq Functional Elements (RefSeqFE) project (65). The RefSeqFE dataset has grown tremendously since it was first published in 2022, now including 157 985 features from 137 141 loci in the human GRCh38.p14 annotation, and 43 687 features from 40 212 loci in the mouse GRCm39 annotation. The data is extensively linked to the experimental literature, and provides a valuable foundation for understanding gene regulation and other biological processes. In addition to incorporation of features into the primary RefSeq annotation files, the RefSeqFE annotation is available as track hub files (https://ftp.ncbi.nlm.nih.gov/refseq/FunctionalElements/trackhub/data/) in bigBed format, and also includes functional interaction data in bigInteract format.
A significant amount of curation is involved in providing specialized annotation products. For example, RefSeq provides a representative transcript and protein (RefSeq Select) for each protein-coding gene in the human, mouse and rat genome annotations, as well as RefSeq Select proteins for prokaryote reference genomes, and is working on approaches to extend this dataset to other eukaryotes. The mouse annotation (GCF_000001635.27-RS_2024_02) has 21 162 RefSeq Select transcripts while the rat annotation (GCF_036323735.1-RS_2024_02) contains 23 167 RefSeq Select transcripts. The human RefSeq Select set is included in the Matched Annotation from NCBI and EMBL-EBI (MANE) collaboration and is called the MANE Select set (described below).
Collaborations
NCBI works collaboratively with many groups and other resources to improve annotation. For example, improvements and changes to nomenclature in multiple organisms are made in consultation with expert nomenclature authorities including HGNC (32), Vertebrate Gene Nomenclature Committee (66), MGI (33), RGD (34), XenBase (67), Chicken Gene Nomenclature Committee (68) and Zebrafish Information Network (69). In addition to improving gene nomenclature assignment for these key model organisms, gene symbols and names from key organisms (human, zebrafish and Drosophila) are used to inform nomenclature in related species through orthology. RefSeq curators also collaborated with the Saccharomyces Genome Database (SGD) (40) to assign descriptive but concise protein names following the international protein nomenclature guidelines (https://www.ncbi.nlm.nih.gov/genbank/internatprot_nomenguide/) referenced in (35).
NCBI has continued with existing collaborations and initiated new ones with external groups to provide high-value annotation products. The Consensus Coding Sequence (CCDS) (70) project, a longstanding collaboration between NCBI, Ensembl, HGNC and MGI has incrementally added coding regions identically annotated by RefSeq and Ensembl in human and mouse reference genomes. The latest human (Release 24, 2022) and mouse (Release 23, 2019) CCDS datasets include 19 107 and 20 486 protein-coding genes, respectively. The MANE collaboration was initiated in 2018 to provide a set of representative transcripts (MANE Select) (71), one for each protein-coding gene in the human genome to serve as default transcript to report known clinical variants and for display in genome resources. The current RefSeq annotation of the human reference genome incorporates MANE v1.3 which contains MANE Select transcripts covering >99% of protein-coding genes, including all genes from the American College of Genetics and Genomics Secondary Findings list (ACMG SF v3.2) (72). Additionally, transcripts called MANE Plus Clinical are provided for 63 genes where the MANE Select alone is not sufficient to report clinical variants.
RefSeq curators also work with members of the scientific community to improve both structural and functional annotation across a wide range of species. For example, curators worked with members of the insect vector research community to improve annotation of key genes including proteases, G protein-coupled receptors and chemosensory receptors (73) in the genome of Aedes aegypti, a prolific mosquito which is a vector transmitting several diseases worldwide. More recently, as part of a study elucidating the structure of sex chromosomes and analyzing their evolution in six non-human ape species (74), RefSeq curators reviewed the annotation of gene families on the Y-chromosomes of these organisms and provided improved annotation. RefSeq curators also analyzed community annotation of proviral genes in four wasps (Fopius arisanus, Venturia canescens, Microplitis demolitor and Chelonus insularis) (75) which are now incorporated into updated genome annotations for those species.
T2T-CHM13 genome annotation
In 2022, the T2T Consortium released the first complete sequence of a human genome from the uniformly homozygous cell line CHM13 (76), which both corrected sequences found in GRCh38 and also added about 238 Mbp of sequence missing in the reference GRCh38 human assembly. The long-read sequencing technologies used in T2T-CHM13 resolve most of the gaps, mis-assembly and sequencing errors found in GRCh38. While the T2T-CHM13 assembly is technically superior, GRCh38.p14 is still considered the reference genome for human due to its extensive usage in bioinformatics and by the clinical community.
The T2T-CHM13 and GRCh38.p14 assemblies were co-annotated using EGAP, with the most recent update in August 2024 including 55 376 genes (19 638 protein-coding) annotated on both genomes, 3046 (433 coding) unique to T2T-CHM13 and 4339 (440 coding) unique to GRCh38.p14 (Supplementary Figure S3). Differences in gene content can arise from changes in copy number in arrays of paralogous genes, genes found in subtelomeric, pericentric or other locations absent in GRCh38, genes affected by mis-assembly or haplotype differences or from regions with substantial sequence differences that inhibit identification of allelic genes. Most genes uniquely annotated on only one genome are based on automated methods, and review by RefSeq curators is ongoing.
Some examples of differences in gene content are described below. LOC105379278 (GeneID:105379278) is supported by full-length transcripts that align to the T2T-CHM13 assembly, encoding a putative tyrosine-protein phosphatase with protein isoforms >500 aa in length (Figure 5). This locus is located in a pericentric genomic region that is flanked by assembly gaps in GRCh38 and is inverted relative to T2T-CHM13, with most of the gene, including the entire coding region, absent in GRCh38. While the latest GRCh38 annotation (GCF_000001405.40-RS_2024_08) did not annotate LOC105379278, this locus was annotated on previous GRCh38 annotations, being defined as a non-coding gene due to the partial gene deletion.

LOC105379278 is a novel protein-coding gene on T2T-CHM13. (A) A CGV view of human chr 13 alignment between T2T-CHM13 (top) and GRCh38 (bottom). The purple ribbon depicts aligned regions on opposite strands. LOC105379278 is annotated as a protein-coding gene on T2T-CHM13 with a complete representation of the coding region. (B) LOC105379278 is absent from the latest RefSeq annotation of GRCh38 (GCF_000001405.40-RS_2024–08). It was represented as a non-coding locus in previous annotations, including GCF_000001405.40-RS_2023–10, owing to a truncated coding region resulting from assembly issues in GRCh38, including an inversion and a gap compared to the corresponding region in T2T-CHM13. (C) A closeup of the genomic region on GRCh38 around LOC105379278 that is disrupted by a gap in the assembly, denoted by a string of 'N’s in the sequence.
Nonsense mutations or small frameshifting indels can result in haplotypic differences in gene biotype sometimes referred to as polymorphic pseudogenes. For example, some alleles of SPEGNB (GeneID:100996693) contain a frameshifting single nucleotide insertion, resulting in a protein-coding allele on GRCh38 (represented by transcript NM_001286811.2; protein NP_001273740.1) and a non-coding pseudogene allele on T2T-CHM13 (represented by transcript NR_178039.1) (Figure 6). Small allelic differences can also result in significant changes in gene products without necessarily resulting in an inactive or pseudogene allele. An example of this is CACNG7 (GeneID:59284), where a frameshifting indel in a string of ‘G’s, rs11324363, results in a 305 aa isoform encoded on GRCh38, while the more common allele represented on T2T-CHM13, encodes a 269 aa isoform with an alternate C-terminus (Supplementary Figure S4).

SPEGNB is a polymorphic pseudogene annotated as a protein-coding gene on GRCh38 assembly and as a pseudogene on T2T-CHM13 assembly. (A) A GDV view of the SPEGNB locus on GRCh38 (top) and T2T-CHM13 (bottom). The darker shading denotes the coding region with a closeup of rs114337065 on the right. The reference allele retains the reading frame, with NM_001286811.2 encoding a 238 aa, NP_001273740.1. (B) The T2T-CHM13 assembly contains the insA allele of rs11437065, shown in closeup view on the right. The single nucleotide 'A' insertion is a frameshifting allele, with a minor allele frequency of 0.06. The resulting non-coding transcript, NR_1780039.1, is annotated on T2T-CHM13.
Genome repeats are susceptible to biological deletions, duplications and gene conversion events or mis-assembly that can result in differences in copy number of paralogous genes. The AMY1 gene cluster is one example of a paralog expansion, with six AMY1-like genes annotated as a cluster on GRCh38, and twice as many AMY1-like genes clustered on T2T-CHM13 (Supplementary Figure S5). RefSeq is exploring methods to annotate differences in gene content found in additional genomes and represent a human pangenome with loci shared between, or unique to, specific assemblies.
Accessing RefSeq data
RefSeq sequences and annotated genomes are continuously released, and the data can be accessed through various NCBI portals, including Nucleotide, Protein, Gene (5) and NCBI datasets (60) for genome and gene data. Users can also access the data programmatically via NCBI Datasets, E-utilities and EntrezDirect APIs and command line tools, or by bulk downloads from FTP. Additionally, graphical tools such as the Sequence Viewer, Genome Data Viewer (9) and the newly released Comparative Genome Viewer (10) facilitate visual browsing and analysis of RefSeq data. Stay up to date on significant updates to RefSeq and other resources through NCBI Insights (https://ncbiinsights.ncbi.nlm.nih.gov).
Bulk download and metadata retrieval
RefSeq genomic, transcript and protein data can be viewed and accessed in different formats, the best choice of which is dependent on the desired application used. The RefSeq Homepage (www.ncbi.nlm.nih.gov/refseq/) provides all links needed for access to the data, and related links.
The NCBI Datasets resource (www.ncbi.nlm.nih.gov/datasets) (60) can be used for finding and retrieving RefSeq sequence and metadata, and is highly customizable with both web and command-line interfaces. Packages of RefSeq-annotated genomes can be retrieved as well as data packages for genes, where specific individual files can be requested, including those for coding sequences and 5′ or 3′ untranslated regions. Gene packages from the full ortholog set for a given gene can also be easily accessed, or individual taxonomic groups can be selected for data retrieval. RefSeq data is also available using FTP to access files for individual genomes (ftp.ncbi.nlm.nih.gov/genomes) or the entire RefSeq dataset (ftp.ncbi.nlm.nih.gov/refseq/). Archived files from suppressed or replaced RefSeq assemblies and previous EGAP annotations are also available on the genomes FTP site.
Graphical sequence displays
The Genome Data Viewer (www.ncbi.nlm.nih.gov/gdv) allows visualization of RefSeq eukaryotic genomes and annotations, including a wide variety of tracks such as sequence, epigenetics, expression and variation tracks available for configuration through the ‘Tracks’ gear icon (9). Displays are available for different scales, from as large as an entire chromosome to the sequence level, and comparative tracks to other assemblies from the same or other species can be included. Different preselected recommended track sets are available for quick viewing and are available for selection using the ‘NCBI recommended Track Sets’ option. The breadth and variety of tracks available differs from organism to organism, depending on the availability of public data. The Graphical Sequence Viewer can also be accessed for individual nucleotide and protein records to visualize annotation data on prokaryote and virus sequences.
The Comparative Genome Viewer (www.ncbi.nlm.nih.gov/cgv) is an interactive NCBI resource that enables global or local visualization of structural variation and its effect on annotation between pairs of genomes from either the same species or cross species (10); >1000 alignments are available for >400 species and >800 assemblies. Inversions, translocations, deletions and duplications can be easily seen at both micro- and macro-level.
Future directions
Advancements in sequencing technologies and application of genome sequencing to a wide range of uses and organisms has led to an explosion of data in the INSDC archives, now exceeding 2.5 million genomes and 35 terabases (Tb) of sequence, making the need for high quality reference datasets greater than ever. The RefSeq project has expanded over the last 25 years to meet this demand – currently including over 395 000 genomes and 3.3 Tb of sequence – with scope tailored to meet the needs of different user communities, and an overarching emphasis on providing high-quality data across all taxonomic domains.
Coverage of prokaryote genomes in RefSeq will continue to grow, including expanding incorporation of MAG genomes with more general taxonomic classification. Further improvements to PGAP will allow the system to be more inclusive of these uncertain MAGs and improve representation of a broader diversity of prokaryotes.
With growing interest in comparative genomics and an anticipated surge in the number of eukaryote genomes from international biodiversity sequencing projects, RefSeq as part of the CGR project is working to make its trusted EGAP pipeline available as a public resource for researchers to annotate their genome assemblies themselves. A preliminary release of the pipeline (EGAPx) is already available to users in GitHub (https://github.com/ncbi/egapx), with refinements ongoing based on user feedback. In addition to enabling researchers to generate high-quality and robust genome annotation, EGAPx is expected to allow RefSeq to focus on the most high-impact eukaryotic genomes that are being used to further research and therapeutics to improve human health. RefSeq will continue to refine annotation pipelines, leverage new near-complete genome assemblies and use annotation quality metrics to provide high-quality genome annotations and ortholog calculations to support key organisms as well as provide diverse coverage of major taxonomic groups.
RefSeq Select has been a valuable resource in human (where it is called MANE Select), mouse, rat and prokaryotes for use in human clinical variant reporting and to provide a focused dataset for comparative genomics. We plan to expand this set to cover additional model organisms and eventually extend it to all organisms that are in scope of RefSeq annotation. We are also exploring criteria for highlighting additional tiers of data, for example subsets of alternatively spliced transcripts from complex genes or subsets of genomes within a taxonomic group, to support analyses that would benefit from focusing on a smaller slice of data. Development of tools to improve data quality, access and analysis for comparative genomics and pangenomics will continue as part of the NIH CGR project to support the scientific community in their research and to foster better exchange of data between NCBI’s resources and those of model organism communities. In keeping with FAIR principles (31), NCBI will continue its efforts to improve data attribution and provenance, including expanded linkage and reporting of evidence data with archival datasets in SRA and other repositories.
In total, the last 25 years has been an exciting time for the RefSeq project, and we look forward to providing reference sequence standards for the next 25 years and beyond!
Data availability
The data underlying this article are available in the NCBI Nucleotide, Protein, Genome and Gene repositories at https://www.ncbi.nlm.nih.gov. Links and documentation can be found at https://www.ncbi.nlm.nih.gov/refseq/, with data available for bulk download at https://ftp.ncbi.nlm.nih.gov/refseq/ and https://ftp.ncbi.nlm.nih.gov/genomes/refseq/.
Supplementary data
Supplementary Data are available at NAR Online.
Acknowledgements
Thanks to Yuri Wolf for suggestions on how to root the merged eukaryotic tree; Azat Badretdin, George Coulouris, Mike DiCuccio, Scott Durkin, Daniel Haft, Eric Jovenitti, Wenjun Li and Kathleen O’Neill for work on curation and production of prokaryote RefSeq genomes; and Farideh Chitsaz, Noreen Gonzales, Marc Gwadz, Gabriele Marchler, James Song, Roxanne Yamashita and Chanjuan Zheng for work on curation of protein conserved domains to inform functional annotation. Thanks to Laurie Breen for providing artwork used in the graphical abstract. The authors would like to thank all product owners, technical experts, developers and subject matter experts within groups at NCBI who support the RefSeq project. We would also like to thank members of the scientific community for constructive feedback and for submitting data to archives, and all collaborating databases for their contribution towards maintaining the high quality of RefSeq products.
Author contributions: Conceptualization was performed by K.D.P., T.D.M., S.P. and F.T.N. Data curation was performed by C.M.F., T.G., D.H, E.H., J.H., J.D.J., V.S.J., V.K., D.K., P.M., K.M.M., R.M., M.R.M., T.D.M., D.H.O., B.R., S.S.S., P.K.S. and A.R.V. Formal analysis was performed by V.K., V.B., D.H.O., T.D.M. and T.G. Software by A.A., O.E., W.H., A.K., E.M., A.S., H.S. and C.W. Writing (original draft) was done by T.G., V.K., S.P., T.D.M., B.R., V.S.J. and D.K. Writing (review and editing) was done by E.H., F.T.M. and K.D.P. Supervision was performed by T.D.M., S.P., F.T.N., J.R.B., B.K., A.M.B., and E.H.
Funding
National Center for Biotechnology Information of the National Library of Medicine, National Institutes of Health. Funding for open access charge: NCBI.
Conflict of interest statement. None declared.
References
Author notes
The first two authors should be regarded as Joint First Authors.
Comments