The Protein Naming Utility: a rules database for protein nomenclature

ABSTRACT

Generation of syntactically correct and unambiguous names for proteins is a challenging, yet vital task for functional annotation processes. Proteins are often named based on homology to known proteins, many of which have problematic names. To address the need to generate high-quality protein names, and capture our significant experience correcting protein names manually, we have developed the Protein Naming Utility (PNU, http://www.jcvi.org/pn-utility). The PNU is a web-based database for storing and applying naming rules to identify and correct syntactically incorrect protein names, or to replace synonyms with their preferred name. The PNU allows users to generate and manage collections of naming rules, optionally building upon the growing body of rules generated at the J. Craig Venter Institute (JCVI). Since communities often enforce disparate conventions for naming proteins, the PNU supports grouping rules into user-managed collections. Users can check their protein names against a selected PNU rule collection, generating both statistics and corrected names. The PNU can also be used to correct GenBank table files prior to submission to GenBank. Currently, the database features 3080 manual rules that have been entered by JCVI Bioinformatics Analysts as well as 7458 automatically imported names.

INTRODUCTION

During the annotation phase of a typical modern genomics project, functional names are assigned to identified genes and proteins in an automated or semi-automated fashion. Ideally, before such names are submitted to public sequence databases, they should be manually reviewed by experts to ensure that they are consistent, syntactically correct and unambiguous. However, with the scale of genomic data produced by next-generation sequencing technology and with increasingly automated functional annotation processes, the manual correction of names is no longer feasible. This issue is further complicated by the prevalence of ambiguous names resulting from the lack of interspecies naming conventions (1). New proteins are often named based on homology to existing proteins and many existing proteins have syntactically incorrect or ambiguous names, producing transitive annotation errors. Consequently, poor-quality names have proliferated in both public databases and the scientific literature.

The need for consistent and unambiguous names has led to the development of a number of conventions for naming genes and proteins [UniProt protein nomenclature (2), HUGO human gene name nomenclature (3) and various other model organism databases (4–7)]. In addition, the biological text mining community has created dictionaries to resolve gene/protein synonyms to improve the identification of genes and proteins in scientific articles (1,8).

The Broad Institute has developed BioNames, a tool to resolve these difficulties using collections of hard-coded regular expressions (https://sourceforge.net/projects/microbiomeutil). Here, we present our solution to this problem in the form of the Protein Naming Utility (PNU), a web-based database to store and apply customizable sets of naming rules to correct and standardize gene and protein names within an annotated genome or metagenome. The database provides an intuitive web interface that allows users to create and maintain their own naming rules and organize these rules in projects that can be shared with the community.

NAMING RULES AND DATA

The PNU does not distinguish between protein or gene names: the term ‘name’ is used as a synonym for either. The PNU features two distinct types of naming rules: ‘full matches’ and ‘partial matches’. A ‘full match’ replaces full name A (nonpreferred name, e.g. a synonym or misspelling) with full name B (preferred name), while a ‘partial match’ matches only a component of the name. Partial matches either trigger a partial name change or a ‘warning’. A ‘warning’ allows the user to flag a matched name as suspicious and enter an alternative name when checking names (Figure 1B). A summary of all ‘partial match’ actions is given in Table 1.

Figure 1.

Screenshots of the user interface (A) Full match entry: this ‘full match’ entry links four nonpreferred names to one preferred name, here ‘enoyl-[acyl-carrier-protein] reductase (NADH)’. The preferred name may be linked to an external reference, here EC 1.3.1.9 of the IUBMB (9). (B) PNU report: the report provides basic statistics in the heading. The table contains five columns: the number of entries for the respective input name, the input name, the PNU naming suggestion, a user confirmation check box and a link to further details. The bottom row in the figure represents a warning. If the user chooses to change the name associated with the warning, they can input the new name in the blank field under ‘Enter Suggested Name’. Checked and entered names will then be used to correct and update the imported file.

Open in new tab Download slide

Figure 2.

Overview of PNU use cases and project customization. Rules are entered via the web interface (either one by one or in a batch) and may be organized into groups, procedures and projects. Projects specify the set of rules that are used to correct names in input files. The PNU report allows users to verify name changes before correcting names (Figure 2B).

Open in new tab Download slide

Table 1.

Open in new tab

List of partial match actions

Action	Match Value	Replace Value	Example Input	Example Output
full replace	DUF	conserved hypothetical protein	hypothetical protein (DUF 1092)	conserved hypothetical protein
partial replace	7-DHC	7-dehydrocholesterol	7-DHC reductase	7-dehydrocholesterol reductase
remove	homolog	N/A	putative repressor homolog	putative repressor
merge duplicates	outative	N/A	putative kinase putative	putative kinase
move to beginning	putative	N/A	acyltransferase, putative	putative acyltransferase
move to end	putative	N/A	putative calicivirin	calicivirin, putative
regular expr. warning	/Salmonella/i	N/A	Salmonella invasin chaperone	WARNING
regular expr. local	/acyl-[cC]o[aA]/acyl-CoA/		acyl-coa dehydrogenase	acyl-CoA dehydrogenase
regular expr. global	/[Gg]nat family/GNAT family/g		acetyltransferase, Gnat family	acetyltransferase, GNAT family

Action	Match Value	Replace Value	Example Input	Example Output
full replace	DUF	conserved hypothetical protein	hypothetical protein (DUF 1092)	conserved hypothetical protein
partial replace	7-DHC	7-dehydrocholesterol	7-DHC reductase	7-dehydrocholesterol reductase
remove	homolog	N/A	putative repressor homolog	putative repressor
merge duplicates	outative	N/A	putative kinase putative	putative kinase
move to beginning	putative	N/A	acyltransferase, putative	putative acyltransferase
move to end	putative	N/A	putative calicivirin	calicivirin, putative
regular expr. warning	/Salmonella/i	N/A	Salmonella invasin chaperone	WARNING
regular expr. local	/acyl-[cC]o[aA]/acyl-CoA/		acyl-coa dehydrogenase	acyl-CoA dehydrogenase
regular expr. global	/[Gg]nat family/GNAT family/g		acetyltransferase, Gnat family	acetyltransferase, GNAT family

For full and partial replace actions, users need to enter two input fields (match and replace value), while the other actions need only one input field.

Perl-styled regular expressions can be used for the three regular expression actions. The example input and output columns demonstrate the respective action. All may match multiple names.

Table 1.

Open in new tab

List of partial match actions

Action	Match Value	Replace Value	Example Input	Example Output
full replace	DUF	conserved hypothetical protein	hypothetical protein (DUF 1092)	conserved hypothetical protein
partial replace	7-DHC	7-dehydrocholesterol	7-DHC reductase	7-dehydrocholesterol reductase
remove	homolog	N/A	putative repressor homolog	putative repressor
merge duplicates	outative	N/A	putative kinase putative	putative kinase
move to beginning	putative	N/A	acyltransferase, putative	putative acyltransferase
move to end	putative	N/A	putative calicivirin	calicivirin, putative
regular expr. warning	/Salmonella/i	N/A	Salmonella invasin chaperone	WARNING
regular expr. local	/acyl-[cC]o[aA]/acyl-CoA/		acyl-coa dehydrogenase	acyl-CoA dehydrogenase
regular expr. global	/[Gg]nat family/GNAT family/g		acetyltransferase, Gnat family	acetyltransferase, GNAT family

Action	Match Value	Replace Value	Example Input	Example Output
full replace	DUF	conserved hypothetical protein	hypothetical protein (DUF 1092)	conserved hypothetical protein
partial replace	7-DHC	7-dehydrocholesterol	7-DHC reductase	7-dehydrocholesterol reductase
remove	homolog	N/A	putative repressor homolog	putative repressor
merge duplicates	outative	N/A	putative kinase putative	putative kinase
move to beginning	putative	N/A	acyltransferase, putative	putative acyltransferase
move to end	putative	N/A	putative calicivirin	calicivirin, putative
regular expr. warning	/Salmonella/i	N/A	Salmonella invasin chaperone	WARNING
regular expr. local	/acyl-[cC]o[aA]/acyl-CoA/		acyl-coa dehydrogenase	acyl-CoA dehydrogenase
regular expr. global	/[Gg]nat family/GNAT family/g		acetyltransferase, Gnat family	acetyltransferase, GNAT family

For full and partial replace actions, users need to enter two input fields (match and replace value), while the other actions need only one input field.

Perl-styled regular expressions can be used for the three regular expression actions. The example input and output columns demonstrate the respective action. All may match multiple names.

The deployed public version of the PNU comes preloaded with 11 115 rules (577 ‘partial matches’ and 10 538 ‘full matches’). Of these, 3080 have been manually curated by expert annotators and 7458 ‘full matches’ are synonym pairs from the IUBMB database (9). New JCVI rules are continuously added, improved and made available through the PNU by JCVI analysts. Users can enter and modify rules by setting up their own PNU account via the web interface, detailed below.

USER INTERFACE

Entering rules

Users can create their own PNU account which will allow them to customize their work environment. During the account-creation process, users have the option either to enter their rules from scratch or to build upon the most current JCVI rules in the PNU database. These will then be imported to the user's profile. After this initial set up, users can create their own rules (entered one by one or uploaded in batch) or modify existing ones (Figure 1A).

Organizing rules

Rules are organized into PNU ‘projects’, with the goal of helping the user to organize, share and apply rules. A project may contain several ‘groups’ and ‘procedures’ (Figure 1). Each ‘group’ contains several ‘full matches’ while each ‘procedure’ contains several ‘partial matches’. The order of ‘partial matches’ in a ‘procedure’ matters as ‘partial matches’ are executed sequentially, i.e. the output of the first ‘partial match’ becomes the input of the second ‘partial match’ and so forth. The interface allows users to adjust the order of ‘procedures’ and ‘partial matches’. The following constraints apply for ‘full matches’: a nonpreferred name cannot match an existing preferred name and vice versa. For all other types (‘partial matches’, ‘groups’, ‘procedures’ and ‘projects’), the name must be unique. Users can share projects with the community by checking its ‘public project’ attribute.

Correcting names

The web interface provides an easy to use reporting tool to check names against a set of naming rules stored in a PNU project. By default, the JCVI project is selected. Users can apply their own custom PNU project or select from other shared projects. The PNU report lists the overall number of matches, ‘full matches’, ‘partial matches’ and ‘warnings’ that have been found among the set of unique input names (Figure 1B). Each row represents a suggested naming operation including the number of input entries with the respective name and the PNU suggested name. For each ‘warning’, the user can enter an alternative name in a text box. After the user has accepted relevant replacements and entered alternative names for ‘warnings’, a file can be downloaded with the original names corrected and replaced.

DISCUSSION

In this article, we have presented the PNU, a new web-based database for storing and applying protein naming rules. The PNU allows users to correct names in an automated fashion, leveraging curated JCVI names and incorporating their own. This will help relieve researchers from extensive manual curation of their genomes. The option to correct names in GenBank table files will aid researchers in submitting GenBank-acceptable names on the first attempt. We are reviewing past and current genome submissions for common issues flagged by GenBank to constantly improve the JCVI rule base. However, the JCVI project is only one take on naming and others are entitled to create and share their own projects. To allow users to apply rules programmatically, we plan to implement a PNU web services interface. Finally, users are requested to suggest additional features of interest.

AVAILABILITY OF THE DATABASE

A database schema (Supplementary Data), a MySQL dump file and a tab delimited list of JCVI ‘full matches’ are available for download at: http://www.jcvi.org/pn-utility/download.php.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Funding for open access charge: National Institute of Allergy and Infectious Disease (contract HHSN266200400038C).

Conflict of interest statement. None declared.

REFERENCES

Fundel

Zimmer

Gene and protein nomenclature in public databases

BMC Bioinformatics

(

2006

)

372

The UniProt Consortium

The Universal Protein Resource (UniProt) 2009

Nucleic Acids Res.

(

2009

)

D169

–

D174

Crossref

PubMed

WorldCat

Eyre

Ducluzeau

Sneddon

Povey

Bruford

Lush

The HUGO Gene Nomenclature Database, 2006 updates

Nucleic Acids Res.

(

2006

)

D319

–

D321

Dwight

Balakrishnan

Christie

Costanzo

Dolinski

Engel

Feierbach

Fisk

Hirschman

Hong

, et al.

Saccharomyces genome database: underlying principles and organisation

Brief Bioinform.

(

2004

)

–

Drysdale

Crosby

FlyBase: genes and gene models

Nucleic Acids Res.

(

2005

)

D390

–

D395

Bult

Eppig

Kadin

Richardson

Blake

The Mouse Genome Database (MGD): mouse biology and model systems

Nucleic Acids Res.

(

2008

)

D724

–

D728

Dwinell

Worthey

Shimoyama

Bakir-Gungor

DePons

Laulederkind

Lowry

Nigram

Petri

Smith

, et al.

The Rat Genome Database 2009: variation, ontologies and pathways

Nucleic Acids Res.

(

2009

)

D744

–

D749

Liu

Zhang

BioThesaurus: a web-based thesaurus of protein and gene names

Bioinformatics

(

2006

)

103

–

105

McDonald

Boyce

Tipton

ExplorEnz: the primary source of the IUBMB enzyme list

Nucleic Acids Res.

(

2009

)

D593

–

D597

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
May 2017	2
June 2017	1
July 2017	1
August 2017	4
October 2017	1
November 2017	5
December 2017	10
January 2018	12
February 2018	4
March 2018	10
April 2018	10
May 2018	6
June 2018	7
July 2018	5
August 2018	12
September 2018	11
October 2018	9
November 2018	13
December 2018	32
January 2019	5
February 2019	9
March 2019	18
April 2019	16
May 2019	8
June 2019	6
July 2019	8
August 2019	15
September 2019	19
October 2019	8
November 2019	9
December 2019	7
January 2020	6
February 2020	9
March 2020	7
April 2020	1
May 2020	10
June 2020	6
July 2020	4
August 2020	6
September 2020	2
October 2020	7
November 2020	3
December 2020	4
January 2021	2
February 2021	6
March 2021	6
April 2021	5
May 2021	8
June 2021	5
July 2021	13
August 2021	5
September 2021	12
October 2021	4
November 2021	7
December 2021	6
January 2022	14
February 2022	3
March 2022	10
April 2022	9
May 2022	7
June 2022	8
July 2022	7
August 2022	9
September 2022	15
October 2022	10
November 2022	8
December 2022	6
January 2023	8
March 2023	14
April 2023	10
May 2023	5
June 2023	7
July 2023	6
August 2023	4
September 2023	11
October 2023	5
November 2023	4
December 2023	15
January 2024	9
February 2024	10
March 2024	14
April 2024	15
May 2024	11
June 2024	11
July 2024	23
August 2024	10
September 2024	22
October 2024	10
November 2024	26
December 2024	7
January 2025	8
February 2025	10
March 2025	14
April 2025	7
May 2025	3

Article Contents

The Protein Naming Utility: a rules database for protein nomenclature

ABSTRACT

INTRODUCTION

NAMING RULES AND DATA

USER INTERFACE

Entering rules

Organizing rules

Correcting names

DISCUSSION

AVAILABILITY OF THE DATABASE

SUPPLEMENTARY DATA

FUNDING

REFERENCES

Supplementary data

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

The Protein Naming Utility: a rules database for protein nomenclature

ABSTRACT

INTRODUCTION

NAMING RULES AND DATA

USER INTERFACE

Entering rules

Organizing rules

Correcting names

DISCUSSION

AVAILABILITY OF THE DATABASE

SUPPLEMENTARY DATA

FUNDING

REFERENCES

Supplementary data

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only