SmProt: a database of small proteins encoded by annotated coding and non-coding RNA loci

Hao, Yajing; Zhang, Lili; Niu, Yiwei; Cai, Tanxi; Luo, Jianjun; He, Shunmin; Zhang, Bao; Zhang, Dejiu; Qin, Yan; Yang, Fuquan; Chen, Runsheng

doi:10.1093/bib/bbx005

Abstract

Small proteins is the general term for proteins with length shorter than 100 amino acids. Identification and functional studies of small proteins have advanced rapidly in recent years, and several studies have shown that small proteins play important roles in diverse functions including development, muscle contraction and DNA repair. Identification and characterization of previously unrecognized small proteins may contribute in important ways to cell biology and human health. Current databases are generally somewhat deficient in that they have either not collected small proteins systematically, or contain only predictions of small proteins in a limited number of tissues and species. Here, we present a specifically designed web-accessible database, small proteins database (SmProt, http://bioinfo.ibp.ac.cn/SmProt), which is a database documenting small proteins. The current release of SmProt incorporates 255 010 small proteins computationally or experimentally identified in 291 cell lines/tissues derived from eight popular species. The database provides a variety of data including basic information (sequence, location, gene name, organism, etc.) as well as specific information (experiment, function, disease type, etc.). To facilitate data extraction, SmProt supports multiple search options, including species, genome location, gene name and their aliases, cell lines/tissues, ORF type, gene type, PubMed ID and SmProt ID. SmProt also incorporates a service for the BLAST alignment search and provides a local UCSC Genome Browser. Additionally, SmProt defines a high-confidence set of small proteins and predicts the functions of the small proteins.

small proteins, ribosome profiling, non-coding RNA, sORFs

Introduction

Identification of coding elements in the genome is a fundamental step towards understanding the building blocks of living systems. Previous genome annotation pipelines mainly focused on proteins longer than 100 amino acids [1, 2]. However, recent studies have shown that many proteins shorter than 100 amino acids (small proteins) also play important roles in diverse functions including development [3], muscle contraction [4, 5] and DNA repair [6]. For example, the functional protein TAL influences development in Drosophila melanogaster [7], and recently, a small protein, MLN, encoded by a putative long non-coding RNA (lncRNA), has been found to regulate muscle performance in human [4]. Identification of the previously neglected small proteins may contribute in important ways to cellular and organismal biology, emphasizing the need for an unbiased and comprehensive strategy to evaluate translation empirically [8].

Deep transcriptome sequencing has revealed the existence of many transcripts that lack long or conserved open reading frames (ORFs), and such transcripts have generally been termed as lncRNAs [9]. Some well-known lncRNAs have been shown to play key roles in diverse biological processes including chromatin remodelling and imprinting [10], cancer metastasis [11] and cell proliferation [12]. Although a number of lncRNAs have established regulatory functions, the vast majority of lncRNAs do not have known functions. While their existence is undisputed, their coding potential and functionality have remained controversial [13–15]. To distinguish lncRNAs from coding mRNAs, previous studies mainly relied on computational approaches [16–20] by evaluating the transcript features, such as ORF length, ORF coverage and protein sequence conservation. However, these approaches may give rise to misclassifications in that lncRNAs containing short conserved regions might be misclassified as protein-coding transcripts (false negatives), whereas actual protein-coding transcripts containing short or weakly conserved ORFs might be misclassified as non-coding (false positives).

In recent years, the use of comparative genomics [21, 22], proteomics [23, 24] and a combination of evolutionary conservation and ribosome profiling data [25, 26] has indicated that translation is far more pervasive than hitherto acknowledged and has identified many coding transcripts previously assumed to be non-coding RNAs [27–29]. For example, Ingolia et al. presented that many ribosomes occupied regions of the transcriptome assumed to be non-coding [30]. Bazzini et al. leveraged the periodicity of ribosome movement on the mRNA to define actively translated ORFs by ribosome footprinting, and found several hundred smORFs (small open reading frames, sORFs) encoded by transcripts without defined coding sequences [26]. Besides, Sebastiaan et al. performed systematic RNA sequencing on ribosome-associated RNA pools obtained through ribosomal fractionation and demonstrated that lncRNAs may have a function in ribosome complexes [29]. Recently, Calviello et al. used a rigorous statistical approach to identify translated regions on the basis of the characteristic three-nucleotide periodicity of Ribo-seq data and found distinct ribosomal signatures for several hundred upstream ORFs and ORFs in annotated non-coding genes [31]. Additionally, several small proteins apparently originating from intergenic regions have been shown to be functional [3–6]. These small proteins have diverse regulatory roles; therefore, a small protein database will provide valuable information on small proteins for the scientific community as well as offer new avenues of research into the functions of what has hitherto been regarded as lncRNAs.

Currently, there are few databases for small proteins, but none of them focus on small proteins encoded by non-coding RNA loci. UniProt [32] and CCDS [33–35] pay more attention on proteins longer than 100 amino acids, although there exist some small proteins. sORF.org [36] only harbours sORFs calculated from ribosome profiling data in three cell lines. Although these existing databases offer some information about the small proteins, an in-depth investigation of small proteins across popular species is still lacking.

Here, we present a specifically designed database, small proteins database (SmProt). The current release of SmProt incorporates 255 010 small proteins computationally or experimentally identified in 291 cell types/tissues originating from eight species (human, mouse, rat, fruit fly, zebrafish, yeast, Caenorhabditis elegans and Escherichia coli). SmProt provides a user-friendly website for users to search, browse, submit, blast, export and download data on small proteins. Furthermore, the database incorporates a service for the BLAST alignment search and an integrated local UCSC Genome Browser service for visualizing the genomic locations of small proteins, which greatly improved the user experience. SmProt also defines a high-confidence set of small proteins and predicts the functions of the small proteins curated from ribosome profiling calculation and literature mining. Altogether, we strongly believe that the current SmProt database will facilitate further investigations on small proteins.

Methods and materials

The pipeline for construction of SmProt is depicted in Figure 1. The small proteins curated in SmProt were obtained from four different processing pipelines. The genes encoding the collected small proteins re-annotated with specific IDs, and the small proteins were categorized based on their genomic locations and data sources. The detailed procedure is explained in the following sections.

Figure 1

Pipeline for construction of the SmProt database (see main text for details).

Open in new tab Download slide

Manual curation of small proteins from the scientific literature and other databases

To obtain small proteins from literature, we searched PubMed using a set of key words (listed in Supplementary Materials). We found 5475 articles (union set) published up to 23 December 2015 and obtained the abstracts of these articles from the PubMed database. Then three persons reviewed each abstract independently and obtained the paper containing small proteins’ information. Then we integrated the results. The articles were retained if all the three people agreed to keep. We double checked the qualification by reading the full article if one abstract received only two votes out of three. The remaining articles were then divided into ‘high-throughput’ or ‘low-throughput’, the former term applied to articles dealing with batch discovery of small proteins, and the latter applied to papers focusing on a smaller number of specific small proteins. The basic information on the small proteins was extracted manually, and only small proteins with strong support from experimental evidence were kept for further consideration. From the ‘high-throughput’ literature, we also extracted detailed information according to the information provided in the articles. From the ‘low-throughput’ literature, we also extracted various information such as experimental design, start codons, general description of the small proteins, functions and possible associated diseases. We have attempted to make the amount of information in SmProt as comprehensible as possible, and have included up to 28 pieces of information for each small protein entry. We also obtained small proteins from the CCDS and UniProt database to prevent missing small proteins without support from publications that may have been directly submitted to the databases. As for UniProt, we only retained the small proteins manually annotated and reviewed by UniProtKB curators, and the type of evidence that supports the existence of the protein from either transcript level or protein level. For the small proteins with lengths <100 amino acids in CCDS database, their feature information was adopted with no additional changes, as the proteins curated in CCDS are already of high quality.

Small proteins predicted in silico from ribosome profiling data sets

Ribosome profiling, that is, deep sequencing of ribosome-protected RNA fragments, has emerged as an efficient technique for comprehensive and quantitative assaying of translation activity. A variety of algorithms and metrics have been developed to use ribosome profiling data for annotation of translated regions of the genome. For the construction of the database, we took advantage of the publicly available ribosome profiling data sets obtained from GEO database [37] and European Nucleotide Archive (http://www.ebi.ac.uk/ena) using a set of key words (listed in Supplementary Materials) to identify small proteins. First, we downloaded 60 ribosome profiling data sets covering 26 cell lines or tissues from eight different species (Supplementary Table S1). At the same time, we also downloaded RNA-seq data sets in the GEO database or European Nucleotide Archive that originated from the same cell lines/tissues and species. The ribosome profiling sequencing (Ribo-seq) reads were stripped of adaptor sequences using Trimmomatic [38], and reads shorter than 28 bases were discarded before removing reads aligning to rRNA sequences using Bowtie2 [39] with the default parameters. Additionally, using the FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/), sequences were discarded if containing a base with quality score of ≤20. The RNA-seq reads were pre-processed likewise. Pre-processed Ribo- and RNA-seq reads were aligned to the relevant genome (hg19, mm10, rn6, saccer3, dr7, dm3, EB1 and ce10) with the split-aware aligner STAR [40]. A maximum of four mismatches were allowed, and multimapping to up to eight different positions was permitted. RiboTaper was used to obtain small proteins as previously described [31], retaining only small proteins that passed the multimapping filter.

Table 1

Types of data sources included in the SmProt database

Data sources	Description
Low-throughput Literature Mining	Literature is obtained from PubMed. We used a set of keywords to retrieve literature from PubMed, and then extracted detailed information manually from the literature that focused on several specific small proteins.
High-throughput Literature Mining	Literature is obtained from PubMed. We used a set of keywords to retrieve literature from PubMed, and then extracted detailed information manually from the literature that focused on batch discovery of small proteins.
MS data	MS data sets are collected from ENCODE project, EMBL-EBI PRIDE Archive or our lab. Then we analysed these data to obtain small proteins encoded by ncRNAs.
Ribosome profiling	Ribosome profiling data sets are collected from GEO or ENA database, and RiboTaper software was used for small proteins identification.
Databases	We also collected small proteins from databases. We only obtained the reliable small proteins (such as manually annotated and reviewed) and reprocessed according to the flow chart.

Data sources	Description
Low-throughput Literature Mining	Literature is obtained from PubMed. We used a set of keywords to retrieve literature from PubMed, and then extracted detailed information manually from the literature that focused on several specific small proteins.
High-throughput Literature Mining	Literature is obtained from PubMed. We used a set of keywords to retrieve literature from PubMed, and then extracted detailed information manually from the literature that focused on batch discovery of small proteins.
MS data	MS data sets are collected from ENCODE project, EMBL-EBI PRIDE Archive or our lab. Then we analysed these data to obtain small proteins encoded by ncRNAs.
Ribosome profiling	Ribosome profiling data sets are collected from GEO or ENA database, and RiboTaper software was used for small proteins identification.
Databases	We also collected small proteins from databases. We only obtained the reliable small proteins (such as manually annotated and reviewed) and reprocessed according to the flow chart.

Table 1

Types of data sources included in the SmProt database

Data sources	Description
Low-throughput Literature Mining	Literature is obtained from PubMed. We used a set of keywords to retrieve literature from PubMed, and then extracted detailed information manually from the literature that focused on several specific small proteins.
High-throughput Literature Mining	Literature is obtained from PubMed. We used a set of keywords to retrieve literature from PubMed, and then extracted detailed information manually from the literature that focused on batch discovery of small proteins.
MS data	MS data sets are collected from ENCODE project, EMBL-EBI PRIDE Archive or our lab. Then we analysed these data to obtain small proteins encoded by ncRNAs.
Ribosome profiling	Ribosome profiling data sets are collected from GEO or ENA database, and RiboTaper software was used for small proteins identification.
Databases	We also collected small proteins from databases. We only obtained the reliable small proteins (such as manually annotated and reviewed) and reprocessed according to the flow chart.

Data sources	Description
Low-throughput Literature Mining	Literature is obtained from PubMed. We used a set of keywords to retrieve literature from PubMed, and then extracted detailed information manually from the literature that focused on several specific small proteins.
High-throughput Literature Mining	Literature is obtained from PubMed. We used a set of keywords to retrieve literature from PubMed, and then extracted detailed information manually from the literature that focused on batch discovery of small proteins.
MS data	MS data sets are collected from ENCODE project, EMBL-EBI PRIDE Archive or our lab. Then we analysed these data to obtain small proteins encoded by ncRNAs.
Ribosome profiling	Ribosome profiling data sets are collected from GEO or ENA database, and RiboTaper software was used for small proteins identification.
Databases	We also collected small proteins from databases. We only obtained the reliable small proteins (such as manually annotated and reviewed) and reprocessed according to the flow chart.

Small proteins calculated from mass spectrometry data sets

Mass spectrometry (MS) data can provide direct evidence for small proteins. Theoretically, evidence of small proteins supported by the MS can be regarded as more reliable than evidence obtained by other methods. To obtain small proteins that are encoded by annotated non-coding transcripts and supported by MS data sets, we downloaded the genome locations of small proteins in four different cell types (H1-hESC, K562, GM12878 and neural in vitro differentiated cells male embryo) from the ENCODE project [41]. We also obtained raw MS data sets from our lab and from EMBL-EBI PRIDE Archive [42]. Peppy [43] with default parameters was used to match the MS/MS spectra to the reference genome and obtain the best genomic location for each spectrum. We obtained the non-coding RNA transcripts from the NONCODE database [44–48] and only retained small proteins mapping to site located entirely within the locus of an annotated non-coding RNA, using the IntersectBed from BedTools [49]. The MS data in SmProt is an additional evidence for the presence of small proteins encoded by non-coding RNAs. If the small proteins, obtained from literature mining and ribosome profiling prediction, are also identified based on MS data, then the MS data is the more confident evidence. We also provided the genomic locations of all the MS/MS spectra in the genome browser web page regardless of whether they intersected with non-coding RNA loci.

Re-annotation and re-organization

After obtaining small proteins from the four data sources mentioned above, we used gene symbols to annotate the genes encoding the small proteins. We also assigned a NONCODE ID, RefSeq ID [50] and ENSEMBL ID [51] to each of the genes. In addition, we re-organized the types of small proteins based on their location relative to the genes encoded the small proteins. The small proteins were assigned as sORF if they were encoded by annotated non-coding RNAs or not translated entirely from the 5ʹ UTR and 3ʹ UTR of protein coding genes. When translating entirely within the 5ʹ UTR (or the 3ʹ UTR) of an mRNA, the small proteins were assigned as uORF (or dORF). As different processing pipelines had different confidences, we re-organized the data processing pipelines and defined five different data sources as described in Table 1. We assigned the data source to each small protein. The small protein entries in SmProt were designated systematically. Small proteins from the same organism curated from different data sources were assigned a number starting with ‘SPRO’ followed by a symbol representing the organism. For example, ‘SPROMUS000018’ denotes a small protein from mouse (‘MUS’ standing for ‘Mus musculus’). In addition, we defined a high confidence set of small proteins, which was obtained from low-throughput literature mining, databases, high-throughput literature mining supported by MS data or ribosome profiles supported by MS data, these representing the highest quality small protein entries in the database.

Functional annotation of small proteins

To enable researchers to have a better understanding of small proteins’ functions, we predicted the functions of small proteins, obtained from databases, high-throughput literature mining and ribosome profiles, through InterProScan [52] with the default parameters. The small proteins obtained from low-throughput literature mining almost had specific functions, while the small proteins obtained from MS data are served as evidence for the presence of small proteins encoded by non-coding RNAs. All results have been added to the SmProt database.

Results

Database content

Currently, SmProt contains 255 010 small protein entries representing eight popular species (Table 2). Each small protein entry in the SmProt database has three main data components: General Information, Detailed Information and References. The General Information provides users with basic information including small protein ID, predicted functions, sequence, length, genomic location, ORF type, transcript ID, gene symbol, gene type, the organism, transcript ID, gene IDs in the NONCODE, RefSeq and ENSEMBL databases, tissue or cell line and data source. The Reference component contains the PubMed ID (PMID), the article title, authors and the journal where the article is published.

Table 2

SmProt Small protein statistics

Species	Number
Human	167 785
Fruit fly	39 015
Caenorhabditis elegans	18 357
Mouse	15 581
Rat	8128
Zebrafish	2994
Yeast	1875
Escherichia coli	1275

Table 2

SmProt Small protein statistics

Species	Number
Human	167 785
Fruit fly	39 015
Caenorhabditis elegans	18 357
Mouse	15 581
Rat	8128
Zebrafish	2994
Yeast	1875
Escherichia coli	1275

In the Detailed Information, we provide the detailed information on each small protein according to its data source.

Low-throughput literature mining: The Detailed Information tables for small protein entry obtained from low-throughput literature mining include the start codon of the small protein, the experimental method(s) used to obtain or characterize the small protein, the function of the small protein, the disease(s) with which the small protein may be involved and its description in the literature.
High-throughput literature mining: The Detailed Information tables for small protein entry curated from high-throughput literature mining vary according to the literature from which information has been obtained. The information descriptions can be viewed in the user manual in the help web page.
MS Data: The Detailed Information tables for small protein entry curated from MS data include Raw Score, Spectrum ID, Peptide Rank and Peptide Repeat Count.
Ribosome profiling data: The Detailed Information tables for small protein entry curated from ribosome profiling data contain a variety of data, including transcripts per million for both the Ribo-seq and RNA-Seq data, the (relative) positions of start and stop codons for both the small protein and for a possible annotated CDS (coding sequence) in the transcript, the sequence reads number for the small protein in both the Ribo-seq and the RNA-seq data, the P and RNA sites number in the small protein, P values for the small protein (calculated by the multitaper method) for both the Ribo-seq and the RNA-seq data, the number of exons in the small protein and the ribosome profiling data set ID.
Databases: The original ID and the detailed information about the small protein such as evidence, experiment and so on.

Web interface

All data were organized into a set of relational MySQL tables. The database query and user interface were developed using HTML, PHP (http://www.php.net/), CSS and JavaScript. Figure 2 illustrates the user interface of the database.

Figure 2

User interface of the SmProt database. (A) A quick search box on the home page. (B) A detailed search function provided in the Search web page. (C) The detailed information of search results. (D) The detailed information of small proteins curated in SmProt. (E) The Browse web page. (F) The Download web page. (G) The Submit web page. (H) The BLAST service provided in the Blast web page. (I) The local UCSC Genome Browser for visualization.

Open in new tab Download slide

Searching and browsing

The SmProt website includes several user-friendly search boxes, which make data retrieval easy and efficient. A quick search box is available on the home page for fast searching by Gene Symbol or Gene IDs from related databases (e.g. NONCODE, RefSeq, ENSEMBL), cell line or tissue, PMID, ORF type and gene type (Figure 2A). A detailed search function is also provided (Figure 2B). The search function is divided into three parts. (1) Keyword search, allowing searching by a variety of keywords, including gene symbol, gene ID, cell line/tissue, gene type or ORF type. Keyword search results can be filtered by data sources and species. (2) Location search, by which small protein loci overlapping biological features of interest (e.g. chromosome, species, genomic region) will be obtained. (3) The option for ID search can be used if the small protein-related ID exists in any of the major databases (i.e. NONCODE, ENSEMBL, RefSeq or PubMed), or if the ID is already created in the SmProt database. The search results are obtained by clicking the ‘submit’ button (Figure 2C). If users want to view the detailed information of any particular small protein occurring in the search results, they can click the corresponding SmProt_ID, which links to the Detailed Information table (Figure 2D).

The database can be browsed by clicking the ‘Browse’ tab on the navigation menu. In the Browse web page, basic information including SmProt_ID, protein size (SmProt_length), protein type (ORF_Type), the name/ID (Gene) or type (gene type) of genomic loci encoding the small proteins, organism, the data source of the small proteins (Data Source) and PMID (Figure 2E). In the Browse web page, users can browse their interested species, ORF type, gene type and data source through Browse button to retrieve results, which would be showed as below. The result list can be viewed either by changing the number of records or by clicking on the page numbers at the bottom right corner of the table. The results list also can be sorted by specific key words by clicking the ‘∧’ or ‘∨’ in the corresponding column. Users can export detailed information using the export button and click the ‘SmProt_ID’ to obtain the detailed information concerning any specific small protein.

Download, export and submit

Specific information and sequence information of the small proteins stored in the database can be downloaded in TXT or FASTA format from the Download web page (Figure 2F). The high-confidence set in SmProt database has been also provided in the Download web page. Downloading can be performed either from the download page or while browsing and searching specific data. The queried data can be exported as a TXT or EXCEL file, using the export button on the top right of each data table. To maintain an up-to-date and comprehensive resource, SmProt also encourages users to submit newly published small proteins in the Submit web page in the requested data format (Figure 2G).

Integration with a service for the BLAST alignment search and a UCSC genome browser

In the SmProt database, we have integrated the online BLAST service (NCBI wwwBLAST version 2.2.24), which allows for sequence similarity searches of both nucleotides and proteins to be run in the blast web page (Figure 2H). Importantly, for a small protein with no recognized gene name or IDs, it is also possible to search in SmProt simply based on its sequence. Additionally, SmProt also has integrated a local UCSC Genome Browser (http://genome.ucsc.edu/) for visualization of the genomic locations of the small proteins in the SmProtTable track (Figure 2I). Small proteins curated from MS data are shown as an independent track in the genome browser. For a small protein with no recognized gene name or IDs, users can also search in SmProt based on its genomic location in genome browser. Associated tracks like NONCODE lncRNA, NONCODE Gene, RefSeq Genes and Ensembl Genes are also shown in the genome browser.

Discussion and feature development

Comparison with existing databases

By integrating data on small proteins from the literature, MS data and ribosome profiling data, SmProt provides an easy access to unbiased and comprehensive sets of small proteins derived from eight species. Compared with the existing small protein database, sORFs.org [36], which is a repository of small open reading frames (sORFs) identified by ribosome profiling, and only harbours sORFs calculated from ribosome profiling data in three cell lines, SmProt database excels in the following aspects: (i) Multiple lines of data sources. SmProt not only collected small proteins that have been computationally predicted from ribosome profiling data but also manually curated information from the scientific literature and known databases. Small protein loci were also predicted based on MS data obtained from public databases or generated experimentally in our laboratory. (ii) Substantially expanded data volumes. The current release of SmProt incorporates small proteins computationally or experimentally identified in 291 cell types and tissues derived from eight species. In comparison, sORFs.org used only ribosome profiling data from three cell lines from three species. (iii) A more stringent data analysis approach, yielding more reliable prediction results. In SmProt, we used the RiboTaper (published in Nature Methods in 2016) to identify translated regions on the basis of the characteristic three-nucleotide periodicity of the Ribo-seq data. (iv) Comprehensive annotation of all small proteins. Basic information about each small protein as well as more specific and source-dependent information is made easily available. SmProt provides up to 28 pieces of information for each small protein, which helps users to better evaluate the search results. (v) Multiple options for keyword searches. To facilitate data extraction, SmProt provides multiple search options, including species, genome location, gene name, cell type/tissue and ORF Type. Moreover, SmProt allows users to search by gene symbol, ENSEMBLID, RefSeq ID and NONCODE ID. In contrast, sORFs.org only allows users to search by ENSEMBL ID in the gene term. (vi) SmProt provides a user-friendly website, incorporating a service for BLAST alignment search and integrating a local UCSC genome browser for the visualization of the small protein genomic locations.

As compared with the databases containing small proteins, SmProt is a specialized database with a specific focus on small protein collection, and is the first database to pay attention to small proteins encoded by loci annotated as non-coding RNAs. The proteins curated in UniProt and CCDS are mainly proteins longer than 100 amino acids encoded by the known protein-coding genes. For the eight species selected in SmProt, the number of small proteins with high confidence in CCDS and UniProt is 1517 and 1481, respectively. However, 95.8% of these small proteins are already obtained in the SmProt. And the left 4.3% proteins without curation in SmProt are mainly submitted directly to the databases. In order not to ignore these small proteins, we also obtained the small proteins from these two databases and processed to add the unique features in SmProt.

Feature development

The overall goal of the SmProt database is to provide a comprehensive resource to facilitate further studies of small proteins and their functions. The SmProt will also provide a new tool for exploring the functional mechanism of annotated lncRNAs. The small protein research field is probably only beginning to unfold. In the future, we will continue to update the database and integrate more species. Furthermore, we will also continue to expand the storage space and improve the computer server performance for storing and analysing these data. We expect that by these continuous efforts on developing and improving SmProt, we will contribute to the general understanding of small proteins and their roles in cellular function.

Key Points

To provide informative data source as well as valuable information on small proteins for the whole scientific community, we developed a small protein database with an integration of 255 010 small proteins curated form five different data sources in eight popular species.
Comprehensive annotation of all small proteins. Basic information about each small protein as well as more specific and source-dependent information is well described. Importantly, we also provided the small proteins’ functions.
SmProt provided a user-friendly website, incorporating a service for BLAST alignment search and integrating a local UCSC genome browser for the visualization of the small protein genomic locations. Additionally, SmProt defines a high-confidence set of small proteins, which can be downloaded in the Download web page.
SmProt also curated small proteins encoded by non-coding regions, which offers new avenues of research into the functions of what has hitherto been regarded as lncRNAs.

Supplementary Data

Supplementary data are available online at http://bib.oxfordjournals.org/.

Acknowledgement

We are very grateful to Dr Geir Skogerbø for helpful suggestions and critical reading of this manuscript. We also thank Dr Xiaowei Chen and Dr Zhen Fan for discussion on high-throughput sequencing.

Funding

National Natural Science Foundation of China (grant number 31520103905) and the National High Technology Research and Development Program (‘863’ Program) of China (grant number 2015AA020108, 2014AA021502).

Yajing Hao is a PhD candidate at Key Laboratory of RNA Biology, Institute of Biophysics and University of the Chinese Academy of Sciences, Beijing, China. She has been working in the field of small proteins’ associated study, especially the small proteins encoded by lncRNAs.

Lili Zhang is a master candidate at Key Laboratory of RNA Biology, Institute of Biophysics and University of the Chinese Academy of Sciences, Beijing, China.

Yiwei Niu is a master candidate at Key Laboratory of RNA Biology, Institute of Biophysics and University of the Chinese Academy of Sciences, Beijing, China.

Tanxi Cai is an associate professor at Key Laboratory of Protein and Peptide Pharmaceuticals and Laboratory of Proteomics, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China. His research interests are focused on proteomics and lipidomics.

Jianjun Luo is an associate professor at Key Laboratory of RNA Biology, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China. His research interests are focused on systematic identification and functional studies of non-coding RNAs and small proteins.

Shunmin He is an associate professor at Key Laboratory of the Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China. His research interests are focused on transcriptomics and bioinformatics.

Bao Zhang is a master candidate at Key Laboratory of RNA Biology, Institute of Biophysics and University of the Chinese Academy of Sciences, Beijing, China.

Dejiu Zhang is an assistant professor at Key Laboratory of RNA Biology, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China. His research interest is focused on ribosome-related studies.

Yan Qin is a professor at Key Laboratory of RNA Biology, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China. Her research interests are focused on the process of translation/protein biosynthesis.

Fuquan Yang is a professor at Key Laboratory of Protein and Peptide Pharmaceuticals and Laboratory of Proteomics, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China. His research interests are focused on proteomics and lipidomics.

Runsheng Chen is a professor at Key Laboratory of RNA Biology, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China. His research interests are focused on transcriptomics and bioinformatics.

References

1

Okazaki

Y

,

Furuno

M

,

Kasukawa

T

, et al.

Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs

.

Nature

2002

;

420

:

563

–

73

.

2

Frith

MC

,

Bailey

TL

,

Kasukawa

T

, et al.

Discrimination of non-protein-coding transcripts from protein-coding mRNA

.

RNA Biol

2006

;

3

:

40

–

8

.

3

Pauli

A

,

Norris

ML

,

Valen

E

, et al.

Toddler: an embryonic signal that promotes cell movement via Apelin receptors

.

Science

2014

;

343

:

1248636

.

4

Anderson

DM

,

Anderson

KM

,

Chang

CL

, et al.

A micropeptide encoded by a putative long noncoding RNA regulates muscle performance

.

Cell

2015

;

160

:

595

–

606

.

5

Magny

EG

,

Pueyo

JI

,

Pearl

FM

, et al.

Conserved regulation of cardiac calcium uptake by peptides encoded in small open reading frames

.

Science

2013

;

341

:

1116

–

20

.

6

Slavoff

SA

,

Heo

J

,

Budnik

BA

, et al.

A human short open reading frame (sORF)-encoded polypeptide that stimulates DNA end joining

.

J Biol Chem

2014

;

289

:

10950

–

7

.

7

Martinez Arias

A

,

Galindo

MI

,

Pueyo

JI

, et al.

Peptides encoded by short ORFs control development and define a new eukaryotic gene family

.

PLoS Biol

2007

;

5

:

e106.

8

Su

M

,

Ling

Y

,

Yu

J

, et al.

Small proteins: untapped area of potential biological importance

.

Front Genet

2013

;

4

:

286.

9

Derrien

T

,

Johnson

R

,

Bussotti

G

, et al.

The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression

.

Genome Res

2012

;

22

:

1775

–

89

.

10

Jeon

Y

,

Lee

JT.

YY1 tethers Xist RNA to the inactive X nucleation center

.

Cell

2011

;

146

:

119

–

33

.

11

Gupta

RA

,

Shah

N

,

Wang

KC

, et al.

Long non-coding RNA HOTAIR reprograms chromatin state to promote cancer metastasis

.

Nature

2010

;

464

:

1071

–

6

.

12

Kong

R

,

Zhang

EB

,

Yin

DD

, et al.

Long noncoding RNA PVT1 indicates a poor prognosis of gastric cancer and promotes cell proliferation through epigenetically regulating p15 and p16

.

Mol Cancer

2015

;

14

:

82.

13

Ruiz-Orera

J

,

Messeguer

X

,

Subirana

JA

, et al.

Long non-coding RNAs as a source of new peptides

.

Elife

2014

;

3

:

e03523.

14

Guttman

M

,

Russell

P

,

Ingolia

NT

, et al.

Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins

.

Cell

2013

;

154

:

240

–

51

.

15

Chew

GL

,

Pauli

A

,

Rinn

JL

, et al.

Ribosome profiling reveals resemblance between long non-coding RNAs and 5' leaders of coding RNAs

.

Development

2013

;

140

:

2828

–

34

.

16

Cabili

MN

,

Trapnell

C

,

Goff

L

, et al.

Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses

.

Genes Dev

2011

;

25

:

1915

–

27

.

17

Carninci

P

,

Kasukawa

T

,

Katayama

S

, et al.

The transcriptional landscape of the mammalian genome

.

Science

2005

;

309

:

1559

–

63

.

18

Guttman

M

,

Amit

I

,

Garber

M

, et al.

Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals

.

Nature

2009

;

458

:

223

–

7

.

19

Pauli

A

,

Valen

E

,

Lin

MF

, et al.

Systematic identification of long noncoding RNAs expressed during zebrafish embryogenesis

.

Genome Res

2012

;

22

:

577

–

91

.

20

Ulitsky

I

,

Shkumatava

A

,

Jan

CH

, et al.

Conserved function of lincRNAs in vertebrate embryonic development despite rapid sequence evolution

.

Cell

2011

;

147

:

1537

–

50

.

21

Cvijovic

M

,

Dalevi

D

,

Bilsland

E

, et al.

Identification of putative regulatory upstream ORFs in the yeast genome using heuristics and evolutionary conservation

.

BMC Bioinformatics

2007

;

8

:

295

.

22

Zhao

Q

,

Xiao

J-F

,

Yu

J.

An integrated analysis of lineage-specific small proteins across eight eukaryotes reveals functional and evolutionary significance*

.

Prog Biochem Biophys

2012

;

39

:

359

–

67

.

Google Scholar

Crossref

WorldCat

23

Ma

J

,

Ward

CC

,

Jungreis

I

, et al.

Discovery of human sORF-encoded polypeptides (SEPs) in cell lines and tissue

.

J Proteome Res

2014

;

13

:

1757

–

65

.

24

Vanderperre

B

,

Lucier

JF

,

Bissonnette

C

, et al.

Direct detection of alternative open reading frames translation products in human significantly expands the proteome

.

PLoS One

2013

;

8

:

e70698.

25

Mackowiak

SD

,

Zauber

H

,

Bielow

C

, et al.

Extensive identification and analysis of conserved small ORFs in animals

.

Genome Biol

2015

;

16

:

179.

26

Bazzini

AA

,

Johnstone

TG

,

Christiano

R

, et al.

Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation

.

Embo J

2014

;

33

:

981

–

93

.

27

Ingolia

NT

,

Lareau

LF

,

Weissman

JS.

Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes

.

Cell

2011

;

147

:

789

–

802

.

28

Juntawong

P

,

Girke

T

,

Bazin

J

, et al.

Translational dynamics revealed by genome-wide profiling of ribosome footprints in Arabidopsis

.

Proc Natl Acad Sci USA

2014

;

111

:

E203

–

12

.

29

van Heesch

S

,

van Iterson

M

,

Jacobi

J

, et al.

Extensive localization of long noncoding RNAs to the cytosol and mono- and polyribosomal complexes

.

Genome Biol

2014

;

15

:

R6.

30

Ingolia

NT

,

Brar

GA

,

Stern-Ginossar

N

, et al.

Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes

.

Cell Rep

2014

;

8

:

1365

–

79

.

31

Calviello

L

,

Mukherjee

N

,

Wyler

E

, et al.

Detecting actively translated open reading frames in ribosome profiling data

.

Nat Methods

2016

;

13

:

165

–

70

.

32

Bateman

A

,

Martin

MJ

,

O'Donovan

C

, et al.

UniProt: a hub for protein information

.

Nucleic Acids Res

2015

;

43

:

D204

–

12

.

33

Pruitt

KD

,

Harrow

J

,

Harte

RA

, et al.

The consensus coding sequence (CCDS) project: identifying a common protein-coding gene set for the human and mouse genomes

.

Genome Res

2009

;

19

:

1316

–

23

.

34

Harte

RA

,

Farrell

CM

,

Loveland

JE

, et al.

Tracking and coordinating an international curation effort for the CCDS Project

.

Database

2012

;

2012

:

bas008

.

35

Farrell

CM

,

O’Leary

NA

,

Harte

RA

, et al.

Current status and new features of the Consensus Coding Sequence database

.

Nucleic Acids Res

2014

;

42

:

D865

–

72

.

36

Olexiouk

V

,

Crappe

J

,

Verbruggen

S

, et al.

sORFs.org: a repository of small ORFs identified by ribosome profiling

.

Nucleic Acids Res

2016

;

44

:

D324

–2

9

.

37

Barrett

T

,

Wilhite

SE

,

Ledoux

P

, et al.

NCBI GEO: archive for functional genomics data sets–update

.

Nucleic Acids Res

2013

;

41

:

D991

–

5

.

38

Bolger

AM

,

Lohse

M

,

Usadel

B.

Trimmomatic: a flexible trimmer for Illumina sequence data

.

Bioinformatics

2014

;

30

:

2114

–

20

.

39

Langmead

B

,

Salzberg

SL.

Fast gapped-read alignment with Bowtie 2

.

Nat Methods

2012

;

9

:

357

–

9

.

40

Dobin

A

,

Davis

CA

,

Schlesinger

F

, et al.

STAR: ultrafast universal RNA-seq aligner

.

Bioinformatics

2013

;

29

:

15

–

21

.

41

ENCODE Project Consortium

.

An integrated encyclopedia of DNA elements in the human genome

.

Nature

2012

;

489

:

57

–

74

.

Crossref

PubMed

WorldCat

42

Vizcaino

JA

,

Csordas

A

,

del-Toro

N

, et al.

2016 update of the PRIDE database and its related tools

.

Nucleic Acids Res

2016

;

44

:

D447

–

56

.

43

Risk

BA

,

Spitzer

WJ

,

Giddings

MC.

Peppy: proteogenomic search software

.

J Proteome Res

2013

;

12

:

3019

–

25

.

44

Liu

CN

,

Bai

BY

,

Skogerbo

G

, et al.

NONCODE: an integrated knowledge database of non-coding RNAs

.

Nucleic Acids Res

2005

;

33

:

D112

–

15

.

45

He

SM

,

Liu

CN

,

Skogerbo

G

, et al.

NONCODE v2.0: decoding the non-coding

.

Nucleic Acids Res

2008

;

36

:

D170

–

2

.

46

Bu

DC

,

Yu

KT

,

Sun

SL

, et al.

NONCODE v3.0: integrative annotation of long noncoding RNAs

.

Nucleic Acids Res

2012

;

40

:

D210

–

15

.

47

Xie

C

,

Yuan

J

,

Li

H

, et al.

NONCODEv4: exploring the world of long non-coding RNA genes

.

Nucleic Acids Res

2014

;

42

:

D98

–

103

.

48

Zhao

Y

,

Li

H

,

Fang

S

, et al.

NONCODE 2016: an informative and valuable data source of long non-coding RNAs

.

Nucleic Acids Res

2016

;

44

:

D203

–

8

.

49

Quinlan

AR

,

Hall

IM.

BEDTools: a flexible suite of utilities for comparing genomic features

.

Bioinformatics

2010

;

26

:

841

–

2

.

50

Pruitt

KD

,

Brown

GR

,

Hiatt

SM

, et al.

RefSeq: an update on mammalian reference sequences

.

Nucleic Acids Res

2014

;

42

:

D756

–

63

.

51

Aken

BL

,

Ayling

S

,

Barrell

D

, et al.

The Ensembl gene annotation system

.

Database (Oxford)

2016

;

2016

:

1

–

19

.

Google Scholar

Crossref

WorldCat

52

Jones

P

,

Binns

D

,

Chang

HY

, et al.

InterProScan 5: genome-scale protein function classification

.

Bioinformatics

2014

;

30

:

1236

–

40

.

Author notes

These authors Yajing Hao and Lili Zhang contributed equally to this work.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
January 2017	21
February 2017	123
March 2017	47
April 2017	22
May 2017	47
June 2017	37
July 2017	25
August 2017	31
September 2017	21
October 2017	22
November 2017	28
December 2017	19
January 2018	37
February 2018	9
March 2018	50
April 2018	18
May 2018	26
June 2018	32
July 2018	28
August 2018	41
September 2018	33
October 2018	38
November 2018	25
December 2018	8
January 2019	25
February 2019	18
March 2019	29
April 2019	21
May 2019	30
June 2019	46
July 2019	74
August 2019	144
September 2019	101
October 2019	38
November 2019	24
December 2019	52
January 2020	29
February 2020	28
March 2020	17
April 2020	22
May 2020	9
June 2020	49
July 2020	69
August 2020	68
September 2020	51
October 2020	54
November 2020	67
December 2020	51
January 2021	63
February 2021	57
March 2021	106
April 2021	94
May 2021	73
June 2021	112
July 2021	122
August 2021	104
September 2021	79
October 2021	158
November 2021	110
December 2021	97
January 2022	98
February 2022	155
March 2022	207
April 2022	180
May 2022	137
June 2022	109
July 2022	85
August 2022	133
September 2022	215
October 2022	315
November 2022	204
December 2022	144
January 2023	148
February 2023	108
March 2023	140
April 2023	199
May 2023	91
June 2023	84
July 2023	68
August 2023	98
September 2023	126
October 2023	164
November 2023	125
December 2023	115
January 2024	176
February 2024	164
March 2024	177
April 2024	105
May 2024	89
June 2024	94
July 2024	86
August 2024	82
September 2024	134
October 2024	113
November 2024	98
December 2024	131
January 2025	125
February 2025	99
March 2025	105
April 2025	80
May 2025	34

Article Contents

SmProt: a database of small proteins encoded by annotated coding and non-coding RNA loci

Abstract

Introduction

Methods and materials

Manual curation of small proteins from the scientific literature and other databases

Small proteins predicted in silico from ribosome profiling data sets

Small proteins calculated from mass spectrometry data sets

Re-annotation and re-organization

Functional annotation of small proteins

Results

Database content

Web interface

Searching and browsing

Download, export and submit

Integration with a service for the BLAST alignment search and a UCSC genome browser

Discussion and feature development

Comparison with existing databases

Feature development

Supplementary Data

Acknowledgement

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

SmProt: a database of small proteins encoded by annotated coding and non-coding RNA loci Free

Abstract

Introduction

Methods and materials

Manual curation of small proteins from the scientific literature and other databases

Small proteins predicted in silico from ribosome profiling data sets

Small proteins calculated from mass spectrometry data sets

Re-annotation and re-organization

Functional annotation of small proteins

Results

Database content

Web interface

Searching and browsing

Download, export and submit

Integration with a service for the BLAST alignment search and a UCSC genome browser

Discussion and feature development

Comparison with existing databases

Feature development

Supplementary Data

Acknowledgement

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

SmProt: a database of small proteins encoded by annotated coding and non-coding RNA loci