Abstract

Annotation of the genome sequence of the SARS-CoV (severe acute respiratory syndrome-associated coronavirus) is indispensable to understand its evolution and pathogenesis. We have performed a full annotation of the SARS-CoV genome sequences by using annotation programs publicly available or developed by ourselves. Totally, 21 open reading frames (ORFs) of genes or putative uncharacterized proteins (PUPs) were predicted. Seven PUPs had not been reported previously, and two of them were predicted to contain transmembrane regions. Eight ORFs partially overlapped with or embedded into those of known genes, revealing that the SARS-CoV genome is a small and compact one with overlapped coding regions. The most striking discovery is that an ORF locates on the minus strand. We have also annotated non-coding regions and identified the transcription regulating sequences (TRS) in the intergenic regions. The analysis of TRS supports the minus strand extending transcription mechanism of coronavirus. The SNP analysis of different isolates reveals that mutations of the sequences do not affect the prediction results of ORFs.

Introduction

Severe acute respiratory syndrome-associated coronavirus (SARS-CoV), the pathogen of SARS, is a positive single-stranded RNA virus. It is classified as a member of Family Coronaviridae taxonomically because its physical profile and genome organization are similar to other known coronaviruses (1–3).

Five proteins in the SARS-CoV genome, R (replicase), S (spike), E (envelope), M (membrane) and N (nucleocapsid), homologically aligned themselves with those of other well-understood coronaviruses (4). The others were previously called PUPs (putative uncharacterized proteins) for their unknown structural or functional features and dissimilarity to those known sequences. However, it has been found that some of the PUPs matched the entries in the NCBI database (5).

Coronavirus performs a specific process of transcription known as discontinuous RNA synthesis (6, 7), which is correlated with the primary and secondary structures of its TRS (transcription regulating sequence). Two prevailing but contradictive models, leader-primed transcription and minus-strand extending transcription, have been proposed to interpret this mechanism (1, 8, 9, 10). The main discrepancies between them are the temporal process of the transcription and the existence of the subgenomic mRNAs.

In this paper, we report the annotation of the SARS-CoV genome, with the complete sequence of Isolate BJ01 as reference (11), and the exploration of its transcription mechanism.

Results

Initial annotation of the SARS-CoV genome

The results were generated by a combination of predictions from various gene identification methods. By using FGENSV, 14 ORFs (open reading frames) were predicted and named F1~F14 (Table 1). Two (F2 and F14) of them were novel to those previously reported, and F14 locates in the minus strand. With parameters trained from the known genes (R, S, E, M, and N), Glimmer (Version 2) predicted nine ORFs that were named G1~G9. BGFV identified another nine genes that were named B1~B9. Besides the computational prediction, we manually identified five more ORFs (BGI-PUP-S-1~S-5) as candidates. Each of these candidates has an upstream region matching the pattern of TRS and its translated sequence is longer than 40 amino acids. All ORFs mentioned above were uniformly listed according to their initial sites along the genome sequence, and the predicted physiochemical properties of these ORFs were presented as well (Figure 1; Table 1).

Table 1

Predicted ORFs and Their Physiochemical Characteristics in the SARS-CoV Genome (Isolate BJ01)

ORFPositionLength (nt)GC content (%)Average MW (kDa)pIHydrophobicity (%)Hydrophilicity (%)Charge (+)(%)Charge (−)(%)
R246-13,379
13,379-21,466
21,22240.8790.286.330.844.311.810.5
BGI-PUP-R-1715-1,20649246.717.7411.533.149.116.01.2
S21,473-25,2403,76838.7139.175.530.444.89.19.2
BGI-PUP-S-121,936-22,08214732.65.649.747.937.512.52.1
BGI-PUP-S-222,461-22,59513536.24.999.647.738.611.42.3
BGI-PUP-S-323,238-23,38414740.15.759.350.035.412.52.1
BGI-PUP-S-424,798-24,99820138.87.4311.039.453.016.70.0
BGI-PUP-S-525,188-25,31012334.94.919.230.050.015.02.5
PUP125,249-26,07382540.330.905.634.739.18.48.0
PUP225,670-26,13446540.617.7211.037.051.919.50.6
E26,098-26,32823140.38.366.047.432.95.35.3
M26,379-27,04466645.225.069.340.736.210.95.9
PUP327,055-27,24619231.27.544.747.642.911.115.9
PUP427,254-27,62236940.113.948.333.642.613.18.2
BGI-PUP4-127,619-27,75313531.85.303.961.427.32.313.6
PUP-Int-127,760-27,87912039.14.389.135.943.617.95.1
PUP-Int-227,845-28,09925540.09.569.431.041.715.53.6
N28,101-29,3691,26948.446.0310.117.354.015.48.5
PUP528,111-28,40729751.810.804.932.746.99.211.2
PUP-N-128,564-28,77621353.57.856.334.335.712.910.0
BGI-PUP-Neg-129,523-29,67815644.25.9011.852.929.413.70.0
ORFPositionLength (nt)GC content (%)Average MW (kDa)pIHydrophobicity (%)Hydrophilicity (%)Charge (+)(%)Charge (−)(%)
R246-13,379
13,379-21,466
21,22240.8790.286.330.844.311.810.5
BGI-PUP-R-1715-1,20649246.717.7411.533.149.116.01.2
S21,473-25,2403,76838.7139.175.530.444.89.19.2
BGI-PUP-S-121,936-22,08214732.65.649.747.937.512.52.1
BGI-PUP-S-222,461-22,59513536.24.999.647.738.611.42.3
BGI-PUP-S-323,238-23,38414740.15.759.350.035.412.52.1
BGI-PUP-S-424,798-24,99820138.87.4311.039.453.016.70.0
BGI-PUP-S-525,188-25,31012334.94.919.230.050.015.02.5
PUP125,249-26,07382540.330.905.634.739.18.48.0
PUP225,670-26,13446540.617.7211.037.051.919.50.6
E26,098-26,32823140.38.366.047.432.95.35.3
M26,379-27,04466645.225.069.340.736.210.95.9
PUP327,055-27,24619231.27.544.747.642.911.115.9
PUP427,254-27,62236940.113.948.333.642.613.18.2
BGI-PUP4-127,619-27,75313531.85.303.961.427.32.313.6
PUP-Int-127,760-27,87912039.14.389.135.943.617.95.1
PUP-Int-227,845-28,09925540.09.569.431.041.715.53.6
N28,101-29,3691,26948.446.0310.117.354.015.48.5
PUP528,111-28,40729751.810.804.932.746.99.211.2
PUP-N-128,564-28,77621353.57.856.334.335.712.910.0
BGI-PUP-Neg-129,523-29,67815644.25.9011.852.929.413.70.0

MW: molecular weight; nt: nucleotide; pI: isoelectric point.

Table 1

Predicted ORFs and Their Physiochemical Characteristics in the SARS-CoV Genome (Isolate BJ01)

ORFPositionLength (nt)GC content (%)Average MW (kDa)pIHydrophobicity (%)Hydrophilicity (%)Charge (+)(%)Charge (−)(%)
R246-13,379
13,379-21,466
21,22240.8790.286.330.844.311.810.5
BGI-PUP-R-1715-1,20649246.717.7411.533.149.116.01.2
S21,473-25,2403,76838.7139.175.530.444.89.19.2
BGI-PUP-S-121,936-22,08214732.65.649.747.937.512.52.1
BGI-PUP-S-222,461-22,59513536.24.999.647.738.611.42.3
BGI-PUP-S-323,238-23,38414740.15.759.350.035.412.52.1
BGI-PUP-S-424,798-24,99820138.87.4311.039.453.016.70.0
BGI-PUP-S-525,188-25,31012334.94.919.230.050.015.02.5
PUP125,249-26,07382540.330.905.634.739.18.48.0
PUP225,670-26,13446540.617.7211.037.051.919.50.6
E26,098-26,32823140.38.366.047.432.95.35.3
M26,379-27,04466645.225.069.340.736.210.95.9
PUP327,055-27,24619231.27.544.747.642.911.115.9
PUP427,254-27,62236940.113.948.333.642.613.18.2
BGI-PUP4-127,619-27,75313531.85.303.961.427.32.313.6
PUP-Int-127,760-27,87912039.14.389.135.943.617.95.1
PUP-Int-227,845-28,09925540.09.569.431.041.715.53.6
N28,101-29,3691,26948.446.0310.117.354.015.48.5
PUP528,111-28,40729751.810.804.932.746.99.211.2
PUP-N-128,564-28,77621353.57.856.334.335.712.910.0
BGI-PUP-Neg-129,523-29,67815644.25.9011.852.929.413.70.0
ORFPositionLength (nt)GC content (%)Average MW (kDa)pIHydrophobicity (%)Hydrophilicity (%)Charge (+)(%)Charge (−)(%)
R246-13,379
13,379-21,466
21,22240.8790.286.330.844.311.810.5
BGI-PUP-R-1715-1,20649246.717.7411.533.149.116.01.2
S21,473-25,2403,76838.7139.175.530.444.89.19.2
BGI-PUP-S-121,936-22,08214732.65.649.747.937.512.52.1
BGI-PUP-S-222,461-22,59513536.24.999.647.738.611.42.3
BGI-PUP-S-323,238-23,38414740.15.759.350.035.412.52.1
BGI-PUP-S-424,798-24,99820138.87.4311.039.453.016.70.0
BGI-PUP-S-525,188-25,31012334.94.919.230.050.015.02.5
PUP125,249-26,07382540.330.905.634.739.18.48.0
PUP225,670-26,13446540.617.7211.037.051.919.50.6
E26,098-26,32823140.38.366.047.432.95.35.3
M26,379-27,04466645.225.069.340.736.210.95.9
PUP327,055-27,24619231.27.544.747.642.911.115.9
PUP427,254-27,62236940.113.948.333.642.613.18.2
BGI-PUP4-127,619-27,75313531.85.303.961.427.32.313.6
PUP-Int-127,760-27,87912039.14.389.135.943.617.95.1
PUP-Int-227,845-28,09925540.09.569.431.041.715.53.6
N28,101-29,3691,26948.446.0310.117.354.015.48.5
PUP528,111-28,40729751.810.804.932.746.99.211.2
PUP-N-128,564-28,77621353.57.856.334.335.712.910.0
BGI-PUP-Neg-129,523-29,67815644.25.9011.852.929.413.70.0

MW: molecular weight; nt: nucleotide; pI: isoelectric point.

The genome organization of the SARS-CoV (Isolate BJ01).
Fig. 1

The genome organization of the SARS-CoV (Isolate BJ01).

The major physiochemical properties of different ORFs are various. For example, the GC contents of these ORFs range from 31.2% to 53.5%, while the range of the negative charge varies from 0 to 15.9%.

Homological and structural analysis of ORFs

The componential and functional features of all the genes or ORFs, including the known nonstructural and structural proteins (R, S, E, M, and N), were explored (5). We here focused on the PUPs identified in the viral genome. Three of the PUPs were predicted to have transmembrane domains.

PUP1 is equivalent to ORF3 in Isolate Tor2 (5). It got 11 hits in GenBank through BLAST, two of which were putative transmembrane proteins. One was from Ralstonia solanacearum, cytochrome b-561 (195 amino acids), with 97 amino acids of PUP1 aligned. The other was from Sinorhizobium meliloti, with 94 amino acids aligned. The identities were 28% and 25%, respectively. TMHMM predicted three transmembrane domains (Figure 2) in PUP1.

Predicted transmembrane structure of PUP1 (TMHMM). Red blocks on the top line are predicted transmembrane domains. The abscissa represents the position on sequence, and the ordinate represents the probability of prediction.
Fig. 2

Predicted transmembrane structure of PUP1 (TMHMM). Red blocks on the top line are predicted transmembrane domains. The abscissa represents the position on sequence, and the ordinate represents the probability of prediction.

PUP4 is an equivalent to ORF8 in Isolate Tor2 (5). It aligned a hypothetical protein of Cytophaga hutchinsonii with 31% identity over a segment of 51 amino acids. TMHMM predicted a transmembrane region at its C-terminus (Figure 3).

Predicted transmembrane structure of PUP4 (TMHMM).
Fig. 3

Predicted transmembrane structure of PUP4 (TMHMM).

BGI-PUP4-1 overlaps four nt with PUP4 at its N-terminus. In BLAST retrieving, it aligned 41 amino acids to a hypothetical protein, 50 amino acids in size, of Clostridium perfringens with an identity of 36%, and 31 amino acids to putative sterol-C5-desaturase of Arabidopsis thaliana with an identity of 38%. TMHMM identified one transmembrane domain in BGI-PUP4-1 (Figure 4), and it covers half of the total length (23 bp out of 45 bp). This ORF has a counterpart (ORF9) in Isolate Tor2.

Predicted transmembrane structure of BGI-PUP4-1 (TMHMM).
Fig. 4

Predicted transmembrane structure of BGI-PUP4-1 (TMHMM).

BGI-PUP-Neg-1 is the only ORF detected on the minus strand of the viral genome. It consists of 51 amino acids with a similar TRS on its upstream, and is predicted to have a transmembrane region at 21-43 amino acids. The prediction from TMHMM showed that this ORF had a transmembrane domain (Figure 5).

Predicted transmembrane structure of BGI-Neg-1 (TMHMM).
Fig. 5

Predicted transmembrane structure of BGI-Neg-1 (TMHMM).

BGI-PUP-R-1 is entirely embedded in the R protein, and is predicted to encode a protein of 163 amino acids. The BLASTp retrieving result showed its limited similarities to two segments, of 125 amino acids in Streptococcus cristatus and of 137 amino acids in Caenorhabditis elegans, respectively. Both of the two alignments has identities near 24%.

BGI-PUP-S-1~S-5 are embedded in (BGI-PUP-S-1~S-4) or overlapped (BGI-PUP-S-5) with the S protein. They were identified, in addition to criteria for their length (>40 amino acids), by the relatively conserved upstream TRSs. Only two of them, BGI-PUP-S-4 and BGI-PUP-S-5, got hits via BLASTp, retrieving against GenBank. The former matched a 41-amino-acid segment of a putative ethylene receptor in Pyrus communis, with an identity of 39%, and a putative nuclear protein family member of C. elegans, with 33% identify over an alignment of 69 amino acids; while the latter hit a 69-amino-acid-long segment in the putative nuclear protein of C. elegans with an identity of 33%.

PUP2 has a counterpart in Isolate Tor2, the ORF4. It matched 4 segments of different entries in GenBank: 138 amino acids with NADH dehydrogenase subunit2 of Laudakia stoliczkana, 137 amino acids with a hypothetical protein of Methanosarcina barkeri, 85 amino acids with myosin IXb of Homo sapiens, and 85 amino acids with MY9B_HUMAN myosin IXb. All of these alignments have the same identities of 28%.

PUP3 got no hit in GenBank, and no transmembrane or other characteristic domain was predicted with software available. It has a typical ORF with 63 amino acids, and has all other features of a gene, like TRS, start and stop codons. It is equivalent to ORF7 in Isolate Tor2 (5).

PUP-Int-1 is thus named since it is a PUP located in the intergenic region between PUP4 and PUP5. It got no hit in GenBank, and no characteristic structure was predicted.

PUP-Int-2 is a protein of 84 amino acids in length, following PUP-Int-1 in the same intergenic region. It matched a putative protein of C. elegans (25 amino acids, 48% identity), and a hypothetical protein, MGC28705, of Mus musculus (40 amino acids, 42% identity). The two ORFs mentioned above are equivalent to ORF10 and ORF11 in Isolate Tor2.

PUP5 is equivalent to ORF13 in Isolate Tor2. It aligned a segment (69 amino acids) with XP_225244, a hypothetical protein of Rattus norvegicus. The identity of their alignment is 26%, similar to the retinoblastoma-associated protein RAP140 of Homo sapiens, which has 24% identity over an alignment of 82 amino acids.

PUP-N-1 is entirely embedded in the N protein, which aligned a 64 amino acids segment with DEC-205 of Mus musculus (28% identity), and the lymphocyte antigen 75 of Homo sapiens as well (25% identity).

Characterization of substitutions

All sequences together with 338 nucleotides (overlapped ORFs may count one nucleotide twice or more times) variations among 42 isolates have been reported (from Jianfei Hu, personal communication). However, after a thorough survey, we have found that the variations do not affect our prediction of ORFs.

Ka and Ks are the rates of non-synonymous and synonymous substitutions, and the ratio between them (Ka/Ks) indicates the selection pressure of a gene. If Ka/Ks is higher than one, the selection pressure the gene takes is heavy; on the contrary, a ratio less than one means a lower pressure. The substitutions and the ratios for those 21 identified ORFs are displayed in Table 2.

Table 2

Substitution Status of ORFs in the Genome of SARS-CoV

ORFSize (nt)SubstitutionsNon-synonymous SubstitutionsSubstitute rate (%)KaKsKa/Ks
R21,2222231711.050.910.980.93
BGI-PUP-R-1492310.610.570.521.09
S3,76847381.251.140.941.21
BGI-PUP-S-11470000.000.000
BGI-PUP-S-2135211.483.720.794.70
BGI-PUP-S-31470000.000.000
BGI-PUP-S-4201301.490.001.890
BGI-PUP-S-5123514.073.183.690.86
PUP182525213.032.881.931.49
PUP246514103.012.453.330.73
E231220.871.020.000
M666841.20.692.230.31
PUP3192874.173.992.341.71
PUP4369330.810.940.000
PUP4-11350000.000.000
PUP-Int-1120504.170.004.620
PUP-Int-2255200.780.000.910
N1,269940.710.361.490.24
PUP-N-1213200.940.001.290
PUP5297220.670.770.000
BGI-PUP-Neg-11560000.000.000
ORFSize (nt)SubstitutionsNon-synonymous SubstitutionsSubstitute rate (%)KaKsKa/Ks
R21,2222231711.050.910.980.93
BGI-PUP-R-1492310.610.570.521.09
S3,76847381.251.140.941.21
BGI-PUP-S-11470000.000.000
BGI-PUP-S-2135211.483.720.794.70
BGI-PUP-S-31470000.000.000
BGI-PUP-S-4201301.490.001.890
BGI-PUP-S-5123514.073.183.690.86
PUP182525213.032.881.931.49
PUP246514103.012.453.330.73
E231220.871.020.000
M666841.20.692.230.31
PUP3192874.173.992.341.71
PUP4369330.810.940.000
PUP4-11350000.000.000
PUP-Int-1120504.170.004.620
PUP-Int-2255200.780.000.910
N1,269940.710.361.490.24
PUP-N-1213200.940.001.290
PUP5297220.670.770.000
BGI-PUP-Neg-11560000.000.000
Table 2

Substitution Status of ORFs in the Genome of SARS-CoV

ORFSize (nt)SubstitutionsNon-synonymous SubstitutionsSubstitute rate (%)KaKsKa/Ks
R21,2222231711.050.910.980.93
BGI-PUP-R-1492310.610.570.521.09
S3,76847381.251.140.941.21
BGI-PUP-S-11470000.000.000
BGI-PUP-S-2135211.483.720.794.70
BGI-PUP-S-31470000.000.000
BGI-PUP-S-4201301.490.001.890
BGI-PUP-S-5123514.073.183.690.86
PUP182525213.032.881.931.49
PUP246514103.012.453.330.73
E231220.871.020.000
M666841.20.692.230.31
PUP3192874.173.992.341.71
PUP4369330.810.940.000
PUP4-11350000.000.000
PUP-Int-1120504.170.004.620
PUP-Int-2255200.780.000.910
N1,269940.710.361.490.24
PUP-N-1213200.940.001.290
PUP5297220.670.770.000
BGI-PUP-Neg-11560000.000.000
ORFSize (nt)SubstitutionsNon-synonymous SubstitutionsSubstitute rate (%)KaKsKa/Ks
R21,2222231711.050.910.980.93
BGI-PUP-R-1492310.610.570.521.09
S3,76847381.251.140.941.21
BGI-PUP-S-11470000.000.000
BGI-PUP-S-2135211.483.720.794.70
BGI-PUP-S-31470000.000.000
BGI-PUP-S-4201301.490.001.890
BGI-PUP-S-5123514.073.183.690.86
PUP182525213.032.881.931.49
PUP246514103.012.453.330.73
E231220.871.020.000
M666841.20.692.230.31
PUP3192874.173.992.341.71
PUP4369330.810.940.000
PUP4-11350000.000.000
PUP-Int-1120504.170.004.620
PUP-Int-2255200.780.000.910
N1,269940.710.361.490.24
PUP-N-1213200.940.001.290
PUP5297220.670.770.000
BGI-PUP-Neg-11560000.000.000

Regulatory elements in the non-coding regions

The 5’ UTR of the whole genome contains a special segment with a size variation between 65 and 90 nt for different species of coronaviruses, being notified as leader, which is immediately followed by a segment called leader-mRNA junction (12). Both of them are crucial components to the discontinuous transcription model of the coronavirus. For further study, we aligned up the upstream sequence of each ORF. The results showed that these intergenic segments were relatively conserved, and composed a core consensus, CUAAACGAA, which was identical to the junction segment mentioned above. It provided a convincing evidence to support the discontinuous transcription model of the coronavirus. The conserved segments, or TRSs, and their multiple alignments were illustrated in Figure 6, from which we can tell the conserved core consensus apparently. The values of most distances between TRSs and their initial sites of corresponding genes are less than 100 nt (to R, this value is 131 nt). In some cases, two overlapped ORFs refer to the same TRS. Analysis on these segments can help understand the transcriptional mechanism of the coronavirus.

The TRS sequences in the SARS-CoV genome (Isolate BJ01). *This refers to the number of nucleotides between the first nucleotide of the TRSs and the first letter of the start codon of the corresponding ORFs.
Fig. 6

The TRS sequences in the SARS-CoV genome (Isolate BJ01). *This refers to the number of nucleotides between the first nucleotide of the TRSs and the first letter of the start codon of the corresponding ORFs.

Another remarkable phenomenon is that its 5’ upstream contains a segment that is similar to the 5’ end region of the plus strand (Figure 7), which was first detected in the AIBV (3).

Homological comparison of the 5’ end and 3’ end of the SARS-CoV genome (Isolate BJ01).
Fig. 7

Homological comparison of the 5’ end and 3’ end of the SARS-CoV genome (Isolate BJ01).

The 3’ UTR of the genome is also required for its transcription, in that the truncation of this part can totally inhibit the transcription of subgenomic mRNAs, despite all of the synthesized minus-strand RNAs (13).

The s2m in the 3’ UTR region is a motif found in Order Nidovirales, such as bovine, porcine, and ovine coronaviruses. It is also thought to be a common feature of Coronaviridae  (14). The identification of the motif in those genomes provides supplemental evidence for genetic taxonomy, although the motif may be a gift from RNA recombination rather than a relic of their ancestor. The genome of the SARS-CoV (Isolate BJ01) has homologous sequence to the s2m at the position from 29,567 to 29,607 nt.

Poly(A) is the 3’ end region of the genome, and each subgenomic mRNA acquired it as a fused tail during its transcription. Poly(A)-binding proteins (PABPs) from the host cell interact with this terminal region, in order to initiate the transcription and enhance the stability of subgenomic mRNA. Results from experiments indicated that functional and selective pressure forced the shortened Poly(A)s to be repaired or restored their missing part, and the longer the poly(A) is (compared with the wild-type isolate), the higher efficiency the transcription has (15).

Discussion

Comparison of the gene prediction software

Genetic information of any life is preserved in its genome, and annotation is the first step to decode the sequence. The length of the SARS-CoV genome is more than 30 Kb, while only 5 structural or non-structural genes seem not to accord with the general characteristics for virus genome and the compactness and concentration of genetic information. In addition to these genes, it may have some non-structural proteins but lack experimental support. The absence probably results from their short existing-time before decomposition. In this study, encouraged by the supposition, we employed four different instruments to predict genes in the SARS-CoV (Isolate BJ01).

Glimmer (Version 2) predicted two ORFs that started with UUG (G5) and GUG (G8), respectively, instead of the usual initiation codon. This fact challenges the prevalent viewpoint that all ORFs start with AUG. The hypothetical minus sense ORF identified by FGENESV (from 48 to 203 nt on the minus strand or 29,523 to 29,678 nt on the plus strand) may be fake, but we should not absolutely deny the probability of the existence of minus ORFs.

Results of four prediction approaches with the genome sequence of the SARS-CoV (Isolate BJ01) were compared. The combined result contradistinguishes with annotations of the isolates from four different areas, one sample per city (Table 3; ref. 5, 11, 16, 17).

Table 3

Comparison of Prediction and Annotation of SARS-CoV (Isolate BJ01)

PredictionCombined resultAnnotation
FGENESVGlimmerZCURVE_CoVBGFVBJ01Tor2UrbaniSIN2500
RORF1RR
F1G1orf1aB1ORF1aORF1aorf1aorf1
F2BGI-PUP-R-1
F3G2orf1bB2ORF1bORF1b
F4G3SB3S
BGI-PUP-S-1
BGI-PUP-S-2
BGI-PUP-S-3
BGI-PUP-S-4
BGI-PUP-S-5
SSS
F5G4Sars274B4PUP1ORF3X1PUP1
F6PUP2ORF4X2PUP2
F7EEEEE
F8G5#MB5MMMM
F9Sars63B6PUP3ORF7X3PUP3
F10G6Sars122B7PUP4ORF8X4PUP4
G7Sars44PUP4-1ORF9
F11Sars39PUP-Int-1ORF10
F12G8#Sars84B8PUP-Int-2ORF11X5
F13G9NB9NNNN
PUP5ORF13PUP5
PUP-N-1ORF14
F14*BGI-PUP-Neg-1
PredictionCombined resultAnnotation
FGENESVGlimmerZCURVE_CoVBGFVBJ01Tor2UrbaniSIN2500
RORF1RR
F1G1orf1aB1ORF1aORF1aorf1aorf1
F2BGI-PUP-R-1
F3G2orf1bB2ORF1bORF1b
F4G3SB3S
BGI-PUP-S-1
BGI-PUP-S-2
BGI-PUP-S-3
BGI-PUP-S-4
BGI-PUP-S-5
SSS
F5G4Sars274B4PUP1ORF3X1PUP1
F6PUP2ORF4X2PUP2
F7EEEEE
F8G5#MB5MMMM
F9Sars63B6PUP3ORF7X3PUP3
F10G6Sars122B7PUP4ORF8X4PUP4
G7Sars44PUP4-1ORF9
F11Sars39PUP-Int-1ORF10
F12G8#Sars84B8PUP-Int-2ORF11X5
F13G9NB9NNNN
PUP5ORF13PUP5
PUP-N-1ORF14
F14*BGI-PUP-Neg-1
*

The ORF on the minus-strand, predicted by FGENESV.

#

Glimmer (Version 2) predicted ORFs, not starting with AUG.

Table 3

Comparison of Prediction and Annotation of SARS-CoV (Isolate BJ01)

PredictionCombined resultAnnotation
FGENESVGlimmerZCURVE_CoVBGFVBJ01Tor2UrbaniSIN2500
RORF1RR
F1G1orf1aB1ORF1aORF1aorf1aorf1
F2BGI-PUP-R-1
F3G2orf1bB2ORF1bORF1b
F4G3SB3S
BGI-PUP-S-1
BGI-PUP-S-2
BGI-PUP-S-3
BGI-PUP-S-4
BGI-PUP-S-5
SSS
F5G4Sars274B4PUP1ORF3X1PUP1
F6PUP2ORF4X2PUP2
F7EEEEE
F8G5#MB5MMMM
F9Sars63B6PUP3ORF7X3PUP3
F10G6Sars122B7PUP4ORF8X4PUP4
G7Sars44PUP4-1ORF9
F11Sars39PUP-Int-1ORF10
F12G8#Sars84B8PUP-Int-2ORF11X5
F13G9NB9NNNN
PUP5ORF13PUP5
PUP-N-1ORF14
F14*BGI-PUP-Neg-1
PredictionCombined resultAnnotation
FGENESVGlimmerZCURVE_CoVBGFVBJ01Tor2UrbaniSIN2500
RORF1RR
F1G1orf1aB1ORF1aORF1aorf1aorf1
F2BGI-PUP-R-1
F3G2orf1bB2ORF1bORF1b
F4G3SB3S
BGI-PUP-S-1
BGI-PUP-S-2
BGI-PUP-S-3
BGI-PUP-S-4
BGI-PUP-S-5
SSS
F5G4Sars274B4PUP1ORF3X1PUP1
F6PUP2ORF4X2PUP2
F7EEEEE
F8G5#MB5MMMM
F9Sars63B6PUP3ORF7X3PUP3
F10G6Sars122B7PUP4ORF8X4PUP4
G7Sars44PUP4-1ORF9
F11Sars39PUP-Int-1ORF10
F12G8#Sars84B8PUP-Int-2ORF11X5
F13G9NB9NNNN
PUP5ORF13PUP5
PUP-N-1ORF14
F14*BGI-PUP-Neg-1
*

The ORF on the minus-strand, predicted by FGENESV.

#

Glimmer (Version 2) predicted ORFs, not starting with AUG.

The GC contents of the five well-explored proteins (R, S, E, M and N) range from 38.7% to 48.4% (Table 1). Most of the other ORFs have approximate values, while the GC contents of BGI-PUP-S-1, PUP3, and PUP4-1 are 32.6%, 31.2% and 31.8%, respectively. Further more, BGI-PUP-S-1 and PUP4-1 both have a Ka/Ks ratio of zero. These facts may suggest that the probabilities of the two ORFs to be proteins are lower than others.

Our study was based on two presumptions. Firstly, the identification of the five proteins could not explain the pathogenesis and relevant observations. Secondly, it might be caused by our “one protein, multiple functional domains” deduction, or by overlapped independent genes which have not been explored. Most PUPs (14 out of 16) were embedded in or overlapped with at least one other ORF, indicating the compactness of the viral genome.

Human coronavirus HCoV-OC43 has a hemagglutinin-esterase (HE) protein while another well-explored virus HCoV-229E has not. Being classified as a coronavirus which could infect human, the SARS-CoV seems not to contain such an ORF coding for the HE protein in that there is no space between R protein and S protein, the very position for HE to exist independently. It was found that the relics of HE protein had spread to neighboring regions (from Jianfei Hu, personal communication). This phenomenon suggests that large-scale recombination might have taken place.

Furthermore, we employed FGENESV to explore the sequences of MHV (NC_001846 in NCBI) and AIBV (NC_001451 in NCBI), and compared the results with their previous annotations, respectively. ORFs for their structural proteins were totally predicted, while some hypothetical ORFs were not.

Apart from the leader-mRNA junction segments found in the plus strand, we also detected some other segments similar to the consensus on the minus-strand sequence. They were ahead of the anti-codons of the termini of positive sense ORFs (Figure 8). This discovery should not be simply coincident, but the unique reason to explain the phenomena is unknown and worthy of further exploration.

Comparison of TRS in SARS-CoV Isolate BJ01 (minus sense). Core segments (CUAAACGAA) are mark up in bold style. Distance in part B is the number of nucleotide between the last letters of the TRSs to the terminal codon of their corresponding ORFs.
Fig. 8

Comparison of TRS in SARS-CoV Isolate BJ01 (minus sense). Core segments (CUAAACGAA) are mark up in bold style. Distance in part B is the number of nucleotide between the last letters of the TRSs to the terminal codon of their corresponding ORFs.

Transcriptomics — two models of transcription

The conserved TRS is one of the characteristics of discontinuous transcription that coronaviridae performs while duplicating. Currently, two different models are applied to interpret the transcription mechanism of coronavirus: leader-primed discontinuous transcription model and minus-strand extending transcription model with subgenomic mRNAs.

Due to the previous failure in detecting the minus-strand subgenomic mRNA, leader-primed transcription is generally accepted. A full-length minus-strand RNA was considered to act as the template for transcription of all subgenomic mRNAs. In this model, duplications of the leader sequence leave the 3’ end of the template, and then move to intergenic regions upstream each mRNA on the minus-strand template. After duplicated leaders fusing again with the reversed TRSs on the Body sequence (to distinguish with the leader portion) through base pairing, the discontinuous transcription procedure was then triggered.

Sawicki et al. proposed another model, the minus-strand extending transcription model, in which subgenome-length negative sense segments were detected in infected cells (18, 19). In this postulation, subgenome-length minus strands derives directly from the genome RNA during transcription, gets terminal TRS counterparts from the body sequence, and then fuses on the TRS region to accomplish the minus-strand after getting the counterpart of the leader. The completed minus-strand RNAs serve as templates for subgenomic mRNAs (9, 10).

Generally, it is found that coronavirus mRNAs are synthesized in amounts reversely related to their sizes, and the N protein is richer than any other proteins in infected cells. It suggests that the gradient of subgenomic mRNA amount results in that the large mRNA tends to premature and generates less proteins than the small subgenomic mRNAs (20). Site-directed mutations to TRSs along the genome decrease the transcriptional efficiencies or even eliminate the synthesis of the subgenomic RNAs. Mutations introduced in different places demonstrate that the upstream TRSs do not affect the downstream ones, but the latter affect the replication of the former. It provides a possible interpretation why the downstream proteins are richer in infected cells than the upstream ones (21). Although mutations in TRSs reduce their opportunities of transcription, some subgenomic mRNAs are synthesized with the mutated sites performed as markers, and the origin of the fused region (here refers TRS) can be traced. These results show that all the TRSs come from the body sequences rather than the leaders, and provide evidence for the minus-strand extending transcription model (22).

Even though the proteins of SARS-CoV were predicted by software, and related transcriptional mechanisms were described, further experiments are still required to prove these hypotheses.

Materials and Methods

The annotation and subsequent analysis were mainly performed on the complete genome sequence of Isolate BJ01 (Accession No. AF278488 in GenBank). FGENESV, a program for gene prediction provided by Softberry Inc. (Mount Kisco, USA) through a web-based interface, has been specially modified and trained with parameters for virus (http://www.softberry.com/berry.phtml?topic=gfindv). Glimmer (Version 2), from TIGR (The Institute for Genomic Research), is a program for gene identification with high performance in handling small genomes like bacteria and archaea (23, 24). ZCURVE-CoV, developed by researchers in Tianjin University, is an approach to recognize ORFs with Z-Curve theory (25). BGFV is a program developed by Beijing Genomics Institute, based on the self-organizing theory (http://arxiv.org/abs/physics/0102048). Fundamental principle of BGFV is the compositional discrepancy between coding and non-coding regions, which is relatively distinctive for simpler species. The prediction have been compared with previous annotations of other isolates for cross-checking. The length threshold for ORFs is the same as that applied by Marra et al. to Isolate Tor2 (5). One ORF is postulated to be a protein-coding region, if its translated sequence is longer than 40 amino acids. A unique segment, the leader-mRNA junction (26), should exist upstream to the transcription initiation site, within a distance of 100 nt, except for the R protein (to R, the distance is 131 nt).

For nomenclature, most of the previously reported ORFs were designated by their original names (11), while some PUPs got suffixes. “PUP-Int-1” refers to a PUP locating in intergenic region. Especially, those that are first reported in this paper were named with a prefix of “BGI”. If an ORF embedded in or overlapped with a known one, the name of the known hosting ORF will be inserted in its name and a sequential number will be attached. BGI-PUP-R-1, for example, stands for the first identified ORF overlapped with the R protein.

Physiochemical features, such as MW and pI, were calculated with a program from Dr. Yan Li, Beijing Genomics Institute (personal communication), and the transmembranous domains of proteins were identified by TMHMM (27), while DAS (28) provided similar results (figures from DAS are not shown).

Acknowledgements

The authors thank Ministry of Science and Technology of China, Chinese Academy of Science, and National Science Foundation of China for financial support. We are grateful to Jinsong Liu for providing the BGFV program and mathematical calculations.

References

1

Lai
 
M.M.
 et al.  
Studies on the mechanism of RNA synthesis of a murine coronavirus
.
Adv. Exp. Med. Biol.
 
1984
;
173
:
187
200
.

2

Baric
 
R.S.
 et al.  
Studies into the mechanism of MHV transcription
.
Adv. Exp. Med. Biol.
 
1987
;
218
:
137
149
.

3

Boursnell
 
M.E.
 et al.  
Completion of the sequence of the genome of the coronavirus avian infectious bronchitis virus
.
J. Gen. Virol.
 
1987
;
68
:
57
77
.

4

de Vries
 
A.A.
 et al.  
The genome organization of the Nidovirales: similaritys and differences between Arteri-, Toro-, and coronaviruses
.
Seminars in Virology
 
1997
;
8
:
33
47
.

5

Marra
 
M.A.
 et al.  
The genome sequence of the SARS-associated coronavirus
.
Science
 
2003
;
300
:
1399
1404
.

6

Lin
 
Y.J.
 et al.  
Identification of the cis-acting signal for minus-strand RNA synthesis of a murine coronavirus: implications for the role of minus-strand RNA in RNA replication and transcription
.
J. Virol.
 
1994
;
68
:
8131
8140
.

7

Jeong
 
Y.S.
,
Makino
 
S.
 
Evidence for coronavirus discontinuous transcription
.
J. Virol.
 
1994
;
68
:
2615
2623
.

8

Baric
 
R.S.
 et al.  
Characterization of leader-related small RNAs in coronavirus-infected cells: further evidence for leader-primed mechanism of transcription
.
Virus Res.
 
1985
;
3
:
19
33
.

9

Sawicki
 
S.G.
,
Sawicki
 
D.L.
 
Coronaviruses use discontinuous extension for synthesis of subgenome-length negative strands
.
Adv. Exp. Med. Biol.
 
1995
;
380
:
499
506
.

10

Sawicki
 
S.G.
,
Sawicki
 
D.L.
 
A new model for coronavirus transcription
.
Adv. Exp. Med. Biol.
 
1998
;
440
:
215
219
.

11

Qin
 
E.D.
 et al.  
A complete sequence and comparative analysis of a SARS-associated virus (Isolate BJ01)
.
Chin. Sci. Bull.
 
2003
;
48
:
941
948
.

12

Hofmann
 
M.A.
 et al.  
Leader-mRNA junction sequences are unique for each subgenomic mRNA species in the bovine coronavirus and remain so throughout persistent infection
.
Virology
 
1993
;
196
:
163
171
.

13

Lin
 
Y.J.
 et al.  
The 3’ untranslated region of coronavirus RNA is required for subgenomic mRNA transcription from a defective interfering RNA
.
J. Virol.
 
1996
;
70
:
7236
7240
.

14

Jonassen
 
C.M.
 et al.  
A common RNA motif in the 3’ end of the genomes of astroviruses, avian infectious bronchitis virus and an equine rhinovirus
.
J. Gen. Virol.
 
1998
;
79
:
715
718
.

15

Spagnolo
 
J.F.
,
Hogue
 
B.G.
 
Host protein interactions with the 3’ end of bovine coronavirus RNA and the requirement of the poly(A) tail for coronavirus defective genome replication
.
J. Virol.
 
2000
;
74
:
5053
5065
.

16

Rota
 
P.A.
 et al.  
Characterization of a novel coronavirus associated with severe acute respiratory syndrome
.
Science
 
2003
;
300
:
1394
1399
.

17

Ruan
 
Y.J.
 et al.  
Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection
.
Lancet
 
2003
;
361
:
1779
1785
.

18

Sethna
 
P.B.
 et al.  
Coronavirus subgenomic minus-strand RNAs and the potential for mRNA replicons
.
Proc. Natl. Acad. Sci. USA.
 
1989
;
86
:
5626
5630
.

19

Sawicki
 
S.G.
,
Sawicki
 
D.L.
 et al.  
Coronavirus transcription: subgenomic mouse hepatitis virus replicative intermediates function in RNA synthesis
.
J. Virol.
 
1990
;
64
:
1050
1056
.

20

van Marle
 
G.
 et al.  
Regulation of coronavirus mRNA transcription
.
J. Virol.
 
1995
;
69
:
7851
7856
.

21

Pasternak
 
A.O.
 et al.  
Sequence requirements for RNA strand transfer during nidovirus discontinuous subgenomic RNA synthesis
.
Embo. J.
 
2001
;
20
:
7220
7228
.

22

van Marle
 
G.
 et al.  
Arterivirus discontinuous mRNA transcription is guided by base pairing between sense and antisense transcription-regulating sequences
.
Proc. Natl. Acad. Sci. USA.
 
1999
;
96
:
12056
12061
.

23

Delcher
 
A.L.
 et al.  
Improved microbial gene identification with GLIMMER
.
Nucleic Acids Res.
 
1999
;
27
:
4636
4641
.

24

Salzberg
 
S.L.
 et al.  
Microbial gene identification using interpolated Markov models
.
Nucleic Acids Res.
 
1998
;
26
:
544
548
.

25

Chen
 
L.L.
 et al.  
ZCURVE_CoV: a new system to recognize protein coding genes in coronavirus genomes, and its applications in analyzing SARS-CoV genomes
.
Biochem. Biophys. Res. Commun.
 
2003
;
307
:
382
388
.

26

Makino
 
S.
 et al.  
Leader sequences of murine coronavirus mRNAs can be freely reassorted: evidence for the role of free leader RNA in transcription
.
Proc. Natl. Acad. Sci. USA.
 
1986
;
83
:
4204
4208
.

27

Krogh
 
A.
 et al.  
Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes
.
J. Mol. Biol.
 
2001
;
305
:
567
580
.

28

Cserzo
 
M.
 et al.  
Prediction of transmembrane alpha-helices in procariotic membrane proteins: the dense alignment surface method
.
Protein Eng.
 
1997
;
10
:
673
676
.

Author notes

These authors contributed equally to this work.

This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 license (https://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial use of the work as published, without adaptation or alteration provided the work is fully attributed. For commercial re-use, please contact [email protected]