JCcirc: circRNA full-length sequence assembly through integrated junction contigs Free

Results of existing assembly methods on human simulated circRNA datasets

Data	Precision (%)			Recall (%)			F1 score (%)
Data	M1	M2	M3	M1	M2	M3	M1	M2	M3
D10150	96.35	57.60	61.73	0.39	28.51	29.25	0.78	38.14	39.69
D20150	92.79	62.45	64.28	3.83	33.91	35.01	7.35	43.96	45.33
D30150	90.30	64.13	63.79	7.13	36.03	36.58	13.22	46.14	46.50
D50150	87.16	66.45	62.70	11.16	38.74	37.32	19.79	48.94	46.79
D70150	87.04	66.94	61.82	14.33	39.89	37.22	24.61	49.99	46.47
D100150	84.74	67.36	60.50	16.67	40.73	36.95	27.86	50.76	45.88
D30075	77.28	55.53	61.67	20.50	27.42	36.33	32.40	36.71	45.72
D30100	81.18	59.09	62.89	16.43	31.40	36.98	27.33	41.00	46.57
D30125	86.46	61.28	63.55	10.96	33.68	36.81	19.45	43.47	46.62
D30150	90.30	64.13	63.79	7.13	36.03	36.58	13.22	46.14	46.50
D30200	87.64	65.68	65.29	2.41	37.59	34.19	4.69	47.81	44.88

Data	Precision (%)			Recall (%)			F1 score (%)
Data	M1	M2	M3	M1	M2	M3	M1	M2	M3
D10150	96.35	57.60	61.73	0.39	28.51	29.25	0.78	38.14	39.69
D20150	92.79	62.45	64.28	3.83	33.91	35.01	7.35	43.96	45.33
D30150	90.30	64.13	63.79	7.13	36.03	36.58	13.22	46.14	46.50
D50150	87.16	66.45	62.70	11.16	38.74	37.32	19.79	48.94	46.79
D70150	87.04	66.94	61.82	14.33	39.89	37.22	24.61	49.99	46.47
D100150	84.74	67.36	60.50	16.67	40.73	36.95	27.86	50.76	45.88
D30075	77.28	55.53	61.67	20.50	27.42	36.33	32.40	36.71	45.72
D30100	81.18	59.09	62.89	16.43	31.40	36.98	27.33	41.00	46.57
D30125	86.46	61.28	63.55	10.96	33.68	36.81	19.45	43.47	46.62
D30150	90.30	64.13	63.79	7.13	36.03	36.58	13.22	46.14	46.50
D30200	87.64	65.68	65.29	2.41	37.59	34.19	4.69	47.81	44.88

M1: CircAST, M2: CIRI-full, M3: CircRNAfull.

Table 1

Results of existing assembly methods on human simulated circRNA datasets

Data	Precision (%)			Recall (%)			F1 score (%)
Data	M1	M2	M3	M1	M2	M3	M1	M2	M3
D10150	96.35	57.60	61.73	0.39	28.51	29.25	0.78	38.14	39.69
D20150	92.79	62.45	64.28	3.83	33.91	35.01	7.35	43.96	45.33
D30150	90.30	64.13	63.79	7.13	36.03	36.58	13.22	46.14	46.50
D50150	87.16	66.45	62.70	11.16	38.74	37.32	19.79	48.94	46.79
D70150	87.04	66.94	61.82	14.33	39.89	37.22	24.61	49.99	46.47
D100150	84.74	67.36	60.50	16.67	40.73	36.95	27.86	50.76	45.88
D30075	77.28	55.53	61.67	20.50	27.42	36.33	32.40	36.71	45.72
D30100	81.18	59.09	62.89	16.43	31.40	36.98	27.33	41.00	46.57
D30125	86.46	61.28	63.55	10.96	33.68	36.81	19.45	43.47	46.62
D30150	90.30	64.13	63.79	7.13	36.03	36.58	13.22	46.14	46.50
D30200	87.64	65.68	65.29	2.41	37.59	34.19	4.69	47.81	44.88

Data	Precision (%)			Recall (%)			F1 score (%)
Data	M1	M2	M3	M1	M2	M3	M1	M2	M3
D10150	96.35	57.60	61.73	0.39	28.51	29.25	0.78	38.14	39.69
D20150	92.79	62.45	64.28	3.83	33.91	35.01	7.35	43.96	45.33
D30150	90.30	64.13	63.79	7.13	36.03	36.58	13.22	46.14	46.50
D50150	87.16	66.45	62.70	11.16	38.74	37.32	19.79	48.94	46.79
D70150	87.04	66.94	61.82	14.33	39.89	37.22	24.61	49.99	46.47
D100150	84.74	67.36	60.50	16.67	40.73	36.95	27.86	50.76	45.88
D30075	77.28	55.53	61.67	20.50	27.42	36.33	32.40	36.71	45.72
D30100	81.18	59.09	62.89	16.43	31.40	36.98	27.33	41.00	46.57
D30125	86.46	61.28	63.55	10.96	33.68	36.81	19.45	43.47	46.62
D30150	90.30	64.13	63.79	7.13	36.03	36.58	13.22	46.14	46.50
D30200	87.64	65.68	65.29	2.41	37.59	34.19	4.69	47.81	44.88

M1: CircAST, M2: CIRI-full, M3: CircRNAfull.

As anticipated, the performance of the assembly methods exhibited improvement with increasing read coverage. With a 10-fold increase in coverage, CircAST showed an increase in the F1 score by 0.27, while circRNAfull’s F1 score only increased by 0.06. In addition, we also observed a significant impact of read length on the performance of circRNA assemblers. Consistent with prior findings, CircAST demonstrated superior performance with shorter read lengths, achieving a higher F1 score at 75 bp (F1 score: 0.32) compared to 200 bp (F1 score: 0.04) [19]. Conversely, CIRI-full exhibited better performance in longer read lengths. Read length increased from 75 to 200 bp, resulting in a 0.11 increase in the F1 score for CIRI-full. circRNAfull shows less dependence on read length in terms of its performance.

We also conducted an analysis on human circRNA simulated datasets to determine the assembly number and assembly rate of circRNA sequences across different length ranges. As illustrated in Figure 1, all methods exhibited better performance on shorter sequences compared to longer ones. When comparing results for sequences below 400 bp to those above 1200 bp, the average correct assembly rates for CircAST, CIRI-full and circRNAfull decreased by 0.12, 0.55 and 0.33, respectively. For sequences shorter than 400 bp, CIRI-full demonstrated the highest correct assembly rate (average: 0.55) (Figure 1A), while for sequences longer than 400 bp, circRNAfull outperformed the other two methods (Figure 1B–D). Overall, circRNAfull exhibited a more stable performance across all sequence lengths. In addition, we also categorized the assembled circRNA sequences. The results, presented in Table 2, indicate that current methods perform more effectively in assembling exonic circRNAs but perform worse in exon–intron, intronic and intergenic circRNAs.

Figure 1

Assembly results of full-length circRNA sequences in different length intervals. The bar chart represents the number of correctly assembled circRNA sequences. The length intervals are as follows: (A) circRNA length between 100 bp and 400 bp, (B) circRNA length between 400 bp and 800 bp, (C) circRNA length between 800 bp and 1200 bp and (D) circRNA length between 1200 bp and 4000 bp.

Table 2

Average proportions of circRNA assembly sequence categories on human simulated circRNA datasets

Assembly method	Exonic	Exonic–intronic	Intronic	Intergenic
CircAST	12.64%	0.00%	0.00%	0.00%
CIRI-full	42.20%	0.92%	0.72%	1.16%
circRNAfull	43.41%	0.00%	0.00%	0.00%

Assembly method	Exonic	Exonic–intronic	Intronic	Intergenic
CircAST	12.64%	0.00%	0.00%	0.00%
CIRI-full	42.20%	0.92%	0.72%	1.16%
circRNAfull	43.41%	0.00%	0.00%	0.00%

Table 2

Average proportions of circRNA assembly sequence categories on human simulated circRNA datasets

Assembly method	Exonic	Exonic–intronic	Intronic	Intergenic
CircAST	12.64%	0.00%	0.00%	0.00%
CIRI-full	42.20%	0.92%	0.72%	1.16%
circRNAfull	43.41%	0.00%	0.00%	0.00%

Assembly method	Exonic	Exonic–intronic	Intronic	Intergenic
CircAST	12.64%	0.00%	0.00%	0.00%
CIRI-full	42.20%	0.92%	0.72%	1.16%
circRNAfull	43.41%	0.00%	0.00%	0.00%

In summary, our findings suggest that different methods are suitable for different datasets. However, the overall performance of existing methods is hindered by imbalanced precision and recall, or low precision and recall. Except for CIRI-full on D100150, the F1 scores for all other datasets remain below 0.5. Furthermore, although all methods achieve relatively high assembly rates for short circRNA sequences, the highest correct assembly rate is only 0.55. In addition, correctly assembling non-exonic circRNA sequences continues to pose a significant challenge.

The impact of junction contigs from different de novo assemblers

In this study, we present a novel method called JCcirc, which incorporates JC and enables the efficient reconstruction of all types of circRNA full-length sequences and their alternative isoforms using splice graphs and fragment coverage.

JC is assembled by a de novo transcript assembler. To analyze the effect of JC assembled by different de novo assemblers on JCcirc, we compared the performance of four assemblers (Trinity [20], SOAPdenovo [21], SPAdes [22] and ABySS [23]) combined with JCcirc on simulated circRNA datasets. These datasets include four animal species (human, mouse, fish and fly) and two plant species (Arabidopsis thaliana and rice) (see details in Method). The selection of these assemblers was based on their superior performance in previous evaluations conducted by Holzer et al. [24].

Our results showed that Trinity and SPAdes exhibited similar average F1 scores in each species and slightly outperformed SOAP and ABySS. The average F1 score of ABySS is the lowest in each species (as depicted in Figure 2). For example, on the mouse datasets, the average F1 scores for Trinity, SPAdes, SOAP and ABySS were 0.57, 0.55, 0.57 and 0.52, respectively. For human datasets, the precision and recall of SOAP, SPAdes and Trinity performed variably on different datasets. However, for datasets of other species, the precision and recall rankings of the four assemblers combined with JCcirc, from high to low, were Trinity, SPAdes, SOAP and ABySS. This trend was consistent with the changes in F1 scores (Figure S1).

Figure 2

Performance comparison of four de novo transcript assemblers combined with JCcirc. The numbers in each grid cell indicate the F1 score of the corresponding assembly method. The last column ‘Average’ of each subgraph represents the average of F1 score for 10 datasets on each species.

It is evident that the choice of de novo assemblers only have a slight impact on the final assembly results in JCcirc, except for ABySS. This can be attributed to the role played by JC in the circRNA sequence, which could support fragments that are not covered by BSJ or add the coverage of these fragments (Figure S2). The contribution of JC to sequence assembly is primarily through aligned fragments rather than the entire JC sequence. Hence, the JC sequence obtained from different de novo assemblers has a slight influence on the final results.

In addition, the data size of assembly results obtained by the de novo assemblers can affect the runtime of JCcirc. Taking the circRNA simulation data D50150 of mouse as an example, Trinity has the largest data size of assembly results and runtime is 647 s, while ABySS has the smallest data size and runtime is 613 s (Table S1). This can also be an index for selecting an assembly tool. Here, we recommend utilizing Trinity for de novo assembly in JCcirc, and all subsequent analyses were performed using Trinity.

The effect of junction contigs on JCcirc

Our findings demonstrate that the inclusion of JC significantly improves the performance of JCcirc on human simulated datasets. F1 scores increased significantly when JC was added, with the maximum improvement of 0.27 observed in data D30075 (Figure 3A). Moreover, the precision and recall of the assembly results for all 10 datasets exhibited the same trend (Figure S3). However, it is worth noting that the positive effect of JC is influenced by the read coverage and length. As the read coverage and length increase, the impact of JC on the assembly results gradually diminishes (Figure 3A). In addition, the incorporation of JC into the assembly process benefits the assembly of longer circRNA sequences. Across the 10 datasets, we observed the median length of the 10 datasets increased by 4 to 18 bp in the assembled sequences (Figure 3B, Table S2). Notably, the highest increase of 18 bp was observed when the read coverage was the lowest (D10150, with a coverage of 10).

Figure 3

The impact of JC on the assembly performance of JCcirc across 10 simulated human circRNA datasets. (A) F1 scores of the assembly results obtained from JCcirc with and without JC. (B) The length distribution of circRNA assembly sequences obtained from JCcirc with and without JC. (C) The effects of incorporating JC on the assembly performance of low-expression and high-expression circRNA. The bar graph represents the difference in F1 scores between the inclusion and exclusion of JC.

Moreover, we compared the effects of incorporating JC on assembling low-expression circRNA (number of BSJ < 6) and high-expression circRNA (number of BSJ ≥ 6). The results indicate that JC has a more pronounced effect on improving the assembly rate of low-expression circRNA compared to high-expression circRNA. On human datasets, the average F1 score of low-expression circRNA increased by 0.12, whereas the average F1 score of high-expression circRNA only increased by 0.04 (Figure 3C, Figure S3C and D). For low-expression circRNA, there are few BSJs available for sequence assembly, and the proportion of JC is similar to BSJ, which means that JC is not easily filtered out in subsequent steps. By incorporating JC, we can effectively increase the total information available for assembly, leading to a noticeable improvement in correct assembly rates. However, high-expression circRNAs have a smaller proportion of JC compared to BSJ, making them more susceptible to being filtered out in subsequent steps. Therefore, for high-expression circRNAs, the impact of adding JC is relatively minor compared to the significant benefits for low-expression circRNAs.

The impact of linear RNA reads on assembly results

Current research on circRNA often constructs circRNA libraries using RNase R to eliminate linear RNA and enrich circRNA. However, certain linear RNAs exhibit resistance to RNase R [25], and circRNA enrichment libraries frequently contain a mixture of circRNA reads and linear RNA reads. To assess the impact of linear RNA reads on assembly tools, we compared their performance on simulated human circRNA datasets with and without linear RNA reads.

As depicted in Figure S4, linear RNA reads have a minor influence on JCcirc, with differences in F1 score, precision and recall not exceeding 0.02 across all datasets. Thus, the residual linear RNA reads in the circRNA library have negligible consequences on the assembly results of JCcirc. We also examined the impact of linear RNA reads on other existing circRNA full-length assemblers (Figure S5). Interestingly, linear RNA reads had a minor influence on CIRI-full and circRNAfull but exerted a significant impact on CircAST. The impact of linear RNA reads on the circRNA assembly methods suggests that the results of real sequencing data may include false-positive sequences.

JCcirc outperforms on human simulated circRNA datasets

We next compared the performance of JCcirc with existing circRNA sequence assemblers on human circRNA simulated datasets. JCcirc exhibited the highest average F1 score of 0.58, primarily due to its superior recall. The average F1 score of JCcirc was higher by 0.13 compared to CIRI-full and circRNAfull, and 0.40 compared to CircAST (Figure 4A, Table S3). However, CircAST achieved the highest average precision at 0.87, surpassing JCcirc by 0.17. This advantage can be attributed to the stringent threshold for BSJ employed by CircAST, while JCcirc does not set a threshold for BSJ.

Figure 4

Performance of four assembly methods for assembling circRNA full-length sequences on simulated human datasets. (A) Precision and recall values of the assembly methods. (B) Overlap of assembly results from different circRNA full-length sequence assemblers. (C) The assembly rate of circRNA full-length sequences by four assembly methods in different length intervals. (D) Assembly results of alternative splicing circRNA. The bar chart represents the number of correctly assembled circRNA full-length sequences for each method, while the line chart depicts the assembly rate of alternative splicing circRNA.

In addition, JCcirc consistently demonstrated a superior correct assembly rate in every length interval compared to existing methods (Figure 4C). Previous analysis revealed that CIRI-full performed best for sequences shorter than 400 bp, while circRNAfull was optimal for sequences longer than 400 bp. Here, JCcirc achieved an average correct assembly rate of over 0.68 for circRNA sequences shorter than 400 bp, increased by 0.13 compared to CIRI-full. Furthermore, compared to circRNAfull, the average correct assembly rates of JCcirc increased by 0.09, 0.04 and 0.03 in the length intervals of 400–800 bp, 800–1200 bp and above 1200 bp, respectively.

We also compared the circRNA sequences assembled by JCcirc and other methods. In total, CircAST, CIRI-full, circRNAfull and JCcirc correctly assembled 25 954, 66 492, 65 959 and 102 922 circRNA sequences, respectively (Figure 4B). Notably, the JCcirc’s results covered a significant proportion of the results assembled by other methods, accounting for 95.40% of CircAST, 93.00% of CIRI-full and 74.93% of circRNAfull. This indicates that the assembly results of JCcirc are more comprehensive.

Alternative splicing of circRNA was analyzed from each method, considering four types of alternative splicing: skipping exons (SE), retained intron (RI), alternative 5′ splice site (A5SS) and alternative 3′ splice site (A3SS). The presence of RI was negligible in all four assembly methods, while A5SS and A3SS exhibited a high assembly rate of detection by each method. JCcirc displayed the highest average assembly rate for both A5SS (62.97%) and A3SS (65.16%), while CIRI-full achieved the highest rate for SE (5.66%) (Figure 4D).

Previous results showed that the majority of correctly assembled circRNAs belong to exonic circRNA. On human simulated circRNA datasets, the average assembly rates of JCcirc for different circRNA types exceeded existing methods, exonic circRNA is 60.32%, exon–intron circRNA is 17.13%, intronic circRNA is 9.12% and intergenic circRNA is 8.56%.

Comparison of assembly results between JCcirc and other assembly methods in multiple species

In addition to its exceptional performance on human simulated datasets, JCcirc also has superiority over existing methods on five other species, as depicted in Figure S6. Across all analyzed species, JCcirc consistently achieves the highest recall and F1 scores, with precision second only to CircAST. Moreover, in the case of mouse, fish and fly, the average recall and F1 scores of circRNAfull were slightly superior to those of CIRI-full, while the average recall and F1 scores of CIRI-full were slightly higher than circRNAfull in rice and A. thaliana. circRNAfull exhibits superior performance in animal datasets, but it lags behind CIRI-full in plant datasets.

Benchmark real datasets

The standard for evaluating assembled circRNA sequences derived from real short-read RNA-seq data is utilizing circRNA-enriched long-read sequencing libraries obtained from the same cell or tissue. Long-read sequencing allows for the direct detection of RNA sequences longer than 1000 bp. In order to evaluate the assembly tools for circRNA full-length sequences, we utilized publicly available Illumina circRNA-seq data samples along with matching long-read circRNA-seq data (Nanopore) to ensure consistency (Table S5). These validation datasets included samples from mouse brain tissue and various human cell lines.

We assessed the performance of five circRNA sequence assemblers, JCcirc, CIRI-full, CircAST, circRNAfull and circseq_cup, on the real RNA-seq datasets. The performance of the assembly methods on the real datasets is consistent with the findings on the simulated datasets (Table S4). CircAST exhibited the highest precision (average: 0.81, Figure 5A), and JCcirc demonstrated the highest recall (average: 0.41, Figure 5B) and the highest F1 scores (average: 0.45, Figure 5C).

Figure 5

Performance of circRNA full-length sequence assembly methods on the real RNA-seq datasets. Results of the analysis on the 13 Illumina RNA-seq datasets. (A) precision, (B) recall and (C) F1 score.

Furthermore, the performance of CircAST, circseq_cup, CIRI-full and circRNAfull was notably influenced by the experimental datasets. circseq_cup demonstrated the highest precision in experiment 1 datasets (average: 0.79) while performing moderately in the experiment 2 datasets (average precision: 0.38) and experiment 3 datasets (average precision: 0.34) (Figure 5A). Excluding JCcirc, circRNAfull achieved the highest F1 scores in experiment 1 datasets (average: 0.04), CircAST obtained the highest F1 scores in experiment 2 datasets (average: 0.51) and CIRI-full obtained the highest F1 scores in experiment 3 datasets (average: 0.53). Besides, Figure 5 demonstrates that the recall and F1 scores of experiment 1 were significantly lower than the other two groups. This can be attributed to the low sequencing depth of the long-read sequencing data in experiment 1, which resulted in fewer reference circRNAs. In other words, some circRNAs may have been correctly assembled, but they could not be verified by reference circRNA sequences. Thus, experiment 1 exhibited lower recall and F1 values compared to the other two groups.

DISCUSSION

In recent years, researchers have increasingly recognized the significant role of circRNAs in various diseases [26], and the functional properties of circRNAs heavily rely on their sequence composition. To enhance the correct assembly rate of circRNA sequences, we have developed a novel assembly tool known as JCcirc. This tool integrates BSJ and JC to facilitate the correct assembly of circRNA full-length sequences. Our results demonstrated that JCcirc achieves a high assembly rate while maintaining high precision, surpassing other existing assembly methods. JCcirc also possesses the advantage of effectively assembling all types of circRNAs, including exonic, intronic, exonic–intronic and intergenic circRNAs.

JCcirc introduced non-BSJ for the first time during the assembly of full-length circRNA sequences, significantly enhancing the assembly performance. In this study, we assessed the potential for false positives resulting from including non-BSJ using circRNA simulated datasets that incorporated linear reads. It was observed that the negative impact of introduced non-BSJ on the assembly results was minimal, with the majority of these reads originating from circRNA sequences. This is because we searched for highly correlated reads with BSJ when introducing non-BSJ. We obtained JC through a de novo assembly tool, which served as a reliable bridge connecting BSJ and high-correlation non-BSJ. Furthermore, we aligned the JC to the circRNA genome sequences, disregarding fragments that failed to align. Therefore, while some non-BSJ were introduced, they rarely caused false positives.

In the analysis of real datasets, we used long-read sequencing data to validate the assembled circRNAs sequences. We used three sets of long-read sequencing data, employing two distinct experimental methods aimed at reducing error rates. The first approach is rolling circle amplification followed by nanopore long-read sequencing [27], and another approach utilizes rolling circular reverse transcription in conjunction with nanopore long-read sequencing [28, 29]. Both of these methods sequenced the reads after generating multiple replicates of transcripts, effectively lowering the sequencing error rate. Also, we employed two circRNA prediction methods: CIRI-long and isoCirc. CIRI-long generated cyclic consensus sequences (CCS) for each nanopore read, with CCS accuracy ranging from 92.6 to 98.1%. isoCirc detected tandem repeats within raw reads and generated consensus sequences, and the error rate drops to less than 5% when consensus sequences have more than four copies.

In the results of real datasets, JCcirc consistently maintained high recall and F1 scores across different datasets (Figure 6). However, the performance of other methods depends on the specific datasets. For instance, in experiment 3 datasets, which involved mouse brain tissue, the differences in F1 scores among the methods were smaller than in the experiment 2 datasets obtained from human cell lines. Interestingly, circseq_cup’s performance exhibited unexpected variations across different datasets. In experiment 1 datasets, its precision surpassed that of CircAST. Conversely, in experiment 2 and experiment 3 datasets, circseq_cup consistently exhibited the lowest precision and recall compared to other methods. Further investigation revealed that experiment 1 and experiment 2 datasets are both derived from human cell lines, with experiment 1 having shorter read lengths (100 bp) than experiment 2 (150 bp), but twice the coverage compared to experiment 2. It suggested that the sequencing depth significantly impacts the performance of circseq_cup.

Figure 6

(A) Workflow of JCcirc. (B) Illustration of the junction contig. (C) Formation of alternative splicing circRNA.

Although JCcirc demonstrates superior performance in assembling circRNA sequences, the assembly of some specific circRNA types remains challenging, with the correct assembly rate for intronic and intergenic circRNAs still falling below 10%. Moreover, despite the utilization of JC in JCcirc to extend the length of assembled sequences, the correct assembly rate of sequences exceeding 1200 bp remains low. In this study, we introduced some non-BSJ through JC to assemble the full-length sequence of circRNA. Based on the JC, we can further expand some weakly correlated non-BSJ that are associated with highly correlated non-BSJ. These reads may prove beneficial for the assembly of long-sequence circRNAs. However, the probability of weakly correlated non-BSJ still derived from circRNA is lower than that of non-BSJ in JC, necessitating further filtering to ensure their reliability.

The coexistence of circRNA reads and linear RNA reads within sequencing data presents a significant challenge in the assembly of circRNA full-length sequences. Existing methods typically rely on BSJ for assembly, which restricts the length of the assembled sequences. In contrast, JCcirc introduces JC, which leverages non-BSJ to enhance the assembly rate of circRNA sequences. The inclusion of JC has indeed yielded improvements in the assembly rate, particularly for low-expression circRNAs. Nevertheless, there remains scope for further advancements in assembling long-sequence circRNAs and other circRNA types beyond exonic. It is still important to develop assembly tools for long-sequence circRNAs and non-exonic circRNAs.

METHOD

Datasets: simulated datasets

In order to address the limitation of CIRI_simulator, which focuses exclusively on simulating exonic circRNA sequencing reads, we have developed CircSimu based on CIRI_simulator [30]. CircSimu is capable of generating simulated sequencing data for four types of circRNA (exonic, exonic–intronic, intronic and intergenic). The source code for this software tool can be accessed at https://github.com/cbbzhang/JCcirc.

The utilization of simulated data allows for the acquisition of all information regarding circRNA, facilitating a comprehensive evaluation of assembly tools. To assess the performance of the assembly methods, we generated simulated data comprising paired-end reads of both circRNAs and linear RNAs. CircSimu was employed to generate the circRNA reads, and the linear RNA reads were generated using Polyester [31].

Two groups of circRNA reads were used to evaluate the performance of the assembly methods. The first group (G1) of simulated circRNA datasets comprises six datasets with varying sequencing depths (10×, 20×, 30×, 50×, 70×, 100×). All datasets in G1 have a read length of 150 bp and an average insert length of 450 bp with a standard deviation of 70 bp. The second group (G2) of simulated circRNA datasets consists of five datasets with different read lengths (75 bp, 100 bp, 125 bp, 150 bp and 200 bp). The sequence depth for G2 is set to 30×, while the average insert length and standard deviation of the insert length remain the same as in G1 (450 bp and 70 bp). The read lengths of the linear RNA simulated data in Polyester are the same as circRNA, with the ‘reads_per_transcript’ parameter set to 300 [32].

To assess the robustness of the assembly methods, we generated simulated circRNA datasets for six different model species using their reference genomes. The species included in the study were human (Homo sapiens, GRCh38/hg38), mouse (Mus musculus, GRCm38/mm10), fruit fly (Drosophila melanogaster, BDGP R6/dm6), zebrafish (Brachydanio rerio, GRCz11/danRer11), A. thaliana (Arabidopsis thaliana, TAIR10) and rice (Oryza sativa Japonica, IRGSP-1.0). Genome information for animals was obtained from UCSC, and genome information for plants was sourced from Ensembl.

Datasets: real datasets

Moreover, we collected three groups of human and mouse datasets from the Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra) and the National Genomics Data Center (https://ngdc.cncb.ac.cn/). Each group included both Illumina RNA-seq data (short reads) and Nanopore sequencing data (long reads) derived from the same experimental samples. The reference genomes for human (GRCh38/hg38) and mouse (GRCm38/mm10) were obtained from UCSC. The detailed information of these datasets can be found in Table S5.

Description of JCcirc

Existing assembly methods heavily rely on BSJ for the reconstruction of circRNA full-length sequences. However, short reads pose challenges in achieving high assembly accuracy and effectively assembling long sequences. To improve the correct assembly rate, we have developed JCcirc. This tool introduces the concept of JC, an extension of the BSJ, and integrates both BSJ and JC to reconstruct full-length circRNA sequences. JCcirc is accessible at https://github.com/cbbzhang/JCcirc and consists of four steps, which are visually depicted in Figure 6.

First step: extraction of BSJ and JC

BSJ was predicted by circRNA prediction algorithms, and JC was assembled by de novo assembly algorithms from RNA-seq data. The assembly contigs are classified into two distinct groups: JC, assembled from both BSJ and non-BSJ (Figure 6B); and non-junction contigs, assembled solely from non-BSJ.

Second step: alignment of BSJ and JC

Once the circRNA junction sites are identified, we extract the genomic sequence between the two sites. The two genomic sequences are then concatenated, forming a single sequence where the junction represents a back-splicing site. We employ BWA with default parameters to align the BSJ and JC to this concatenated reference sequence. The alignment process yields a SAM file containing the mapping results. Based on these mapping results, we obtain candidate fragments related to the full-length sequences of circRNAs.

Third step: filtering of low-confidence fragments

First of all, we performed a fine-tuning process for fragment locations based on the provided annotation file. When the distance between a fragment site and an exon site was ≤20 nt, we fine-tuned the fragment site to the exon site. For two fragments situated within the same exon, we generated a new fragment with the start site set as the smallest among the fragments’ sites and the end site set as the largest.

Subsequently, we applied several criteria to merge or delete fragments based on the following rules: (i) Fragment shorter than 20 nt was excluded. (ii) In cases where one fragment was entirely covered by another fragment, we retained the fragment with higher coverage. (iii) Fragments with partial overlap were merged to form a longer fragment. (iv) Fragments with a coverage of at least 3 or coverage equal to or greater than the average coverage minus 2 were retained. By implementing these filtering criteria, we ensured the selection of highly reliable fragments for subsequent reconstruction.

Fourth step: generation of alternative splicing circRNA

Primarily, we generate potential splice graphs based on shared BSJ IDs or JC IDs, where candidate fragments serve as nodes, and BSJ and JC represent edges in these graphs. If the same BSJ or JC supports two fragments, these two nodes may be adjacent. We then generate the longest full-length sequences according to the longest splice graph. Next, we filter out low-confidence fragments based on fragment coverage to identify reliable splice graphs as potential circRNA alternative splicing isoforms. Fragments with low coverage are excluded as they may not provide sufficient support for multiple circRNAs. Given that our sequence alignment primarily relies on BSJ and JC, higher coverage at both ends of the sequence compared to the middle region. Assuming a circRNA has N alignment fragments, we make the following assumptions:

(i) The coverage of fragment 1 is approximately equal to the coverage of fragment N, the coverage of fragment 2 is approximately equal to the coverage of fragment N − 1, and so forth.
(ii) The coverage of fragment 2 is less than that of fragment 1.

Based on these assumptions, we apply the following rules:

If the coverage of two corresponding fragments, i and j, differ by three or more (coverage_i − coverage_j ≥ 3), and (coverage_j + D) < coverage_j + 1 and (coverage_j + D) < coverage_j − 1, we exclude fragment j in circRNA isoforms. D can be set to 0, 1 or 2; a larger value means stricter screening. This parameter can also be adjusted based on the intron length of the reference genome; longer introns could set smaller values.

When N is an odd number, we compare the coverage of the median fragment with the coverage of the fragments median − 1 and median + 1. If coverage_median < coverage_{median − 1}/2 and coverage_median < coverage_{median + 1}/2, we exclude fragment median in circRNA isoforms.

Finally, if the length of the assembled sequence is greater than 5000 bp, we delete the sequence.

Prediction of circRNA and assembly of circRNA full-length sequences

To analyze the sequencing data obtained from experimental samples, we employed several existing tools. For RNA-Seq data (short reads), we utilized CIRI2 (version 2.0.6) [30] in combination with BWA (version 0.7.17) [33], CIRCexplorer2 (version 2.3.0) [34] in combination with BWA, and DCC (version 0.5.0) [35] in combination with STAR (version 2.7.6a) [36] to predict circRNAs. The circRNAs identified by all three methods simultaneously were considered for subsequent assembly. For Nanopore sequencing data, we employed isoCirc (version 1.0.2) [27] or CIRI-long (version 1.0.1) [29] to predict circRNA sequences. isoCirc is more suitable for long reads longer than 1000 bp, while CIRI-long is more suitable for long reads shorter than 1000 bp.

To generate the circRNA full-length sequences, we utilized five different assemblers: CircAST combined with Tophat2 (version 2.1.0) [37], circseq_cup (version 1.0) combined with Tophat2, CIRI-full (version 2.0) combined with BWA, circRNAfull combined with STAR, and our proposed JCcirc. All of these software tools were executed using their default parameters.

Evaluation metrics

In order to assess the correctness of the assembled circRNA sequences, we conducted comparisons with simulated sequences or sequences predicted from third-generation sequencing. We considered an assembly sequence to be correct if this sequence has the same number of fragments as the reference sequence, and each fragment of the circRNA assembly sequence aligned perfectly with the corresponding reference fragment. A tolerance of up to 6 bp displacement was allowed between the start and end sites of each fragment and the reference fragment sites.

To evaluate the performance of circRNA full-length sequence assemblers, we utilized several metrics, namely precision, recall and F1 score. Higher scores indicate superior performance across all of these metrics. The calculation formulas for these metrics are defined as follows:

$$precision=\frac{TP}{TP+ FP}$$

$$recall=\frac{TP}{TP+ FN}$$

$$F1\ score=\frac{2\times precision\times recall}{precision+ recall}$$

where TP indicates true positives, FP is false positives, TN refers to true negatives and FN represents false negatives.

Key Points

We have developed an efficient tool named JCcirc, which is used for reconstructing full-length circRNAs through integrated junction contigs.
JCcirc consistently outperforms other existing methods on simulated datasets of multiple model organisms and real datasets.
In comparison to other methods, JCcirc exhibits superior assembly rates for all types of circRNAs, including exonic circRNAs, intronic circRNAs, exonic–intronic circRNAs and intergenic circRNAs.

FUNDING

This work is partly supported by the Key Research and Development Project of Guangdong Province under grant no. 2021B0101310002, the National Science Foundation of China under grant no. 62272449, the Strategic Priority CAS Project under grant no. XDB38050100, the National Key Research and Development Program of China under grant no. 2021YFF1200104, the Shenzhen Basic Research Fund under grant nos. RCYX20200714114734194, KQTD20200820113106007 and ZDSYS20220422103800001, and by the Youth Innovation Promotion Association (Y2021101), CAS to Yanjie Wei.

DATA AVAILABILITY

All required links or identifiers for our data are present in the manuscript and supplementary data.

Jingjing Zhang is a PhD student at the University of Chinese Academy of Sciences, Beijing, China. Her research interest is bioinformatics and circular RNAs.

Huiling Zhang is currently a lecturer at South China Agricultural University. Her research interest lies in the fields of big data, machine learning, computational biology and bioinformatics.

Zhen Ju is a PhD student at the University of Chinese Academy of Sciences, Beijing, China. His current research interests include heterogeneous computing and high-performance computing.

Yin Peng is an associate professor in department of Pathology at Shenzhen University, China. Her long-standing research interests lie in elucidating the molecular mechanisms of cell signaling transduction pathway underlying tumorigenesis.

Yi Pan is a Chair Professor and the Dean of the College of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, China, and a Regents’ Professor Emeritus with Georgia State University, USA. His current research interests include bioinformatics and health informatics using big data analytics, cloud computing, and machine learning technologies.

Wenhui Xi is an associate professor at Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences. His research interests include bioinformatics and molecular simulation of proteins and peptides.

Yanjie Wei is a professor and the director in Center for High Performance Computing, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences. His research focuses on high performance computing and computational biology/bioinformatics.

References

1.

Liu

CX

,

Chen

LL

.

Circular RNAs: characterization, cellular roles, and applications

.

Cell

2022

;

185

(

13

):

2016

–

34

.

2.

Chen

Y

,

Wang

J

,

Wang

C

, et al.

Deep learning models for disease-associated circRNA prediction: a review

.

Brief Bioinform

2022

;

23

(6):bbac364.

3.

Ashwal-Fluss

R

,

Meyer

M

,

Pamudurti

NR

, et al.

circRNA biogenesis competes with pre-mRNA splicing

.

Mol Cell

2014

;

56

(

1

):

55

–

66

.

4.

Zhang

P

,

Dai

M

,

Genomics

DM

.

CircRNA: a rising star in plant biology

.

J Genet Genomics

2022

;

49

(

12

):

1081

–

92

.

5.

Niu

M

,

Ju

Y

,

Lin

C

,

Zou

Q

.

Characterizing viral circRNAs and their application in identifying circRNAs in viruses

.

Brief Bioinform

2022

;

23

(1):bbab404.

6.

Di Timoteo

G

,

Dattilo

D

,

Centron-Broco

A

, et al.

Modulation of circRNA metabolism by m6A modification

.

Cell Rep

2020

;

31

(

6

):

107641

.

7.

Liu

R

,

Zhang

L

,

Zhao

X

, et al.

circRNA: regulatory factors and potential therapeutic targets in inflammatory dermatoses

.

J Cell Mol Med

2022

;

26

(

16

):

4389

–

400

.

8.

He

Y

,

Huang

H

,

Jin

L

, et al.

CircZNF609 enhances hepatocellular carcinoma cell proliferation, metastasis, and stemness by activating the hedgehog pathway through the regulation of miR-15a-5p/15b-5p and GLI2 expressions

.

Cell Death Dis

2020

;

11

(

5

):

358

.

9.

Puri

S

,

Hu

J

,

Sun

Z

, et al.

Identification of circRNAs linked to Alzheimer’s disease and related dementias

.

Alzheimers Dement

2023

;

19

(

8

):

3389

–

405

.

10.

Chen

L

,

Shan

G

.

CircRNA in cancer: fundamental mechanism and clinical potential

.

Cancer Lett

2021

;

505

:

49

–

57

.

11.

Huang

W

,

Wu

Y

,

Qiao

M

, et al.

CircRNA–miRNA networks in regulating bone disease

.

J Cell Physiol

2022

;

237

(

2

):

1225

–

44

.

12.

Du

WW

,

Zhang

C

,

Yang

W

, et al.

Identifying and characterizing circRNA-protein interaction

.

Theranostics

2017

;

7

(

17

):

4183

–

91

.

13.

Shi

Y

,

Jia

X

,

Xu

J

, et al.

The new function of circRNA: translation

.

Clin Transl Oncol

2020

;

22

(

12

):

2162

–

9

.

14.

Zhang

X-O

,

Dong

R

,

Zhang

Y

, et al.

Diverse alternative back-splicing and alternative splicing landscape of circular RNAs

.

Genome Res

2016

;

26

(

9

):

1277

–

87

.

15.

Wu

J

,

Li

Y

,

Wang

C

, et al.

CircAST: full-length assembly and quantification of alternatively spliced isoforms in circular RNAs

.

Genomics Proteomics Bioinformatics

2019

;

17

(

5

):

522

–

34

.

16.

Ye

C-Y

,

Zhang

X

,

Chu

Q

, et al.

Full-length sequence assembly reveals circular RNAs with diverse non-GT/AG splicing signals in rice

.

RNA Biol

2017

;

14

(

8

):

1055

–

63

.

17.

Zheng

Y

,

Ji

P

,

Chen

S

, et al.

Reconstruction of full-length circular RNAs enables isoform-level quantification

.

Genome Med

2019

;

11

(

1

):

2

.

18.

Hossain

MT

,

Zhang

J

,

Reza

MS

, et al.

Reconstruction of full-length circRNA sequences using chimeric alignment information

.

Int J Mol Sci

2022

;

23

(

12

):

6776

.

19.

Li

X

,

Zhang

B

,

Li

F

, et al.

The mechanism and detection of alternative splicing events in circular RNAs

.

PeerJ

2020

;

8

:

e10032

.

20.

Grabherr

MG

,

Haas

BJ

,

Yassour

M

, et al.

Full-length transcriptome assembly from RNA-Seq data without a reference genome

.

Nat Biotechnol

2011

;

29

(

7

):

644

–

52

.

21.

Li

R

,

Zhu

H

,

Ruan

J

, et al.

De novo assembly of human genomes with massively parallel short read sequencing

.

Genome Res

2010

;

20

(

2

):

265

–

72

.

22.

Bankevich

A

,

Nurk

S

,

Antipov

D

, et al.

SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing

.

J Comput Biol

2012

;

19

(

5

):

455

–

77

.

23.

Robertson

G

,

Schein

J

,

Chiu

R

, et al.

De novo assembly and analysis of RNA-seq data

.

Nat Methods

2010

;

7

(

11

):

909

–

12

.

24.

Hölzer

M

,

Marz

M

.

De novo transcriptome assembly: a comprehensive cross-species comparison of short-read RNA-Seq assemblers

.

Gigascience

2019

;

8

(5):giz039.

25.

Kristensen

LS

,

Andersen

MS

,

Stagsted

LV

, et al.

The biogenesis, biology and characterization of circular RNAs

.

Nat Rev Genet

2019

;

20

(

11

):

675

–

91

.

26.

Gao

JL

,

Chen

G

,

He

HQ

, et al.

CircRNA as a new field in human disease research

.

Zhongguo Zhong Yao Za Zhi

2018

;

43

(

3

):

457

–

62

.

PubMed

27.

Xin

R

,

Gao

Y

,

Gao

Y

, et al.

isoCirc catalogs full-length circular RNA isoforms in human transcriptomes

.

Nat Commun

2021

;

12

(

1

):

266

.

28.

Liu

Z

,

Tao

C

,

Li

S

, et al.

circFL-seq reveals full-length circular RNAs with rolling circular reverse transcription and nanopore sequencing

.

Elife

2021

;

10

:

e69457

.

29.

Zhang

J

,

Hou

L

,

Zuo

Z

, et al.

Comprehensive profiling of circular RNAs with nanopore sequencing and CIRI-long

.

Nat Biotechnol

2021

;

39

(

7

):

836

–

45

.

30.

Gao

Y

,

Wang

J

,

Zhao

F

.

CIRI: an efficient and unbiased algorithm for de novo circular RNA identification

.

Genome Biol

2015

;

16

(

1

):

4

.

31.

Frazee

AC

,

Jaffe

AE

,

Langmead

B

,

Leek

JT

.

Polyester: simulating RNA-seq datasets with differential transcript expression

.

Bioinformatics

2015

;

31

(

17

):

2778

–

84

.

32.

Gaffo

E

,

Buratin

A

,

Dal Molin

A

,

Bortoluzzi

S

.

Sensitive, reliable and robust circRNA detection from RNA-seq with CirComPara2

.

Brief Bioinform

2022

;

23

(1):bbab418.