A deep-learning-based RNA-seq germline variant caller

Author Notes

Abstract

Summary

RNA sequencing (RNA-seq) can be applied to diverse tasks including quantifying gene expression, discovering quantitative trait loci and identifying gene fusion events. Although RNA-seq can detect germline variants, the complexities of variable transcript abundance, target capture and amplification introduce challenging sources of error. Here, we extend DeepVariant, a deep-learning-based variant caller, to learn and account for the unique challenges presented by RNA-seq data. Our DeepVariant RNA-seq model produces highly accurate variant calls from RNA-sequencing data, and outperforms existing approaches such as Platypus and GATK. We examine factors that influence accuracy, how our model addresses RNA editing events and how additional thresholding can be used to facilitate our models’ use in a production pipeline.

Supplementary information

Supplementary data are available at Bioinformatics Advances online.

1 Introduction

RNA-seq is a widely used method for transcriptome analysis. The technology is often used to study gene expression in a variety of contexts. Expression can be examined over time (Zhang et al., 2014), in response to disease or environmental changes (Kakumanu et al., 2012; Ren et al., 2012), or to examine differences across cell types, tissues or species (Gerstein et al., 2014; The GTEx Consortium, 2015; Villani et al., 2017; Zhang et al., 2022). In addition to gene expression, RNA-seq can be used for identifying quantitative trait loci (QTL) (Vigorito et al., 2021), quantifying isoform and allele-specific expression (Li and Dewey, 2011; Miao et al., 2018; Raghupathy et al., 2018), detecting gene fusion events (Haas et al., 2017), and in identification of RNA-editing events (Bahn et al., 2012; Lo Giudice et al., 2020). Interestingly, it is also possible to use RNA-seq to perform germline variant calling (Brouard et al., 2019). However, although germline variant calling with RNA-seq data could be beneficial in many studies, it is an underutilized application of RNA sequencing (Curry et al., 2021).

The lack of studies using RNA-seq for germline variant calling is likely driven by a number of challenges not present with traditional DNA sequencing (DNA-seq) approaches. For example, alignment coverage, a proxy for gene expression, exhibits more dynamic range in RNA-seq than DNA-seq. Coverage can range from little to none to excessively high levels. Low coverage can reduce or eliminate evidence of germline variants and complicate genotype classification (Kukurba and Montgomery, 2015; Sims et al., 2014). Additionally, RNA splicing, allele-specific expression and RNA editing can skew allele frequencies, introduce artifacts or remove variant signals (Guo et al., 2017; Jehl et al., 2021). Sample preparation can also introduce errors. For example, PCR can skew allele frequencies due to amplification bias (Parekh et al., 2016). Finally, contamination by RNAse enzymes can degrade RNA, and RNA-seq often requires an additional reverse transcription step to produce a cDNA library that can introduce errors (Schroeder et al., 2006). These issues can be challenging to address in clinical and research settings.

Despite these challenges, calling germline variants from RNA-seq data provide a number of advantages over traditional DNA-seq approaches. Foremost is an economic consideration: RNA-seq can provide germline variant calls where one might otherwise sequence an additional DNA sample (Ozsolak and Milos, 2011). This is especially useful when characterizing coding variation, which is captured with most RNA-seq approaches, providing a rapid and cost-effective way to interrogate coding regions. Additionally, the versatility of RNA-seq allows germline variation data to be combined with information about transcription and enables QTL discovery (Sun and Hu, 2013).

Existing RNA-seq variant callers use statistical or algorithmic approaches (DePristo et al., 2011; Oikkonen and Lise, 2017). However, these methods are challenging to implement given the numerous additional sources of error and uncertainty in RNA-seq data, and the complex interactions between them. In contrast, deep-learning approaches are capable of learning distinct patterns, extracted from sequence data, that are associated with the validity of candidate variants. In recent years, deep-learning approaches have been shown to be highly competitive and often exceed the performance of statistical and algorithmic approaches (Olson et al., 2022). Here, we introduce a new method for performing variant calling on RNA-seq data. We adapt DeepVariant, a deep-learning-based variant caller, for use in calling germline variants from RNA-seq data. Our best performing DeepVariant RNA-seq model is trained using data from the Genotype-Tissue Expression (GTEx) consortium (The GTEx Consortium, 2015, 2020), and can overcome challenges associated with RNA-seq variant calling and accurately classify germline variants. Our GTEx-based DeepVariant RNA-seq model outperforms existing approaches, achieving an F1 score of 0.933 on coding sequences (CDS) and 0.921 on exonic sequences.

The DeepVariant RNA-seq model and code are available under an open-source license at https://github.com/google/deepvariant.

2 Methods

2.1 Software implementation

DeepVariant is a deep-learning-based variant caller that uses a convolutional neural network to classify the genotype of a position in the genome by learning how read data correlate with real and spurious variants (Poplin et al., 2018). To classify variants, DeepVariant first scans a BAM file to identify evidence of SNP or INDEL variation. For example, a candidate variant SNP is identified when two or more bases differ from the reference at a particular site. When candidate sites are identified, DeepVariant will proceed to generate an example, or model input. Examples are constructed using read pileups by extracting a series of sequence features into channels. Each channel represents a sequence feature such as base, base quality or mapping quality. These channels are stacked to form a model input for training or inference. Following training, DeepVariant can use the resulting model to classify candidate variants as being homozygous-REF, heterozygous or homozygous-ALT.

We adapted the data preprocessing stage of DeepVariant to allow for processing of RNA-seq alignment data (BAMs). RNA-seq BAMs, in contrast to DNA-seq BAMs, contain SKIP CIGAR operations which are used to represent skipped regions (Supplementary Fig. S1) in the reference genome. These are generally caused by splicing events, although structural variation and mismapping of read segments are other possible explanations. Previously, DeepVariant would incorporate these SKIP operations during the local realignment step, resulting in a substantial increase in the computation time. To address this issue, we introduced a new flag (—split_skip_reads) that splits reads into multiple segments by removing their SKIP operations. This change allowed DeepVariant to process RNA-seq data efficiently (Supplementary Fig. S2). With these changes in place, model training and inference operate similar to existing approaches that use DNA-seq.

2.2 Model training and evaluation

Once we updated DeepVariant to process RNA-seq data, we trained models using two datasets. For the first model, we used samples from the Genome in a Bottle consortium (GIAB) (Poplin et al., 2018; Wagner et al., 2022; Zook et al., 2016). The GIAB consortium has developed a high quality set of genotypes for seven human cell lines (Wagner et al., 2022), derived from two sets of trios and a singleton. These genotypes can be used as training labels or for benchmarking purposes. To train the DeepVariant RNA-seq GIAB model (‘DV RNA-seq [GIAB]’), we extracted RNA from cell pellets of lymphoblastoid lines HG001 (GM12878; 10 replicates), and HG002 (GM26105; three replicates) obtained from the Coriell institute. This RNA was sequenced with the Tempus xT assay, which includes exome-capture regions that cover the exons of 19 433 human genes (34 Mb target region of the human genome) (Beaubier et al., 2019). Sequenced reads were aligned using STAR v2.5.4. The resulting BAMs were used for training, using data from chr2–19; chr21 and chr22 were used for hyperparameter optimization, with chr1 and chr20 held out for testing. For training we used the variants from the truth set from Genome-in-a-Bottle v4.1 and restricting to high confidence regions(Cleary_undated-zh). The GIAB dataset allowed us to train a model using high-quality labels.

We used RNA-seq data from GTEx V8 (phs000424.v8.p2) to train our second model (‘DV RNA-seq [GTEx]’). The GTEx consortium has extensively characterized human tissue samples from postmortem donors, and provides whole-genome DNA-seq and RNA-seq from a large collection of tissues. RNA-seq data from the GTEx project was sequenced using the Illumina TruSeq non-stranded polyA+ selection protocol. We assigned GTEx donors to a train or test group, and randomly selected 1000 RNA-seq samples from our train donor group (526 unique donors across 41 tissue types; Supplementary Table S1). Similarly, we randomly selected 200 RNA-seq samples for testing (74 unique donors across 36 tissue types; Supplementary Table S1). The GTEx dataset allowed us to train a model with a large number and variety of samples. Both models were trained using Inception V3, a convolutional neural network model (Szegedy et al., 2015). Parameters are detailed in Supplementary Table S2.

To derive a truth dataset for training the GTEx model, we ran DeepVariant’s highly accurate WGS model on GTEx DNA-seq samples and used the resulting genotypes as labels for training and testing our RNA-seq model. We refer to this training data as a ‘silver truth’ dataset because our labels are unlikely to be as accurate as GIAB for training. Nevertheless, this approach generated over 80.4 million training examples, 314× more than was used for the GIAB dataset, and provided a set of training examples with more diversity across tissue and ancestry. We also partitioned the genomes in our silver truth dataset: chr1–19 were used for training, chr21 and chr22 were used for tuning, and chr20 was held out for testing. We performed training by subsetting on exonic regions, and by warm starting from an existing whole-exome sequence trained DeepVariant model. A summary table of the datasets for both the GTEx and GIAB models is provided in Supplementary Table S3. We used prealigned BAMs and CRAMs provided by the GTEx consortium. The V8 release RNA-seq BAMs aligned using STAR v2.5.3a, and DNA-seq BAMs aligned using BWA (v0.7.13-r1126).

Separately, we also sequenced HG002 (GM24385, GM26105, GM27730) and HG005 (GM26107) on the NovaSeq platform using an mRNA poly-A-based library preparation. These samples were processed using the nf-core/rnaseq pipeline (Ewels et al., 2020) (10.5281/zenodo.1400710). We used HG002 with our GTEx model to establish filtering cutoff thresholds. HG005, which was held out from training for both GIAB and GTEx models, was used to compare performance of both models. Benchmarking was performed using GIAB Truth set v4.2.1. We used hap.py with the RTG vcfeval engine to perform benchmarking (https://github.com/Illumina/hap.py). Variants were required to have a minimum depth of 3× in the RNA-seq sample in order to be considered.

3 Results

3.1 DeepVariant RNA-seq produces accurate germline variant calls

We trained two DeepVariant RNA-seq models: ‘DV RNA-seq [GTEx]’ and ‘DV RNA-seq [GIAB]’. We compared the performance of the GTEx and GIAB models using HG005 (GM26107; poly-A capture), which was held out from training for both models. Performance was stratified across CDS, exon, transcript, gene, high-confidence GIAB regions (GIAB-hc), chr20 (which was held out from all training data) and regions with a minimum of 3× depth. Our initial analysis here examines F1 scores as they are sensitive to both precision and recall and enable comparison of models using a single value. Our subsequent analyses consider precision and sensitivity separately. The GTEx model outperforms the GIAB model, except for INDEL exon regions (Table 1). We chose to focus most of our subsequent analysis on DV RNA-seq [GTEx] given that it outperformed the GIAB model and fits a desirable use case of calling SNPs in CDS regions.

Table 1.

Open in new tab

Performance of GTEx and GIAB trained models on HG005 (GM26107)

Model	CDS	Exon	Transcript	Gene	GIAB-hc	chr20	3×
SNP
DV RNA-seq [GTEx]	0.934	0.918	0.835	0.831	0.857	0.837	0.82
DV RNA-seq [GIAB]	0.925	0.832	0.719	0.713	0.734	0.719	0.702
INDEL
DV RNA-seq [GTEx]	0.692	0.646	0.557	0.557	0.663	0.564	0.537
DV RNA-seq [GIAB]	0.676	0.66	0.541	0.541	0.624	0.53	0.527

Model	CDS	Exon	Transcript	Gene	GIAB-hc	chr20	3×
SNP
DV RNA-seq [GTEx]	0.934	0.918	0.835	0.831	0.857	0.837	0.82
DV RNA-seq [GIAB]	0.925	0.832	0.719	0.713	0.734	0.719	0.702
INDEL
DV RNA-seq [GTEx]	0.692	0.646	0.557	0.557	0.663	0.564	0.537
DV RNA-seq [GIAB]	0.676	0.66	0.541	0.541	0.624	0.53	0.527

Note: F1 scores are shown for DV RNA-seq [GTEx] and DV RNA-seq [GIAB] for sample HG005 across different regions. All regions are intersected with a 3× minimum coverage bed file from the HG005 RNA-seq sample. Bolded numbers indicate the best performing model by region based on F1 score for each variant type (SNP or INDEL).

Table 1.

Open in new tab

Performance of GTEx and GIAB trained models on HG005 (GM26107)

Model	CDS	Exon	Transcript	Gene	GIAB-hc	chr20	3×
SNP
DV RNA-seq [GTEx]	0.934	0.918	0.835	0.831	0.857	0.837	0.82
DV RNA-seq [GIAB]	0.925	0.832	0.719	0.713	0.734	0.719	0.702
INDEL
DV RNA-seq [GTEx]	0.692	0.646	0.557	0.557	0.663	0.564	0.537
DV RNA-seq [GIAB]	0.676	0.66	0.541	0.541	0.624	0.53	0.527

Model	CDS	Exon	Transcript	Gene	GIAB-hc	chr20	3×
SNP
DV RNA-seq [GTEx]	0.934	0.918	0.835	0.831	0.857	0.837	0.82
DV RNA-seq [GIAB]	0.925	0.832	0.719	0.713	0.734	0.719	0.702
INDEL
DV RNA-seq [GTEx]	0.692	0.646	0.557	0.557	0.663	0.564	0.537
DV RNA-seq [GIAB]	0.676	0.66	0.541	0.541	0.624	0.53	0.527

3.2 Comparison of DeepVariant RNA-seq to other open source callers

We compared the performance of our GTEx and GIAB models with several other RNA-seq variant calling approaches using our GTEx-derived ‘silver truth’ dataset. Our comparison includes the warm-start model trained on whole-exome data and used for training (DV WES; DeepVariant v1.1.0) to see how much training on RNA-seq data improved our model. We also tested GATK HaplotypeCaller (v4.2.4.0) (DePristo et al., 2011) based on a RNA-seq workflow developed by the GATK team. Finally, we tested Platypus (v0.8.1.2) (Rimmer et al., 2014). Platypus requires the BAM input to be preprocessed with a tool called Opossum (v1.0), which performs filtering and processing of RNA-seq BAMs to optimize them for RNA-seq variant calling (Oikkonen and Lise, 2017). We also tested other callers with and without Opossum-preprocessed BAMs.

We used our GTEx test data to compare performance across variant callers. Similar to our model comparison analysis, we stratified across several regions intersected at a 3× minimum coverage. We observe that DV RNA-seq [GTEx] achieves the highest median F1 score across all region stratifications (Table 2), performing the best overall in CDS regions (Fig. 1). We observe the same trend with INDELs as well (Supplementary Fig. S3). Interestingly, we found that preprocessing with Opossum lowered the performance of DeepVariant, likely because the alignment data were substantially modified from our training dataset. However, our modifications to DeepVariant obviate the need for preprocessing while still resulting in more accurate results.

Fig. 1.

SNP F1 scores in CDS regions for 200 GTEx RNA-seq samples. SNP F1 scores in CDS regions are shown for 200 GTEx RNA-seq test samples across Platypus, GATK, DV WES, DV RNA-seq [GIAB], and DV RNA-seq [GTEx]. Data that were preprocessed with Opossum is labeled as such

Open in new tab Download slide

Table 2.

Open in new tab

Performance of variant callers on 200 GTEx samples across genomic regions

Variant caller	CDS	Exon	Transcript	Gene	chr20	3×
SNP
DV RNA-seq [GTEx]	0.933	0.921	0.846	0.842	0.83	0.83
DV RNA-seq [GIAB]	0.909	0.845	0.745	0.737	0.738	0.722
DV RNA-seq [GIAB] (Opossum)	0.903	0.84	0.742	0.733	0.739	0.719
GATK (Opossum)	0.903	0.852	0.759	0.75	0.753	0.737
DV WES	0.898	0.801	0.678	0.667	0.661	0.649
GATK	0.898	0.847	0.751	0.742	0.741	0.727
Platypus (Opossum)	0.898	0.833	0.726	0.717	0.716	0.703
DV WES (Opossum)	0.895	0.809	0.692	0.682	0.679	0.666
DV RNA-seq [GTEx] (Opossum)	0.805	0.785	0.701	0.696	0.687	0.685
INDEL
DV RNA-seq [GTEx]	0.73	0.675	0.558	0.554	0.544	0.539
DV RNA-seq [GIAB]	0.652	0.6	0.494	0.49	0.49	0.477
DV RNA-seq [GIAB] (Opossum)	0.592	0.558	0.436	0.432	0.443	0.418
DV RNA-seq [GTEx] (Opossum)	0.524	0.531	0.437	0.434	0.429	0.419
GATK (Opossum)	0.504	0.565	0.482	0.477	0.487	0.463
GATK	0.486	0.574	0.491	0.487	0.495	0.471
DV WES	0.423	0.581	0.493	0.489	0.502	0.477
Platypus (Opossum)	0.414	0.468	0.398	0.394	0.397	0.374
DV WES (Opossum)	0.392	0.533	0.44	0.435	0.452	0.423

Variant caller	CDS	Exon	Transcript	Gene	chr20	3×
SNP
DV RNA-seq [GTEx]	0.933	0.921	0.846	0.842	0.83	0.83
DV RNA-seq [GIAB]	0.909	0.845	0.745	0.737	0.738	0.722
DV RNA-seq [GIAB] (Opossum)	0.903	0.84	0.742	0.733	0.739	0.719
GATK (Opossum)	0.903	0.852	0.759	0.75	0.753	0.737
DV WES	0.898	0.801	0.678	0.667	0.661	0.649
GATK	0.898	0.847	0.751	0.742	0.741	0.727
Platypus (Opossum)	0.898	0.833	0.726	0.717	0.716	0.703
DV WES (Opossum)	0.895	0.809	0.692	0.682	0.679	0.666
DV RNA-seq [GTEx] (Opossum)	0.805	0.785	0.701	0.696	0.687	0.685
INDEL
DV RNA-seq [GTEx]	0.73	0.675	0.558	0.554	0.544	0.539
DV RNA-seq [GIAB]	0.652	0.6	0.494	0.49	0.49	0.477
DV RNA-seq [GIAB] (Opossum)	0.592	0.558	0.436	0.432	0.443	0.418
DV RNA-seq [GTEx] (Opossum)	0.524	0.531	0.437	0.434	0.429	0.419
GATK (Opossum)	0.504	0.565	0.482	0.477	0.487	0.463
GATK	0.486	0.574	0.491	0.487	0.495	0.471
DV WES	0.423	0.581	0.493	0.489	0.502	0.477
Platypus (Opossum)	0.414	0.468	0.398	0.394	0.397	0.374
DV WES (Opossum)	0.392	0.533	0.44	0.435	0.452	0.423

Note: Each row lists the median F1 score across 200 GTEx test samples for the given variant caller when using the original BAM or an Opossum preprocessed BAM in each region type. Performance is sorted by F1 score descending. Bolded numbers indicate the best performing model by region based on F1 score for each variant type (SNP or INDEL).

Table 2.

Open in new tab

Performance of variant callers on 200 GTEx samples across genomic regions

Variant caller	CDS	Exon	Transcript	Gene	chr20	3×
SNP
DV RNA-seq [GTEx]	0.933	0.921	0.846	0.842	0.83	0.83
DV RNA-seq [GIAB]	0.909	0.845	0.745	0.737	0.738	0.722
DV RNA-seq [GIAB] (Opossum)	0.903	0.84	0.742	0.733	0.739	0.719
GATK (Opossum)	0.903	0.852	0.759	0.75	0.753	0.737
DV WES	0.898	0.801	0.678	0.667	0.661	0.649
GATK	0.898	0.847	0.751	0.742	0.741	0.727
Platypus (Opossum)	0.898	0.833	0.726	0.717	0.716	0.703
DV WES (Opossum)	0.895	0.809	0.692	0.682	0.679	0.666
DV RNA-seq [GTEx] (Opossum)	0.805	0.785	0.701	0.696	0.687	0.685
INDEL
DV RNA-seq [GTEx]	0.73	0.675	0.558	0.554	0.544	0.539
DV RNA-seq [GIAB]	0.652	0.6	0.494	0.49	0.49	0.477
DV RNA-seq [GIAB] (Opossum)	0.592	0.558	0.436	0.432	0.443	0.418
DV RNA-seq [GTEx] (Opossum)	0.524	0.531	0.437	0.434	0.429	0.419
GATK (Opossum)	0.504	0.565	0.482	0.477	0.487	0.463
GATK	0.486	0.574	0.491	0.487	0.495	0.471
DV WES	0.423	0.581	0.493	0.489	0.502	0.477
Platypus (Opossum)	0.414	0.468	0.398	0.394	0.397	0.374
DV WES (Opossum)	0.392	0.533	0.44	0.435	0.452	0.423

Variant caller	CDS	Exon	Transcript	Gene	chr20	3×
SNP
DV RNA-seq [GTEx]	0.933	0.921	0.846	0.842	0.83	0.83
DV RNA-seq [GIAB]	0.909	0.845	0.745	0.737	0.738	0.722
DV RNA-seq [GIAB] (Opossum)	0.903	0.84	0.742	0.733	0.739	0.719
GATK (Opossum)	0.903	0.852	0.759	0.75	0.753	0.737
DV WES	0.898	0.801	0.678	0.667	0.661	0.649
GATK	0.898	0.847	0.751	0.742	0.741	0.727
Platypus (Opossum)	0.898	0.833	0.726	0.717	0.716	0.703
DV WES (Opossum)	0.895	0.809	0.692	0.682	0.679	0.666
DV RNA-seq [GTEx] (Opossum)	0.805	0.785	0.701	0.696	0.687	0.685
INDEL
DV RNA-seq [GTEx]	0.73	0.675	0.558	0.554	0.544	0.539
DV RNA-seq [GIAB]	0.652	0.6	0.494	0.49	0.49	0.477
DV RNA-seq [GIAB] (Opossum)	0.592	0.558	0.436	0.432	0.443	0.418
DV RNA-seq [GTEx] (Opossum)	0.524	0.531	0.437	0.434	0.429	0.419
GATK (Opossum)	0.504	0.565	0.482	0.477	0.487	0.463
GATK	0.486	0.574	0.491	0.487	0.495	0.471
DV WES	0.423	0.581	0.493	0.489	0.502	0.477
Platypus (Opossum)	0.414	0.468	0.398	0.394	0.397	0.374
DV WES (Opossum)	0.392	0.533	0.44	0.435	0.452	0.423

We also evaluated the performance on chr20 that was held out from training and again found DV RNA-seq [GTEx] to outperform alternative variant callers (Supplementary Table S3).

3.3 Factors impacting RNA-seq model performance

Next, we investigated factors that impact DV RNA-seq [GTEx] performance, focusing on the use case of calling SNPs in CDS regions. We examined F1 scores after restricting variant calling to sites at a series of coverage thresholds in CDS regions (Fig. 2A). For each threshold, we indicate the relative precision and sensitivity.

Fig. 2.

Impact of depth, RIN, and tissue on DV RNA-seq performance. (A) F1 scores are shown for SNPs in CDS regions for a series of coverage thresholds. For each coverage threshold, we show the distribution of F1 scores for 200 test RNA-seq samples. Below each box plot are the values for precision (P), sensitivity (S) at the given threshold. (B) Correlation plot between RIN and F1 scores for CDS SNPs. (C) F1-score distributions for SNPs in CDS regions across a collection of GTEx tissues. The number of RNA-seq samples is indicated for each tissue at the top (n). We filtered tissues with >5 samples in our test set

Open in new tab Download slide

Assuming variants are detectable at a minimum depth of 3×, we asked what proportion of these detectable variants remain across minimum coverage thresholds (Supplementary Fig. S4). Although our examination here will be dependent on coverage for a given sample, our analysis suggests that a modest coverage threshold of 8× sample still captures 75% of CDS label variants on average in GTEx samples.

We also examined the correlation between the RNA integrity number (RIN) and RNA-seq sample accuracy. An RIN score can be calculated for a given RNA sample using an algorithm that examines the ratio of 28S and 18S ribosomal RNA. RIN scores range from 1 (totally degraded) to 10 (intact) (Schroeder et al., 2006), and are used as an indicator of RNA quality. We observe a weak correlation between RIN and CDS SNP F1 scores (R = 0.26; P = 0.00015; Fig. 2B), suggesting that DeepVariant RNA-seq can be used on samples with low RIN scores. We also observe a weak correlation for INDELs as well (R = 0.16, P = 0.021; Supplementary Fig. S5).

Tissue type can strongly influence RNA-seq data. We examined how GTEx tissues impacted DeepVariant RNA-seq performance in CDS regions. We observed the highest F1 score with skin and the lowest performance in blood (Fig. 2C). However, performance for SNPs in CDS regions was relatively consistent across tissues. One major factor that may drive differences in performance across tissue is the proportion of abundant versus rare transcripts, which in turn affects the coverage at variant positions.

Unlike DNA, RNA-seq reads possess splicing boundaries. We reasoned that error rates might increase near exon boundaries, where read context around candidate variants will drop off and make accurate variant classification more difficult. Unsurprisingly, we observe this effect across several RNA-seq variant calling methods (Fig. 3). However, DV RNA-seq [GTEx] appears to largely eliminate this effect for false-positive (FP) SNP variants at exon boundaries, and achieves low false-negative (FN) levels as well.

Fig. 3.

SNP errors relative to exon boundary. Average counts of false-negative (FN; top panel) and false positive (FP; bottom panel) errors per GTEx RNA sample (n = 200) are shown based on a called SNPs position relative to the nearest exon boundary for several variant callers. Negative positions correspond to SNPs outside of exons, whereas positive positions indicate SNPs falling within exons. A value of 0 indicates a position falling on an exon boundary

Open in new tab Download slide

3.4 Our model learns to ignore RNA-editing events

RNA is subject to editing by ADAR (adenosine deaminases acting on RNA), which can result in Adenosine-to-inosine (A-to-I) changes (Walkley and Li, 2017). These A-to-I changes are observed in RNA-seq datasets as apparent A-to-G and T-to-C changes which do not reflect real variants in the DNA, and can increase the germline FP rate. We sought to examine how our GTEx-based model treated these types of events. To do this, we annotated variants that belonged to an RNA editing database called REDIportal (Mansi et al., 2021). REDIportal has identified and annotated genomic positions where RNA editing has been observed using GTEx data.

We asked how the variants compared with the label dataset based on the base change observed and whether they were present in the REDIportal database. Interestingly, we observe that the DV RNA-seq [GTEx] model has a large reduction in the number of FPs at REDIportal sites compared to the DV WES model. In Figure 4, we show a representative example of this for GTEx sample GTEX-ZZ64-1126-SM-5GZXY. We also observe this phenomenon in an aggregate analysis (Supplementary Fig. S6). The REDIportal FP panel shows a considerable reduction in the FP rate at sites where RNA editing has been observed. The reduction in FPs does appear to slightly increase the FN rate at these sites. However, we observe an overall reduction in errors, suggesting that our model learns to ignore RNA-edited sites. We suspect that our model picks up on sequence context and allele frequency differences at these sites as a basis for filtering them out.

Fig. 4.

RNA-editing events are ignored. Exonic variants classified as false-negative (FN), false positive (FP) or true positive (TP) are shown by base pair change and whether they are present within the REDIportal database for GTEx sample GTEX-ZZ64-1126-SM-5GZXY

Open in new tab Download slide

3.5 Selecting a cutoff for practical implementation

When implementing a variant caller into a production analysis pipeline, it is convenient to select a cutoff based on quality scores to maintain a desired true positive rate (TPR) or false discovery rate (FDR) based on the application. For example, the 1000 Genomes Project FDR target was 5% (although ultimately the project delivered calls at 1% FDR) (Siva, 2008). In this example, we will similarly illustrate a threshold that favors precision over sensitivity.

Variability between samples is expected due to both random sampling during the sequencing process and variation among samples. Therefore, cutoff scores should be established by examining multiple replicates or samples using validated variants. We examined 3 HG002 cell lines: GM24385, a lymphoblastoid cell line (LCL), and two induced pluripotent stem cell lines, GM26105 and GM27730. These cell lines were evaluated with the DV RNA-seq [GTEx] model in high-confidence GIAB CDS regions to identify an appropriate cutoff score that would maintain an FDR of 1.5% while retaining high sensitivity.

We considered two quality scores to use for setting a cutoff: Genotype quality (GQ) and quality (QUAL). GQ describes the probability of a variant call being incorrect, whereas QUAL describes the probability a given site is not a variant. We found GQ worked well for both SNPs and INDELs for thresholding in comparison to QUAL scores, and decided on a cutoff of GQ ≥ 18 (GQ: Fig. 5; QUAL: Supplementary Fig. S6). This threshold achieves an FDR of <1.5% for both SNPs and INDELs while maintaining high sensitivity for SNPs, and modest sensitivity INDELs.

Fig. 5.

Selection of a QUAL cut off to maintain FDR of 1.5%. The FDR (x-axis) is plotted against the TPR (y-axis) for both SNP and INDELs across 3 HG002 cell lines in high confidence GIAB regions. Each point represents a given GQ cutoff. The red line indicates an FDR of 1.5%. Larger markers indicate the selected cutoff of GQ $\geq$ 18

Open in new tab Download slide

Finally, we evaluated the GQ ≥ 18 threshold using an LCL line from HG005 (GM26107; poly-A selected) to see how it would perform against unfiltered models and other RNA-seq variant callers. We again examined performance in high-confidence GIAB regions intersected with CDS regions. We find that DV RNA-seq [GTEx] GQ ≥ 18 is able to achieve a precision of 0.998 on SNPs and 0.989 on INDELs with similar levels of sensitivity we observed when deriving our cutoff (Table 3).

Table 3.

Open in new tab

Precision and recall in high confidence CDS regions of HG005 (GM26107)

Variant caller	Precision	Recall	F1	# FP	# TP
SNP
DV RNA-seq [GTEx] GQ $\geq$ 18	0.998	0.809	0.893	18	9077
DV RNA-seq [GTEx]	0.978	0.942	0.96	236	10 579
Platypus (Opossum)	0.96	0.93	0.945	431	10 438
GATK	0.959	0.919	0.938	444	10 301
DV WES	0.938	0.946	0.942	702	10 624
INDEL
DV RNA-seq [GTEx] GQ $\geq$ 18	0.989	0.509	0.672	1	86
DV RNA-seq [GTEx]	0.831	0.805	0.818	28	138
Platypus (Opossum)	0.696	0.604	0.646	45	103
GATK	0.404	0.805	0.538	220	149
DV WES	0.223	0.828	0.352	487	140

Variant caller	Precision	Recall	F1	# FP	# TP
SNP
DV RNA-seq [GTEx] GQ $\geq$ 18	0.998	0.809	0.893	18	9077
DV RNA-seq [GTEx]	0.978	0.942	0.96	236	10 579
Platypus (Opossum)	0.96	0.93	0.945	431	10 438
GATK	0.959	0.919	0.938	444	10 301
DV WES	0.938	0.946	0.942	702	10 624
INDEL
DV RNA-seq [GTEx] GQ $\geq$ 18	0.989	0.509	0.672	1	86
DV RNA-seq [GTEx]	0.831	0.805	0.818	28	138
Platypus (Opossum)	0.696	0.604	0.646	45	103
GATK	0.404	0.805	0.538	220	149
DV WES	0.223	0.828	0.352	487	140

Note: The DeepVariant RNA-seq [GTEx] model with a GQ ≥ 18 cutoff is listed with alternative RNA-seq variant calling methods. For each variant calling method, we show the precision, recall, F1 score, number of false positives (#FP), and number of true positives (#TP) in high-confidence GIAB regions intersected with CDS regions. Bolded values indicate the best performing variant caller for each column, grouped by variant type (SNP or INDEL).

Table 3.

Open in new tab

Precision and recall in high confidence CDS regions of HG005 (GM26107)

Variant caller	Precision	Recall	F1	# FP	# TP
SNP
DV RNA-seq [GTEx] GQ $\geq$ 18	0.998	0.809	0.893	18	9077
DV RNA-seq [GTEx]	0.978	0.942	0.96	236	10 579
Platypus (Opossum)	0.96	0.93	0.945	431	10 438
GATK	0.959	0.919	0.938	444	10 301
DV WES	0.938	0.946	0.942	702	10 624
INDEL
DV RNA-seq [GTEx] GQ $\geq$ 18	0.989	0.509	0.672	1	86
DV RNA-seq [GTEx]	0.831	0.805	0.818	28	138
Platypus (Opossum)	0.696	0.604	0.646	45	103
GATK	0.404	0.805	0.538	220	149
DV WES	0.223	0.828	0.352	487	140

Variant caller	Precision	Recall	F1	# FP	# TP
SNP
DV RNA-seq [GTEx] GQ $\geq$ 18	0.998	0.809	0.893	18	9077
DV RNA-seq [GTEx]	0.978	0.942	0.96	236	10 579
Platypus (Opossum)	0.96	0.93	0.945	431	10 438
GATK	0.959	0.919	0.938	444	10 301
DV WES	0.938	0.946	0.942	702	10 624
INDEL
DV RNA-seq [GTEx] GQ $\geq$ 18	0.989	0.509	0.672	1	86
DV RNA-seq [GTEx]	0.831	0.805	0.818	28	138
Platypus (Opossum)	0.696	0.604	0.646	45	103
GATK	0.404	0.805	0.538	220	149
DV WES	0.223	0.828	0.352	487	140

3.6 Runtime

We examined the runtime performance of our DeepVariant RNA-seq models against existing approaches (Supplementary Fig. S8). All runtimes were measured using GTEx RNA-seq BAM files on Google Cloud n1-standard-64 virtual machines (64 cores, 240 Gb ram). Using Opossum preprocessed bams generally reduced the required runtime of variant callers, but would add an additional runtime of 15.6 min to preprocess BAM files. DeepVariant RNA-seq models are faster than GATK (Table 4). Although Platypus runs faster than DeepVariant, running Platypus requires preprocessing with Opossum. The total runtime of Platypus + Opossum results in a similar runtime to DeepVariant RNA-seq models without Opossum preprocessing (Table 4).

Table 4.

Open in new tab

Runtime performance of variant callers and Opossum on GTEx BAMs

Model	Opossum	Caller	Total
Platypus (Opossum)	932s	479s	1410s
DV RNA-seq [GTEx]		1426s	1426s
DV RNA-seq [GIAB]		1427s	1427s
DV WES		1434s	1434s
DV WES (Opossum)	937s	1006s	1943s
DV RNA-seq [GTEx] (Opossum)	937s	1011s	1948s
DV RNA-seq [GIAB] (Opossum)	937s	1017s	1954s
GATK		2868s	2868s

Model	Opossum	Caller	Total
Platypus (Opossum)	932s	479s	1410s
DV RNA-seq [GTEx]		1426s	1426s
DV RNA-seq [GIAB]		1427s	1427s
DV WES		1434s	1434s
DV WES (Opossum)	937s	1006s	1943s
DV RNA-seq [GTEx] (Opossum)	937s	1011s	1948s
DV RNA-seq [GIAB] (Opossum)	937s	1017s	1954s
GATK		2868s	2868s

Note: The duration required to run opossum, the variant caller and the total duration are listed.

Table 4.

Open in new tab

Runtime performance of variant callers and Opossum on GTEx BAMs

Model	Opossum	Caller	Total
Platypus (Opossum)	932s	479s	1410s
DV RNA-seq [GTEx]		1426s	1426s
DV RNA-seq [GIAB]		1427s	1427s
DV WES		1434s	1434s
DV WES (Opossum)	937s	1006s	1943s
DV RNA-seq [GTEx] (Opossum)	937s	1011s	1948s
DV RNA-seq [GIAB] (Opossum)	937s	1017s	1954s
GATK		2868s	2868s

Model	Opossum	Caller	Total
Platypus (Opossum)	932s	479s	1410s
DV RNA-seq [GTEx]		1426s	1426s
DV RNA-seq [GIAB]		1427s	1427s
DV WES		1434s	1434s
DV WES (Opossum)	937s	1006s	1943s
DV RNA-seq [GTEx] (Opossum)	937s	1011s	1948s
DV RNA-seq [GIAB] (Opossum)	937s	1017s	1954s
GATK		2868s	2868s

Note: The duration required to run opossum, the variant caller and the total duration are listed.

4 Discussion

This study introduces DeepVariant RNA-seq, an extension of DeepVariant that enables germline variant calling using Illumina short-read RNA-seq data. To the best of our knowledge, this is the first deep-learning-based RNA-seq germline variant caller. However, an existing tool called SmartRNASeqCaller (Bosio et al., 2019) is a machine-learning-based post-processing pipeline that operates on variants called with germline RNA-seq variant callers such as GATK. DeepVariant RNA-seq can operate in a standalone manner, without requiring pre- or postprocessing.

Additionally, we have demonstrated that a ‘silver-truth set’ can be used to generate highly accurate models when used appropriately. The GTEx-based model was trained using label variants derived from the DeepVariant whole-genome sequencing (DV WGS). Note that we only use a fraction of these labels for training where we observe adequate RNA-seq coverage. Although these labels are likely of poorer quality than GIAB-based truth sets, we observed that this approach allowed for a substantially larger and more diverse training dataset than the GIAB-based model. This approach ultimately yielded improved performance compared with the GIAB-based training approach, suggesting that a silver truth may be appropriate in cases which can dramatically expand gold standard datasets of limited size.

Our examination of RNA editing events suggests our model learns to ignore these types of events, because it is trained on germline variant calls only. Conversely, this suggests that we could train a model capable of calling these RNA-editing events by configuring our model to produce an additional output (e.g. probability of a site being an RNA editing event). Such functionality could integrate germline and RNA-editing events into a single variant call format file, and allow for investigations of RNA editing and germline variant interactions. Consider the result when a heterozygous variant occurs adjacent to an RNA editing event within a codon. For example, [heterozygous genotype: T/C] + [RNA edit: A/G] + [C] could produce four different protein isoforms: UAC (Tyr), UGC (Cys), CAC (His) and CGC (Arg). Although CDS editing is rare, this extension of our work would allow for these types of events to be identified, in addition to long-range combinatorial effects of genotype and RNA edits.

We expect our models to be sensitive to sample type, library preparation and/or bioinformatic methods. For example, we observed considerable differences in terms of performance when Opossum was applied to preprocess RNA-seq bams. Similarly, we expect using alternative RNA-seq aligners (other than STAR) may impact results. We will note, however, that DeepVariant performs a realignment step that may mitigate this effect somewhat.

By releasing the DeepVariant RNA-seq models, we hope to provide reliable methods to add RNA-seq variant detection to typical RNA analysis use cases. We welcome community feedback to further enhance DeepVariant RNA-seq.

Acknowledgements

We thank the anonymous reviewers for their reviews and valuable feedback. We are grateful for the data and resources made available from the Genome-in-a-Bottle (GIAB) consortium and the Genotype-Tissue Expression (GTEx) Project. The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH and NINDS.

Author contributions

Daniel E. Cook (Data curation, Formal analysis, methodology, software, validation, visualization, writing—original draft, writing—review & editing [equal]), Aarti Venkat (data curation, formal analysis, methodology, validation, visualization [equal]), Dennis Yelizarov (formal analysis, visualization [equal]), Yannick Pouliot (data curation, formal analysis, validation [equal]), Pi-Chuan Chang (methodology, software, supervision, writing—review & editing [equal]), Andrew Carroll (conceptualization, supervision, writing—review & editing [equal]), and Francisco M. De La Vega (conceptualization, supervision, writing—original draft, Writing—review & editing [equal])

Software and data availability

The DeepVariant RNA-seq GTEx model and documentation are available at https://github.com/google/deepvariant. See the RNA-seq case study for a tutorial on how to use the RNA-seq model. URLs for GIAB RNA-seq FASTQs and BAMs are listed in Supplementary Table S5.

Funding

This work was supported by Google LLC and Tempus.

Conflict of Interest: DEC, DY, PCC, and AC are employees of Google LLC and own Alphabet stock as part of the standard compensation package. AV, YP, and FMDLV are employees of Tempus Labs.

References

Bahn

J.H.

et al. (

2012

)

Accurate identification of A-to-I RNA editing in human by transcriptome sequencing

Genome Res

142

–

150

Month:	Total Views:
June 2023	124
July 2023	280
August 2023	244
September 2023	224
October 2023	216
November 2023	289
December 2023	248
January 2024	311
February 2024	250
March 2024	349
April 2024	206
May 2024	229
June 2024	221
July 2024	216
August 2024	252
September 2024	188
October 2024	303
November 2024	256
December 2024	170
January 2025	284
February 2025	253
March 2025	303
April 2025	213

Article Contents

A deep-learning-based RNA-seq germline variant caller

Abstract

1 Introduction

2 Methods

2.1 Software implementation

2.2 Model training and evaluation

3 Results

3.1 DeepVariant RNA-seq produces accurate germline variant calls

3.2 Comparison of DeepVariant RNA-seq to other open source callers

3.3 Factors impacting RNA-seq model performance

3.4 Our model learns to ignore RNA-editing events

3.5 Selecting a cutoff for practical implementation

3.6 Runtime

4 Discussion

Acknowledgements

Author contributions

Software and data availability

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only