Figure 1.
(A) The framework of ViraLM. The input sequence is tokenized and fed into the transformer block. Then the binary classification layer will aggregate the result from the transformer block to generate a final prediction. (B) The performance of each tool on various-length contigs where negative samples only consist of prokaryotes (bacteria, archaea, plasmid). (C) Comparison of the performances on prokaryotic (bacteria, archaea, plasmid) and eukaryotic (fungi, protozoa, insects, bats, and humans) genomes. (D) The performance of each tool on distinguishing viruses from eukaryotic contigs (fungi, protozoa, insects, bats, and humans). (E) Sensitivity of virus identification on contigs with various protein densities, grouped by contig lengths. X-axis: percentage of identified viruses (sensitivity). Y-axis: number of proteins.