Progress and opportunities of foundation models in bioinformatics

Biological problems and their associated data in biological FMs. This table provides an overview of five distinct problems in bioinformatics addressed by biological FMs: multimedia analysis, core biological problems (sequence analysis, structure construction, function prediction), single-cell multi-omics (scMultiomics) analysis, and multimodal integration. These problems involve one or more categories of biological data, including DNA, RNA, proteins, scMultiomics, knowledge graphs/networks, and biological text/images. Biological multimedia analysis primarily focuses on biological text/image or video data. Core biological problems involve genes and mutations, biological phenomena data, and their relations and interactions. Multimodal integration biological problems may utilize multiple data types, such as biomedical text/images and proteins.

Problems/Data	Multimedia analysis	Sequence analysis	Construction	Function prediction	Multimodal integration
DNA		√		√
RNA			√	√
Protein		√	√	√	√
scMultiomics		√	√	√	√
KGs/Net.		√		√
Text/Image	√			√	√

Problems/Data	Multimedia analysis	Sequence analysis	Construction	Function prediction	Multimodal integration
DNA		√		√
RNA			√	√
Protein		√	√	√	√
scMultiomics		√	√	√	√
KGs/Net.		√		√
Text/Image	√			√	√

Table 1

Open in new tab Download slide

Problems/Data	Multimedia analysis	Sequence analysis	Construction	Function prediction	Multimodal integration
DNA		√		√
RNA			√	√
Protein		√	√	√	√
scMultiomics		√	√	√	√
KGs/Net.		√		√
Text/Image	√			√	√

Problems/Data	Multimedia analysis	Sequence analysis	Construction	Function prediction	Multimodal integration
DNA		√		√
RNA			√	√
Protein		√	√	√	√
scMultiomics		√	√	√	√
KGs/Net.		√		√
Text/Image	√			√	√

Figure 2

Timeline of FMs in bioinformatics and their background in deep learning. The emergence of FMs in bioinformatics coincided with the ascent of deep learning, gaining significant momentum as these models showcased remarkable advancements in the era of big data. Landmark achievements such as Alpha Go, the first robot to meet top standards, significantly enriched the landscape of deep learning. Subsequent developments, exemplified by AlphaFold and AlphaFold2, revolutionized protein structure prediction from biological sequences. The introduction of GPT4 marked a pivotal moment, catalyzing a surge in the application of FMs. These strides propelled FMs (including discriminative FMs and generative FMs) in bioinformatics to acquire salient information for practical applications in biology.

This review delivers an in-depth exploration of recent advancements and challenges related to FMs, with a focus on cultivating a comprehensive understanding of biological issues. While the primary shift involves transitioning FMs from general domains to specialized biological multimedia domains, the review primarily concentrates on three core biological problems critical for analyzing the sequence, structure, and function of biological targets. Additionally, it highlights the significance of analyzing single-cell multi-omics and addressing multimodal integration biological problems, which involve the integration of multiple types of biological data to enhance performance further. The review concludes by deliberating on potential directions in light of current challenges and opportunities. In summary, the exploration of recent FMs in bioinformatics is structured across the following subsections: (i) FM architectures; (ii) biological FMs tailored for five types of biological problems, including the introduction of distinct problems and datasets, data preprocessing, and downstream tasks; (iii) challenges and opportunities; and (iv) conclusions.

Foundation model architectures

FMs has been primordially observed in NLP [21] and subsequently permeated into computer vision and various other domains of deep learning [22]. In bioinformatics, FMs trained on massive biological data offer unparalleled predictive capabilities through fine-tuning mechanisms. Based on pretraining modules, FMs in bioinformatics can be divided into two main categories: discriminative FMs are primarily designed to capture the semantic or biological meaning of entire sequences by constructing encoders that extract intricate patterns and relationships within annotated data through masked language modeling, resulting in meaningful embeddings. These models excel at tasks like classification and regression, where accurate predictions are derived from well-structured inputs. On the other hand, generative FMs focus on autoregressive methods to generate semantic features and contextual information from unannotated data. These models produce rich representations that are valuable for various downstream applications, particularly in generation tasks where the model must synthesize new data based on learned patterns. The complementary strengths of both discriminative and generative FMs highlight their versatility across a wide array of applications, from precise predictive modeling to creative content generation.

Discriminative pretrained foundation models

Conventional AI models are typically designed to train neural networks for specific tasks in an end-to-end manner, focusing on optimizing the model for a single task at a time. However, this approach often lacks generalizability and requires significant retraining when applied to new tasks. The advancements in NLP and word embeddings introduced a significant shift in this paradigm, exemplified by BERT (Bidirectional Encoder Representations from Transformers), which marked a breakthrough in embedding capabilities. BERT demonstrated substantial improvements in tasks like summarization, semantic understanding, and sentence classification, underscoring its ability to capture the overall semantic meaning of sequences.

As a discriminative model, BERT [23] leverages variations of masked language modeling during pretraining, where a portion of tokens is masked, and the model is trained to predict these masked tokens. The corresponding loss function is typically cross-entropy, applied to the masked tokens. This bidirectional context allows embeddings to fully capture the semantic nuances of a sequence. Discriminative pre-trained foundation models are particularly effective for classification and regression tasks, enabling a deeper understanding of complex biological processes in both normal and pathological states. These models are typically structured with encoder-only deep learning architectures that employ masking strategies on labeled inputs—such as words and characters—to extract relevant features. These features are then processed through self-attention mechanisms to effectively capture and interpret intricate relationships within the data, enhancing model performance across various applications.

Building on BERT’s success, discriminative pretrained FMs such as BERT have been adapted for specialized domains. For example, in the biomedical field, models like BioBERT [24], BLURB [25], and DNABERT [26] extend BERT’s pipeline to pretrain encoders specifically on biomedical text corpora. These models are designed to capture correlations and associations within large-scale biomedical data, supporting a wide range of downstream tasks such as entity recognition, relation extraction (RE), document classification, and question answering. ProteinBERT [27], for instance, introduces innovative strategies by replacing tokens and introducing false annotations during pretraining, compelling the model to accurately predict the correct sequence information even under challenging conditions. These advancements illustrate how discriminative pretrained FMs continue to evolve, offering robust solutions for increasingly complex and domain-specific applications.

Generative pretrained foundation models

Generative pretrained foundation models focus on generating coherent sequences by modeling the underlying distribution of the data. Unlike discriminative models, which primarily concentrate on understanding and classifying inputs, generative models are designed to predict the next token in a sequence, making them suitable for tasks that involve generating new content, such as text completion, translation, and content creation. These models are typically trained using autoregressive techniques, where each token is predicted based on previously generated tokens. A well-known example is GPT (Generative Pretrained Transformer), which generates text by predicting each word one at a time, conditioned on all the previous words. In the context of bioinformatics, generative models can be fine-tuned to produce meaningful sequences, such as protein or RNA structures, based on a given prompt. By learning from vast unannotated datasets, these models can generate outputs that are contextually rich and informative for various downstream applications. Generative pretrained FMs excel at capturing complex relationships within data, making them versatile tools for a wide array of tasks beyond classification, including synthesis, creativity, and exploratory data generation. Generative pretrained FMs usually tailor decoder-only modules, such as GPT-3 [28], and GPT-4 V [29], or encoder–decoder, such as ESM-2 [30], ESM3 [31], and T5 models [32], and are adept at simultaneously understanding and generating tasks. These tasks are achieved by optimizing an objective of bidirectional input autoregressive blank-filling, which involves a permuted language modeling approach [33]. For instance, the targeted encoder–decoder model, Enformer [34], predicts promoter–enhancer interactions directly from DNA sequences by leveraging substantial information flow within the CNN. ProtST [35] integrates mask prediction and representation alignment into the pretraining task, facilitating the modeling of paired protein sequences and textual descriptions. ESM3 demonstrates the potential of transformers to generate protein sequences and structures with new functions by training on data produced by evolution. As for the decoder-only model, ProtGPT2 [36] generates protein sequences that exhibit amino acid and disorder properties comparable to those found in natural proteins yet remain distinct from the existing protein space. Additionally, xTrimoPGLM [37] employs four distinct masking strategies to redesign the sequence of complementarity determining region 3 (CDR3).

As shown in Fig. 1 (iii), FMs consist of various neural network modules. MLP with multiple hidden layers in a feedforward neural network is suitable for regression and classification tasks. CNNs, like ResNet, excel in processing grid-like data for vision tasks. The graph convolutional network (GCN) handles graph-structured data, such as molecular graphs, by aggregating information from neighbors. AutoEncoder reduces data dimensions for their representations using encoder and decoder architectures. Transformer, initially proposed for NLP tasks, has positional encodings, self-attention, and multihead attention mechanisms. Generative pretrained FMs leverage these diverse modules for specific biological tasks. For example, the encoder–decoder model Bingo uses a transformer and the GNN for protein function prediction. Decoder-only models such as ProGen, xTrimoPGLM, and CLAPE-DB [38] combine transformer and the CNN for protein sequence, structure, and function analysis. In contrast, encoder-only models, e.g. BioBERT and HyenaDNA, use either transformer or MLP paired with the CNN to address biological text and DNA analysis tasks.

Remarkably, the choice of pretraining strategies holds significant importance in attaining optimal performance [39]. For example, CLAPE-DB combines a pretrained protein language model and constructive learning to predict DNA binding residues [40]. Similarly, HyenaDNA uses a sequence length scheduling technique to stabilize model pretraining and leverages longer context to better adapt to novel tasks [41]. Both discriminative and generative pretrained FMs possess the capability to update and pretrain neural networks via a back-propagation pipeline, leveraging the statistical outcomes of target variables and their estimated counterparts.

Tuning with foundation models

FMs, particularly in the context of LLMs and biological applications, provide several strategies for interaction depending on the specific needs of a task. These strategies range from zero-shot approaches requiring no additional training to more sophisticated methods that involve fine-tuning or conditional training [24, 25, 35, 41]. Understanding these mechanisms is key to maximizing the effectiveness of FMs across various applications, including bioinformatics.

Zero-shot learning (no-tuning)

Zero-shot learning enables FMs to perform tasks without any additional training or fine-tuning [25, 42]. In this approach, the model leverages the knowledge it has acquired during pretraining to handle entirely new tasks by relying solely on prompts or queries provided by the user. For instance, in a biological FM like Geneformer [17], BioBERT [24], or scGPT [42], zero-shot learning might be applied to infer the function of a protein sequence based solely on pre-existing general knowledge encoded in the model. This approach is particularly useful when there are limited annotated data available for fine-tuning, as it capitalizes on the model’s broad understanding of biological concepts.

Few-shot fine-tuning

Few-shot fine-tuning involves providing the model with a small number of labeled examples specific to a task [26, 27]. The model then adapts to the task through minimal additional training. In biological contexts, few-shot fine-tuning is valuable when a model needs to be specialized for niche applications, such as identifying rare genetic mutations or predicting highly specific protein–protein interactions (PPIs). Biological FMs like GMAI [14], BLURB [25], and Enformer [34] can be fine-tuned with just a few examples, making this approach both efficient and effective for tasks where obtaining large datasets is challenging.

Conditional training and prompt engineering

Conditional training involves adapting foundation models using carefully designed prompts or instructions that guide the model’s output based on task-specific conditions [35, 41]. In the realm of large language models and bioinformatics, this could mean designing prompts that instruct the model to generate hypotheses about molecular structures or to predict biological interactions based on a given sequence. Biological FMs such as ProtST [35], HyenaDNA [41], and ProGen [43] can be used in this approach, often combined with few-shot learning, to allow for more controlled outputs that are aligned with the desired application. Prompt engineering and conditional training are particularly effective when nuanced control over the model’s response is required.

Foundation models for biological problems

To implement FMs in bioinformatics appropriately, biological problems and datasets together with relevant data preprocessing and downstream tasks in biology will be elucidated at first. As this review concentrates on biological macromolecules (including DNA, RNA, and protein), single-cell genomics, knowledge graphs/networks, text/images, and FMs illustrated in this part are generally employed to solve problems in macromolecule biology. We introduce FMs as versatile tools capable of addressing practical biological problems, including multimedia, sequence analysis, structure construction, function prediction, single-cell multi-omics analysis, and multimodal integration. Each of these areas represents a unique challenge within the field of bioinformatics, and the application of FMs provides innovative approaches to these complex issues. Recent FMs in bioinformatics are summarized in Tables 2 and 3.

Table 2

A summary of foundation models in bioinformatics. This table summarizes the model categories, targets, deep module types, and technical advancement of FMs for tackling biological problems (MA, multimedia analysis; SA, sequence analysis; SC, structure construction; FP, function prediction; MI, multimodal integration). FMs are categorized by their pretraining process: DM, discriminative model; GM, generative model. Target biological data types include DNA, RNA, protein, scMultiomics, biomedical text/image/video, and knowledge graph/network. Various deep modules enhance the performance or interpretability of FMs, such as MLP, multilayer perceptron; CNN, convolutional neural network; GNN, graph neural network, and transformer.

Model name	Biological problem	Model category	Targets	Deep module type	Technical advancement	Author name, publication year
BioBERT	MA	DM	Biomedical text	Transformer	Adapt for biomedical corpora by pretrained BERT on large-scale biomedical corpora	Lee et al., 2020
BioELECTRA	MA	DM	Biomedical text	Transformer	A biomedical domain-specific language moBMAl introducing a replaced token prediction pretraining task with generator and discriminator network	Kanakarajan et al., 2021
BLURB	MA	DM	Biomedical text	Transformer	Pretrain biomedical language model from scratch for a wide range of biomedical NLP tasks instead of using complex tagging schemes	Gu et al., 2021
BioBART	MA	GM	Biomedical text	Transformer	A bidirectional and auto-regressive generative language model for biomedical natural language generation tasks along with corresponding data	Yuan et al., 2022
Med-PaLM	MA	GM	Biomedical text	Transformer	Introduce HealthSearchQA dataset, propose a human evaluation framework, and present instruction prompt tuning for aligning LLMs to new domains using a few exemplars	Karan et al., 2023
MSA	MA	GM	Biomedical graph	MLP	A medical image segmentation model that fine-tunes the pretrained SAM by integrating the medical-specific domain knowledge	Wu et al., 2023
GMAI	MA	GM	Biomedical text, graph, video	Transformer	Adapt to new tasks due to the acceptance of inputs and production of outputs with varying combinations of data modalities	Moor et al., 2023
DNABERT	SA, FP	DM	DNA	Transformer	Use pretrained bidirectional encoder representation to capture a global and transferrable understanding of genomic DNA sequences	Ji et al., 2021
Enformer	SA	GM	DNA	Transformer	Use a larger receptive field to improve gene expression and promoter–enhancer interaction prediction	Avsec et al., 2021
HyenaDNA	SA, SC	DM	DNA	MLP \| CNN	Use a sequence length scheduling technique to stabilize training and leverage longer context length to adapt to novel tasks	Nguyen et al., 2023
Nucleotide Transformer	SA	DM	DNA	Transformer	Build and pretrain foundational language models in genomics, across different genomic datasets and parameter sizes	Dalla-Torre et al., 2023
ProteinBERT	SA, SC	DM	Protein	Transformer	Pretrain protein language model with gene ontology annotation prediction task for both local and global representations	Brandes et al., 2022
ProtGPT2	SA, SC	GM	Protein	Transformer	A generative language model trained on protein space to learn the protein language and produce sequences to sample any region	Ferruz et al., 2022
DNABERT-2	SA, FP	DM	DNA	Transformer	Adapt byte pair encoding to improve computational efficiency and employ multiple strategies to overcome input length constraints	Zhou et al., 2023
ProGen	SA, SC	GM	Protein	CNN \| Transformer	A protein language model trained on millions of raw protein sequences that generate artificial proteins across multiple families and functions	Madani et al., 2023
xTrimoPGLM	SA, SC, FP	GM	Protein	CNN \| Transformer	A pretraining framework to address protein understanding and generation tasks with joint optimization of the two types of objectives	Chen et al., 2023
CLAPE-DB	SA, SC	DM	Protein	CNN \| Transformer	Combines pre-trained protein language model and constructive learning to predict DNA binding residues	Liu et al., 2023
Geneformer	SA, FP	DM	scMultiomics	Transformer	A context-aware, attention-based deep learning model pretrained on a large-scale corpus and can be transferred to diverse fine-tuning tasks	Theodoris et al., 2022
scGPT	SA, SC, FP	GM	scMultiomics	Transformer	A single cell foundation model through generative pre-training on over 10 million cells stored by an in-memory data structure	Cui et al., 2023
ESM-1b	SC, FP	GM	Protein	Transformer	Use an unsupervised deep language model to acquire protein structure and function directly from 250 million protein sequences	Rives et al., 2021
AlphaFold2	SC	DM	Protein	Transformer	Improve the AlphaFold by employing an SE(3)-equivariant transformer with an attention mechanism to represent their interactions and distances	Jumper et al., 2021
RGN2	SC	DM	Protein	Transformer	Combine a differentiable recurrent geometric network (RGN) with a transformer-based AminoBERT protein language model to generate backbone structures from unaligned proteins before refinement	Chowdhury et al., 2022
Uni-Mol	SC	GM	Protein	Transformer	A 3D position predict model by a 3D molecular pre-training framework along with the candidate protein pre-training for various downstream tasks	Zhou et al., 2023
RNA-FM	SC, FP	DM	RNA	Transformer	Use self-supervised learning to train 23 million non-coding RNA sequences and infer their sequential and evolutionary information	Chen et al., 2022
UNI-RNA	SC, FP	DM	RNA	Transformer	A context-aware foundation model pretrained on an unprecedented scale of RNA sequences unraveling evolutionary and structural information	Wang et al., 2023
RNA-MSM	SC, FP	GM	RNA	Transformer	An RNA language model effective at capturing evolutionary information from homologous sequences using a stack of MSA transformer blocks	Zhang et al., 2024
Bingo	FP	GM	Protein	GNN \| Transformer	A large language model and graph neural network (LLM-GNN) based adversarial training method for protein-coding genes prediction	Ma et al., 2024
scFoundation	FP	GM	scMultiomics	Transformer	An extensive single-cell foundation model pre-trained on a dataset of over 50 million single-cell data points with 100 million parameters	Hao et al., 2023
scHyena	FP	GM	scMultiomics	Transformer	A full-length scRNA-seq analysis in the brain by a linear adaptor layer and a bidirectional Hyena operator without losing raw data information	Oh et al., 2023
scBERT	FP	DM	scMultiomics	Transformer	Use self-supervised learning on large-scale unannotated scRNA-seq data to improve the model’s generalizability and overcome the batch effect	Yang et al., 2022
ProtST	FP, MI	GM	Protein, biomedical text	CNN \| Transformer	A pretrained framework with three tasks of both protein and biomedical text to boost protein sequence understanding	Xu et al., 2023

Model name	Biological problem	Model category	Targets	Deep module type	Technical advancement	Author name, publication year
BioBERT	MA	DM	Biomedical text	Transformer	Adapt for biomedical corpora by pretrained BERT on large-scale biomedical corpora	Lee et al., 2020
BioELECTRA	MA	DM	Biomedical text	Transformer	A biomedical domain-specific language moBMAl introducing a replaced token prediction pretraining task with generator and discriminator network	Kanakarajan et al., 2021
BLURB	MA	DM	Biomedical text	Transformer	Pretrain biomedical language model from scratch for a wide range of biomedical NLP tasks instead of using complex tagging schemes	Gu et al., 2021
BioBART	MA	GM	Biomedical text	Transformer	A bidirectional and auto-regressive generative language model for biomedical natural language generation tasks along with corresponding data	Yuan et al., 2022
Med-PaLM	MA	GM	Biomedical text	Transformer	Introduce HealthSearchQA dataset, propose a human evaluation framework, and present instruction prompt tuning for aligning LLMs to new domains using a few exemplars	Karan et al., 2023
MSA	MA	GM	Biomedical graph	MLP	A medical image segmentation model that fine-tunes the pretrained SAM by integrating the medical-specific domain knowledge	Wu et al., 2023
GMAI	MA	GM	Biomedical text, graph, video	Transformer	Adapt to new tasks due to the acceptance of inputs and production of outputs with varying combinations of data modalities	Moor et al., 2023
DNABERT	SA, FP	DM	DNA	Transformer	Use pretrained bidirectional encoder representation to capture a global and transferrable understanding of genomic DNA sequences	Ji et al., 2021
Enformer	SA	GM	DNA	Transformer	Use a larger receptive field to improve gene expression and promoter–enhancer interaction prediction	Avsec et al., 2021
HyenaDNA	SA, SC	DM	DNA	MLP \| CNN	Use a sequence length scheduling technique to stabilize training and leverage longer context length to adapt to novel tasks	Nguyen et al., 2023
Nucleotide Transformer	SA	DM	DNA	Transformer	Build and pretrain foundational language models in genomics, across different genomic datasets and parameter sizes	Dalla-Torre et al., 2023
ProteinBERT	SA, SC	DM	Protein	Transformer	Pretrain protein language model with gene ontology annotation prediction task for both local and global representations	Brandes et al., 2022
ProtGPT2	SA, SC	GM	Protein	Transformer	A generative language model trained on protein space to learn the protein language and produce sequences to sample any region	Ferruz et al., 2022
DNABERT-2	SA, FP	DM	DNA	Transformer	Adapt byte pair encoding to improve computational efficiency and employ multiple strategies to overcome input length constraints	Zhou et al., 2023
ProGen	SA, SC	GM	Protein	CNN \| Transformer	A protein language model trained on millions of raw protein sequences that generate artificial proteins across multiple families and functions	Madani et al., 2023
xTrimoPGLM	SA, SC, FP	GM	Protein	CNN \| Transformer	A pretraining framework to address protein understanding and generation tasks with joint optimization of the two types of objectives	Chen et al., 2023
CLAPE-DB	SA, SC	DM	Protein	CNN \| Transformer	Combines pre-trained protein language model and constructive learning to predict DNA binding residues	Liu et al., 2023
Geneformer	SA, FP	DM	scMultiomics	Transformer	A context-aware, attention-based deep learning model pretrained on a large-scale corpus and can be transferred to diverse fine-tuning tasks	Theodoris et al., 2022
scGPT	SA, SC, FP	GM	scMultiomics	Transformer	A single cell foundation model through generative pre-training on over 10 million cells stored by an in-memory data structure	Cui et al., 2023
ESM-1b	SC, FP	GM	Protein	Transformer	Use an unsupervised deep language model to acquire protein structure and function directly from 250 million protein sequences	Rives et al., 2021
AlphaFold2	SC	DM	Protein	Transformer	Improve the AlphaFold by employing an SE(3)-equivariant transformer with an attention mechanism to represent their interactions and distances	Jumper et al., 2021
RGN2	SC	DM	Protein	Transformer	Combine a differentiable recurrent geometric network (RGN) with a transformer-based AminoBERT protein language model to generate backbone structures from unaligned proteins before refinement	Chowdhury et al., 2022
Uni-Mol	SC	GM	Protein	Transformer	A 3D position predict model by a 3D molecular pre-training framework along with the candidate protein pre-training for various downstream tasks	Zhou et al., 2023
RNA-FM	SC, FP	DM	RNA	Transformer	Use self-supervised learning to train 23 million non-coding RNA sequences and infer their sequential and evolutionary information	Chen et al., 2022
UNI-RNA	SC, FP	DM	RNA	Transformer	A context-aware foundation model pretrained on an unprecedented scale of RNA sequences unraveling evolutionary and structural information	Wang et al., 2023
RNA-MSM	SC, FP	GM	RNA	Transformer	An RNA language model effective at capturing evolutionary information from homologous sequences using a stack of MSA transformer blocks	Zhang et al., 2024
Bingo	FP	GM	Protein	GNN \| Transformer	A large language model and graph neural network (LLM-GNN) based adversarial training method for protein-coding genes prediction	Ma et al., 2024
scFoundation	FP	GM	scMultiomics	Transformer	An extensive single-cell foundation model pre-trained on a dataset of over 50 million single-cell data points with 100 million parameters	Hao et al., 2023
scHyena	FP	GM	scMultiomics	Transformer	A full-length scRNA-seq analysis in the brain by a linear adaptor layer and a bidirectional Hyena operator without losing raw data information	Oh et al., 2023
scBERT	FP	DM	scMultiomics	Transformer	Use self-supervised learning on large-scale unannotated scRNA-seq data to improve the model’s generalizability and overcome the batch effect	Yang et al., 2022
ProtST	FP, MI	GM	Protein, biomedical text	CNN \| Transformer	A pretrained framework with three tasks of both protein and biomedical text to boost protein sequence understanding	Xu et al., 2023

Table 2

Model name	Biological problem	Model category	Targets	Deep module type	Technical advancement	Author name, publication year
BioBERT	MA	DM	Biomedical text	Transformer	Adapt for biomedical corpora by pretrained BERT on large-scale biomedical corpora	Lee et al., 2020
BioELECTRA	MA	DM	Biomedical text	Transformer	A biomedical domain-specific language moBMAl introducing a replaced token prediction pretraining task with generator and discriminator network	Kanakarajan et al., 2021
BLURB	MA	DM	Biomedical text	Transformer	Pretrain biomedical language model from scratch for a wide range of biomedical NLP tasks instead of using complex tagging schemes	Gu et al., 2021
BioBART	MA	GM	Biomedical text	Transformer	A bidirectional and auto-regressive generative language model for biomedical natural language generation tasks along with corresponding data	Yuan et al., 2022
Med-PaLM	MA	GM	Biomedical text	Transformer	Introduce HealthSearchQA dataset, propose a human evaluation framework, and present instruction prompt tuning for aligning LLMs to new domains using a few exemplars	Karan et al., 2023
MSA	MA	GM	Biomedical graph	MLP	A medical image segmentation model that fine-tunes the pretrained SAM by integrating the medical-specific domain knowledge	Wu et al., 2023
GMAI	MA	GM	Biomedical text, graph, video	Transformer	Adapt to new tasks due to the acceptance of inputs and production of outputs with varying combinations of data modalities	Moor et al., 2023
DNABERT	SA, FP	DM	DNA	Transformer	Use pretrained bidirectional encoder representation to capture a global and transferrable understanding of genomic DNA sequences	Ji et al., 2021
Enformer	SA	GM	DNA	Transformer	Use a larger receptive field to improve gene expression and promoter–enhancer interaction prediction	Avsec et al., 2021
HyenaDNA	SA, SC	DM	DNA	MLP \| CNN	Use a sequence length scheduling technique to stabilize training and leverage longer context length to adapt to novel tasks	Nguyen et al., 2023
Nucleotide Transformer	SA	DM	DNA	Transformer	Build and pretrain foundational language models in genomics, across different genomic datasets and parameter sizes	Dalla-Torre et al., 2023
ProteinBERT	SA, SC	DM	Protein	Transformer	Pretrain protein language model with gene ontology annotation prediction task for both local and global representations	Brandes et al., 2022
ProtGPT2	SA, SC	GM	Protein	Transformer	A generative language model trained on protein space to learn the protein language and produce sequences to sample any region	Ferruz et al., 2022
DNABERT-2	SA, FP	DM	DNA	Transformer	Adapt byte pair encoding to improve computational efficiency and employ multiple strategies to overcome input length constraints	Zhou et al., 2023
ProGen	SA, SC	GM	Protein	CNN \| Transformer	A protein language model trained on millions of raw protein sequences that generate artificial proteins across multiple families and functions	Madani et al., 2023
xTrimoPGLM	SA, SC, FP	GM	Protein	CNN \| Transformer	A pretraining framework to address protein understanding and generation tasks with joint optimization of the two types of objectives	Chen et al., 2023
CLAPE-DB	SA, SC	DM	Protein	CNN \| Transformer	Combines pre-trained protein language model and constructive learning to predict DNA binding residues	Liu et al., 2023
Geneformer	SA, FP	DM	scMultiomics	Transformer	A context-aware, attention-based deep learning model pretrained on a large-scale corpus and can be transferred to diverse fine-tuning tasks	Theodoris et al., 2022
scGPT	SA, SC, FP	GM	scMultiomics	Transformer	A single cell foundation model through generative pre-training on over 10 million cells stored by an in-memory data structure	Cui et al., 2023
ESM-1b	SC, FP	GM	Protein	Transformer	Use an unsupervised deep language model to acquire protein structure and function directly from 250 million protein sequences	Rives et al., 2021
AlphaFold2	SC	DM	Protein	Transformer	Improve the AlphaFold by employing an SE(3)-equivariant transformer with an attention mechanism to represent their interactions and distances	Jumper et al., 2021
RGN2	SC	DM	Protein	Transformer	Combine a differentiable recurrent geometric network (RGN) with a transformer-based AminoBERT protein language model to generate backbone structures from unaligned proteins before refinement	Chowdhury et al., 2022
Uni-Mol	SC	GM	Protein	Transformer	A 3D position predict model by a 3D molecular pre-training framework along with the candidate protein pre-training for various downstream tasks	Zhou et al., 2023
RNA-FM	SC, FP	DM	RNA	Transformer	Use self-supervised learning to train 23 million non-coding RNA sequences and infer their sequential and evolutionary information	Chen et al., 2022
UNI-RNA	SC, FP	DM	RNA	Transformer	A context-aware foundation model pretrained on an unprecedented scale of RNA sequences unraveling evolutionary and structural information	Wang et al., 2023
RNA-MSM	SC, FP	GM	RNA	Transformer	An RNA language model effective at capturing evolutionary information from homologous sequences using a stack of MSA transformer blocks	Zhang et al., 2024
Bingo	FP	GM	Protein	GNN \| Transformer	A large language model and graph neural network (LLM-GNN) based adversarial training method for protein-coding genes prediction	Ma et al., 2024
scFoundation	FP	GM	scMultiomics	Transformer	An extensive single-cell foundation model pre-trained on a dataset of over 50 million single-cell data points with 100 million parameters	Hao et al., 2023
scHyena	FP	GM	scMultiomics	Transformer	A full-length scRNA-seq analysis in the brain by a linear adaptor layer and a bidirectional Hyena operator without losing raw data information	Oh et al., 2023
scBERT	FP	DM	scMultiomics	Transformer	Use self-supervised learning on large-scale unannotated scRNA-seq data to improve the model’s generalizability and overcome the batch effect	Yang et al., 2022
ProtST	FP, MI	GM	Protein, biomedical text	CNN \| Transformer	A pretrained framework with three tasks of both protein and biomedical text to boost protein sequence understanding	Xu et al., 2023

Model name	Biological problem	Model category	Targets	Deep module type	Technical advancement	Author name, publication year
BioBERT	MA	DM	Biomedical text	Transformer	Adapt for biomedical corpora by pretrained BERT on large-scale biomedical corpora	Lee et al., 2020
BioELECTRA	MA	DM	Biomedical text	Transformer	A biomedical domain-specific language moBMAl introducing a replaced token prediction pretraining task with generator and discriminator network	Kanakarajan et al., 2021
BLURB	MA	DM	Biomedical text	Transformer	Pretrain biomedical language model from scratch for a wide range of biomedical NLP tasks instead of using complex tagging schemes	Gu et al., 2021
BioBART	MA	GM	Biomedical text	Transformer	A bidirectional and auto-regressive generative language model for biomedical natural language generation tasks along with corresponding data	Yuan et al., 2022
Med-PaLM	MA	GM	Biomedical text	Transformer	Introduce HealthSearchQA dataset, propose a human evaluation framework, and present instruction prompt tuning for aligning LLMs to new domains using a few exemplars	Karan et al., 2023
MSA	MA	GM	Biomedical graph	MLP	A medical image segmentation model that fine-tunes the pretrained SAM by integrating the medical-specific domain knowledge	Wu et al., 2023
GMAI	MA	GM	Biomedical text, graph, video	Transformer	Adapt to new tasks due to the acceptance of inputs and production of outputs with varying combinations of data modalities	Moor et al., 2023
DNABERT	SA, FP	DM	DNA	Transformer	Use pretrained bidirectional encoder representation to capture a global and transferrable understanding of genomic DNA sequences	Ji et al., 2021
Enformer	SA	GM	DNA	Transformer	Use a larger receptive field to improve gene expression and promoter–enhancer interaction prediction	Avsec et al., 2021
HyenaDNA	SA, SC	DM	DNA	MLP \| CNN	Use a sequence length scheduling technique to stabilize training and leverage longer context length to adapt to novel tasks	Nguyen et al., 2023
Nucleotide Transformer	SA	DM	DNA	Transformer	Build and pretrain foundational language models in genomics, across different genomic datasets and parameter sizes	Dalla-Torre et al., 2023
ProteinBERT	SA, SC	DM	Protein	Transformer	Pretrain protein language model with gene ontology annotation prediction task for both local and global representations	Brandes et al., 2022
ProtGPT2	SA, SC	GM	Protein	Transformer	A generative language model trained on protein space to learn the protein language and produce sequences to sample any region	Ferruz et al., 2022
DNABERT-2	SA, FP	DM	DNA	Transformer	Adapt byte pair encoding to improve computational efficiency and employ multiple strategies to overcome input length constraints	Zhou et al., 2023
ProGen	SA, SC	GM	Protein	CNN \| Transformer	A protein language model trained on millions of raw protein sequences that generate artificial proteins across multiple families and functions	Madani et al., 2023
xTrimoPGLM	SA, SC, FP	GM	Protein	CNN \| Transformer	A pretraining framework to address protein understanding and generation tasks with joint optimization of the two types of objectives	Chen et al., 2023
CLAPE-DB	SA, SC	DM	Protein	CNN \| Transformer	Combines pre-trained protein language model and constructive learning to predict DNA binding residues	Liu et al., 2023
Geneformer	SA, FP	DM	scMultiomics	Transformer	A context-aware, attention-based deep learning model pretrained on a large-scale corpus and can be transferred to diverse fine-tuning tasks	Theodoris et al., 2022
scGPT	SA, SC, FP	GM	scMultiomics	Transformer	A single cell foundation model through generative pre-training on over 10 million cells stored by an in-memory data structure	Cui et al., 2023
ESM-1b	SC, FP	GM	Protein	Transformer	Use an unsupervised deep language model to acquire protein structure and function directly from 250 million protein sequences	Rives et al., 2021
AlphaFold2	SC	DM	Protein	Transformer	Improve the AlphaFold by employing an SE(3)-equivariant transformer with an attention mechanism to represent their interactions and distances	Jumper et al., 2021
RGN2	SC	DM	Protein	Transformer	Combine a differentiable recurrent geometric network (RGN) with a transformer-based AminoBERT protein language model to generate backbone structures from unaligned proteins before refinement	Chowdhury et al., 2022
Uni-Mol	SC	GM	Protein	Transformer	A 3D position predict model by a 3D molecular pre-training framework along with the candidate protein pre-training for various downstream tasks	Zhou et al., 2023
RNA-FM	SC, FP	DM	RNA	Transformer	Use self-supervised learning to train 23 million non-coding RNA sequences and infer their sequential and evolutionary information	Chen et al., 2022
UNI-RNA	SC, FP	DM	RNA	Transformer	A context-aware foundation model pretrained on an unprecedented scale of RNA sequences unraveling evolutionary and structural information	Wang et al., 2023
RNA-MSM	SC, FP	GM	RNA	Transformer	An RNA language model effective at capturing evolutionary information from homologous sequences using a stack of MSA transformer blocks	Zhang et al., 2024
Bingo	FP	GM	Protein	GNN \| Transformer	A large language model and graph neural network (LLM-GNN) based adversarial training method for protein-coding genes prediction	Ma et al., 2024
scFoundation	FP	GM	scMultiomics	Transformer	An extensive single-cell foundation model pre-trained on a dataset of over 50 million single-cell data points with 100 million parameters	Hao et al., 2023
scHyena	FP	GM	scMultiomics	Transformer	A full-length scRNA-seq analysis in the brain by a linear adaptor layer and a bidirectional Hyena operator without losing raw data information	Oh et al., 2023
scBERT	FP	DM	scMultiomics	Transformer	Use self-supervised learning on large-scale unannotated scRNA-seq data to improve the model’s generalizability and overcome the batch effect	Yang et al., 2022
ProtST	FP, MI	GM	Protein, biomedical text	CNN \| Transformer	A pretrained framework with three tasks of both protein and biomedical text to boost protein sequence understanding	Xu et al., 2023

Table 3

A summary of key characteristics of FMs in bioinformatics.

Model name	Model size	Model task	Model name	Model size	Model task
BioBERT	110 M/340 M	Biomedical text mining (NER, RE, QA)	ProGen	1.2B	Stability prediction, remote homology detection, secondary structure prediction
BioELECTRA	109 M	Biomedical text mining (NER, RE, QA)	ProGen2	6.4B	Functional sequence generation, protein fitness prediction
BLURB	Unknown	Biomedical NLP benchmark (QA, NER, parsing, etc.)	CLAPE-DB	Unknown	Protein–ligand-binding site prediction
BioBART	139 M/400 M	Biomedical text generation (dialogue, summarization, NER)	Geneformer	30 M	Sequence-based prediction
Med-PaLM	12B/84B/562B	Medical question answering	scGPT	Unknown	Multibatch integration, multi-omic integration, cell-type annotation, genetic perturbation prediction, gene network inference
MSA	30 M/100 M	Arabic NLP tasks (NER, POS tagging, sentiment analysis, etc.)	ESM-1b	650 M	Supervised prediction of mutational effect and secondary structure
GMAI	Unknown	Generalist medical AI (multimodal tasks)	AlphaFold2	21 M	Protein structure prediction
DNABERT	110 M	DNA sequence prediction (promoters, TFBSs, splice sites)	AlphaFold3	93 M	Protein structure prediction, structure of protein–protein interaction prediction
Enformer	Unknown	Gene expression prediction	RGN2	110 M	Protein design and analysis of allelic variation or disease mutations
HyenaDNA	7 M	Genomic sequence modeling (regulatory elements, chromatin profiles)	Uni-Mol	1.1B	3D position recovery, masked atom prediction, molecular property prediction
Nucleotide Transformer	500 M ~ 2.5B	DNA sequence analysis	RNA-FM	99.52 M	RNA secondary structure prediction, distance regression task
ProteinBERT	16 M	Bidirectional language modeling of protein sequences, Gene Ontology (GO) annotation prediction	UNI-RNA	25 M/85 M/169 M/400 M	RNA structure and function prediction
ProtGPT	1.6 M/25.2 M	Protein sequence generation	RNA-MSM	Unknown	RNA structure and function prediction
ProtGPT2	738 M	Protein sequence generation, structural similarity detection, stability prediction	Bingo	8 ~ 15 M	Filling in randomly masked amino acids, generating residue-level feature matrix and protein contact map
xTrimoPGLM	100B	Protein understanding and generation	scFoundation	100 M	Gene expression enhancement, tissue drug response prediction, single-cell drug response classification, single-cell perturbation prediction
DNABERT-2	117 M	DNA sequence prediction	scHyena	Unknown	Cell type classification, scRNA-seq imputation
scBERT	Unknown	Single-cell RNA sequencing analysis	ProtST	650 M	Unimodal mask prediction, multimodal representation alignment, multimodal mask prediction

Model name	Model size	Model task	Model name	Model size	Model task
BioBERT	110 M/340 M	Biomedical text mining (NER, RE, QA)	ProGen	1.2B	Stability prediction, remote homology detection, secondary structure prediction
BioELECTRA	109 M	Biomedical text mining (NER, RE, QA)	ProGen2	6.4B	Functional sequence generation, protein fitness prediction
BLURB	Unknown	Biomedical NLP benchmark (QA, NER, parsing, etc.)	CLAPE-DB	Unknown	Protein–ligand-binding site prediction
BioBART	139 M/400 M	Biomedical text generation (dialogue, summarization, NER)	Geneformer	30 M	Sequence-based prediction
Med-PaLM	12B/84B/562B	Medical question answering	scGPT	Unknown	Multibatch integration, multi-omic integration, cell-type annotation, genetic perturbation prediction, gene network inference
MSA	30 M/100 M	Arabic NLP tasks (NER, POS tagging, sentiment analysis, etc.)	ESM-1b	650 M	Supervised prediction of mutational effect and secondary structure
GMAI	Unknown	Generalist medical AI (multimodal tasks)	AlphaFold2	21 M	Protein structure prediction
DNABERT	110 M	DNA sequence prediction (promoters, TFBSs, splice sites)	AlphaFold3	93 M	Protein structure prediction, structure of protein–protein interaction prediction
Enformer	Unknown	Gene expression prediction	RGN2	110 M	Protein design and analysis of allelic variation or disease mutations
HyenaDNA	7 M	Genomic sequence modeling (regulatory elements, chromatin profiles)	Uni-Mol	1.1B	3D position recovery, masked atom prediction, molecular property prediction
Nucleotide Transformer	500 M ~ 2.5B	DNA sequence analysis	RNA-FM	99.52 M	RNA secondary structure prediction, distance regression task
ProteinBERT	16 M	Bidirectional language modeling of protein sequences, Gene Ontology (GO) annotation prediction	UNI-RNA	25 M/85 M/169 M/400 M	RNA structure and function prediction
ProtGPT	1.6 M/25.2 M	Protein sequence generation	RNA-MSM	Unknown	RNA structure and function prediction
ProtGPT2	738 M	Protein sequence generation, structural similarity detection, stability prediction	Bingo	8 ~ 15 M	Filling in randomly masked amino acids, generating residue-level feature matrix and protein contact map
xTrimoPGLM	100B	Protein understanding and generation	scFoundation	100 M	Gene expression enhancement, tissue drug response prediction, single-cell drug response classification, single-cell perturbation prediction
DNABERT-2	117 M	DNA sequence prediction	scHyena	Unknown	Cell type classification, scRNA-seq imputation
scBERT	Unknown	Single-cell RNA sequencing analysis	ProtST	650 M	Unimodal mask prediction, multimodal representation alignment, multimodal mask prediction

Table 3

Open in new tab Download slide

A summary of key characteristics of FMs in bioinformatics.

Model name	Model size	Model task	Model name	Model size	Model task
BioBERT	110 M/340 M	Biomedical text mining (NER, RE, QA)	ProGen	1.2B	Stability prediction, remote homology detection, secondary structure prediction
BioELECTRA	109 M	Biomedical text mining (NER, RE, QA)	ProGen2	6.4B	Functional sequence generation, protein fitness prediction
BLURB	Unknown	Biomedical NLP benchmark (QA, NER, parsing, etc.)	CLAPE-DB	Unknown	Protein–ligand-binding site prediction
BioBART	139 M/400 M	Biomedical text generation (dialogue, summarization, NER)	Geneformer	30 M	Sequence-based prediction
Med-PaLM	12B/84B/562B	Medical question answering	scGPT	Unknown	Multibatch integration, multi-omic integration, cell-type annotation, genetic perturbation prediction, gene network inference
MSA	30 M/100 M	Arabic NLP tasks (NER, POS tagging, sentiment analysis, etc.)	ESM-1b	650 M	Supervised prediction of mutational effect and secondary structure
GMAI	Unknown	Generalist medical AI (multimodal tasks)	AlphaFold2	21 M	Protein structure prediction
DNABERT	110 M	DNA sequence prediction (promoters, TFBSs, splice sites)	AlphaFold3	93 M	Protein structure prediction, structure of protein–protein interaction prediction
Enformer	Unknown	Gene expression prediction	RGN2	110 M	Protein design and analysis of allelic variation or disease mutations
HyenaDNA	7 M	Genomic sequence modeling (regulatory elements, chromatin profiles)	Uni-Mol	1.1B	3D position recovery, masked atom prediction, molecular property prediction
Nucleotide Transformer	500 M ~ 2.5B	DNA sequence analysis	RNA-FM	99.52 M	RNA secondary structure prediction, distance regression task
ProteinBERT	16 M	Bidirectional language modeling of protein sequences, Gene Ontology (GO) annotation prediction	UNI-RNA	25 M/85 M/169 M/400 M	RNA structure and function prediction
ProtGPT	1.6 M/25.2 M	Protein sequence generation	RNA-MSM	Unknown	RNA structure and function prediction
ProtGPT2	738 M	Protein sequence generation, structural similarity detection, stability prediction	Bingo	8 ~ 15 M	Filling in randomly masked amino acids, generating residue-level feature matrix and protein contact map
xTrimoPGLM	100B	Protein understanding and generation	scFoundation	100 M	Gene expression enhancement, tissue drug response prediction, single-cell drug response classification, single-cell perturbation prediction
DNABERT-2	117 M	DNA sequence prediction	scHyena	Unknown	Cell type classification, scRNA-seq imputation
scBERT	Unknown	Single-cell RNA sequencing analysis	ProtST	650 M	Unimodal mask prediction, multimodal representation alignment, multimodal mask prediction

Model name	Model size	Model task	Model name	Model size	Model task
BioBERT	110 M/340 M	Biomedical text mining (NER, RE, QA)	ProGen	1.2B	Stability prediction, remote homology detection, secondary structure prediction
BioELECTRA	109 M	Biomedical text mining (NER, RE, QA)	ProGen2	6.4B	Functional sequence generation, protein fitness prediction
BLURB	Unknown	Biomedical NLP benchmark (QA, NER, parsing, etc.)	CLAPE-DB	Unknown	Protein–ligand-binding site prediction
BioBART	139 M/400 M	Biomedical text generation (dialogue, summarization, NER)	Geneformer	30 M	Sequence-based prediction
Med-PaLM	12B/84B/562B	Medical question answering	scGPT	Unknown	Multibatch integration, multi-omic integration, cell-type annotation, genetic perturbation prediction, gene network inference
MSA	30 M/100 M	Arabic NLP tasks (NER, POS tagging, sentiment analysis, etc.)	ESM-1b	650 M	Supervised prediction of mutational effect and secondary structure
GMAI	Unknown	Generalist medical AI (multimodal tasks)	AlphaFold2	21 M	Protein structure prediction
DNABERT	110 M	DNA sequence prediction (promoters, TFBSs, splice sites)	AlphaFold3	93 M	Protein structure prediction, structure of protein–protein interaction prediction
Enformer	Unknown	Gene expression prediction	RGN2	110 M	Protein design and analysis of allelic variation or disease mutations
HyenaDNA	7 M	Genomic sequence modeling (regulatory elements, chromatin profiles)	Uni-Mol	1.1B	3D position recovery, masked atom prediction, molecular property prediction
Nucleotide Transformer	500 M ~ 2.5B	DNA sequence analysis	RNA-FM	99.52 M	RNA secondary structure prediction, distance regression task
ProteinBERT	16 M	Bidirectional language modeling of protein sequences, Gene Ontology (GO) annotation prediction	UNI-RNA	25 M/85 M/169 M/400 M	RNA structure and function prediction
ProtGPT	1.6 M/25.2 M	Protein sequence generation	RNA-MSM	Unknown	RNA structure and function prediction
ProtGPT2	738 M	Protein sequence generation, structural similarity detection, stability prediction	Bingo	8 ~ 15 M	Filling in randomly masked amino acids, generating residue-level feature matrix and protein contact map
xTrimoPGLM	100B	Protein understanding and generation	scFoundation	100 M	Gene expression enhancement, tissue drug response prediction, single-cell drug response classification, single-cell perturbation prediction
DNABERT-2	117 M	DNA sequence prediction	scHyena	Unknown	Cell type classification, scRNA-seq imputation
scBERT	Unknown	Single-cell RNA sequencing analysis	ProtST	650 M	Unimodal mask prediction, multimodal representation alignment, multimodal mask prediction

Biological problems and datasets

FMs can solve practical biological problems that fall within five categories: multimedia analysis, sequence analysis, structure construction, function prediction, single-cell multi-omics analysis, and multimodal integration. Core biological problems are sequence analysis, structure construction, and function prediction. Sequence analysis obtains salient gene information, such as the information of binding sites (commonly encoded as position weight matrices), PPIs, and gene expression, from gene and mutation sequence data DNA, RNA (in which each of four kinds of nucleotides is encoded as a one-hot vector like [1, 0, 0, 0]), protein, and genome (the complex of all the genetic information of an organism). Indeed, the information obtained from these models can be utilized to analyze various downstream tasks. For instance, gene expression embodies the functional regulation process of cells. The differences observed between single-cell genomics pave the way for the discovery of new cell types. Similarly, PPIs and their analogs (such as protein–nucleic acid interactions, protein–ligand interactions, and protein–small molecule interactions) encapsulate the physical binding information between them, providing valuable insights into their interactions. Sequential data are often included as annotations of higher-level data that further contain their synergistic or catalytic interactions.

Structure construction focuses on predicting the structures of proteins and RNA from the secondary structure to the quaternary structure based on the primary structure linear sequences of amino acids in a peptide or protein; the secondary structure contains α-helix, β-sheet with three strands, β-bend, Ω-loop, random coil architecture, and topology targets; the tertiary structure has hydrogen bonds, hydrophobic interactions, and tertiary contacts; and the quaternary structure represents a complex molecule structure. As these structures can be represented by statistical information among amino acid residues [44], many structure construction efforts are made on amino acids in DNA, single-cell genomics, and homologous protein families. Moreover, biological sequence data with different positions may have different functions. In this context, they can also be categorized under multiple sequence alignment (MSA) [16].

Function prediction related to biomedicine enables understanding functions of targets such as proteins and variants to predict polypharmacy side-effects, etc. Core biological data for solving this problem are proteins, individual genes, and their spatial interactions commonly encapsulated within knowledge graphs or networks. These networks represent various information indicated as gene interaction networks, disease pathways, patient networks, and networks that capture similarities between cells. Notably, the prediction of biological function is intrinsically linked to the outcomes of gene expression analysis, given that protein functionality is influenced by the degree of gene expression.

Multimedia analysis involves parsing biological problems by transforming the principles of natural language analysis and computer vision domains into biological areas. Hence, biomedical text like BioBERT [24] and 2D and 3D biomedical images such as microscopy images [19] make up major data in solving domain-specific problems. Multimodal integration biological problems can map multiple types of biological data encompassing multi-omics and morphological histology data, etc.

Data preprocessing

Data preprocessing is paramount to ensure satisfactory performance before building a model. Original biological datasets may contain multiple inconsistencies caused by varying purposes and acquisition technologies preventing them from being analyzed directly. Adoption of appropriately curated and preprocessed data without exorbitant data overlap, data deficiency, noise interference, or other unexplored data can improve model computational efficiency and representative ability with better model performance. Examples include doublet removal (without duplicate articles), cell-cycle variance removal, data imputation and denoising, dimensionality reduction (reducing the sequence similarity), representation learning, batch effect removal, normalization, background correction, etc.

Doublet removal can avoid mapping duplicate overlapping data with different identifiers or different data that share the same identifying barcode, which plays a significant role in constructing a unique set of relations between entities [45, 46]. Cell-cycle variance removal focuses on removing vain variations in gene expression between cells emerging along the cell cycle by subtracting out the cell cycle influence [47]. It is intractable for data imputation and denoising because only 6%–30% of values can be captured under different chemistry versions and depths, and to decipher “true” and “false” (called “dropout”) zeros in >70% missing values is a guarantee for further identification. To a certain degree, these data can be refined by leveraging similarities with other datasets or through the construction of multiple subneural networks for imputation [48]. Dimensionality reduction of wide gene expression profiles represented in high feature dimensions, also known as representation learning, aims at the construction of embeddings that facilitate the identification of data elements. Systematic variations specific to each batch tend to raise challenges in data integration and lead to significant data wastage. Tran et al. compared benchmarks of batch effect removal methods such as seGen (variational autoencoder neural network model and latent space), Scanorama (mutual nearest neighbor and panoramic stitching), and MND-ResNet (residual neural network for calibration) to effectively reduce the variations and batch effects in data captured with different times, types of equipment, or technologies [49]. Therefore, customized fine-tuning can correct sequencing batches from multiple datasets [42]. Protein data can be compared on a common scale by normalization to adjust the measurements [50], and background correction aims to correct for any background noise in the protein data [51].

With preprocessed biological data, the data analysis model can be efficiently employed and mitigate or even eliminate obstacles in biological tasks such as doublet detection and cell-cycle variance annotation. As a result, the judicious utilization of biological data and their corresponding embeddings can significantly enhance the performance of downstream tasks.

Downstream tasks

In bioinformatics, the analysis of downstream tasks is permitted to evolve through the application of fine-tuning strategies that are desired for accurate performance in analyzing biological problems of interest based on pretrained biological knowledge in FMs. Fine-tuning can greatly reduce computational time and barriers to their implementation and is capable of solving biological tasks related to sequence analysis, structure construction, function prediction, domain exploration, scMultiomics analysis, and multimodal integration.

For sequence analysis, besides traditional sequence alignment analysis [52], homology detection [53], and molecular evolutionary genetics analysis tasks [54], there are promoter interaction prediction, enhancer–promoter interaction prediction, variant identification, variant effect prediction, signal peptide prediction, gene dosage sensitivity predictions, genetic perturbation prediction, protein understanding, DNA replication, stability prediction, etc. Promoter prediction identifies promoter regions of motifs in transcription start sites of genome-wide sequences. Nonpromoter region samples can then be constructed by shuffling and keeping different parts of split promoter sequences with matching lengths. Enhancer–promoter interaction (EPI) prediction is essential in cell differentiation and can interpret noncoding mutation with potential pathogenicity [55]. EPIs are determined by chromatin conformation and thereby can be inferred by chromatin conformation capture-based techniques or other genetic approaches. In addition, the promoters and enhancers are also known as initial and distal regulatory elements, respectively [56].

Variant identification discloses human diseases and traits by distinguishing casual from noncasual variants [34]. Variant effect prediction focuses on determining functional important variants and giving priorities to them [57]. Signal peptide prediction is a binary protein sequence analysis that predicts their presence and locates their cleavage sites [27]. Gene dosage sensitivity predictions present genes that are sensitive to changes in their dosage interpreting copy number variants in genetic diagnosis [17]. Genetic perturbation prediction aims to forecast perturbed original values or perturbed gene expression values in certain tasks [42]. Protein understanding requires accurate representation at the residue level or protein level to understand biological information encoded within proteins [37]. The process of DNA replication is governed by specific initiation and termination sites, with the function of the origin of replication being modulated by epigenetic factors. This intricate process can be studied at a population level by leveraging nontransformed, highly proliferative, and karyotypically stable pluripotent stem cells. Stability prediction calls for statistical representations of protein informatics such as natural language–inspired representations [43].

Structure construction commonly performs secondary or tertiary structure prediction in downstream tasks. Secondary structure prediction was originally achieved by thermodynamic and alignment methods to determine the homologous sequences and their alignments [58]. 3D structures, by contrast, need further exploration due to the lack of 3D structure data, which may be constructed on the raised deep learning method. Moreover, other tasks related to DNA, RNA, protein, and genomics such as predicting DNA binding residues, protein–RNA binding preference, protein–ligand binding pose, splicing junction prediction, neuropeptide cleavage, genome structure and evolution, and gene network, underlie the discovery of their structure information as well. Predicting DNA- and RNA-binding proteins is essential for analyzing genetic variants [59]. Transcription factors (TFs) are binding proteins in regulating gene expression that can bind motifs (specific DNA sequences) to regulate transcription. Generally, pathogenic functional variants in complex neurodegenerative diseases occur with the change of TF-binding intensities [5]. PPI prediction aims at revealing bindings between proteins with transient or stable physical connections. Protein–small molecules and protein–nucleic acid interactions are significant prediction tasks that dominate organism activities [60].

Splicing junction prediction is crucial for protein synthesis and genetic disease identification, whose variant effects can be predicted with the integration of process-specific scores [61]. Neuropeptide cleavage is one of the post-translational modification binary prediction tasks where the maturation of neuropeptides occurs associated with molecule variability for behavioral and physiological states [27]. Genome structure represents genome-regulatory-element secondary structures, and evolution denotes the evolutionary trend of virus variants [58]. Gene network prediction can map networks based on learned connections between genes. Recently, a transfer learning method has been proposed to learn the connection with limited task-specific data showing a promising analysis for rare diseases [17].

Function prediction captures various properties of RNA/protein/gene functions, discoveries (novel) cell type, functional genetic variants, and functional modules and describes gene expression regulation, in silico treatment analysis, fitness landscape inference, trajectory inference, etc. Functional properties prediction performs the classification of RNA/protein/gene into several functional groups. For instance, gene function prediction, from classifying gene and protein functions to analyzing genome-wide experimental data with multiple statistical tests, relies on the coverage and accuracy of the annotation data such as Gene Ontology (GO) annotation data [62]. Cell type annotation describes heterogeneity in tissues following cell clustering for further investigation insights into biology and pathology [42]. Functional genetic variants identification probes functional variants located inside regions of interest and subsequently repeats prediction with altered alleles [26]. Functional module detection inputs from networks and functional features to protein complexes and evaluates the overlap of the predicted module and known complex [8].

Gene expression regulation models a biological process where the genetic blueprint within a gene is harnessed to synthesize a functional product. Chromatin state analysis is commonly used for detecting annotation and regulation of the genome and for further nucleosome-level function prediction with gene expression and other related data [63]. Gene expression profile facilitates therapeutic discovery through gene expression similarities measured by distance metric and clinical importance evaluating a certain gene on the gene expression level, e.g. finding a tumor gene compared with normal groups [64]. in silico treatment is applied to model human disease by detecting candidate therapeutic targets such as cardiomyopathy and determining the related genes [17]. Fitness landscape inference is developed to map protein fitness under given environments and navigate their residue mutation effect in evolutionary trajectories [35]. Trajectory inference also known as pseudo-time analysis predicts the order or “progress” ranging from the original to the end cell state for single cells from genome-wide omics data [65]. Noticeably, cell ordering, topology, stability, and usability of trajectory inference methods highly depend on the dimension of the dataset and the topology of the trajectory.

Multimedia analysis leverages biomedical text, images, video, etc., for biological domain-specific analysis such as name entity recognition, medical image extraction, medical complementary [24], etc. Prevalent text processing techniques of NLP make numerous efforts to push the progress of mining biomedical text for name entity recognition, RE, sentence similarity, document classification, natural language inference, evidence-based medical information extraction, abstractive summarization, question answering (QA), multiple-choice question answering, etc. Analyzing terms and expressions in the biological domain corpus is pivotal for these tasks. For instance, relation extraction on PubMed enables the discovery of chemical–protein interactions where the majority of relation instances consist of single sentences. In medical vision, they specialize in visual recognition, image captioning, and medical image segmentation. Other domain-specific analyses focus on the medical complementary and alternative, for instance, grounded radiology reports, bedside decision support, augmented procedures, etc.

scMultiomics analysis serves not only single omics downstream tasks such as cell-type annotation, and genetic perturbation prediction but also multi-omics tasks like multi-omics integration relevant to cellular dynamics and disease. Multimodal integration deciphers manifold biological understanding across data modalities such as cross-modal retrieval and multimodal understanding. Besides the aforementioned downstream analysis tasks, many other tasks are not listed or remain to be further studied employing FMs, such as chemical–genetic interaction prediction and other modality-relevant tasks for future biological problems.

Foundation models for biological multimedia analysis

FMs have extensively explored knowledge within NLP and computer vision [40, 66]. A series of methods such as BERT [23], K-BERT [67], GPT-3 [68], and Dragon [69] utilized FMs to map text, images, knowledge graphs, or their combined data such as Wikidata [70], BoogCorputs [71], and ConceptNet [72], to curate comprehensive language and their complementary domain representation information. Thereby, biological multimedia data, i.e. biomedical text, images, and knowledge graphs/networks built from diagnosis records [73] or other large health datasets [74], could be analyzed in the same way.

Semantic-level biological information encoded within biological text has been a central focus for FMs. Models like BioBERT [24], Med-PaLM [75], BioBLECTRA [76], CodonBERT [77], and BioBART [78] efficiently shift from the general domain to the biological multimedia domain through efficient tokenization. BioBERT identifies a multitude of proper nouns in biomedical texts leveraging its final layer representations to compute token-level BIO2 probabilities exclusively. It also employs sentence classification via a single output layer using BERT’s [CLS] token representation for RE and SQuAD [79] with the same architecture as BERT for the QA task. With minimal architectural modifications, BioBERT accomplishes these tasks by pretraining BERT on large-scale biomedical corpora, such as PubMed [80], resulting in improved performance in biomedical text mining. Med-PaLM combines seven professional medical QA datasets (MMLU clinical topics, LiveQA, MedicationQA, MedQA, MedMCQA, PubMedQA, HealthSearchQA) for aligning the model to new domains using a few exemplars. BioBLECTRA pretrained on PubMed and PMC full-text articles introduces a replaced token prediction pretraining task with a generator and discriminator network. BLURB pretrains a biomedical language model from scratch on unannotated biomedicine text, eliminating the need for complex tagging schemes. Lastly, BioBART, a bidirectional and auto-regressive generative language model designed specifically for biomedical natural language generation tasks, completes the suite with corresponding data.

Besides these biological text-based domain-shift explorations, FMs also incorporate multiple modalities of data. For instance, MSA [81] breakthroughs the lack of training data through a medical-specific domain knowledge–integrated adaptation technique. Similarly, GMAI facilitates easy adaptation to new tasks by accepting inputs with varying combinations of data modalities. With minimal or no task-specific annotated data, GMAI can perform a wide array of tasks, including constructing a comprehensive perspective of a patient’s health status by integrating various modalities, from unstructured symptom descriptions to continuous glucose monitor readings and patient-supplied medication logs.

FMs in bioinformatics exhibit competitive efficacy in explorations tasks, including biomedical text mining (such as named entity recognition, relation extraction, and question answering), PICO (Participants, Interventions, Comparisons and Outcomes entities) extraction, and vision-language extraction. For instance, BioBERT [24], pretrained on biological PubMed abstracts totaling 4.5 billion words and PubMed Central full-text articles totaling 13.5 billion words, outperforms the general domain foundation model BERT [23] Similarly, BioBLECTRA, pretrained from scratch on PubMed abstracts and PubMedBERT, achieves the superior performance of mean test results across all datasets in BLURB.

Model capacity is pivotal to biological multimedia analysis as well. For instance, Med-PaLM, with 540 billion model parameters, achieves a medical QA accuracy of 67.6% on MedQA, surpassing PubMedBERT (38.1%, 100 million parameters) [82], DRAGON (47.5%, 360 million parameters) [69], and PubMed GPT (50.3%, 2.7 billion parameters) [83]. Further, models like MSA, when fine-tuned, achieves the best results compared to SOTA segmentation methods. BioBART also demonstrates competitive performance on biomedical summarization datasets, surpassing BART large by 1.93/1.31/2.1 on Rouge1/2/L on MeQSum. Noticeably, the absence of a standard dataset for training and different training splits could result in lower scores, while the large scale of the model may also present technical obstacles.

Foundation models for biological sequence analysis

Biological sequence analysis handles exponentially growing sequence data related to genes, mutations, and various biological phenomena. Traditional models typically train identifiers using handcrafted features, necessitating an extra step of manual feature extraction. In contrast, recent works leverage implicit medical knowledge from FMs to tackle specialized tasks and even unseen tasks from unknown sequences. These advancements provide superior prediction results across various tasks with the constraints of correlated biological theory.

Deciphering the language of noncoding DNA to understand how DNA sequence encodes phenotypes represents a major challenge for the next phase of genome biology research. Enformer [34] improves gene expression prediction accuracy, noncoding variant effect prediction, and candidate enhancer prioritization from DNA sequence by integrating long-range interactions in a larger receptive field. Due to the existence of polysemy and distant semantic relationships of noncoding DNA especially in data-scarce scenarios, the gene regulatory code is highly complex. DNABERT [26] pretrains for proximal promoter region identification and fine-tunes two models, DNABERT-Prom-300 and DNABERT-Promscan, using TATA and non-TATA human promoters of 10 000 bp in length from the Eukaryotic Promoter Database (EPDnew). It captures a global and transferrable understanding of DNA sequences after fine-tuning small task-specific annotated data to visualize semantic relationships. When dealing with sequences that extend beyond 512 bp in length, DNABERT segments them into manageable parts and combines their representations to yield the final composite representation. DNABERT-2 [84] further enhances efficiency by incorporating a skilled tokenizer and strategies to handle input length limitations, optimizing time and memory consumption while boosting model capabilities. When extracting semantic-level genome representations, existing processes tend to rely on manual design and generate unsatisfactory representations instead of refined ones that demand costly database explorations. CLAPE-DB [40] leverages pretraining and contrastive learning on vast unannotated data with the ability to handle imbalanced data.

The translation of DNA into proteins, governed by the universal genetic code, relies heavily on the vast information encoded within the genome rather than mere sequential order. HyenaDNA [41] addresses this complexity, leveraging genome sequences across various data lengths and model sizes. Protein sequences across large protein families could be generated through language models. Nucleotide Transformer [57] incorporates information from 3202 diverse human genomes and 850 genomes from a broad spectrum of species, demonstrating that increased diversity improves performance compared to increased model size.

The synthesis of proteins holds immense application potential in areas such as pharmaceutical design and protein engineering. ProGen [43] successfully generates a million artificial sequences after fine-tuning on the curated lysozyme dataset and generates artificial proteins across multiple families and functions. Interactions between proteins and DNA play pivotal roles in vital biological processes such as replication, transcription, and splicing. xTrimoPGLM [37] pretrains a transformer framework with 100 billion parameters to address protein understanding and generation tasks with joint optimization of the two types of objectives. Systematic prioritization obtained from a sequence-based CNN instead of the binary outcome can accurately predict the TF-binding intensities and measure the impact of noncoding variants on TFs. Further, ProteinBERT [27] enables meticulous fine-tuning across an extensive spectrum of protein-related tasks in a remarkably short span of minutes. ProtGPT2 [36] generates sequences with prevalent disorders across datasets displaying 48.64%, 39.70%, and 11.66% alpha-helical, beta-sheet, and coil contents, which is comparable to the natural space with the 45.19%, 41.87%, and 12.93%.

Accurate identification of splice sites is pivotal for ensuring precise protein translation. Among these endeavors, DNABERT outperforms SliceFinder [85] on benchmark data, boasting superior performance with multiclass accuracy of 0.923, an F1 score of 0.919, and an matthews correlation coefficient (MCC) of 0.871 compared to SliceFinder’s reported accuracy of 0.833, an F1 score of 0.828, and an MCC of 0.724. In functional variant prediction, ProGen aligns more accurately with experimentally measured assay data from protein datasets chorismate mutase (CM) and malate dehydrogenase (MDH), boasting an area under the curve (AUC) of 0.94 compared to sequence generation methods from the studies that were specifically designed for these families such as ProteinGan [86] with an AUC of 0.87. HyenaDNA establishes a new state of the art across all datasets. It surpasses previous benchmarks such as GenomicBenchmarks [87] by substantial margins, achieving an improvement of up to 20 percentage points in the task of human enhancer identification.

Protein sequences, akin to natural languages, encapsulate structure and function in their amino acid sequence. ProtGPT datasets exhibit a comparable distribution of ordered and disordered regions across datasets IUPred3 and ordered content. Notably, the proportion of ordered amino acids in ProtGPT2 and natural datasets is 79.71% and 82.59%, respectively, underscoring the similarity in their composition. Specifically, FM xTrimoPGLM achieves a template modeling score (TMscore) of 0.961 in predicting variable heavy (VH) and variable light (VL) structure in antibodies, surpassing AlphaFold2 (0.951) and other advanced methods including OmegaFold [88] (0.946), ESMFold [89] (0.943), IgFold [90] (0.945), and xTrimoAbFold [91] (0.958). While FMs are not anticipated in generating an entirely different distribution or domain, they can expand the variety of sequences sampled by evolution, thereby enhancing model performance.

Foundation models for biological structure construction

Understanding secondary and 3D biological structures is vital for medical interventions such as vaccine creation, which involves determining the messenger RNA (mRNA) structure. Conventional methods rely on physics-oriented methods like cryogenic electron microscopy, thermodynamic methods supported by experimentally determined thermodynamic parameters, and alignment-oriented methods [85]. The lack of structure datasets and structural instability of genes like RNA have prompted significant efforts in developing computational methods. FMs allow for the creation of a learned, RNA/protein-specific neural network to accurately predict biological structure through token manipulation and position embedding.

FMs offer a scalable combination of data and model capacity for downstream tasks in biological structure construction [92]. For secondary structure prediction, ProteinBERT [27] recovers corrupted inputs by randomly replacing tokens and adding random false annotations, forcing the model to predict annotations from the sequence alone; ProGen [43], trained on millions of raw protein sequences, generates artificial proteins with structural divergence; AlphaFold [44] derives distance maps and torsion distributions between pairs of residues from protein sequences; AlphaFold2 [93] further improves the accuracy using a certain noisy student self-distillation approach and generates a new dataset of predicted structures; ESM-1b [94] trains a deep contextual language model on 86 billion amino acids across 250 million protein sequences; NetSurf [95] replaces the conventional logistic regression linear layer with a deep neural network; xTrimoPGLM [37] pretrains the classification task on helices, strands, and various turns like coils; and CLAPE-DB [40] combines the pretrained protein language model ProtBert [96] with constructive learning to discover a representation space for predicting ligand-binding sites in a protein sequence.

For tertiary structure, FMs can directly predict the positions of biological targets. For instance, Uni-MoI [97] predicts 3D positions by utilizing two pretrained models: a molecular model pre-trained on 209 million molecular conformations and a pocket model pretrained on 3 million candidate protein pocket data. It completes self-supervised pretraining on selected positions with minimal delta positions from random positions, avoiding the need for a masking strategy to recover the correct 3D position [98]. Rapid prediction of protein structure is also indispensable. RGN2 [99] achieves a remarkable reduction in computation time by up to 106-fold, outperforming AlphaFold2 in the analysis of orphan proteins and various classes of designed proteins. Guo et al. [100] propose a pretraining model to learn hierarchical structure embeddings from protein tertiary structures to improve prediction efficiency. Moreover, McDermott et al. [101] impose relational structure constraints on the pretraining framework and incorporate a pretraining graph as an auxiliary input, whose performance is supported by theoretical results.

FMs in biological structure construction greatly surpass the limits of conventional structure prediction methods. RNA-FM achieves an F1 score of 94.1% and 70.4% on ArchieveII600 and bpRNATS0 datasets, respectively, surpassing SPORT-RNA [102] by 22.8% and 7.5% and notably outperforming the SOTA UFold [103] by 3.4% and 4.0%, respectively. AlphaFold combines bioinformatics and physical approaches to build components from PDB data, enabling the handling of complex physical contexts in challenging cases such as intertwined homomers. As for 3D pose prediction, Uni-Mol predicts 80.35% of binding poses with an root mean square deviation (RMSD) ≤2 Å, better than popular docking methods. When dealing with small-scale data tasks, RNA-FM confirms that the transfer learning employing pretrained parameters of ResNet32 on bpRNA-1 m enables improvement of the task performance by another 20 points compared to a simple ResNet32 with RNA-FM. RNA-FM’s 3D distance prediction attains a pearson product-moment correlation coefficient (PMCC) of 0.8313 when combining sequence encoding, MSA covariances, and RNA-FM embeddings.

The potential of FMs extend beyond structure prediction CLAPE-DB demonstrates superior performance with AUC values of 0.871 and 0.881 on two benchmark datasets, TE46 and TE129 in DNA-binding sites prediction. It outperforms the latest advanced sequence-based models, including DNAPred [104] with AUC values of 0.845 and 0.730, NCBRPred [105] with AUC values of 0.823 and 0.713, and SVMnuc [106] with AUC values of 0.812 and 0.715. With the increasing availability of genomic profiling data and 3D genome contact maps, more types of binding sites can be further identified.

Foundation models for biological function prediction

Biological functions have garnered significant attention and interest within the domain of bioinformatics. Traditional function prediction models mainly classify targets into one or more categories of collected function datasets such as GO [107] that delineates functions by hierarchical ontologies including molecular functional, biological process, and cellular component [108, 109]. Although GO has >50 000 classes, existent function taxonomy is immature, incomplete, and imbalanced. Further, highly variable genes (HVGs) are mainly selected from the expression variance across the entire dataset. This selection will also remove genes that are stable within the dataset, even though it is crucial to specific cell states in that dataset with respect to all other possible cell states. For instance, in a brain dataset, a neuron-specific transcription factor might be excluded due to its lack of high variability. However, this gene, specific to neurons, is critical for the model’s accurate understanding, distinguishing it from other cell types encountered during pretraining. Here, biological function prediction FMs offer a solution to these challenges.

For instance, Geneformer [17], a context-aware, attention-based deep learning model, leverages pretraining on a corpus of 29.9 million transcriptomes to accurately predict disease genes and their targets. It can be fine-tuned for a variety of downstream tasks related to chromatin and network dynamics. DNABERT [26] provides an accurate prediction of functional genetic variant candidates from ~700 million short variants in dbSNP. xTrimoPGLM [37] employs four distinct masking strategies to redesign the selected sequence and evaluates the implications of synthesized protein sequences associated with specific biological functions. ProtST [35] proposes a multimodal integration pretraining framework on both protein sequences and biomedical texts, outperforming the sequence-based model ESM-1b [95] on protein function annotation. RNA-FM [58] leverages embeddings pretrained on noncoding RNAs to model the function of the 5′ untranslated region in mRNA, showing its versatility in handling noncoding sequences.

Existing methods often require preprocessing of raw data due to their limited capability to model high-dimensional data efficiently. To overcome this challenge, scBERT [110], pretrained on 1 million unannotated single-cell RNA sequencing data, tackles batch effects, increases sequence length, and enhances model generalizability by employing Performer [111], a matrix decomposition transformer. It provides dense embeddings encoded from large-scale unlabeled raw data. To simplify the comparison between pretrained and fine-tuned models, scGPT [42] performs HVG selection as well as a binning technique to model high-dimension data. Specifically, it employs an in-memory data structure tailored for nonsequential omic data, enabling storage of hundreds of datasets and facilitating rapid access to large-scale data. Pretrained on over 10 million cells stored within this in-memory data structure, scGPT converts all expression counts into relative values using a novel value binning technique, partitioning the expression matrix into bins after selecting HVG with log1p transformation. Similar to creating sentences in natural language that are grammatically and semantically correct on a range of topics, ProtST [35] creates protein sequences with predictable functions across diverse protein families, enabling adaptation to a wide array of protein families. L2P-GNN [112] employs a dual adaptation mechanism at both node and graph levels to encode local and global information, facilitating biological function prediction using 88 000 annotated subgraphs for a 40-binary classification task.

Another limitation of available training data is the imbalance and the presence of extremely similar subtypes. For instance, in the cell type annotation task on the Zheng68k peripheral blood mononuclear cell dataset, the accuracy could not surpass 0.71 with traditional methods. In contrast, FM scBERT acquires an accuracy of 0.759 in the same scenario. Meanwhile, scBERT effectively captures long-range interactions and achieves higher performance on both known and unknown classes. For unknown classes, scBERT achieves an accuracy of 0.329, surpassing SciBet [113] (0.174) and scmap_cluster [114] (0.174), and, for known classes, scBERT achieves an accuracy of 0.942, outperforming SciBet (0.784) and scmap_cluster (0.666). Genomeformer significantly boosts the prediction of central versus peripheral factors (AUC 0.81) compared to other methods (AUC 0.59–0.69). While functional tasks, tied to the distribution of pretrained data, have not yet surpassed structural tasks in terms of improvements, biological function prediction FMs consistently outperform baselines by a large margin and achieve higher accuracy.

Foundation models for biological scMultiomics analysis

Single-cell multi-omics provides significant insights into cellular dynamics, gene regulation, and disease mechanisms, incorporating complex cellular modalities and states at once [115]. However, integrating diverse data sources often suffers from data scarcity, integration intricacies, and limited robustness and clarity of many modalities such as epigenomics and proteomics, hindering traditional models in maximizing their benefits. FMs bridge this gap by connecting disparate data modalities. For single-cell biology, scGPT [42] integrates multiple sequence data modalities such as DNA, RNA, and protein with a generative pretrained model across a repository of >10 million cells, offering a holistic view of cellular states. It supports joint optimization of multi-omics tokens, i.e. condition-specific tokens, from paired data in the flexible embedding architecture. Compared with scGLUE and Seurate v4, which produce a merged cluster of B cells with three major types, scGPT can differentiate these three types of B cells into distinct groups and further provides a clear separate cluster for CD8 naive cells with a superior performance in overall clustering (AvgBIO = 0.767).

Besides single-cell multi-omics, other multi-omics FMs also hold immense promise, bridging the gap between data modalities and propelling us toward a deeper understanding of cellular dynamics and disease. For instance, scFoundation [116] serves as a versatile FM, harmonizing diverse biological data sources including sequence information, structural features, and chemical compounds. It integrates the cancer cell line encyclopedia (CCLE) and genomics of cancer drug sensitivity (GDSC) datasets for input cell line gene expression, the drugs, and IC50 labels, extracting transcriptome features fed into subsequent prediction modules. scFoundation enables transferring gene relationships to bulk-level expression data, improving IC50 prediction accuracy. As the amount of publicly available multi-omics data continued to expand, future FMs may enable more meaningful predictions in elusive tasks even with limited task-specific data.

Foundation models for multimodal integration biological problems

Recent language understanding research suggests that the text-corpora pretrained models are surprisingly effective, shedding light on the potential for multimodal analysis [117]. Traditional biological models mainly focus on unimodal information and have difficulties in handling multimodal data or multilevel data [35, 118, 119]. While FMs offer a broader understanding of targets, they are also vulnerable to perturbation and necessitate specific fine-tuning approaches. Multimodal integration enables a deeper understanding of diverse topics for a systematic study of biological samples [58]. For instance, scGPT [42] enables multi-omics prediction through generative AI with the integration of expression and new condition tokens to extend the embedding architecture to accommodate multiple modalities, batches, and states. Similarly, ProtST [35] improves the original representational capacity of the protein language model through the application of multimodal representation alignment and multimodal mask prediction.

It is imperative for biological FMs to incorporate a more diverse range of data types, e.g. temporal data and perturbation data. For instance, analysis of knowledge graphs often suffers from low accuracy due to the data imbalance problem, where richer entities possess more relations and information, but the scarce ones will not be fully represented with limited information. A possible solution is to incorporate varying forms of biological data such as sequence data, structure data, and chemical compounds. Leveraging exponentially growing biological data alongside advanced FMs holds promise for achieving both clinically and biologically significant outcomes.

Challenges and opportunities

FMs are a double-edged sword, presenting both opportunities and challenges [120] [121]. While they make large biological data analysis possible, they also demand substantial computing resources, entail a vast number of model parameters, and suffer from low explainability and reliability. These challenges, along with potential opportunities for advancing promising biological areas, are depicted separately in Fig. 3 and are further elaborated below.

Figure 3

Challenges and opportunities in applying FMs for biological problems. FMs for addressing biological problems face hurdles related to biological data, model structures, and their social influence, which concurrently catalyzes opportunities in bioinformatics due to the increasing availability of biological data, advancements in FMs, and their versatile real-world applications. The top half of this figure outlines challenges such as data noise and sparsity, increasing data diversity, long sequence length, and multimodality in biological data collection. Additionally, challenges in training efficiency, model explainability, and establishing evaluation standards in model design and construction are depicted. Social influences, including ethics and fairness, privacy concerns, potential misuse, and social bias, further compound these challenges. Conversely, the bottom half of the figure illustrates emerging opportunities driven by the proliferation of diverse biological data types and volumes, including RNA, DNA, scMultiomics, proteins, and knowledge graphs/networks. The enhancement of FMs, particularly through pretrained mechanisms, presents another avenue for progress. Moreover, a wide range of applications spanning surgery, hormonal therapy, immunotherapy, radiotherapy, personalized therapy, chemotherapy, bone marrow transplant, drug discovery, and online healthcare, underscore the potential impact of FMs in bioinformatics. These developments signal a promising trajectory for the application of FMs in addressing biological complexities.

Challenges

Data noise, sparsity, and diversity

FMs derive embeddings to support other biological tasks still grapple with the sparse or corrupted noisy data [122, 123]. The sparsity of biological data typically stems from data collection deficiencies, with variations in chemistry versions and depths of data capture, as well as imbalances in research focus that tend to concentrate on well-known phenomena. Noises and biases often manifest in different selection strategies, experimental conditions, and other factors. While some FMs indicate that this issue could be alleviated through careful review of these data and deep investigation of the phenomenon, they remain susceptible to data corruption present in existing evaluation sets or in future data capture scenarios in the real world. Increasing data diversity holds the promise of improving model performance efficiently [95]. However, diversity with task-unrelated data in real-world scenarios may not be readily transferred to downstream tasks [124, 125]. To derive accurate representations from diverse biological data, FMs can emphasize both feature-level and semantic-level training strategies to harmonize biological knowledge across modalities [126].

Long sequence length

Biological sequence holds great potential for addressing diverse biological challenges, but their extensive length poses significant challenges during model training. Consider a single human, for instance, with ~3 billion nucleotides, or a bacterium with 5 million, or even a virus with 30 000. These lengthy sequences introduce considerable gradient variance, leading to increased instability and reduced training efficiency. Inspired by Sortformer [127], Li et al. [128] have endeavored to achieve stable training with reduced computational cost by improving data efficiency. However, the method’s utilization of varying sequence lengths, obtained directly through truncation, entails sacrificing the information from dropped data. Causal relationships, prerequisites, or other significant factors have yet to be adequately represented and validated. This could be addressed by further leveraging data localization, structural and functional aspects, or other chemical and biological rules and relationships.

Training efficiency

Pretraining constitutes a vital step for most of the FMs to maintain coherence within each shot. However, the high computational costs on huge amounts of data remain a significant barrier to their widespread implementation. For instance, AlphaFold2 requires several weeks of training on up to 200 graphics processing units (GPUs). To improve the efficiency in analyzing big data, previous approaches have employed various strategies. These include leveraging attention mechanisms such as FlashAttention [129] and Multi-Query Attention [130], kernels [131], sparse activation [65], and other advanced mechanisms to reduce both the model’s training time and detection time. Similarly, substantial redundant computations could be cut down in FMs with advanced technologies that focus on removing unimportant parameters, reducing memory consumption, enhancing convergence rate, paralleling data processing, and fully exploiting the generative and adaptive capabilities of models [92]. In summary, further efforts along these lines are essential for continually improving the efficiency of FMs, thereby enhancing their applicability in diverse domains.

Model explainability

It is also challenging to provide interpretability for FMs in each step and acquisitions with logical evidence in bioinformatics. Clear and robust explainability and interpretability are significant factors of highly comparative prediction accuracy enabling a wide range of biomedical and healthcare applications to explain the model and results to consumers and researchers [132]. Efforts have been made to explain them in biological applications such as scBERT, which elucidates the contribution of genes and their interactions by analyzing attention weights within the self-attention mechanism in the model for gene exploration and decision-making tasks. Thereby, the top genes could be visualized by the weights and analyzed in the following stages [110]. However, this approach relies solely on structural results, which neither indicate the importance of each node nor provide explanations for the model’s reliable results. We envision that FMs can dramatically improve interpretability and explainability by incorporating knowledge graphs and networks to narrow the gap between FMs and experts for solving more complex biological problems.

Evaluation standard

Traditional AI-based models designed for specific tasks in computer vision or natural language processing are typically evaluated based on predefined metrics, making it straightforward to assess their performance. However, FMs face various downstream tasks as well as unseen tasks, making it uniquely challenging to anticipate all of the modes and set an evaluation standard for these methods. Current qualitative evaluations often focus on certain modules such as Machine Reading Comprehension within a complete QA pipeline, rather than assessing previously unseen tasks, such as diagnosing disease in brain magnetic resonance imaging [14]. Additionally, the general domain evaluation may overlook the impact of rich biological regulations, such as biomedical synonymous relationships [133]. To accurately evaluate model performance and output quality, which, in turn, prevents the occurrence of overly confident assertions, it is essential to consider model uncertainty and incorporate biological knowledge from domains such as radiology, pathology, and oncology.

Opportunities

Biological data

The exponential growth of available biological data, such as datasets for RNA secondary structure prediction like bpRNA-1m [134] (102 318 sequences from 2588 RNA families), is expected to significantly enhance the performance of FMs on downstream tasks. Despite the availability of extensive datasets, there remains a vast amount of untapped data that FMs have yet to fully utilize. Complex combinations of biological information or conditional data offer promising avenues for further analysis by FMs.

Foundation model architecture

Transitioning from a derivable approach to a multifocus framework presents difficulties. In this respect, we can discuss FMs from two perspectives. First, with controllable biological data and model size, designing different strategies for different data and tasks becomes feasible. Second, dealing with particularly large models that contain extensive biological information and a massive number of parameters requires quick and stable learning. Despite feasible efforts, the current cognitive abilities of FMs still fall short of expectations. To this end, developing new training strategies for FMs is of paramount significance.

Feasible applications

FMs enable various bioinformatics applications, particularly in disease understanding and therapy, drug discovery, and personalized medicine. Regarding the therapy of cancers, FMs could provide physiological function insights into targets and potentially replace traditional analytical methods for building cancer prognosis models [135]. In drug discovery, FMs can uncover corresponding phenomena, while in personalized medicine, they can design drugs based on a patient’s genetics, genome, and health history. Additionally, in online healthcare, FMs serve as a central storage and backbone of healthcare systems, powering question-answering systems and healthcare-assistive robots to leverage medical data and resources effectively [136].

Conclusions

In conclusion, with the rapid iteration and development of AI, the surge in data availability has brought unexpected opportunities and challenges for the field of bioinformatics. The vast increase in both annotated and unannotated biological data, coupled with AI advancements, provides an ideal environment for leveraging foundation models to transform computational biology. Below, we outline the key contributions of this survey:

Enhancements in bioinformatics through FMs: This survey illustrates how FMs have significantly advanced bioinformatics by addressing challenges with abundant unannotated data. Through pretraining on large and diverse datasets, FMs have demonstrated a remarkable capacity for capturing intricate biological representations, achieving state-of-the-art results in the analysis of biological sequences, structures, and functions. Their ability to generalize knowledge across various biological contexts makes FMs powerful tools for advancing research in computational biology.
Comparative strengths and challenges of FMs: We emphasize the advantages of FMs over traditional bioinformatics methods, highlighting their adaptability, superior performance, and ability to represent complex biological information. Unlike conventional models that often require task-specific training, FMs can be flexibly fine-tuned or used in zero-shot settings, making them versatile across a wide range of biological tasks. However, FMs also face notable challenges, including dealing with data sparsity, handling noisy or incomplete data, managing the computational costs of training on long biological sequences, and providing interpretable results in biological applications. Addressing these challenges is critical for further advancing their utility in the field.
Guidance and future prospects for innovations: This survey offers valuable insights for future research by summarizing the current applications and achievements of FMs in tackling biological challenges. We outline potential pathways for advancing FMs in bioinformatics, emphasizing the integration of domain-specific knowledge to narrow the gap between general AI and biological applications. Moreover, we advocate for the development of advanced training strategies and improved model architectures to enhance interpretability, efficiency, and overall performance. These prospects guide the research community toward overcoming current limitations and unlocking new opportunities for innovation in computational biology.

Conflict of interest: None declared.

Funding

This research was partially funded by the Research Grants Council of the Hong Kong Special Administrative Region, China [Project No.: CUHK 24204023] and by the Innovation and Technology Commission of the Hong Kong Special Administrative Region, China [Project No.: GHP/065/21SZ]. Additional support came from The Chinese University of Hong Kong (CUHK) under award numbers 4937025, 4937026, 5501517, 5501329, 8601603, and 8601663, as well as the Research Grants Council of the Hong Kong SAR [Project Nos.: CUHK 24204023, CUHK 14222922, and RGC GRF 2151185]. The Innovation and Technology Commission of the Hong Kong SAR also provided funding through Project No.: GHP/065/21SZ to Y.L. Moreover, this work was supported in part by the Shenzhen-Hong Kong Joint Funding Project (Category A) under Grant No.: SGDX20230116092056010 to S.W.

Data availability statement

No datasets have been utilized in this review paper.

Ethical statement

There are no ethical issues.

References

Hughes

Rees

Kalindjian

. et al.

Principles of early drug discovery

Br J Pharmacol

2011

;

162

1239

–

10.1111/j.1476-5381.2010.01127.x

Bommasani

Hudson

Adeli

. et al.

On the opportunities and risks of foundation models.

arXiv preprint arXiv:2108.07258,

2021

Topol

High-performance medicine: the convergence of human and artificial intelligence

Nat Med

2019

;

–

10.1038/s41591-018-0300-7

Park

Y S

Lek

. Artificial Neural Networks: Multilayer Perceptron for Ecological Modeling[M]. In: Jørgensen SE, (eds.),

Developments in Environmental Modeling

Netherlands

Elsevier

2016

;

123

–

10.1016/B978-0-444-63623-2.00007-4

Wang

Tai

CEW

Wei

DeFine: deep convolutional neural networks accurately quantify intensities of transcription factor-DNA binding and facilitate evaluation of functional non-coding variants

Nucleic Acids Res

2018

;

e69

–

Shen

Liu

. et al.

Finding gene network topologies for given biological function with recurrent neural network

Nat Commun

2021

;

3125

10.1038/s41467-021-23420-5

Whalen

Truty

Pollard

Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin

Nat Genet

2016

;

488

–

Forster

Yashiroda

. et al.

BIONIC: biological network integration using convolutions

Nat Methods

2022

;

1250

–

10.1038/s41592-022-01616-x

Dong

Zhang

Deciphering spatial domains from spatially resolved transcriptomics with an adaptive graph attention auto-encoder

Nat Commun

2022

;

1739

10.1038/s41467-022-29439-6

10.

Mahmud

Kaiser

Hussain

. et al.

Applications of deep learning and reinforcement learning to biological data

IEEE Trans Neural Netw Learn Syst

2018

;

2063

–

10.1109/TNNLS.2018.2790388

11.

Wiggins

Tejani

On the opportunities and risks of foundation models for natural language processing in radiology

Radiol Artif Intell

2022

;

e220119

12.

Baker

Akkaya

Zhokov

. et al.

Video pretraining (vpt): learning to act by watching unlabeled online videos

Adv Neural Inf Process Syst

2022

;

24639

–

10.1038/s41586-023-05881-4

13.

Tack

Piech

The AI teacher test: measuring the pedagogical ability of blender and GPT-3 in educational dialogues

arXiv preprint arXiv:2205.07540,

2022

14.

Moor

Banerjee

Abad

ZSH

. et al.

Foundation models for generalist medical artificial intelligence

Nature

2023

;

616

259

–

15.

Rao

R M

Liu

Verkuil

. et al.

MSA transformer

International Conference on Machine Learning

PMLR

2021;

139

8844

–

16.

Sapoval

Aghazadeh

Nute

. et al.

Current progress and open challenges for applying deep learning across the biosciences

Nat Commun

2022

;

1728

10.1038/s41467-022-29268-7

17.

Theodoris

Xiao

Chopra

. et al.

Transfer learning enables predictions in network biology

Nature

2023

;

618

616

–

10.1038/s41586-023-06139-9

18.

Zou

Huss

Abid

. et al.

A primer on deep learning in genomics

Nat Genet

2019

;

–

10.1038/s41588-018-0295-5

19.

Uhlmann

Donati

Sage

A practical guide to supervised deep learning for bioimage analysis: challenges and good practices

IEEE Signal Process Mag

2022

;

–

10.1109/MSP.2021.3123589

10.1093/bioinformatics/btz682

20.

Wasserman

Sandelin

Applied bioinformatics for the identification of regulatory elements

Nat Rev Genet

2004

;

276

–

21.

Howard

Ruder

Universal language model fine-tuning for text classification

arXiv preprint arXiv:1801.06146

2018

22.

Yuan

Chen

. et al.

Florence: a new foundation model for computer

vision, arXiv preprint arXiv:2111.11432,

2021

23.

Devlin

Chang

Lee

. et al.

Bert: pre-training of deep bidirectional transformers for language

understanding. arXiv preprint arXiv:1810.04805,

2018

24.

Lee

Yoon

Kim

. et al.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Bioinformatics

2020

;

1234

–

25.

Tinn

Cheng

. et al.

Domain-specific language model pretraining for biomedical natural language processing

ACM Trans Comput Healthc

2021

;

–

10.1093/bioinformatics/btab083

26.

Zhou

Liu

. et al.

DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome

Bioinformatics

2021

;

2112

–

27.

Brandes

Ofer

Peleg

. et al.

Proteinbert: a universal deep-learning model of protein sequence and function

Bioinformatics

2022

;

2102

–

10.1093/bioinformatics/btac020

28.

Radford

Child

. et al.

Language models are unsupervised multitask learners

OpenAI Blog

2019

;

29.

Wang

Yang

. et al.

An early evaluation of gpt-4v(ision).

arXiv preprint

arXiv:2310.16534, 2023

30.

Lin

Akin

Rao

. et al.

Language models of protein sequences at the scale of evolution enable accurate structure prediction

bioRxiv

500902, 2022

31.

Hayes

Rao

Akin

. et al.

Simulating 500 million years of evolution with a language model

bioRxiv

600583,

2024

32.

Raffel

Shazeer

Roberts

. et al.

Exploring the limits of transfer learning with a unified text-to-text transformer

J Mach Learn Res

2020

;

–

PubMed

33.

Song

Tan

Qin

. et al.

Mpnet: masked and permuted pre-training for language understanding

Adv Neural Inf Process Syst

2020

;

16857

–

10.1038/s41592-021-01252-x

34.

Avsec

Agarwal

Visentin

. et al.

Effective gene expression prediction from sequence by integrating long-range interactions

Nat Methods

2021

;

1196

–

203

35.

Yuan

Miret

. et al.

Protst: multi-modality learning of protein sequences and biomedical

texts. arXiv preprint arXiv:2301.12040,

2023

36.

Ferruz

Schmidt

Höcker

ProtGPT2 is a deep unsupervised language model for protein design

Nat Commun

2022

;

4348

10.1038/s41467-022-32007-7

37.

Chen

Cheng

Geng

. et al.

xTrimoPGLM: unified 100B-scale pre-trained transformer for deciphering the language of protein

. arXiv preprint arXiv:2401.06199, 2024.

38.

Liu

and

Tian

. Protein-DNA binding sites prediction based on pre-trained protein language model and contrastive learning [J].

arXiv preprint arXiv

2023

;

2306

15912

39.

Azher

Suvarna

Chen

. et al.

Assessment of emerging pretraining strategies in interpretable multimodal deep learning for cancer prognostication

BioData Min

2023

;

10.1186/s13040-023-00338-w

40.

Liu

Tian

Protein-DNA binding sites prediction based on pre-trained protein language model and contrastive learning

Briefings in Bioinformatics

2024;

25.1

:bbad488.

10.1093/bib/bbad488

41.

Nguyen

Poli

Faizi

. et al.

Hyenadna: long-range genomic sequence modeling at single nucleotide resolution

Advances in Neural Information Processing Systems

, 2024;36.

42.

Cui

Wang

Maan

. et al.

scGPT: towards building a foundation model for single-cell multi-omics using generative AI

Nature Methods

2024;1–11.

10.1038/s41587-022-01618-2

43.

Madani

Krause

Greene

. et al.

Large language models generate functional protein sequences across diverse families

Nat Biotechnol

2023

;

1099

–

106

44.

Senior

Evans

Jumper

. et al.

Improved protein structure prediction using potentials from deep learning

Nature

2020

;

577

706

–

10.1038/s41586-019-1923-7

45.

Walsh

Mohamed

Nováček

. Biokg: A knowledge graph for relational learning on biological data. In: d'Aquin PM, Dietze PS, (eds.),

Proceedings of the 29th ACM International Conference on Information & Knowledge Management

. ACM (Association for Computing Machinery), New York, NY, USA, 2020; 3173–3180.

46.

Bernstein

Fong

Lam

. et al.

Solo: doublet identification in single-cell RNA-seq via semi-supervised deep learning

Cell Syst

2020

;

–

101.e5

10.1016/j.cels.2020.05.010

47.

Brendel

Bai

. et al.

Application of deep learning on single-cell RNA sequencing data analysis: a review

Genomics Proteomics Bioinformatics

2022

;

814

–

10.1016/j.gpb.2022.11.011

48.

Arisdakessian

Poirion

Yunits

. et al.

DeepImpute: an accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data

Genome Biol

2019

;

211

10.1186/s13059-019-1837-6

49.

Tran

HTN

Ang

Chevrier

. et al.

A benchmark of batch-effect correction methods for single-cell RNA sequencing data

Genome Biol

2020

;

–

10.1186/s13059-019-1850-9

10.1186/s12859-022-05095-x

50.

Clement

Statistical methods for quantitative MS-based proteomics: part I. Preprocessing

51.

Mowoe

Garnett

Lennard

. et al.

Pro-MAP: a robust pipeline for the pre-processing of single channel protein microarray data

BMC Bioinformatics

2022

;

534

52.

Hong

Sun

Zheng

Tan

Q X

, and

. fastmsa:

Accelerating multiple sequence alignment with dense retrieval on protein language

bioRxiv

2021;2021–12.

10.1186/s12859-019-3019-7

53.

Steinegger

Meier

Mirdita

. et al.

HH-suite3 for fast remote homology detection and deep protein annotation

BMC Bioinformatics

2019

;

–

54.

Stecher

Tamura

Kumar

Molecular evolutionary genetics analysis (MEGA) for macOS

Mol Biol Evol

2020

;

1237

–

10.1093/molbev/msz312

55.

Chen

Zhao

Yang

Capturing large genomic contexts for accurately predicting enhancer-promoter interactions

Brief Bioinform

2022

;

:bbab577.

10.1093/bib/bbab577

10.1038/s41576-022-00532-2

56.

Novakovsky

Dexter

Libbrecht

. et al.

Obtaining genetics insights from deep learning via explainable artificial intelligence

Nat Rev Genet

2023

;

125

–

57.

Dalla-Torre

Gonzalez

Mendoza

. et al.

The nucleotide transformer: building and evaluating robust foundation models for human genomics

bioRxiv

2023;2023–01.

58.

Chen

Sun

. et al.

Interpretable RNA foundation model from unannotated data for highly accurate RNA structure and function predictions

. arXiv preprint arXiv:2204.00300, 2022.

59.

Alipanahi

Delong

Weirauch

. et al.

Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning

Nat Biotechnol

2015

;

831

–

60.

Liu

Yuan

. et al.

Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing

ACM Comput Surv

2023

;

–

61.

Rentzsch

Schubach

Shendure

. et al.

CADD-splice—improving genome-wide variant effect prediction using deep learning-derived splice scores

Genome Med

2021

;

–

10.1186/s13073-021-00835-9

62.

Muruganujan

Huang

. et al.

Protocol update for large-scale genome and gene function analysis with the PANTHER classification system (v. 14.0)

Nat Protoc

2019

;

703

–

10.1038/s41596-019-0128-8

63.

Ernst

Kheradpour

Mikkelsen

. et al.

Mapping and analysis of chromatin state dynamics in nine human cell types

Nature

2011

;

473

–

64.

Tang

Kang

. et al.

GEPIA: a web server for cancer and normal gene expression profiling and interactive analyses

Nucleic Acids Res

2017

;

W98

–

102

65.

Saelens

Cannoodt

Todorov

. et al.

A comparison of single-cell trajectory inference methods

Nat Biotechnol

2019

;

547

–

10.1038/s41587-019-0071-9

66.

Kaddour

Harris

Mozes

. et al.

Challenges and applications of large language

models. arXiv preprint arXiv:2307.10169

2023

67.

Liu

Zhou

Zhao

. et al.

K-bert: enabling language representation with knowledge graph

. In:

Proceedings of the AAAI Conference on Artificial Intelligence

2020

;

2901

–

10.1609/aaai.v34i03.5681

68.

Brown

Mann

Ryder

. et al.

Language models are few-shot learners

Adv Neural Inf Process Syst

2020

;

1877

–

901

69.

Yasunaga

Bosselut

Ren

. et al.

Deep bidirectional language-knowledge graph pretraining

Adv Neural Inf Process Syst

2022

;

37309

–

70.

Denny

Krötzsch

Wikidata: a free collaborative knowledgebase

Communications of the ACM

2014

;

–

71.

Zhu

Kiros

Zemel

. et al.

Aligning books and movies: towards story-like visual explanations by watching movies and reading books

. arXiv preprint arXiv:1506.06724, 2015.

72.

Speer

Chin

Havasi

Conceptnet 5.5: an open multilingual graph of general knowledge

. In

Proceedings of the AAAI Conference on Artificial Intelligence

2017;

:4444–4451.

10.1609/aaai.v31i1.11164

10.1038/s43588-023-00453-y

73.

Jia

Zhong

. et al.

The high-dimensional space of human diseases built from diagnosis records and mapped to genetic loci

Nat Comput Sci

2023

;

403

–

74.

Jia

Zhang

. et al.

Estimating heritability and genetic correlations from large health datasets in the absence of genetic data

Nat Commun

2019

;

5508

10.1038/s41467-019-13455-0

75.

Singhal

Azizi

. et al.

Large language models encode clinical knowledge

Nature

2023

;

620

172

–

10.1038/s41586-023-06291-2

76.

Kanakarajan

Kundumani

Sankarasubbu

. BioELECTRA: Pretrained biomedical text encoder using discriminators. In: Demner-Fushman D, Cohen KB, Ananiadou S, Tsujii J, (eds.),

Proceedings of the 20th Workshop on Biomedical Language Processing

. Association for Computational Linguistics, Online,

2021

;143–154.

77.

Babjac

Emrich

. CodonBERT: Using BERT for sentiment analysis to better predict genes with low expression. In: Wang MD, Byung-Jun Yoon P, (eds.),

Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

. Association for Computing Machinery, New York, NY, United States,

2023

; 1–6.

78.

Yuan

Gan

. et al.

BioBART: Pretraining and evaluation of a biomedical generative language model

. arXiv preprint arXiv:2204.03905

2022

79.

Rajpurkar

Zhang

Lopyrev

. et al.

Squad: 100,000+ questions for machine comprehension of text

arXiv preprint.

arXiv preprint arXiv:1606.05250, 2016.

80.

Fiorini

Leaman

Lipman

. et al.

How user intelligence is improving PubMed

Nat Biotechnol

2018

;

937

–

81.

Fang

. et al.

Medical sam adapter: adapting segment anything model for medical image segmentation. arXiv preprint arXiv:2304.12620, 2023.

82.

Pathak

Shukla

Tiwari

. et al.

Deep transfer learning based classification model for COVID-19 disease

Ing Rech Biomed

2022

;

–

10.1016/j.irbm.2020.05.003

83.

Bolton

Hall

Yasunaga

. et al.

Stanford crfm introduces pubmedgpt 2.7 b

2022

84.

Zhou

. et al.

DNABERT-2: efficient foundation model and benchmark for multi-species genome. arXiv preprint arXiv:2306.15006, 2023.

85.

Wang

. et al.

SpliceFinder: ab initio prediction of splice sites using convolutional neural network

BMC Bioinformatics

2019

;

–

10.1186/s12859-019-3306-3

86.

Repecka

Jauniskis

Karpus

. et al.

Expanding functional protein sequence spaces using generative adversarial networks

Nat Mach Intell

2021

;

324

–

10.1038/s42256-021-00310-5

10.1038/s41467-023-38063-x

87.

Gresova

Martinek

Cechak

. et al.

Genomic benchmarks: a collection of datasets for genomic sequence classification

BMC Genomic Data

, 2023;

:25.

88.

Ding

Wang

. et al.

High-resolution de novo structure prediction from primary sequence

. bioRxiv 2022; 2022-07.

89.

Lin

Akin

Rao

. et al.

Evolutionary-scale prediction of atomic-level protein structure with a language model

Science

2023

;

379

1123

–

10.1126/science.ade2574

90.

Ruffolo

Chu

Mahajan

. et al.

Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies

Nat Commun

2023

;

2389

91.

Wang

Xumeng

Gong

Yang

Sun

Chuan

Shi

Wang

Yang

, and

Song

xtrimoabfold: de novo antibody structure prediction without msa

. arXiv preprint arXiv:2212.00735, 2022.

92.

Skinnider

Johnston

Gunabalasingam

. et al.

Comprehensive prediction of secondary metabolite structure and biological activity from microbial genome sequences

Nat Commun

2020

;

6058

10.1038/s41467-020-19986-1

93.

Jumper

Evans

Pritzel

. et al.

Highly accurate protein structure prediction with AlphaFold

Nature

2021

;

596

583

–

10.1038/s41586-021-03819-2

94.

Rives

Meier

Sercu

. et al.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Proc Natl Acad Sci

2021

;

118

e2016239118

10.1073/pnas.2016239118

95.

Klausen

Jespersen

Nielsen

. et al.

NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning

Proteins

2019

;

520

–

96.

Elnaggar

Heinzinger

Dallago

. et al.

Prottrans: toward understanding the language of life through self-supervised learning

IEEE Trans Pattern Anal Mach Intell

2021

;

7112

–

10.1109/TPAMI.2021.3095381

10.7551/mitpress/11068.001.0001

97.

Zhou

Gao

Ding

. et al. Uni-Mol: A universal 3d molecular representation learning framework. chemrxiv,

2023

98.

Feynman

The Character of Physical Law, with New Foreword

MIT Press

, Cambridge, Massachusetts, USA,

2017

99.

Chowdhury

Bouatta

Biswas

. et al.

Single-sequence protein structure prediction using a language model and deep learning

Nat Biotechnol

2022

;

1617

–

10.1038/s41587-022-01432-w

100.

Guo

. et al.

Self-supervised pre-training for protein embeddings using tertiary structures

Proceedings of the AAAI Conference on Artificial Intelligence

2022

;

6801

–

10.1609/aaai.v36i6.20636

10.1038/s42256-023-00647-z

101.

McDermott

Yap

Szolovits

. et al.

Structure-inducing pre-training

Nat Mach Intell

2023

;

612

–

10.1038/s41467-019-13395-9

102.

Singh

Hanson

Paliwal

. et al.

RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning

Nat Commun

2019

;

5407

103.

Cao

. et al.

UFold: fast and accurate RNA secondary structure prediction with deep learning

Nucleic Acids Res

2022

;

e14

–

104.

Zhu

Song

. et al.

DNAPred: accurate identification of DNA-binding sites from protein sequence by ensembled hyperplane-distance-based support vector machines

J Chem Inf Model

2019

;

3057

–

10.1021/acs.jcim.8b00749

105.

Zhang

Chen

Liu

NCBRPred: predicting nucleic acid binding residues in proteins based on multilabel learning

Brief Bioinform

2021

;

bbaa397

106.

Liu

Sun

. et al.

Improving the prediction of protein-nucleic acids binding residues via multiple sequence profiles and the consensus of complementary methods

Bioinformatics

2019

;

930

–

10.1093/bioinformatics/bty756

107.

Ashburner

Ball

Blake

. et al.

Gene ontology: tool for the unification of biology

Nat Genet

2000

;

–

108.

Gligorijević

Renfrew

Kosciolek

. et al.

Structure-based protein function prediction using graph convolutional networks

Nat Commun

2021

;

3168

10.1038/s41467-021-23303-9

109.

Kulmanov

Hoehndorf

DeepGOZero: improving protein function prediction from sequence and zero-shot learning based on ontology axioms

Bioinformatics

2022

;

i238

–

10.1093/bioinformatics/btac256

110.

Yang

Wang

. et al.

scBERT as a large-scale pre-trained deep language model for cell type annotation of single-cell RNA-seq data

Nat Mach Intell

2022

;

852

–

10.1038/s42256-022-00534-z

111.

Choromanski

Likhosherstov

Dohan

. et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.

112.

Jiang

Fang

. et al.

Learning to pre-train graph neural networks

Proceedings of the AAAI Conference on Artificial Intelligence

2021

;

4276

–

10.1609/aaai.v35i5.16552

10.1038/s41467-020-15523-2

113.

Liu

Kang

. et al.

SciBet as a portable and fast single cell type identifier

Nat Commun

2020

;

1818

114.

Kiselev

Yiu

Hemberg

Scmap: projection of single-cell RNA-seq data across data sets

Nat Methods

2018

;

359

–

115.

Yang

Mann

. et al.

scCross: a deep generative model for unifying single-cell multi-omics with seamless integration, cross-modal generation, and in silico exploration

Genome Biol

2024

;

198

10.1186/s13059-024-03338-z

116.

Hao

Gong

Zeng

. et al.

Large-scale foundation model on single-cell transcriptomics

Nat Methods

2024;

:1481–1491.

117.

Saharia

Chan

Saxena

. et al.

Photorealistic text-to-image diffusion models with deep language understanding

Adv Neural Inf Process Syst

2022

;

36479

–

10.1038/s41587-022-01284-4

118.

Cao

Gao

Multi-omics single-cell data integration and regulatory inference with graph-linked embedding

Nat Biotechnol

2022

;

1458

–

119.

Ciciani

Demozzi

Pedrazzoli

. et al.

Automated identification of sequence-tailored Cas9 proteins using massive metagenomic data

Nat Commun

2022

;

6474

10.1038/s41467-022-34213-9

120.

Ruiz

Zitnik

Leskovec

Identification of disease treatment mechanisms through the multiscale interactome

Nat Commun

2021

;

1796

10.1038/s41467-021-21770-8

121.

Eraslan

Avsec

Gagneur

. et al.

Deep learning: new computational modeling techniques for genomics

Nat Rev Genet

2019

;

389

–

403

10.1038/s41576-019-0122-6

122.

Poli

Massaroli

Nguyen

. et al.

Hyena hierarchy: towards larger convolutional language models

. In:

International Conference on Machine Learning

. PMLR, 2023;28043–28078.

123.

Jeliazkov

del

Alamo

Karpiak

Esmfold hallucinates native-like protein sequences

bioRxiv

2023; 2023–05.

124.

Wang

Dai

Póczos

. et al. Characterizing and avoiding negative transfer. In:

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, 2019; 11293–11302.

125.

Wang

Kaddour

Liu

. et al.

Evaluating self-supervised learning for molecular graph embeddings.

Advances in Neural Information Processing Systems

, 2024;

126.

Zhou

Zhang

Peng

. et al.

Informer: beyond efficient transformer for long sequence time-series forecasting

Proceedings of the AAAI Conference on Artificial Intelligence

2021

;

11106

–

10.1609/aaai.v35i12.17325

127.

Press

Smith

Lewis

Shortformer: better language modeling using shorter

inputs. arXiv preprint arXiv:2012.15832

2020

128.

Zhang

The stability-efficiency dilemma: investigating sequence length warmup for training GPT models

Adv Neural Inf Process Syst

2022

;

26736

–

129.

Dao

Ermon

. et al.

Flashattention: fast and memory-efficient exact attention with io-awareness

Adv Neural Inf Process Syst

2022

;

16344

–

10.1038/s42256-022-00445-z

130.

Ainslie

Lee-Thorp

Jong

. et al.

GQA: training generalized multi-query transformer models from multi-head

checkpoints. arXiv preprint arXiv:2305.13245

2023

131.

Hijma

Heldens

Sclocco

. et al.

Optimization techniques for GPU programming

ACM Comput Surv

2023

;

–

132.

Cui

Athey

Stable learning establishes some common ground between causal inference and machine learning

Nat Mach Intell

2022

;

110

–