FGeneBERT: function-driven pre-trained gene language model for metagenomics

Abstract

Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer, which limits the capture of structurally and functionally relevant gene contexts. Moreover, these approaches struggle with encoding biologically meaningful genes and fail to address the one-to-many and many-to-one relationships inherent in metagenomic data. To overcome these challenges, we introduce FGeneBERT, a novel metagenomic pre-trained model that employs a protein-based gene representation as a context-aware and structure-relevant tokenizer. FGeneBERT incorporates masked gene modeling to enhance the understanding of inter-gene contextual relationships and triplet enhanced metagenomic contrastive learning to elucidate gene sequence–function relationships. Pre-trained on over 100 million metagenomic sequences, FGeneBERT demonstrates superior performance on metagenomic datasets at four levels, spanning gene, functional, bacterial, and environmental levels and ranging from 1 to 213 k input sequences. Case studies of ATP synthase and gene operons highlight FGeneBERT’s capability for functional recognition and its biological relevance in metagenomic research.

metagenomics, DNA, pre-trained language model, transformer

Introduction

Metagenomics, the study of mixed genomes of microbial communities in the environment (e.g. gut microbiomes or soil ecosystems) [1–3], has revealed the critical role in fundamental biological processes like enzyme synthesis, gene expression regulation, and immune function [4–6]. This deepened understanding highlights the need to accurately interpret the intricate genetic information contained within these diverse communities. Consequently, deciphering the complex sequences of multiple species in metagenomics is vital for unraveling life’s mechanisms [7, 8], and advancing biotechnology [9, 10], with important applications in biodiversity conservation and epidemiology [11, 12].

Unlike traditional genomics focused on single species, metagenomics involves genetic material directly from environmental samples, posing significant challenges due to sample diversity and species abundance [13, 14]. As shown in Fig. 1, the typical challenges in metagenomics are the presence of one-to-many (OTM) and many-to-one (MTO) problems. The OTM problem indicates that a single gene can exhibit various functions in different genomic contexts, underscoring the significance of inter-gene interactions in function regulation [15]. For example, ATP synthase displays distinct functionalities in diverse organisms such as bacteria, plants, and humans (Fig. 1a). Conversely, the MTO problem implies that different genes can share the same function, emphasizing expression commonality [16]. For example, the CRISPR (CRISPR-Cas systems are essential adaptive immune components in microorganisms that defend against mobile genetic elements and form the basis of advanced genome engineering technologies. Unlike the commonly used DNA-targeting Cas9 or Cas12 systems, Cas13 is an RNA-guided, programmable RNA-targeting system that enables gene manipulation at the transcriptional level.) immune mechanism [16, 17] involves various proteins like Cpf1, Cas1, and Cas13, each contributing to the same defensive function (Fig. 1b) [18, 19].

Figure 1

Motivaion. Two types of complex relationships between gene sequences and functions in metagenomics. One-to-Many problem means that the same gene may display different functions based on the genomic context; for example, ATP synthase works differently in plants, heterotrophic bacteria, and humans. Many-to-One problem shows that multiple genes may perform the same function; for instance, different genes from different bacteria, e.g., Cpf1, Cas1, etc., produce the same resistance function within the immune system CRISPR.

Open in new tab Download slide

Recently, various computational methods have emerged for genomic and metagenomic data analysis. Traditional alignment-based methods like MetaPhlAn5 [20] aim to match similarities between query sequences and known reference genomes and are common for taxonomic profiling. Advancements in deep learning have led to new methods like CNN-MGP [21] and DeepVirFinder [22], which use CNNs for gene and viral classifications with one-hot encoding. K-mer tokenization [23], employed in approaches like MDL4Microbiome [24], is a standard for DNA sequence characterization. Additionally, Virtifier [25] maps a nucleotide sequence using a codon dictionary combined with LSTM to predict viral genes. DeepMicrobes [26] employs a self-attention mechanism, while DeepTE [27] uses K-mer inputs with CNNs for element classification, and Genomic-nlp [28] applies word2vec for gene function analysis. MetaTransformer [29] uses K-mer embedding for species classification with Transformer. For pre-training models, LookingGlass [30] uses a three-layer LSTM model for functional prediction in short DNA reads. ViBE [31] employs a K-mer token-based BERT model pre-trained with Masked Language Modeling for virus identification. The BERT model, effective in DNA sequence characterization, is limited by the Transformer architecture’s computational burden. LOGO [32] addresses this by cutting off long sequences into 1-2kb sub-sequences. Enformer [33] combines extended convolution with Transformers for long human genomic data. GenSLMs [34] introduce hierarchical language models for whole-genome modeling. DNABERT [35], the first pre-trained model on the human genome that focuses on extracting efficient genomic representations. DNABERT2 [36], its successor, uses Byte Pair Encoding on multi-species genomic data. NT [37] is trained on nucleotide sequences from humans and other species and evaluated on 18 genome prediction tasks. HyenaDNA [38] presents a long-range genomic model based on single-nucleotide polymorphisms on human reference genomes.

Despite these advancements, current approaches exhibit three critical limitations when analyzing metagenomic data. Firstly, the Semantic Tokenizer problem. Most machine learning-based methods [26, 28, 30], taking K-mer counts, frequencies, or embeddings as input features, providing efficient alternatives to traditional homology searches against reference genome databases. However, K-mer features often have limited representation ability and fail to capture global information. Secondly, the function-driven modeling problem. Although recent Transformer models excel in modeling complex DNA contexts through long-range dependencies, they are predominantly tailored for single-species genomic analysis [36, 37, 39]. Some models have also been developed for long-range DNA-related tasks, demonstrating promising performance in long-range species classification tasks [40, 41], although the practical applications of this problem remain poorly defined [42]. However, these tasks do not adequately address OTM and MTO challenges. This limitation impedes their ability to accurately model the intricate relationships between genes, their function across diverse genomic environments, and their connections among sequences with similar functions. Thirdly, the low generalization problem persists in metagenomic models, with recent work [43] proposing optimized strategies to enhance model adaptability. Models like MetaTransformer [29] and ViBE [31], designed for specific tasks such as read classification and virus category prediction, fail to grasp the broader biological complexities of metagenomic data, limiting their generalization across diverse metagenomic tasks.

To address these challenges, we propose FGeneBERT (Function-Driven Pre-trained Gene Language Model for Metagenomics), a novel metagenomic pre-trained model designed to encode contextually aware and functionally relevant representations of metagenomic sequences. First, to solve the problem of encoding gene sequences with biological meaning, we propose a protein-based gene representation as a context-aware tokenizer, allowing for a flexible token vocabulary for longer metagenomic sequences. This strategy leverages the inherent protein functional and structural information encoded in metagenomic data [44], overcoming the limitations of K-mer methods and maintaining functional consistency despite potential mutations [45]. Second, we propose two pre-training tasks for function-driven modeling: masked gene modeling (MGM) and triplet enhanced metagenomic contrastive learning (TMC) to enhance the co-representation learning of metagenomic gene sequences and functions. Thirdly, FGeneBERT is pre-trained on over 100 million sequences, showcasing robust performance across diverse datasets spanning gene, functional, bacterial, and environmental levels.

In this work, we identify three key challenges in metagenomic analysis. To address these issues, we propose FGeneBERT. To the best of our knowledge, this is the first metagenomic pre-trained model encoding context-aware and function-relevant representations of metagenomic sequences. To summarize: (i) We introduce a new idea of protein-based gene representations to learn biologically meaningful tokenization of long sequences. (ii) We propose MGM to model inter-gene relationships and TMC to learn complex relationships between gene sequences and functions. (iii) We conduct extensive experiments across various downstream tasks, spanning gene, functional, bacterial, and environmental levels with input sizes from 1 to 213 k sequences. FGeneBERT achieves the state-of-the-art performance.

Methods

In this section, we provide a detailed description of the proposed pre-training model FGeneBERT, which contains the MGM and TMC components as depicted in Fig. 2.

$Overview of FGeneBERT. A metagenomic sequence $\mathcal{X}$ is converted into ordered protein-based gene representations $\mathcal{G}$ via a Context-Aware Tokenizer. Next, we pre-train a Gene Encoder with $\mathcal{L}_{\text{MGM}}$, 15% of these tokens are masked to predict labels $\mathcal{Y}$. Meanwhile, we introduce $\mathcal{L}_{\text{Tri}}$ to distinguish gene sequences. The data augmentation and negative sampling modules generate positive samples $\mathcal{G}^{+}$ and negative samples $\mathcal{G}^{-}$, respectively. Finally, after fine-tuning, FGeneBERT can handle various downstream tasks.$

Figure 2

Overview of FGeneBERT. A metagenomic sequence |$\mathcal{X}$| is converted into ordered protein-based gene representations |$\mathcal{G}$| via a Context-Aware Tokenizer. Next, we pre-train a Gene Encoder with |$\mathcal{L}_{\text{MGM}}$|⁠, 15% of these tokens are masked to predict labels |$\mathcal{Y}$|⁠. Meanwhile, we introduce |$\mathcal{L}_{\text{Tri}}$| to distinguish gene sequences. The data augmentation and negative sampling modules generate positive samples |$\mathcal{G}^{+}$| and negative samples |$\mathcal{G}^{-}$|⁠, respectively. Finally, after fine-tuning, FGeneBERT can handle various downstream tasks.

Open in new tab Download slide

Datasets

We pre-train and evaluate FGeneBERT on distinct datasets ranging from thousands to hundreds of thousands of sequences. For pre-training, we utilize the MGnify database [updated February 2023 (https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2023_02/)], comprising 2 973 257 435 protein sequences aggregated from 30 FASTA files of diverse microbial communities. Table 1 details microbial genomes across environments (e.g. Human gut: 289 232 genomes, 4744 species; marine: 1504 genomes, 1496 species), distinct from its protein sequences used for pre-training, with cataloged microbial species provided. For downstream evaluation, we test FGeneBERT across four task levels—(i) gene structure analysis, (ii) functional gene prediction, (iii) pathogenicity potential assessment, and (iv) nitrogen cycle prediction—using datasets split into training, validation, and test sets (8:1:1) with five-fold cross-validation. As detailed in Table 2: E-K12 [46] (4315 operons) for gene structure; CARD [47] (1966 sequences) for functional prediction across 269 AMR families (CARD-A), 37 drug classes (CARD-D), and 7 resistance mechanisms (CARD-R); VFDB core [48] (8945 sequences, 15 VF categories) for virulence factors; ENZYME [49] (5761 sequences, 7 categories) for enzyme functions with unique EC numbers [50]; PATRIC core [51] (5000 sequences, 110 classes) for pathogenicity; and NCycDB [52] (213 501 sequences, 68 gene families) for nitrogen cycle processes (see Supplementary section Overview of Downstream Tasks).

Table 1

Open in new tab

Microbial species and genome numbers in each environment in the MGnify dataset. Columns denote environmental catalogs, microbial species, and assembled genomes, respectively

Catalogues	Species	Genomes
Human gut	4744	289 232
Human oral	452	1225
Cow rumen	2729	5578
Marine	1496	1504
Pig gut	1376	3972
Zebrafish feacel	79	101
Non-model fish gut	172	196

Catalogues	Species	Genomes
Human gut	4744	289 232
Human oral	452	1225
Cow rumen	2729	5578
Marine	1496	1504
Pig gut	1376	3972
Zebrafish feacel	79	101
Non-model fish gut	172	196

Table 1

Open in new tab

Microbial species and genome numbers in each environment in the MGnify dataset. Columns denote environmental catalogs, microbial species, and assembled genomes, respectively

Catalogues	Species	Genomes
Human gut	4744	289 232
Human oral	452	1225
Cow rumen	2729	5578
Marine	1496	1504
Pig gut	1376	3972
Zebrafish feacel	79	101
Non-model fish gut	172	196

Catalogues	Species	Genomes
Human gut	4744	289 232
Human oral	452	1225
Cow rumen	2729	5578
Marine	1496	1504
Pig gut	1376	3972
Zebrafish feacel	79	101
Non-model fish gut	172	196

Table 2

Open in new tab

Description of experimental datasets. #Seq. and #Class means the number of sequences and categories in each dataset

Task	Dataset	Description	#Seq.	#Class
Gene structure prediction (Gene level)	E-K12	Gene operons	4315	1379
Functional prediction (Functional level)	CARD-A	AMR family	1966	269
	CARD-D	Drug class	1966	37
	CARD-R	Resistance mech.	1966	7
	VFDB	Virulence fact.	8945	15
	ENZYME	Enzyme func.	5761	7
Pathogenicity prediction (Bacteria level)	PATRIC	Pathogenic genes	5000	110
Nitrogen cycle prediction (Environmental level)	NCycDB	Cycling genes	213 501	68

Task	Dataset	Description	#Seq.	#Class
Gene structure prediction (Gene level)	E-K12	Gene operons	4315	1379
Functional prediction (Functional level)	CARD-A	AMR family	1966	269
	CARD-D	Drug class	1966	37
	CARD-R	Resistance mech.	1966	7
	VFDB	Virulence fact.	8945	15
	ENZYME	Enzyme func.	5761	7
Pathogenicity prediction (Bacteria level)	PATRIC	Pathogenic genes	5000	110
Nitrogen cycle prediction (Environmental level)	NCycDB	Cycling genes	213 501	68

Table 2

Open in new tab

Description of experimental datasets. #Seq. and #Class means the number of sequences and categories in each dataset

Task	Dataset	Description	#Seq.	#Class
Gene structure prediction (Gene level)	E-K12	Gene operons	4315	1379
Functional prediction (Functional level)	CARD-A	AMR family	1966	269
	CARD-D	Drug class	1966	37
	CARD-R	Resistance mech.	1966	7
	VFDB	Virulence fact.	8945	15
	ENZYME	Enzyme func.	5761	7
Pathogenicity prediction (Bacteria level)	PATRIC	Pathogenic genes	5000	110
Nitrogen cycle prediction (Environmental level)	NCycDB	Cycling genes	213 501	68

Task	Dataset	Description	#Seq.	#Class
Gene structure prediction (Gene level)	E-K12	Gene operons	4315	1379
Functional prediction (Functional level)	CARD-A	AMR family	1966	269
	CARD-D	Drug class	1966	37
	CARD-R	Resistance mech.	1966	7
	VFDB	Virulence fact.	8945	15
	ENZYME	Enzyme func.	5761	7
Pathogenicity prediction (Bacteria level)	PATRIC	Pathogenic genes	5000	110
Nitrogen cycle prediction (Environmental level)	NCycDB	Cycling genes	213 501	68

Notation

Given a dataset of metagenomic long sequences |$\{\mathcal{X}_{i}\}_{i=1}^{m}$|⁠, we simplify each |$\mathcal{X}_{i}$| into a set of shorter gene sequences |$\{{x_{i}}\}_{i=1}^{n_{i}}$| using ultrasonic fragmentation and assembly techniques [53], where |$n_{i}$| represents the variable number of gene sequences derived from each |$\mathcal{X}_{i}$|⁠. Each gene sequence |$x_{i}$| is tokenized in a vector |$g_{i} \in \mathbb{R}^{d}$|⁠, where |$d$| is the token dimension. Each sequence is associated with a reduced-dimensional representation |$y_{i}\in \mathbb{R}^{100}$|⁠. Suppose a gene group |$\mathcal{G}={\left \{ g_{i} \right \}} _{i=1}^{N}$| is formed by concatenating |$N$| gene vectors sequentially. During the pre-training phase, each gene token |$g_{i}\in \mathcal{G}$| is processed by the context-aware genome language encoder |$\mathcal{F}(\cdot )$|⁠, generating the knowledge representations |$z_{i}$|⁠, where |$z_{i}=\mathcal{F}(g_{i})$|⁠. These representations are then transformed by a fully connected layer into |$\hat{z}_{i}$|⁠, defined as |$\hat{z}_{i}=\mathcal{H}(z_{i})$|⁠, where |$\mathcal{H(\cdot )}$| represents a multi-classification head. In addition, we incorporate contrastive learning into the methodology. For each gene |$x_{i}$|⁠, we introduce a data augmentation module to generate positive samples |$x_{j(i)}$| and a hard negative sampling strategy for constructing negative samples |$x_{k(i)}$|⁠.

Context-aware masked gene modeling for one-to-many problem

Context-aware tokenizer

To develop a biologically meaningful tokenization of long sequences, we design a context-aware tokenizer utilizing the protein language model (PLM), such as ESM-2 [54] with 15 billion parameters, integrating biological prior knowledge. As illustrated in Fig. 3, this tokenizer framework begins by extracting DNA gene sequences |$\{{x_{i}}\}_{i=1}^{n_{i}}$| from a metagenomic sequence |$\mathcal{X}_{i}$| utilizing the European Nucleotide Archive (ENA) software [55]. This conversion enhances the flexibility of analyzing longer metagenomic sequences.

$Framework of Context-Aware Tokenizer. Gene sequences $\{{x_{i}}\}_{i=1}^{n_{i}}$ extracted from metagenomic sequence $\mathcal{X}_{i}$ are translated into amino acid sequences and encoded into ESM-2 representations as $\{{g_{i}}\}_{i=1}^{n_{i}}$. These representations are then concatenated sequentially every $N$ representation to form gene groups $\mathcal{G}$.$

Figure 3

Framework of Context-Aware Tokenizer. Gene sequences |$\{{x_{i}}\}_{i=1}^{n_{i}}$| extracted from metagenomic sequence |$\mathcal{X}_{i}$| are translated into amino acid sequences and encoded into ESM-2 representations as |$\{{g_{i}}\}_{i=1}^{n_{i}}$|⁠. These representations are then concatenated sequentially every |$N$| representation to form gene groups |$\mathcal{G}$|⁠.

Open in new tab Download slide

Secondly, these DNA sequences |$x_{i}$| are translated into amino acid (AA) sequences using the Transeq software [56]. This translation addresses the issue of degenerate DNA codes, where certain non-”ATCG” symbols like ”Y” or ”S” can represent multiple nucleotides (e.g. ”Y” can be ”C” or ”G”). By translating to AA sequences, we eliminate this redundancy, as different DNA sequences can map to the same AA sequence, ensuring consistency in representing biologically equivalent sequences [57, 58]. Thirdly, these AA sequences are transformed into 1280D normalized ESM-2 representations, with an additional gene orientation vector, resulting in 1281D gene representations |$\{g_{i}\}_{i=1}^{n_{i}}$|⁠. The utility of ESM-2 lies in its ability to reveal gene relationships and functional information inherent in metagenomic data [44], preserving important intra-gene contextually-aware information. Finally, these representations every |$N$| representations are concatenated sequentially to form gene groups |$\mathcal{G}$|⁠, which serve as the basis for subsequent modeling tasks.

Masked gene modeling

We propose the MGM to enhance the model’s understanding of the relationships between genes within metagenomic sequences and their function regulations across diverse genomic environments (OTM problem). During pre-training, each gene token is masked with a 15% probability and predicted based on its unmasked genome context |${\mathcal{G}_{/M}}$|⁠:

$$ \begin{align}& \mathcal{L}_{\text{MLM}} = \mathbb{E}_{g\sim \mathcal{G}} \mathbb{E}_{M}\sum_{i \in M}- \log p(g_{i}|{\mathcal{G}_{/M}}),\end{align} $$

(1)

where |$M$| denotes the index set of the masked gene.

In addition, considering genetic polymorphism [59, 60], the MGM component focuses on detecting multiple genes that could coexist at a single genomic site, denoted as |$\hat{z}_{i}=[\hat{z}_{1},\hat{z}_{2},\hat{z}_{3},\hat{z}_{4}]$|⁠. This requires the model not only to predict individual genes but also to identify various combinations of genes occurring within the same site. Thus, we enhance the model with a comprehensive loss function, |$\mathcal{L}_{\text{MGM}}$|⁠, which incorporates feature reconstruction and probability prediction:

$$ \begin{align} & \mathcal{L}_{\text{MGM}} = \frac{1}{N}\sum\limits_{i=1}^{N}{\left( 1-\frac{y_{i}^{T} \hat{z}_{i}}{\left \| y_{i} \right\|\cdot\left\| \hat{z}_{i}\right\|}\right)}^{\gamma} + \frac{\alpha}{N}\sum\limits_{i=1}^{N} || \tilde{z}_{i} - \tilde{y}_{i}||_{2}^{2},\end{align} $$

(2)

where |$\gamma $| is a reconstruction loss with the scaled cosine error, and |$\alpha $| is a weighting factor to balance the importance of the two loss functions. The first item, feature reconstruction loss (FRL), quantifies the distance between the model prediction |$\hat{z}_{i}$| and its corresponding label |$y_{i}$|⁠. The second item, probability prediction loss (PPL), evaluates the discrepancy between the predicted embedding probability |$\tilde{z}_{i}=\frac{e^{\hat{z}_{i}}}{\sum _{j=1}^{C}e^{\hat{z}_{j}}} $| and the true category probability |$\tilde{y}_{i} = \frac{e^{y_{i}}}{\sum _{j=1}^{C}e^{y_{j}}}$|⁠, both processed via the softmax function. |$C$| denotes the number of gene combinations and is set to 4.

Triplet metagenomic contrastive framework for many-to-one problem

Contrastive learning

The MGM component enhances the model’s ability to learn contextual relationships between genes, helping to alleviate the OTM problem present in metagenomic data. However, when faced with the MTO problem, the model’s ability to describe the relationships between sequences with the same function remains weak. For instance, when annotating Enzyme Commission (EC) numbers for partial metagenomic sequences, we observe that sequences sharing the same EC number cluster closely in feature space, while those with distinct EC numbers are more separated [61], with recent advancements [62] optimizing feature extraction for improved clustering. Therefore, we introduce a contrastive learning technique to capture the functional relationships between gene classes, enabling different genes with similar functions to cluster together and further optimize model training. Generally speaking, the objective of contrastive learning is to learn an embedding function |$\mathcal{F}$| such that the distance between positive pairs is smaller than the distance between negative pairs:

$$ \begin{align}& d(\mathcal{F}(x_{a}),\mathcal{F}(x_{p})) < d(\mathcal{F}(x_{a}),\mathcal{F}(x_{n})),\end{align} $$

(3)

where |$d(\cdot ,\cdot )$| is a distance function (e.g. Euclidean distance) defined on the embedding space. We adopt the SupCon-Hard loss [63] to incorporate multiple positive and negative samples per anchor, enhancing robustness by mining difficult samples, with recent work [64] extending its application to biological contexts. Additionally, data augmentation and negative sampling modules are included to create positive and negative samples, further improving the model’s capacity to recognize commonalities among gene classes.

Positives sampling

The strategy for sampling triplets is crucial to learning a well-organized embedding space. For each gene group |$\mathcal{G}$|⁠, as an anchor gene |$x_{i}$| within a gene batch |$I$|⁠, a mutation strategy is proposed to augment orphan sequences (i.e. functions associated with individual sequences) to generate a large number of pairs of positive samples |$x_{j(i)}\in \mathcal{G}_{i}^{+}$|⁠, where |$\mathcal{G}_{i}^{+}$| is the set of positive samples for anchor |$x_{i}$|⁠. Specifically, 10 random mutations are performed for each gene sequence, with mutation ratios randomly generated according to a standard normal distribution. The number of mutations is calculated based on sequence lengths. This process aims to generate new sequences that are functionally similar to the original sequence but sequentially different, providing additional training data to improve the predictive power and accuracy of orphan EC numbers.

Hard negatives sampling

Previous work [65] established that balancing triplet triviality and hardness is key to effective contrastive learning, while recent studies [66, 67] further refine this by introducing advanced sampling strategies and efficiency improvements. For the negative sample pair |$x_{k(i)}\in \mathcal{G}_{i}^{-}$|⁠, where |$\mathcal{G}_{i}^{-}$| represents the set of negative samples for the anchor |$x_{i}$|⁠, this balance is particularly important. We determine the centers for each functional group by averaging the embeddings of all sequences within that group. Subsequently, we compute the Euclidean distances |$d(\cdot )$| based on these centers. For negative sample selection, we choose samples that are similar to the anchor in latent space but from different clusters, thus increasing the learning difficulty compared to random selection. The triplet loss |$ \mathcal{L}_{\text{Tri}}( x_{i}, \{ x_{j(i)} \}_{j=1}^{N_{j}}, \{ x_{k(i)} \}_{k=1}^{N_{k}})$| is defined:

$$ \begin{align}& \mathcal{L}_{\text{Tri}} = -\sum_{i\in I}\log( {1}/{|\mathcal{G}_{i}^{+}|} \sum_{j \in \mathcal{G}_{i}^{+}} {\exp(S_{z_{i}, z_{j(i)}}/\tau)}/{\mathcal{G}_{i}}).\end{align} $$

(4)

For all negative samples in |$\mathcal{G}_{i}^{-}$|⁠, the probability of selecting each negative sample |$x_{k}$| for the anchor |$x_{i}$| as follows:

$$ \begin{align}& \mathcal{G}_{i} = \sum_{x_{j(i)}} \exp(S_{z_{i}, z_{j(i)}} / \tau) + \sum_{x_{k(i)}} p_{x_{k}} \exp(S_{z_{i}, z_{k(i)}} / \tau),\end{align} $$

(5)

where |$\tau $| is the temperature hyper-parameter, and |$S$| is the similarity function, typically cosine similarity. The term |$p_{x_{k}}$| represents the probability of selecting the negative sample |$x_{k}$| for the anchor |$x_{i}$|⁠, calculated as |$p_{x_{k}} = {w_{x_{k}}}/{\sum _{x_{m} \in \mathcal{G}_{i}^{-}}w_{x_{m}}}$| with |$w_{x_{k}} = \frac{1}{d(x_{i}, x_{k})} $|⁠.

Finally, MGM and TMC constitute a unified pre-training framework with a total loss:

$$ \begin{align}& \mathcal{L}_{\text{Total}} = \mathcal{L}_{\text{MGM}} + \lambda \mathcal{L}_{\text{Tri}},\end{align} $$

(6)

where |$\lambda $| is a hyper-parameter tuning the influence between two loss functions.

Results and discussion

Experiments results on four level downstream tasks

Level 1 Task A: Gene structure analysis |$\rightarrow $| Gene operons prediction. This task is to identify the transcription factor binding sites that are strongly correlated with operon regulation in the gene regulatory network, enhancing our understanding of gene regulation mechanisms. We utilize the Escherichia coli K12 RegulonDB dataset (E-K12) [46], a comprehensive resource of operon annotations, with further details provided in Supplementary Table S2.

Results analysis. The attention heatmap in Fig. 4 shows that gene operon |$\color{orange} {tolC}$| has high attention weight with the operons |$\color{blue} {ygiB}$| and |$\color{blue} {ygiC}$| and |$\color{blue} {yqiA}$|⁠. This suggests a significant interaction among these operons, indicating the presence of a shared genetic operon tolC-ygib. This inference finds support in biological research, which associates these operons with the DUF1190 domain-containing protein YgiB [68].

Figure 4

Visualization of attention. The attention value is from the first head of the last (19th) attention layer. Darker shading indicates higher attention weight.

Open in new tab Download slide

Table 3

Open in new tab

Classification results on CARD-R

Class	Total	Correct	Correct
name	count	num	ratio
Antibiotic inactivation	252	238	94.44%
Antibiotic target alteration	70	59	84.29%
Antibiotic target protection	28	27	96.43%
Antibiotic efflux	25	18	72.00%
Antibiotic target replacement	14	12	85.72%

Class	Total	Correct	Correct
name	count	num	ratio
Antibiotic inactivation	252	238	94.44%
Antibiotic target alteration	70	59	84.29%
Antibiotic target protection	28	27	96.43%
Antibiotic efflux	25	18	72.00%
Antibiotic target replacement	14	12	85.72%

Table 3

Open in new tab

Classification results on CARD-R

Class	Total	Correct	Correct
name	count	num	ratio
Antibiotic inactivation	252	238	94.44%
Antibiotic target alteration	70	59	84.29%
Antibiotic target protection	28	27	96.43%
Antibiotic efflux	25	18	72.00%
Antibiotic target replacement	14	12	85.72%

Class	Total	Correct	Correct
name	count	num	ratio
Antibiotic inactivation	252	238	94.44%
Antibiotic target alteration	70	59	84.29%
Antibiotic target protection	28	27	96.43%
Antibiotic efflux	25	18	72.00%
Antibiotic target replacement	14	12	85.72%

Level 2 Task B: Functional gene prediction |$\rightarrow $| Antimicrobial resistance genes (ARG) prediction. This task is crucial for understanding ARGs and facilitates the identification of resistance mechanisms. However, existing methods suffer from high false-positive rates and category bias [69, 70], necessitating the use of deep learning methods to rapidly and accurately detect ARG presence in metagenomic data. We utilize the CARD dataset [47], which provides a structured classification framework for resistance profiles, with our model performing three-category classification across its gene annotations.

Results analysis. FGeneBERT’s performance on CARD-A is significant, as shown in Table 4. This category’s broad range (269 classifications) creates a long-tail distribution, necessitating an understanding of the biological properties of gene sequences for accurate annotation [71]. To mitigate this issue, we adjust the data sampling strategy to increase the frequency of fewer samples, improving the model’s prediction accuracy. Table 3 demonstrates FGeneBERT’s high prediction accuracy for the CARD-R category, exhibiting superior classification results for both majority and minority classes, with over 85% accuracy. Supplementary Table S8 reveals FGeneBERT has better performance for all 269 AMR Gene Family categories, with 100% classification accuracy for majority categories like CTX, ADC, and CMY, as well as for minor ones like AXC, CRH, and KLUC.

Table 4

Open in new tab

Macro F1 (⁠|$\% \uparrow $|⁠) and Weighted F1 (⁠|$\% \uparrow $|⁠) on eight downstream tasks. This includes gene operon prediction on E-K12, ARG prediction on three CARD categories, virulence factors classification on VFDB, enzyme function annotation on ENZYME, microbial pathogens detection on PATRIC, Nitrogen Cycle processes prediction on NCycDB. RF denotes Random Forest, and VT represents Vanilla Transformer. The highest results are highlighted with boldface. The second highest results are highlighted with underline

Method	Operons		ARG prediction						Virus		Enzyme		Pathogen		N-Cycle
	E-K12		CARD-A		CARD-D		CARD-R		VFDB		ENZYME		PATRIC		NCycDB
	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1
RF	20.2	34.8	22.4	35.3	36.1	49.0	47.8	57.6	22.4	38.5	33.6	41.2	25.3	29.8	67.0	71.7
SVM	38.6	45.2	27.6	40.5	33.6	47.2	43.3	66.2	28.0	41.4	31.3	43.6	26.6	31.2	66.9	70.3
KNN	39.9	41.0	36.9	54.4	36.4	51.3	36.2	63.5	27.3	47.1	31.4	42.9	11.0	27.4	68.8	73.2
LSTM	40.4	42.5	47.1	60.3	39.1	62.3	47.5	84.2	36.7	66.3	42.8	51.0	41.3	49.7	71.9	81.2
BiLSTM	38.2	43.8	47.4	61.9	43.5	58.1	58.9	80.3	46.1	72.1	38.7	50.2	43.3	48.5	82.0	88.4
VT	43.3	47.8	57.1	70.0	49.8	68.1	55.7	86.4	58.0	81.0	68.2	75.8	49.8	57.3	84.5	90.7
HyenaDNA	42.4	47.1	50.9	68.2	53.6	78.1	66.2	88.1	61.0	70.4	79.6	83.6	51.1	57.6	92.4	96.0
ESM-2	38.2	42.5	57.2	71.4	56.0	82.1	68.2	90.0	60.7	84.4	92.5	96.7	56.0	67.5	95.8	96.1
NT	45.1	44.8	58.5	72.0	56.2	80.2	68.0	90.3	58.3	71.6	74.1	76.7	46.1	61.9	75.1	86.5
DNABERT2	51.7	52.4	65.2	79.8	51.5	78.7	61.2	88.6	58.2	82.3	85.4	85.2	52.9	60.6	88.6	95.7
Ours	61.8	65.4	78.6	90.1	57.4	85.2	69.4	91.4	75.7	90.2	99.1	98.8	99.3	99.0	99.5	99.2

Method	Operons		ARG prediction						Virus		Enzyme		Pathogen		N-Cycle
	E-K12		CARD-A		CARD-D		CARD-R		VFDB		ENZYME		PATRIC		NCycDB
	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1
RF	20.2	34.8	22.4	35.3	36.1	49.0	47.8	57.6	22.4	38.5	33.6	41.2	25.3	29.8	67.0	71.7
SVM	38.6	45.2	27.6	40.5	33.6	47.2	43.3	66.2	28.0	41.4	31.3	43.6	26.6	31.2	66.9	70.3
KNN	39.9	41.0	36.9	54.4	36.4	51.3	36.2	63.5	27.3	47.1	31.4	42.9	11.0	27.4	68.8	73.2
LSTM	40.4	42.5	47.1	60.3	39.1	62.3	47.5	84.2	36.7	66.3	42.8	51.0	41.3	49.7	71.9	81.2
BiLSTM	38.2	43.8	47.4	61.9	43.5	58.1	58.9	80.3	46.1	72.1	38.7	50.2	43.3	48.5	82.0	88.4
VT	43.3	47.8	57.1	70.0	49.8	68.1	55.7	86.4	58.0	81.0	68.2	75.8	49.8	57.3	84.5	90.7
HyenaDNA	42.4	47.1	50.9	68.2	53.6	78.1	66.2	88.1	61.0	70.4	79.6	83.6	51.1	57.6	92.4	96.0
ESM-2	38.2	42.5	57.2	71.4	56.0	82.1	68.2	90.0	60.7	84.4	92.5	96.7	56.0	67.5	95.8	96.1
NT	45.1	44.8	58.5	72.0	56.2	80.2	68.0	90.3	58.3	71.6	74.1	76.7	46.1	61.9	75.1	86.5
DNABERT2	51.7	52.4	65.2	79.8	51.5	78.7	61.2	88.6	58.2	82.3	85.4	85.2	52.9	60.6	88.6	95.7
Ours	61.8	65.4	78.6	90.1	57.4	85.2	69.4	91.4	75.7	90.2	99.1	98.8	99.3	99.0	99.5	99.2

Table 4

Open in new tab

Method	Operons		ARG prediction						Virus		Enzyme		Pathogen		N-Cycle
	E-K12		CARD-A		CARD-D		CARD-R		VFDB		ENZYME		PATRIC		NCycDB
	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1
RF	20.2	34.8	22.4	35.3	36.1	49.0	47.8	57.6	22.4	38.5	33.6	41.2	25.3	29.8	67.0	71.7
SVM	38.6	45.2	27.6	40.5	33.6	47.2	43.3	66.2	28.0	41.4	31.3	43.6	26.6	31.2	66.9	70.3
KNN	39.9	41.0	36.9	54.4	36.4	51.3	36.2	63.5	27.3	47.1	31.4	42.9	11.0	27.4	68.8	73.2
LSTM	40.4	42.5	47.1	60.3	39.1	62.3	47.5	84.2	36.7	66.3	42.8	51.0	41.3	49.7	71.9	81.2
BiLSTM	38.2	43.8	47.4	61.9	43.5	58.1	58.9	80.3	46.1	72.1	38.7	50.2	43.3	48.5	82.0	88.4
VT	43.3	47.8	57.1	70.0	49.8	68.1	55.7	86.4	58.0	81.0	68.2	75.8	49.8	57.3	84.5	90.7
HyenaDNA	42.4	47.1	50.9	68.2	53.6	78.1	66.2	88.1	61.0	70.4	79.6	83.6	51.1	57.6	92.4	96.0
ESM-2	38.2	42.5	57.2	71.4	56.0	82.1	68.2	90.0	60.7	84.4	92.5	96.7	56.0	67.5	95.8	96.1
NT	45.1	44.8	58.5	72.0	56.2	80.2	68.0	90.3	58.3	71.6	74.1	76.7	46.1	61.9	75.1	86.5
DNABERT2	51.7	52.4	65.2	79.8	51.5	78.7	61.2	88.6	58.2	82.3	85.4	85.2	52.9	60.6	88.6	95.7
Ours	61.8	65.4	78.6	90.1	57.4	85.2	69.4	91.4	75.7	90.2	99.1	98.8	99.3	99.0	99.5	99.2

Method	Operons		ARG prediction						Virus		Enzyme		Pathogen		N-Cycle
	E-K12		CARD-A		CARD-D		CARD-R		VFDB		ENZYME		PATRIC		NCycDB
	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1	M.F1	W.F1
RF	20.2	34.8	22.4	35.3	36.1	49.0	47.8	57.6	22.4	38.5	33.6	41.2	25.3	29.8	67.0	71.7
SVM	38.6	45.2	27.6	40.5	33.6	47.2	43.3	66.2	28.0	41.4	31.3	43.6	26.6	31.2	66.9	70.3
KNN	39.9	41.0	36.9	54.4	36.4	51.3	36.2	63.5	27.3	47.1	31.4	42.9	11.0	27.4	68.8	73.2
LSTM	40.4	42.5	47.1	60.3	39.1	62.3	47.5	84.2	36.7	66.3	42.8	51.0	41.3	49.7	71.9	81.2
BiLSTM	38.2	43.8	47.4	61.9	43.5	58.1	58.9	80.3	46.1	72.1	38.7	50.2	43.3	48.5	82.0	88.4
VT	43.3	47.8	57.1	70.0	49.8	68.1	55.7	86.4	58.0	81.0	68.2	75.8	49.8	57.3	84.5	90.7
HyenaDNA	42.4	47.1	50.9	68.2	53.6	78.1	66.2	88.1	61.0	70.4	79.6	83.6	51.1	57.6	92.4	96.0
ESM-2	38.2	42.5	57.2	71.4	56.0	82.1	68.2	90.0	60.7	84.4	92.5	96.7	56.0	67.5	95.8	96.1
NT	45.1	44.8	58.5	72.0	56.2	80.2	68.0	90.3	58.3	71.6	74.1	76.7	46.1	61.9	75.1	86.5
DNABERT2	51.7	52.4	65.2	79.8	51.5	78.7	61.2	88.6	58.2	82.3	85.4	85.2	52.9	60.6	88.6	95.7
Ours	61.8	65.4	78.6	90.1	57.4	85.2	69.4	91.4	75.7	90.2	99.1	98.8	99.3	99.0	99.5	99.2

Level 2 Task C: Functional gene prediction |$\rightarrow $| Virulence factors (VF) prediction. This task is to detect microbial elements like bacterial toxins, which enhance pathogen infectivity and exacerbate antimicrobial resistance. Existing methods for metagenomic analysis, particularly those co-predicting ARGs and VFs, are inadequate and suffer from threshold sensitivity issues [72]. We employ the VFDB core dataset [48], a detailed repository of VF characteristics from well-characterized bacterial pathogens, to predict these critical microbial elements. Results analysis. Our model achieves the SOTA results on VFDB, as reported in Table 4. It significantly outperforms the genomic pre-trained model; for instance, M.F1 and W.F1 scores improve by 30% and 9.5%, respectively, compared to DNABERT2. This highlights the limitations of directly applying the genomic pre-trained model to metagenomic data for precise functional annotation. Conversely, ESM-2, as a PLM, excels by leveraging intrinsic protein information in metagenomic data, highlighting its effectiveness.

Level 2 Task D: Functional gene prediction |$\rightarrow $| Enzyme function prediction. This task is critical for understanding metabolism and disease mechanisms in organisms. While traditional methods rely on time-consuming and labor-intensive biochemical experiments, advanced technologies can offer efficient and accurate predictions for large-scale genomic data. We leverage the ENZYME core dataset [49], a comprehensive repository of enzyme nomenclature, to facilitate these predictions. Results analysis. Our experimental results demonstrate FGeneBERT’s superior performance on the ENZYME dataset. It outperforms ESM-2, the second-highest method, by approximately 6.62% in M.F1 and 2.09% in W.F1, demonstrating its ability to discern distinct enzyme function characteristics. This observation highlights that our model not only captures gene-protein contextual relationships but also effectively models the relationships between sequences and functions within metagenomic data.

Level 3 Task E: Pathogenicity potential assessment |$\rightarrow $| Genome pathogens prediction. This task assesses the pathogenic potential of pathogens to cope with the public health risks caused by newly emerging pathogens. Accurate deep-learning algorithms are key for the precise identification of pathogens, improving the ability to respond to drug resistance threats. We use PATRIC core dataset [51], which has 5000 pathogenic bacterial sequences across 110 classes. Results analysis. Table 4 shows FGeneBERT’s classification of pathogenic bacteria species within PATRIC, demonstrating superior performance over baselines by recognizing crucial genera features. The PATRIC dataset presents a significant challenge due to its large number of categories and sparse data. Baselines generally underperform because they require more data to discern the subtle differences between numerous categories. In contrast, FGeneBERT stands out with M.F1 and W.F1 scores of 99.27% and 99.03%, respectively. This robust performance indicates its advanced learning capability, making it well-suited for high-dimensional classification tasks and highlighting the benefits of using protein-based gene representations for enhanced functional annotation accuracy.

Level 4 Task F: Nitrogen cycle prediction |$\rightarrow $| Nitrogen (N) cycling process prediction. This task focuses on the functional genes related to the N cycle, linking them to environmental and ecological processes. We utilize the NCycDB dataset [52], a comprehensive resource spanning key N cycle gene families and processes, to facilitate this analysis. Results analysis. Table 4 presents FGeneBERT’s classification results on NCycDB, suggesting its ability to recognize key features of particular N cycle processes and improve gene family classification by recognizing domains. Although baselines show improved performance on NCycDB compared to PATRIC, due to a larger amount of data per category aiding in discrimination among diverse categories, FGeneBERT still leads with macro F1 (M.F1) and weighted F1 (W.F1) scores of 99.49% and 99.22%, respectively. However, pre-trained baselines require more time and memory for tokenizing large datasets, as analyzed in Model efficiency analysis.

Ablation study

Ablation study on the performance of MGM and TMC

We perform the ablation study to investigate the effectiveness of our proposed components. Table 5 compares the performance of FGeneBERT without MGM and TMC modules across four datasets. Specifically, MGM is designed to address the OTM task, while TMC is tailored to handle the MTO task. The decrease in M.F1 score after removing the MGM and TMC, respectively, highlights their roles in enhancing model performance across different tasks. While TMC contributes to performance, it is evident that MGM has a more substantial impact. Fig. 5 illustrates four more metrics on the CARD dataset.

Table 5

Open in new tab

Ablation study of M.F1 on CARD-D. |$w/o.$| MGM denotes FGeneBERT without MGM, |$w/o.$| Triplet denotes FGeneBERT without TMC

Method	Operons	ARG prediction
	E-K12	CARD-A	CARD-D	CARD-R
FGeneBERT	61.8	78.7	57.4	69.4
\|$w/o.$\| MGM	–8.1	–6.7	–10.5	–6.7
\|$w/o.$\| Triplet	–7.4	–5.4	–5.8	–3.4

Method	Operons	ARG prediction
	E-K12	CARD-A	CARD-D	CARD-R
FGeneBERT	61.8	78.7	57.4	69.4
\|$w/o.$\| MGM	–8.1	–6.7	–10.5	–6.7
\|$w/o.$\| Triplet	–7.4	–5.4	–5.8	–3.4

Bold values highlight the best (maximum) results in each column. Negative values indicate how much lower the corresponding result is compared to the best result.

Table 5

Open in new tab

Ablation study of M.F1 on CARD-D. |$w/o.$| MGM denotes FGeneBERT without MGM, |$w/o.$| Triplet denotes FGeneBERT without TMC

Method	Operons	ARG prediction
	E-K12	CARD-A	CARD-D	CARD-R
FGeneBERT	61.8	78.7	57.4	69.4
\|$w/o.$\| MGM	–8.1	–6.7	–10.5	–6.7
\|$w/o.$\| Triplet	–7.4	–5.4	–5.8	–3.4

Method	Operons	ARG prediction
	E-K12	CARD-A	CARD-D	CARD-R
FGeneBERT	61.8	78.7	57.4	69.4
\|$w/o.$\| MGM	–8.1	–6.7	–10.5	–6.7
\|$w/o.$\| Triplet	–7.4	–5.4	–5.8	–3.4

Bold values highlight the best (maximum) results in each column. Negative values indicate how much lower the corresponding result is compared to the best result.

Figure 5

Ablation studies of our proposed modules on four downstream tasks.

Open in new tab Download slide

Ablation study on visualization results of MGM

We conduct a visualization experiment to validate the effectiveness of MGM in the OTM scenario. ATP synthases can exhibit different functions in different organisms to adapt to their respective environmental conditions, even though the basic functions are the same [73].

We collect 1177 ATP synthase sequences from UniProt [74] and color them according to six taxonomies, i.e. Hydrolases (35.6%, dark blue), Isomerases (3.6%, orange), Ligases (35.7%, green), Lyases (10.7%, red), Oxidoreduct (8.9%, light purple) and Transferases (5.3%, brown) (numbers in parentheses indicate category ratios). Figure 6(a) shows the clustering results of ATPase protein embeddings without genome contextual analysis, indicating that more dispersed clustering results of genes in different genome contexts. Figure 6(b) presents the clustering results of ATPase embeddings after our designed protein-based gene embedding, which shows genes belonging to the same category (the same color) apparently cluster together. Concretely, Isomerases (orange) and Transferases (brown), which account for the smallest percentage, cluster together, whereas, in the left plot, these two are scattered. This demonstrates that our proposed MGM can resolve the One-to-Many problem between sequences and functions.

Figure 6

T-SNE visualization of different embeddings for ATP synthases. Each dot denotes a sequence and is grouped according to different functions.

Open in new tab Download slide

Ablation study on clustering results of TMC

We further explore the impact of integrating TMC into our model on gene operon prediction in the Many-to-One (MTO) scenario. Initially, we evaluate FGeneBERT without the TMC module. Afterward, we add the TMC module, and K-means are performed using the embeddings from the network’s intermediate layers. We compute the normalized mutual information (NMI), adjusted Rand index (ARI), and Silhouette coefficient index to measure the clustering quality. The results in Table 6 show significant improvements in ARI and Silhouette coefficient with TMC, highlighting its effectiveness in enhancing clustering performance across various metrics.

Table 6

Open in new tab

Clustering results of TMC for many-to-one problem

Method	NMI	ARI	Silhouette coefficient
+Tokenizer	0.44	0.34	0.7
+MGM	0.66	0.51	0.72
+TMC	0.72	0.59	0.75

Bold values indicate the best performance in each column.

Table 6

Open in new tab

Clustering results of TMC for many-to-one problem

Method	NMI	ARI	Silhouette coefficient
+Tokenizer	0.44	0.34	0.7
+MGM	0.66	0.51	0.72
+TMC	0.72	0.59	0.75

Bold values indicate the best performance in each column.

Ablation study on context-aware tokenizer

We replace the FGeneBERT’s tokenizer with ESM2 and BPE representations and assessed performance across downstream tasks. Table 7 shows that FGeneBERT outperforms both methods in capturing complex gene sequence-function relationships. Although DNABERT2 compresses sequences effectively with BPE, our context-aware tokenizer demonstrates superior accuracy in metagenomic data analysis. For complete results, see Supplementary Table S6.

Table 7

Open in new tab

The ablation study using protein-based gene representation as a context-aware tokenizer

Dataset	Method	Acc	M.Pre	M.Recall	M.F1	W.Pre	W.Recall	W.F1
E-K12	FGBERT	0.68	0.68	0.61	0.61	0.77	0.67	0.65
	ESM2	0.54	0.51	0.48	0.49	0.61	0.51	0.52
	BPE	0.59	0.58	0.55	0.54	0.67	0.54	0.55
CARD-A	FGBERT	0.91	0.77	0.8	0.78	0.9	0.91	0.9
	ESM2	0.82	0.74	0.76	0.73	0.87	0.82	0.81
	BPE	0.84	0.75	0.77	0.75	0.88	0.86	0.86

Dataset	Method	Acc	M.Pre	M.Recall	M.F1	W.Pre	W.Recall	W.F1
E-K12	FGBERT	0.68	0.68	0.61	0.61	0.77	0.67	0.65
	ESM2	0.54	0.51	0.48	0.49	0.61	0.51	0.52
	BPE	0.59	0.58	0.55	0.54	0.67	0.54	0.55
CARD-A	FGBERT	0.91	0.77	0.8	0.78	0.9	0.91	0.9
	ESM2	0.82	0.74	0.76	0.73	0.87	0.82	0.81
	BPE	0.84	0.75	0.77	0.75	0.88	0.86	0.86

Bold values indicate the best performance, and underlined values indicate the second-best performance.

Table 7

Open in new tab

The ablation study using protein-based gene representation as a context-aware tokenizer

Dataset	Method	Acc	M.Pre	M.Recall	M.F1	W.Pre	W.Recall	W.F1
E-K12	FGBERT	0.68	0.68	0.61	0.61	0.77	0.67	0.65
	ESM2	0.54	0.51	0.48	0.49	0.61	0.51	0.52
	BPE	0.59	0.58	0.55	0.54	0.67	0.54	0.55
CARD-A	FGBERT	0.91	0.77	0.8	0.78	0.9	0.91	0.9
	ESM2	0.82	0.74	0.76	0.73	0.87	0.82	0.81
	BPE	0.84	0.75	0.77	0.75	0.88	0.86	0.86

Dataset	Method	Acc	M.Pre	M.Recall	M.F1	W.Pre	W.Recall	W.F1
E-K12	FGBERT	0.68	0.68	0.61	0.61	0.77	0.67	0.65
	ESM2	0.54	0.51	0.48	0.49	0.61	0.51	0.52
	BPE	0.59	0.58	0.55	0.54	0.67	0.54	0.55
CARD-A	FGBERT	0.91	0.77	0.8	0.78	0.9	0.91	0.9
	ESM2	0.82	0.74	0.76	0.73	0.87	0.82	0.81
	BPE	0.84	0.75	0.77	0.75	0.88	0.86	0.86

Bold values indicate the best performance, and underlined values indicate the second-best performance.

Model efficiency analysis

We analyze the time complexity and memory efficiency of our tokenizer compared to four genomic pre-trained methods on six datasets in Fig. 7. Our tokenizer demonstrates superior efficiency, achieving a significant reduction in time and memory usage. Notably, on the NCyc dataset (brown) with 213 501 sequences, ours reduces processing time by 31.05% and memory usage by 94.33% compared to DNABERT2. For the CARD dataset (orange) with 1966 sequences, time, and memory usage decreased by 61.70% and 58.53%. Although HyenaDNA uses less memory on Operons, CARD, VFDB, and ENZYME datasets, it underperforms ours in time cost and overall performance.

Figure 7

Comparative analysis on tokenization efficiency: time(s) vs. memory (MB). Each point denotes a specific dataset, with the size indicating its scale.

Open in new tab Download slide

To further explore model scalability, we train three FGBERT variants—FGBERT-T, FGBERT-S, and FGBERT-B—differing in layers, hidden dimensions, and attention heads. Each variant is trained on 50 million, 100 million, and 150 million sequences to assess the impact on time and memory during pre-training, as shown in Table 8.

Table 8

Open in new tab

Time complexity and memory efficiency of pre-training and scalability of FGBERT

Model	P^a	# L^b	H^a	# H^b	Operons (500/1000/1500 ep)	CARD-A (500/1000/1500 ep)	CARD-D (500/1000/1500 ep)	CARD-R (500/1000/1500 ep)
FGBERT-T	50M	10	640	5	50.9 / 61.5 / 63.5	74.6 / 88.6 / 90.7	72.4 / 83.7 / 86.4	77.3 / 90.1 / 91.1
FGBERT-S	100M	19	1280	10	55.1 / 65.4 / 65.9	80.4 / 90.1 / 91.2	76.9 / 85.2 / 87.5	81.4 / 91.4 / 93.0
FGBERT-B	150M	25	2560	25	57.1 / 66.7 / 67.6	82.7 / 91.1 / 93.2	81 / 87.6 / 90.0	83.7 / 93.4 / 94.8

Model	P^a	# L^b	H^a	# H^b	Operons (500/1000/1500 ep)	CARD-A (500/1000/1500 ep)	CARD-D (500/1000/1500 ep)	CARD-R (500/1000/1500 ep)
FGBERT-T	50M	10	640	5	50.9 / 61.5 / 63.5	74.6 / 88.6 / 90.7	72.4 / 83.7 / 86.4	77.3 / 90.1 / 91.1
FGBERT-S	100M	19	1280	10	55.1 / 65.4 / 65.9	80.4 / 90.1 / 91.2	76.9 / 85.2 / 87.5	81.4 / 91.4 / 93.0
FGBERT-B	150M	25	2560	25	57.1 / 66.7 / 67.6	82.7 / 91.1 / 93.2	81 / 87.6 / 90.0	83.7 / 93.4 / 94.8

^aP and H represent the pre-training data and hidden dimension, respectively. ^b# L, and # H represent the number of Layers and Heads, respectively.

Table 8

Open in new tab

Time complexity and memory efficiency of pre-training and scalability of FGBERT

Model	P^a	# L^b	H^a	# H^b	Operons (500/1000/1500 ep)	CARD-A (500/1000/1500 ep)	CARD-D (500/1000/1500 ep)	CARD-R (500/1000/1500 ep)
FGBERT-T	50M	10	640	5	50.9 / 61.5 / 63.5	74.6 / 88.6 / 90.7	72.4 / 83.7 / 86.4	77.3 / 90.1 / 91.1
FGBERT-S	100M	19	1280	10	55.1 / 65.4 / 65.9	80.4 / 90.1 / 91.2	76.9 / 85.2 / 87.5	81.4 / 91.4 / 93.0
FGBERT-B	150M	25	2560	25	57.1 / 66.7 / 67.6	82.7 / 91.1 / 93.2	81 / 87.6 / 90.0	83.7 / 93.4 / 94.8

Model	P^a	# L^b	H^a	# H^b	Operons (500/1000/1500 ep)	CARD-A (500/1000/1500 ep)	CARD-D (500/1000/1500 ep)	CARD-R (500/1000/1500 ep)
FGBERT-T	50M	10	640	5	50.9 / 61.5 / 63.5	74.6 / 88.6 / 90.7	72.4 / 83.7 / 86.4	77.3 / 90.1 / 91.1
FGBERT-S	100M	19	1280	10	55.1 / 65.4 / 65.9	80.4 / 90.1 / 91.2	76.9 / 85.2 / 87.5	81.4 / 91.4 / 93.0
FGBERT-B	150M	25	2560	25	57.1 / 66.7 / 67.6	82.7 / 91.1 / 93.2	81 / 87.6 / 90.0	83.7 / 93.4 / 94.8

^aP and H represent the pre-training data and hidden dimension, respectively. ^b# L, and # H represent the number of Layers and Heads, respectively.

Sensitivity analysis

Our sensitivity analysis indicates that FGeneBERT can be optimized effectively using small batch size |$b$| without larger performance degradation shown in Fig. 8. This is important for resource-constrained scenarios, highlighting our model maintains good performance even with limited data. We choose a batch size of 1000. Next, we analyze another hyper-parameter of balance ratio |$\alpha $| on CARD dataset. FGeneBERT obtains good performance for different values of |$\alpha $|⁠, demonstrating its robustness and insensitivity to this hyper-parameter. We set |$\alpha $| to be 0.4.

$Sensitivity w.r.t. hyper-parameters $\alpha , b$ of CARD dataset on AMR gene family.$

Figure 8

Sensitivity w.r.t. hyper-parameters |$\alpha , b$| of CARD dataset on AMR gene family.

Open in new tab Download slide

Model configurations

All our experiments are performed on four NVIDIA V100 GPUs and the PyTorch framework. The encoder of FGeneBERT is initialized with Roberta [75]. For pre-training, the parameters are set as follows: batch size of 1000, 19 encoder layers, 10 attention heads, an embedding dimension of 1280, and relative position encoding. During the pre-training stage, the model is trained for 500 epochs using the AdamW optimizer [76] with a weight decay of 0.02. The learning rate starts at 1e-5, warms up to 1e-4 over the first 5000 steps, and then decreases back to 1e-5 following a cosine decay schedule. The complete model consists of 954.73 million parameters and requires 2.55 billion FLOPs, as detailed in Supplementary Table S4.

Conclusion

In this paper, we propose a new idea of protein-based gene representation, preserving essential biological characteristics within each gene sequence. With the new context-aware tokenizer, we propose MGM, a gene group-level pre-training task designed to learn the interactions between genes. Additionally, we develop TMC, a contrastive learning module to generate multiple positive and negative samples to distinguish the gene sequences. MGM and TMC constitute a joint pre-training model, FGeneBERT for metagenomic data. Our experiments and visualizations demonstrate the superior performance of our model. For the future, it remains to be explored how to incorporate multi-omics data, such as metabolomics, into our metagenomic pre-trained model.

Key Points

Metagenomic analysis faces three key challenges: the lack of context-aware representations, limited capture of inter-gene relationships, and inefficient modeling of gene-function associations in long sequences.
We introduce FGeneBERT, the first metagenomic pre-trained model that encodes biologically meaningful, context-aware, and function-relevant representations of metagenomic data.
FGeneBERT employs protein-based gene representations for efficient tokenization of long sequences and uses MGM and TMC modules to model inter-gene relationships and gene-function associations, respectively.
Comprehensive experiments show that FGeneBERT consistently achieves SOTA performance in metagenomic analysis.

Conflict of interest: No competing interest is declared.

Funding

This work was supported by National Science and Technology Major Project (No. 2022ZD0115101), National Natural Science Foundation of China Project (No. U21A20427), Project (No. WU2022A009) from the Center of Synthetic Biology and Integrated Bioengineering of Westlake University, Project (No. WU2023C019) from the Westlake University Industries of the Future Research Funding. We thank the AI Station of Westlake University for the support of GPUs. This work was supported by the InnoHK program and Ant Group through CAAI-Ant Research Fund.

Data availability

All data processed in this study were obtained exclusively from public sources and are available at https://github.com/jane-pyc/OpenMeta/datasets. Source codes are freely available at https://github.com/jane-pyc/OpenMeta/models/FGBERT.

References

Mathieu

Leclercq

Sanabria

. et al.

Machine learning and deep learning applications in metagenomic taxonomy and functional annotation

Front Microbiol

2022

;

811495

10.3389/fmicb.2022.811495

Nayak

Das

. et al.

Metagenomics and bioinformatics in microbial ecology: current status and beyond

. In: Thatoi H, Pradhan SK, Kumar U. (Eds.),

Applications of Metagenomics

Amsterdam, Netherlands: Elsevier

2024

, pp.

359

–

10.1016/B978-0-323-98394-5.00009-2

Han

Zhang

Ning

Techniques for learning and transferring knowledge for microbiome-based classification and prediction: review and assessment

Brief Bioinform

2025

;

bbaf015

Google Scholar

Crossref

WorldCat

Duan

Zang

. et al.

Phylogen: language model-enhanced phylogenetic inference via graph structure generation

Adv Neural Inform Process Syst

2024

;

131676

–

703

Google Scholar

OpenURL Placeholder Text

WorldCat

Ariaeenejad

Gharechahi

Shahraki

. et al.

Precision enzyme discovery through targeted mining of metagenomic data

Na Products Bioprospect

2024

;

:7.

10.1007/s13659-023-00426-8

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

Teukam

YGN

Zipoli

Laino

. et al.

Integrating genetic algorithms and language models for enhanced enzyme design

Brief Bioinform

2025

;

:bbae675.

10.1093/bib/bbae675

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

Albertsen

Long-read metagenomics paves the way toward a complete microbial tree of life

Nat Methods

2023

;

–

10.1038/s41592-022-01726-6

Lin

Akin

Rao

. et al.

Evolutionary-scale prediction of atomic-level protein structure with a language model

Science

2023

;

379

1123

–

10.1126/science.ade2574

Liu

Moon

Zheng

. et al.

Opportunities and challenges of using metagenomic data to bring uncultured microbes into cultivation

Microbiome

2022

;

:76.

10.1186/s40168-022-01272-5

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

Aplakidou

Vergoulidis

Chasapi

. et al.

Visualizing metagenomic and metatranscriptomic data: a comprehensive review

Comput Struct Biotechnol J

2024

;

2011

–

10.1016/j.csbj.2024.04.060

Sarumi

Heider

Large language models and their applications in bioinformatics

Comput Struct Biotechnol J

2024

;

3498

–

505

10.1016/j.csbj.2024.09.031

Duan

Zang

. et al.

Phylogen: language model-enhanced phylogenetic inference via graph structure generation

Adv Neural Inform Process Syst

2025

;

131676

–

703

Google Scholar

OpenURL Placeholder Text

WorldCat

Zhang

Duan

Lin

. et al.

Gvfom: a novel external force for active contour based image segmentation

Inform Sci

2020

;

506

–

Google Scholar

Crossref

WorldCat

Hongyuan

Diaz

Czarnecki

. et al.

Machine learning-aided engineering of hydrolases for pet depolymerization

Nature

2022

;

604

662

–

10.1038/s41586-022-04599-z

Yang

Chowdhury

Zhang

. et al.

A review of computational tools for generating metagenome-assembled genomes from metagenomic sequencing data

Comput Struct Biotechnol J

2021

;

6301

–

10.1016/j.csbj.2021.11.028

Al-Shayeb

Skopintsev

Soczek

. et al.

Diverse virus-encoded crispr-cas systems include streamlined genome editors

Cell

2022

;

185

4574

–

4586.e16

10.1016/j.cell.2022.10.020

Yanping

Chen

Jing

. et al.

Metagenomic discovery of novel crispr-cas13 systems

Cell Discov

2022

;

107

10.1038/s41421-022-00464-5

Yang

Patel

Structures, mechanisms and applications of rna-centric crispr–cas13

Nat Chem Biol

2024

;

673

–

10.1038/s41589-024-01593-6

Zilberzwige-Tal

Altae-Tran

Kannan

. et al.

Reprogrammable rna-targeting crispr systems evolved from rna toxin-antitoxins

Cell

2025

. Advance online publication.

10.1016/j.cell.2025.01.034

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

Segata

Waldron

Ballarini

. et al.

Metagenomic microbial community profiling using unique clade-specific marker genes

Nat Methods

2012

;

811

–

Al-Ajlan

El Allali

Cnn-mgp: convolutional neural networks for metagenomics gene prediction

Interdiscipl Sci: Comput Life Sci

2019

;

628

–

10.1007/s12539-018-0313-4

Google Scholar

Crossref

WorldCat

Ren

Song

Deng

. et al.

Identifying viruses from metagenomic data using deep learning

Quant Biol

2020

;

–

10.1007/s40484-019-0187-4

Fiannaca

La Paglia

La Rosa

. et al.

Deep learning models for bacteria taxonomic classification of metagenomic data

BMC Bioinform

2018

;

–

10.1186/s12859-018-2182-6

Google Scholar

Crossref

WorldCat

Lee

Rho

Multimodal deep learning applied to classify healthy and disease states of human microbiome

Sci Rep

2022

;

824

10.1038/s41598-022-04773-3

Yan Miao

Liu

Virtifier: a deep learning-based identifier for viral sequences from metagenomes

Bioinformatics

2022

;

1216

–

Liang

Bible

Liu

. et al.

Deepmicrobes: taxonomic classification for metagenomics with deep learning

NAR Genomics Bioinform

2020

;

lqaa009

Google Scholar

Crossref

WorldCat

Yan

Bombarely

Deepte: a computational method for de novo classification of transposons with convolutional neural network

Bioinformatics

2020

;

4269

–

10.1093/bioinformatics/btaa519

Miller

Stern

Burstein

Deciphering microbial gene function using natural language processing

Nat Commun

2022

;

5731

10.1038/s41467-022-33397-4

Wichmann

Buschong

Müller

. et al.

Metatransformer: deep metagenomic sequencing read classification using self-attention models

NAR Genomics Bioinformat

2023

;

lqad082

Google Scholar

Crossref

WorldCat

Hoarfrost

Aptekmann

Farfañuk

. et al.

Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter

Nat Commun

2022

;

2606

10.1038/s41467-022-30070-8

Gwak

H-J

Rho

Vibe: a hierarchical bert model to identify eukaryotic viruses using metagenome sequencing data

Brief Bioinform

2022

;

bbac204

Yang

Huang

. et al.

Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution

Nucleic Acids Res

2022

;

e81

–

Avsec

Agarwal

Visentin

. et al.

Effective gene expression prediction from sequence by integrating long-range interactions

Nat Methods

2021

;

1196

–

203

10.1038/s41592-021-01252-x

Zvyagin

Brace

Hippe

. et al.

Genslms: genome-scale language models reveal sars-cov-2 evolutionary dynamics.

Int J High Perform Comput Appl

2022

;

683

–

705

10.1177/10943420231201154

Google Scholar

Crossref

WorldCat

Zhou

Liu

. et al.

Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome

Bioinformatics

2021

;

2112

–

10.1093/bioinformatics/btab083

Zhou

. et al.

Dnabert-2: efficient foundation model and benchmark for multi-species genome.

arXiv preprint arXiv:2306.15006

2023

Dalla-Torre

Gonzalez

Mendoza-Revilla

. et al.

Nucleotide transformer: building and evaluating robust foundation models for human genomics

Nature Methods

2024;

:1–11.

10.1038/s41592-024-02523-z

OpenURL Placeholder Text

WorldCat

Crossref

Nguyen

Poli

Faizi

. et al.

Hyenadna: long-range genomic sequence modeling at single nucleotide resolution.

Adv Neural Inf Process Syst.

2023;

:43177–201.

Wolf

Debut

Sanh

. et al.

Transformers: state-of-the-art natural language processing

. In:

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

2020

–

10.18653/v1/2020.emnlp-demos.6

Fishman

Kuratov

Shmelev

. et al.

Gena-lm: a family of open-source foundational dna language models for long sequences

Nucleic Acids Res

2025

;

gkae1310

Liu

Cao

. et al.

Hybridna: A Hybrid Transformer-mamba2 Long-Range Dna Language Model

2025

10.48550/arXiv.2502.10807

Cheng

Song

Yang

. et al.

Dnalongbench: a benchmark suite for long-range dna prediction tasks

bioRxiv

2025

;

2025

–

10.1101/2025.01.06.631595

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

Gündüz

Mreches

Moosbauer

. et al.

Optimized model architectures for deep learning on genomic data

Commun Biol

2024

;

7:516

10.1038/s42003-024-06161-1

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

Pavlopoulos

Baltoumas

Liu

. et al.

Unraveling the functional dark matter through global metagenomics

Nature

2023

;

622

594

–

602

10.1038/s41586-023-06583-7

D’Onofrio

Abel

Redundancy of the genetic code enables translational pausing

Front Genet

2014

;

140

10.3389/fgene.2014.00140

Salgado

Martínez-Flores

Bustamante

. et al.

Using regulondb, the Escherichia coli k-12 gene regulatory transcriptional network database

Curr Protoc Bioinformatics

2018

;

–

Jia

Raphenya

Alcock

. et al.

Card 2017: Expansion and model-centric curation of the comprehensive antibiotic resistance database

Nucleic Acids Res

2017;

:D566–73.

OpenURL Placeholder Text

WorldCat

Chen

Yang

Jun

. et al.

Vfdb: a reference database for bacterial virulence factors

Nucleic Acids Res

2005

;

D325

–

Bairoch

The enzyme database in 2000

Nucleic Acids Res

2000

;

304

–

McDonald AG, Tipton KF. Enzyme nomenclature and classification: the state of the art.

The FEBS Journal

. Chichester, UK: Wiley Online Library, 2023;

290

:2214–31.

10.1111/febs.16494

Gillespie

Wattam

Cammer

. et al.

Patric: the comprehensive bacterial bioinformatics resource with a focus on human pathogenic species

Infect Immun

2011

;

4286

–

Qichao

Lin

Cheng

. et al.

Ncycdb: a curated integrative database for fast and accurate metagenomic profiling of nitrogen cycling genes

Bioinformatics

2019

;

1040

–

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Kusters

Pratsinis

Thoma

. et al.

Ultrasonic fragmentation of agglomerate powders

Chem Eng Sci

1993

;

4119

–

10.1016/0009-2509(93)80258-R

Google Scholar

Crossref

WorldCat

Lin

Akin

Rao

. et al.

Language models of protein sequences at the scale of evolution enable accurate structure prediction

bioRxiv

2022

;

2022

:500902.

Google Scholar

OpenURL Placeholder Text

WorldCat

Gruenstaeudl

annonex2embl: automatic preparation of annotated dna sequences for bulk submissions to ena

Bioinformatics

2020

;

3841

–

10.1093/bioinformatics/btaa209

McWilliam

Uludag

. et al.

Analysis tool web services from the embl-ebi

Nucleic Acids Res

2013

;

W597

–

600

Lawson

Swigon

Murakami

. et al.

Catabolite activator protein: Dna binding and transcription activation

Curr Opin Struct Biol

2004

;

–

10.1016/j.sbi.2004.01.012

Jain

Hassanzadeh

Schwartz

. et al.

Duplication-correcting codes for data storage in the dna of living organisms

IEEE Trans Inform Theory

2017

;

4996

–

5010

10.1109/TIT.2017.2688361

Google Scholar

Crossref

WorldCat

Pastinen

Hudson

Influence of human genome polymorphism on gene expression

Hum Mol Genet

2006

;

–

R16

Zhang

Han

Yuqing

. et al.

Metagenomics assembled genome scale analysis revealed the microbial diversity and genetic polymorphism of Lactiplantibacillus plantarum in traditional fermented foods of Hainan, China

Food Res Int

2021

;

150

110785

10.1016/j.foodres.2021.110785

Jumper

Evans

Pritzel

. et al.

Highly accurate protein structure prediction with alphafold

Nature

2021

;

596

583

–

10.1038/s41586-021-03819-2

Yang

Jerger

Feng

. et al.

Improved enzyme functional annotation prediction using contrastive learning with structural inference

Commun Biol

2024

;

7:1690

10.1038/s42003-024-07359-z

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

Khosla

Teterwak

Wang

. et al.

Supervised contrastive learning

Adv Neural Inform Process Syst

2020

;

18661

–

Google Scholar

OpenURL Placeholder Text

WorldCat

Mikhael

Chinn

Barzilay

Clipzyme: reaction-conditioned virtual screening of enzymes.

In:

Proceedings of the 41st International Conference on Machine Learning (ICML'24)

. Vienna, Austria: JMLR.org; 2024, Article No. 1453, pp. 1–17.

Hermans

Beyer

Leibe

In defense of the triplet loss for person re-identification.

arXiv preprint arXiv:1703.07737

2017

. https://www.sci-hub.ru/10.48550/arXiv.1703.07737

Haigen

Wang

Zhang

. et al.

A comprehensive survey on contrastive learning

Neurocomputing

2024

;

610

128645

10.1016/j.neucom.2024.128645

Google Scholar

Crossref

WorldCat

Naeve

Mitchell

Reed

. et al.

Introducing dynamic token embedding sampling of large language models for improved inference accuracy

Authorea Preprints

2024

10.36227/techrxiv.173014793.37761346/v1

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

Karp

Billington

Caspi

. et al.

The biocyc collection of microbial genomes and metabolic pathways

Brief Bioinform

2019

;

1085

–

Jian

Zeng

Taojie

. et al.

Antibiotic resistance genes in bacteria: occurrence, spread, and control

J Basic Microbiol

2021

;

1049

–

10.1002/jobm.202100201

Arnold

Huang

I-T

Hanage

Horizontal gene transfer and adaptive evolution in bacteria

Nat Rev Microbiol

2022

;

206

–

10.1038/s41579-021-00650-4

McArthur

Waglechner

Nizam

. et al.

The comprehensive antibiotic resistance database

Antimicrob Agents Chemother

2013

;

3348

–

Yang

Jiang

Chai

. et al.

Args-oap: online analysis pipeline for antibiotic resistance genes detection from metagenomic data using an integrated structured arg-database

Bioinformatics

2016

;

2346

–

10.1093/bioinformatics/btw136

Hong

Pedersen

Atp synthase and the actions of inhibitors utilized to study its roles in human health, disease, and other scientific areas

Microbiol Mol Biol Rev

2008

;

590

–

641

10.1128/MMBR.00016-08

Uniprot: The universal protein knowledgebase in 2023

Nucleic Acids Res

2023

;

D523

–

Devlin

Chang

M-W

Lee

. et al.

Bert: pre-training of deep bidirectional transformers for language understanding.

In:

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

. Minneapolis, Minnesota: Association for Computational Linguistics; 2019, pp. 4171–4186.

10.18653/v1/N19-1423

Loshchilov

Hutter

Decoupled weight decay regularization.

arXiv preprint arXiv:1711.05101

2017

10.48550/arXiv.1711.05101

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

Download all slides

Article Contents

FGeneBERT: function-driven pre-trained gene language model for metagenomics

Abstract

Introduction

Methods

Datasets

Notation

Context-aware masked gene modeling for one-to-many problem

Context-aware tokenizer

Masked gene modeling

Triplet metagenomic contrastive framework for many-to-one problem

Contrastive learning

Positives sampling

Hard negatives sampling

Results and discussion

Experiments results on four level downstream tasks

Ablation study

Ablation study on the performance of MGM and TMC

Ablation study on visualization results of MGM

Ablation study on clustering results of TMC

Ablation study on context-aware tokenizer

Model efficiency analysis

Sensitivity analysis

Model configurations

Conclusion

Funding

Data availability

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

FGeneBERT: function-driven pre-trained gene language model for metagenomics

Abstract

Introduction

Methods

Datasets

Notation

Context-aware masked gene modeling for one-to-many problem

Context-aware tokenizer

Masked gene modeling

Triplet metagenomic contrastive framework for many-to-one problem

Contrastive learning

Positives sampling

Hard negatives sampling

Results and discussion

Experiments results on four level downstream tasks

Ablation study

Ablation study on the performance of MGM and TMC

Ablation study on visualization results of MGM

Ablation study on clustering results of TMC

Ablation study on context-aware tokenizer

Model efficiency analysis

Sensitivity analysis

Model configurations

Conclusion

Funding

Data availability

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only