-
PDF
- Split View
-
Views
-
Cite
Cite
Jiabei Cheng, Xiaoyong Pan, Yi Fang, Kaiyuan Yang, Yiming Xue, Qingran Yan, Ye Yuan, GexMolGen: cross-modal generation of hit-like molecules via large language model encoding of gene expression signatures, Briefings in Bioinformatics, Volume 25, Issue 6, November 2024, bbae525, https://doi.org/10.1093/bib/bbae525
- Share Icon Share
Abstract
Designing de novo molecules with specific biological activity is an essential task since it holds the potential to bypass the exploration of target genes, which is an initial step in the modern drug discovery paradigm. However, traditional methods mainly screen molecules by comparing the desired molecular effects within the documented experimental results. The data set limits this process, and it is hard to conduct direct cross-modal comparisons. Therefore, we propose a solution based on cross-modal generation called GexMolGen (Gene Expression-based Molecule Generator), which generates hit-like molecules using gene expression signatures alone. These signatures are calculated by inputting control and desired gene expression states. Our model GexMolGen adopts a “first-align-then-generate” strategy, aligning the gene expression signatures and molecules within a mapping space, ensuring a smooth cross-modal transition. The transformed molecular embeddings are then decoded into molecular graphs. In addition, we employ an advanced single-cell large language model for input flexibility and pre-train a scaffold-based molecular model to ensure that all generated molecules are 100% valid. Empirical results show that our model can produce molecules highly similar to known references, whether feeding in- or out-of-domain transcriptome data. Furthermore, it can also serve as a reliable tool for cross-modal screening.
Introduction
Designing novel molecules based on genetic information is a challenging but promising task [1]. Traditional search methods like CMap [2] compare observed gene expression changes, referred to as gene expression signatures, with those recorded in molecular perturbation experiments, which leads to a highly restricted search space that cannot accommodate the vast possibilities within the molecular space. Furthermore, they cannot effectively narrow down this search space by directly comparing cross-modal similarities.
Several existing in silico methods are attempting to overcome these shortcomings. Some models, like DLEPS [3], use an indirect approach by first predicting gene expression signatures as the drug effects and then using these signatures to screen molecules, as traditional methods do. Other methods, like Wasserstein generative adversarial network (WGAN) [4] and Gex2SGen [5], directly input gene expression signatures and generate candidate molecules. WGAN uses GANs, while Gex2SGen uses variational autoencoders (VAEs). However, these models still need to improve their generalization ability. For example, they require input gene expression signatures that match the format used during the training phase; otherwise, re-training is required. Additionally, they mostly use a simplified molecular-input line-entry system (SMILES) to represent molecules. This form is a popular one-dimensional representation that can be interpreted as specific “chemical text” so that it is easy to utilize advanced sequence networks like Transformer [6] fully. However, this representation form fails to preserve structural information and has limited potential to expand to higher dimensions [7]. On the other hand, the hypothesis behind this task is that molecules with induced similar gene expression signatures are likely to possess similar structures like scaffolds or pharmacophoric attributes. This structural information conflict makes it difficult to explore inherent mechanisms further [8]. Therefore, exploring new forms of modal encoders and modal interaction is necessary.
For molecular encoders, there are various graph-based approaches. For example, molecular autoencoders represent a molecule as a graph, where vertices correspond to atoms and edges represent chemical bonds [9, 10]. These methods intuitively capture the shape of molecules but need more efficiency due to low validity in generations. Another line of research involves flow-based methods like MolFlow [11], which achieve a perfect reconstruction rate but can be very computationally expensive for long chains. Scaffold-based approaches, such as hierVAE [12], utilize scaffolds as building blocks instead of individual atoms and edges, ensuring high validity and modest computational cost.
For genetic encoders, considering the remarkable success of foundation models that have revolutionized various fields, including omics data in life science, single-cell large language models hold great potential to be general encoders in this specific field. These models are pre-trained on large, unlabeled single-cell expression datasets and then fine-tuned for specific tasks [13]. High-quality models such as scBERT [14], Geneformer [15], TOSICA [16], scGPT [17], scFoundation [18], SCimilarity [19], and GeneCompass [20] are capable of receiving the flexible input form of gene expression signatures due to their powerful generalization abilities.
As to modal interaction, the “first-align-then-generate” strategy, inspired by models like DALL.E [21, 22], has gained great interest in biology. Recent work, including text-molecule matching [23], text-protein sequence generation [24], and PLIP, a pathology language and image pre-training method [25], has achieved impressive performance.
Therefore, we propose GexMolGen, a cross-modal generation-based approach for designing molecules with specific biological activities.
Materials and methods
Overview of GexMolGen
GexMolGen is a general framework capable of generating hit-like molecules from two specific gene expression inputs: initial and desired or post-perturbed states. The underlying principle is that molecular structure correlates with its resulting differential gene expression, also called gene expression signatures. GexMolGen adopts a strategy of “first-align-then-generate” for cross-modal generation. As shown in Fig. 1A, the pipeline generally consists of four steps.

The workflow of GexMolGen employs a "first-align-then-generate" strategy (A), consisting of four steps: Encoding (B and C), Aligning (A), Transforming (E) and Decoding (D).
The first step involves the encoding of each modality. For the molecular encoder (illustrated in Fig. 1B), we pre-train a scaffold-based molecular generator [12] on the ChEMBL [26] dataset and then use its encoder to produce molecular embeddings, thereby capturing structural information more efficiently than SMILES-based encoders [3, 4]. For the genetic encoder (visualized in Fig. 1C), we apply an advanced single-cell foundation model, scGPT [17], to encode pre- and post-perturbation gene expressions. We then calculate the difference within the gene embedding space to represent expression signatures.
The second step, as presented in Fig. 1C, is to fine-tune each encoder through aligning. As complex negative examples are crucial to final alignment effects, we employ a hierarchical gene-molecule matching approach to enhance contrastive learning with iteratively refined granularity. The alignment projects gene expression signatures and molecules into a unified space, preparing for the transition from genetic embeddings to molecular ones.
The third step, as depicted in Fig. 1E, is implemented by a module entitled “Facilitator”, which is trained using a teacher-forcing technique and applied in an auto-regressive manner. Molecular embeddings are discretized into molecular tokens through principal component analysis (PCA) and binning. Beam search is adopted in this module to optimize the diversity of the generation.
Finally, the inferred tokens are reordered and made continuous through retrieving and inverse PCA to molecular embeddings. They are then fed into the molecular decoder (as shown in Fig. 1D) to generate molecular graphs. During the training phase, this module needs to compute the reconstruction loss to form a closed training loop within the framework.
The necessity for the molecular encoder and alignment phase is limited to the training phase only. For generation, the initial and the desired gene expressions and the number of molecular candidates are needed. For the retrieval function, additional molecules are required as a screening list. To our knowledge, GexMolGen is the first work that applies large language models to generate molecules cross-modally.
Dataset pre-processing
Dataset used for training
The datasets used in GexMolGen are derived from the L1000 public dataset (GEO ID: GSE70138) [27]. The L1000 technology uses microarray measurements to record the gene expression signatures caused by over 25 000 perturbations, including around 19 800 small molecules, in cell lines. Each perturbation is performed on multiple cell lines. This dataset is processed hierarchically and divided into five levels: levels 1–3 record the absolute expression values of each gene in each bead, while levels 4 and 5 report the differential expression of each gene obtained by comparing samples within each bead.
Specifically, we first collect the perturbed and control data from level 3, ensuring they originate from the same bead. These data belong to six major cell lines—VCAP, PC3, A549, A375, HT29, and MCF7. We average each molecule’s perturbed gene expressions, ignoring the cell line information, resulting in a final dataset size of |$2\times N \times M$|. Here, 2 signifies the experimental and control groups, |$N$| represents the number of molecules, and |$M$| indicates the count of gene symbols. Subsequently, we classify the structures based on the main layer of the InChI key, resulting in a total of 6360 classes. We randomly select 300 structures from this set, which include |$N_{2}$| types of small molecules, to form the hold-out set. Additionally, we choose another 300 structures, including |$N_{3}$| types of small molecules, to create the validation set. The remaining data are used for training the model with |$N_{1}$| types of small molecules.
We randomly sample |$N_{p}$| molecules from ChEMBL [26] as datasets for the molecular pre-training stage.
Dataset used for evaluating
In addition to the hold-out compound-induced gene expression data, we follow the same procedure as WGAN [4], our benchmarking method, to gather known inhibitors and genetic inputs for the comparative analysis of molecules generated from knock-out gene expressions. Initially, we select 10 known inhibitors of human genes from the ExCape dataset [28]. These genes are AKT1, AKT2, AURKB, CTSK, EGFR, HDAC1, MTOR, PIK3CA, SMAD3, and TP53. Our selection criteria consist of a Pic50 value greater than or equal to 5, non-presence of the same InChi-key in the training set, and compliance with valence bond rules. Eventually, we collected 2153, 1293, 2122, 1609, 4244, 1939, 2858, 2292, 23 758, and 12 860 known inhibitors, the numbers, respectively, corresponding to each gene.
For each of these 10 genes, we extract 100 transcriptome profiles from both the perturbed and control groups in the CRISPR knock-out perturbation data of the L1000 dataset [27].
Molecular encoder
For the molecular encoder (see Fig. 1B), hierVAE [12] regards motifs as the smallest building blocks for generating small molecules. This approach achieves 100% validity in molecular generation. It encodes the input molecule hierarchically to form an atom layer |$\mathcal{H}_{\mathcal G}^{g}$|, an attachment layer |$\mathcal{H}_{\mathcal G}^{a}$|, and a motif layer |$\mathcal{H}_{\mathcal G}^{s}$|. Each successive layer utilizes the previous one as its encoding condition.
Atom layer
The input for the atom layer |$\mathcal{H}_{\mathcal G}^{g}$| is the molecular graph |$\mathcal{G}$|, the edge features |$\{\mathrm{e}\left (\mathrm{b}_{\mathrm{uv}}\right )\}$|, and the node features |$\{\mathrm{e}\left (\mathrm{a}_{\mathrm{u}}\right )\}$|. Through the message passing network [29, 30] |$M P N_{\psi }(\cdot )$|, we obtain the encoded features:
where |$\mathrm{h}_{\mathrm{v}}$| represents the updated atomic features.
Attachment layer
The attachment layer |$\mathcal{H}_{\mathcal G}^{a}$| updates its nodes using the following equation:
where |$\mathrm{e}\left (\mathrm{d}_{\mathrm{ij}}\right )$| denotes the edge features for the attachment layer, and |${f_{\mathcal{A}_{i}}}$| can be calculated based on atom vectors |$\mathrm{h}_{\mathrm{v}}$| and nodes |$\mathcal{A}_{i}$| in the attachment layer:
where |${emb}_{\mathcal{A}_{i}}(\cdot )$| represents a fully connected layer and |${e}(\cdot )$| symbolizes a conventional embedding layer. The node |$\mathcal{A}_{i}$| is the set of overlapping atoms between adjacent motifs.
Motif layer
The motif layer |$\mathcal{H}_{\mathcal G}^{s}$| updates its nodes using the following equation:
where |$\mathrm{e}\left (\mathrm{d}_{\mathrm{ij}}\right )$| denotes the edge features for the motif layer, and |${f_{\mathcal{S}_{i}}}$| can be calculated based on |$\mathrm{h}_{\mathcal{A}_{i}}$| and |$\mathcal{S}_{i}$|:
where |${emb}_{\mathcal{A}_{i}}(\cdot )$| represents a fully connected layer and |${e}(\cdot )$| symbolizes a conventional embedding layer. Nodes |$\mathcal{S}_{i}$| in the motif layer represent the type of the motifs.
Finally, the embedding of the |$i$|th molecule is defined as
where |$\mu (\cdot )$| and |$\exp (\cdot )$| are both fully connected layers representing the mean and log variance, respectively, and |$\mathcal{S}_{1}$| represents the root motif, which is also the first motif for decoding.
Molecular decoder
During decoding (see Fig. 1D), we continue to adopt the molecular embedding notion as |$z_{m}$|, and the motif layer encoded vectors as |$h_{\mathcal{S}_{k}}$|, where |$k$| signifies the step. Initially, we set |$k=1$|. Then, the hierarchical decoding of the molecular graph is shown as follows.
Motif prediction
This step predicts the next motif |$\mathcal{S}_{t}$| to be attached to |$\mathcal{S}_{k}$|. It is implemented as a classification task over the motif vocabulary |$\mathcal{V}_{\mathcal S}$|:
Attachment prediction
Similar to motif prediction, but applying to the attachment vocabulary |$\mathcal{V}_{\mathcal A}(\mathcal{S}_{t})$|, this step predicts the attachment configuration between |$\mathcal{S}_{t}$| and |$\mathcal{S}_{k}$|. In other words, it determines the type of atom that connects |$\mathcal{S}_{t}$| and |$\mathcal{S}_{k}$|:
Graph prediction
This step is to decide how |$\mathcal{S}_{t}$| will be attached to |$\mathcal{S}_{k}$|. After attachment prediction, we only narrow down to the connected atom type but do not specify the exact position. The attachment between |$\mathcal{S}_{t}$| and |$\mathcal{S}_{k}$| is defined as atom pairs |$\mathcal{M}_{tk}=\{(u_{j}.v_{j})|u_{j}\in \mathcal{A}_{k}, v_{j}\in \mathcal{A}_{t}\}$|, where atom |$u_{j}$| and |$v_{j}$| are attached together. The probability of a candidate attachment, |$M$|, is calculated based on the atom vectors, |$h_{u_{j}}$| and |$h_{v_{j}}$|:
Training
During training, this module needs to minimize the negative ELBO:
to complete the training loop.
Validation
During validation, we utilize the Fréchet ChemNet Distance [31] as a selection criterion for models. This metric evaluates whether the generated molecules show diversity and possess chemical and biological properties similar to reference molecules.
Genetic encoder
In Fig. 1C, a single-cell large language model known as scGPT [17] is employed as the genetic encoder. Trained on an extensive amount of unlabeled single-cell expression data, this model shows potential in capturing underlying biological signals. The pre-trained gene tokens from scGPT can uniquely identify each gene in the input data, allowing for flexible gene inputs during the encoding process. Our model receives two sets of gene expression data: the initial and the desired or post-perturbation data. Both sets of data independently undergo the same encoding pipeline. Using the |$i$|th gene expression data point as an example, we initially retrieve its corresponding gene tokens from the scGPT pre-trained gene dictionary:
where |$t_{g}^{(i)}$| represents the gene token encoding for the |$i$|th data, and |$i d\left (g_{1}^{(j)}\right )$| represents the unique integer identifier for gene |$g_{j}$| in the |$i$|th data.
Next, we pre-process the expression values by implementing transcripts per thousand normalization and |$log1p$| transformation:
The gene embedding is then defined as follows:
where |$e m b_{g}\left (\cdot \right )$| refers to the conventional embedding layers, and |$e m b_{x}\left (\cdot \right )$| represents the fully connected layer.
We further utilize the mask language model (MLM) task to fine-tune scGPT. Instead of commonly using mean square loss, we use the autofocus direction-aware loss in GEARS [32]. This loss automatically assigns a greater weight to differentially expressed genes by amplifying the exponent of the error. Given a minibatch of |$B_{1}$| data points, we randomly sample 95%
Moreover, we introduce an additional direction-aware loss to increase the predicted data’s sensitivity to the direction relative to the control group, which is defined as follows:
The optimization objective for the MLM task, denoted as |$\mathcal{L}_{\mathrm{MLM}}$|, is ultimately calculated by taking a weighted sum of the values of |$\mathcal{L}_{\mathrm{autofocus}}$| and |$\mathcal{L}_{\mathrm{direction}}$|:
where |$\lambda $| functions to adjust the weight for the directional loss.
We then represent the gene expression signatures for subsequent alignment by calculating the difference in post-encoding embedding between the desired and control states:
Given that some other single-cell foundation models are available, we conducted an ablation study in Appendix D to demonstrate scGPT’s advantages in efficiency and alignment accuracy.
Alignment
Alignment is the second step (see Fig. 1A) and aims to extract the mutual information between genetic and small molecular modalities, thus enhancing the controllability of the generation model. In this process, we adopt contrastive learning [33, 34] for multi-modal matching and introduce a hierarchical gene-molecule matching strategy to overcome model collapse.
Hierarchical clustering
Hierarchical clustering merges data into clusters step by step by calculating their similarity until a complete tree-like structure is formed. We use this strategy to cluster the small molecule embeddings into |$C=\{c_{1},c_{2},c_{3}\}$| classes. The detailed process is shown in Algorithm 1.

Hierarchical gene-molecule matching
Through hierarchical clustering, molecules are grouped into clusters, and the model can be guided by the cross-entropy loss function at a coarse resolution. Given that the small molecules have been clustered into |$C=\{c_{1},c_{2},c_{3}\}$| classes, we require the genetic encoder to classify the corresponding perturbations based on gene expression signatures. The cross-entropy loss function measures the quality of this task:
where |$y_{c, i}$| and |$p_{c, i}$| denote the truth labels and predicted labels for the |$c$|th cluster of small molecules based on gene embeddings.
Contrastive learning alignment
The final layer |$z_{g}$| output in the genetic encoder is aligned with small molecular vectors |$z_{m}$| via contrastive learning. The optimization objective is to minimize the InfoNCE loss function [35]:
where |${z_{m}}^{\prime }$| and |${z_{g}}^{\prime }$| are unpaired molecule-gene negative sample pairs in a minibatch composed of |$B_{2}$| data points, and |$E(.,.)$| is an energy (scalar) function defined on these two modalities. We use vector dot product for representation; |$ \tau $| is a learnable coefficient. The InfoNCE loss aims to minimize the distance between positive samples and maximize the distance between negative samples.
Overall, during the learning process, our genetic encoder optimizes two functions: InfoNCE loss |$\mathcal{L}_{\mathrm InfoNCE}$| for contrastive learning, and cross-entropy loss |$\mathcal{L}_{\mathrm ce:{c}}$| for the hierarchical gene-molecule matching, which can also serve as the hard sample mining task:
Facilitator
Assuming our dataset consists of sample pairs |$(m, g)$|, where |$m$| represents a small molecule and |$g$| represents its corresponding gene expression signatures, we model the transition from the gene expression signatures to the molecules, denoted as |$P(m|g)$|, using a two-phase process involving modality facilitation and model generation [36]. Let |$z_{m}$| and |$z_{g}$| continue representing the small molecule and gene embeddings, respectively. Thus, |$P(m|g)$| can be decomposed as
The first equation holds because |$z_{m}$| is a deterministic function of the small molecule dataset |$m$|, and the second equation holds due to the chain rule [21, 22].
When the small molecule decoder |$P\left (m\middle | z_{m},g\right )$| has been determined [12], the primary focus is on optimizing the facilitator |$P\left (z_{m}\middle | g\right )$| (see Fig. 1E). We adopt an auto-regressive model for modality transformation [37–39].
Since |$z_{m}$| is a continuous value, the cross-entropy loss function cannot be directly applied. To solve this, we adopt the approach utilized in DALL|$\cdot $| E 2 [22], where we reduce the dimension of |$z_{m}$| from |$d$| to |$L$| using PCA. This reduction allows us to retain nearly all the information (with a reconstruction mean squared error below 1%). Subsequently, we discretize the |$L$|-dimensional values into |$N_{m}$| bins, assuming that these values are uniformly distributed.
Figure 1E provides the mechanism of the Facilitator. It takes as input |$M$| genes, a special symbol [start], and |$L$| small molecule tokens and sequentially produces inferred tokens |$z_{m}^{d}$|. The loss function is defined as follows:
where |$t_{i}$| represents the |$i$|th true token, and |$p(z_{m}^{d(i)}|z_{m}^{d(0:i-1)})$| represents the inferred token of the small molecules.
Beam search optimization
Beam search is a commonly used search algorithm in sequence generation tasks [9, 40]. It utilizes dynamic programming principles by keeping track of a set of candidate sequences, known as the beam, with a beam width of |$K$| to preserve promising results. The beam search algorithm effectively manages the search space and improves the accuracy and diversity of generated sequences. Specifically, the beam search algorithm follows the following steps:
Initialize the beam with a set of candidate sequences, each consisting of |$M$| gene tokens and [start].
At each time step:
Calculate the likelihood of every candidate token becoming the next token based on the current sequence.
Select the top |$K$| candidate sequences based on their scores, where |$K$| is the beam width.
Expand each selected candidate sequence by appending the molecular tokens and updating its score.
Repeat step two until a complete sequence of small molecule tokens |$L$| is formed.
Return the top-ranked sequence from the beam as the final output.
After the beam search process, the final output tokens |$z_{m}^{d}$| are transformed into |$z_{m}$| using the inverse of PCA. The continuous |$z_{m}$| values are then fed into the molecular decoder |$P(m|z_{m},g)$| [12]. This decoding procedure results in |$K$| sets of molecule candidates. We choose the value of |$K$| by trading between molecular generation quality and time consumption; the detailed sensitivity analysis is in Appendix C.
Results
Generating hit-like molecules from compound-induced gene expression signatures
In this section, we assess the performance of our model GexMolGen in generating hit-like molecules using compound-induced gene expression signatures. We input a set of |$N_{3}$| hold-out samples into the model and compare the molecules generated with the reference ones in hold-out samples.
In Fig. 2, we illustrate this comparison (for more details, please refer to Table A.1). The results indicate that the generated molecules can successfully reproduce some complex scaffold structures that appeared in the reference molecules, such as long chains (see Fig. 2(3)(6)), cyclic structures (see Fig. 2(1)), and cross-like structures (see Fig. 2(7)(9)).

Examples of generated molecules using a compound-induced gene expression signature.

Examples of generated molecules using knock-out gene expression signatures.
We compare our model with two traditional methods: searching molecules in the training set according to the cosine similarity or Euclidean distance among gene expression signatures [4]. We also include a pre-trained hierVAE [12] that generates directly from samples in the molecular latent space to demonstrate the guided influence of genetic information. We adopt Tanimoto similarity based on Morgan fingerprints (radius=3, 1024 bits) as the global similarity metric and use Tanimoto similarity based on MACCS keys and Fraggle similarity as the local structural metric.

Examples of generated molecules based on knock-out gene expression signatures, extending the findings from Fig. 3.
Figure 5A shows the distribution of these metrics for different methods. Detailed results can be found in Tables A.1, A.2, and A.3. The results obtained from our model GexMolGen demonstrate a significant improvement over the other three baseline methods (P-value |$\lt 0.001$|, according to a one-sided Mann–Whitney U-test). Specifically, our model achieves a mean Fraggle score of |$0.57$|, a mean Morgan-based Tanimoto score of |$0.30$|, and a mean MACCS-based Tanimoto score of |$0.70$|. These results indicate that the molecules generated by our model exhibit high similarity to the reference molecules in both fingerprints and substructures. Moreover, across all three metrics, the Euclidean distance search method is consistently inferior to the cosine similarity method. This discrepancy can be attributed to the curse of dimensionality, where the Euclidean distance becomes less sensitive to differences in gene expression signatures as the dimension increases. We also notice that the randomly generated model has the poorest overall performance. Although it occasionally produces molecules with high similarity scores (e.g. |$1.19\%$| with a similarity |$\gt 0.29$| in Fig. 5A(b)), it also generates a more significant proportion of irrelevant molecules (e.g. |$26.27\%$| with a similarity |$\lt 0.08$| in Fig. 5A(b)), which highlights the significance of guidance from gene expression signatures in generating desired molecular properties.
Generating hit-like molecules from knock-out gene expression signatures
To test the generalization ability of our model GexMolGen, we evaluate it to generate potential gene inhibitors from knock-out gene expression data, which can also serve as an out-of-distribution test. The underlying hypothesis is that inhibitors of a specific gene will cause equivalent changes in gene expression when that gene is knocked out.
We utilize the same evaluation metrics and comparison methods described in the Result 3.1. The collection of knock-out data and the corresponding known inhibitors can be found in the Methods 2.2.2. The procedure involves each method generating 100 candidate molecules for each gene based on the given knock-out transcriptome. Subsequently, we record the scores between each candidate and its most similar known inhibitors.
Figure 3 and Fig. 4 present a visualization of the molecules generated by our model and the most similar known inhibitors. Morgan-based Tanimoto similarity scores measure this similarity. Compared with WGAN [4] with an average score of |$0.26$| (see Table A11 for details), our model generates molecules that not only exhibit a higher overall degree of similarity to known inhibitors but also successfully reconstructs the continuous complete skeletal structure of the molecules.

Quantitative results of generated molecules. A. Comparison results in generating hit-like molecules from compound-induced gene expression signatures. B. Comparison results of potential gene inhibitors generated from knock-out gene expression signatures.
Name|$^{1}$| . | Data|$^{2}$| . | Form|$^{3}$| . | Valid|$^{4}$| . | Metric|$^{5}$| . |
---|---|---|---|---|
WGAN [4] | L1000 | SMILES | |$10\%$| | |$0.15 \thicksim 0.70$| |
Bidirectional Adversarial AE [41] | L1000 | SMILES | |$76\%$| | |$0.30 \thicksim 0.48$| |
TRIOMPHE [42] | L1000 | SMILES/SELFIES | |$16.89\% /\ 93.96\% $| | |$0.29 \thicksim 0.68$| |
PaccMann [43] | GDSC/CCLE | SMILES | |$96.2\%$| | |$0.27 \thicksim 0.45$| |
Deep Generative with RL [44] | TCGA project | SMILES | |$95.9\%$| | |$0.47 \thicksim 0.83$| |
Ours | L1000 | Graph | |$\textbf{100\%}$| | |$\textbf{0.72} \thicksim \textbf{1.00}$| |
Name|$^{1}$| . | Data|$^{2}$| . | Form|$^{3}$| . | Valid|$^{4}$| . | Metric|$^{5}$| . |
---|---|---|---|---|
WGAN [4] | L1000 | SMILES | |$10\%$| | |$0.15 \thicksim 0.70$| |
Bidirectional Adversarial AE [41] | L1000 | SMILES | |$76\%$| | |$0.30 \thicksim 0.48$| |
TRIOMPHE [42] | L1000 | SMILES/SELFIES | |$16.89\% /\ 93.96\% $| | |$0.29 \thicksim 0.68$| |
PaccMann [43] | GDSC/CCLE | SMILES | |$96.2\%$| | |$0.27 \thicksim 0.45$| |
Deep Generative with RL [44] | TCGA project | SMILES | |$95.9\%$| | |$0.47 \thicksim 0.83$| |
Ours | L1000 | Graph | |$\textbf{100\%}$| | |$\textbf{0.72} \thicksim \textbf{1.00}$| |
|$^{1}$| The first column (Name) outlines the name and source of these methods. |$^{2}$| The second column (Data) is the types of datasets. |$^{3}$| The third column (Form) specifies the input format of the molecular data. |$^{4}$| The fourth column (Valid) reports the average validation of their molecular generator. |$^{5}$| The fifth column (Metric) outlines the range of Tanimoto similarity.
Name|$^{1}$| . | Data|$^{2}$| . | Form|$^{3}$| . | Valid|$^{4}$| . | Metric|$^{5}$| . |
---|---|---|---|---|
WGAN [4] | L1000 | SMILES | |$10\%$| | |$0.15 \thicksim 0.70$| |
Bidirectional Adversarial AE [41] | L1000 | SMILES | |$76\%$| | |$0.30 \thicksim 0.48$| |
TRIOMPHE [42] | L1000 | SMILES/SELFIES | |$16.89\% /\ 93.96\% $| | |$0.29 \thicksim 0.68$| |
PaccMann [43] | GDSC/CCLE | SMILES | |$96.2\%$| | |$0.27 \thicksim 0.45$| |
Deep Generative with RL [44] | TCGA project | SMILES | |$95.9\%$| | |$0.47 \thicksim 0.83$| |
Ours | L1000 | Graph | |$\textbf{100\%}$| | |$\textbf{0.72} \thicksim \textbf{1.00}$| |
Name|$^{1}$| . | Data|$^{2}$| . | Form|$^{3}$| . | Valid|$^{4}$| . | Metric|$^{5}$| . |
---|---|---|---|---|
WGAN [4] | L1000 | SMILES | |$10\%$| | |$0.15 \thicksim 0.70$| |
Bidirectional Adversarial AE [41] | L1000 | SMILES | |$76\%$| | |$0.30 \thicksim 0.48$| |
TRIOMPHE [42] | L1000 | SMILES/SELFIES | |$16.89\% /\ 93.96\% $| | |$0.29 \thicksim 0.68$| |
PaccMann [43] | GDSC/CCLE | SMILES | |$96.2\%$| | |$0.27 \thicksim 0.45$| |
Deep Generative with RL [44] | TCGA project | SMILES | |$95.9\%$| | |$0.47 \thicksim 0.83$| |
Ours | L1000 | Graph | |$\textbf{100\%}$| | |$\textbf{0.72} \thicksim \textbf{1.00}$| |
|$^{1}$| The first column (Name) outlines the name and source of these methods. |$^{2}$| The second column (Data) is the types of datasets. |$^{3}$| The third column (Form) specifies the input format of the molecular data. |$^{4}$| The fourth column (Valid) reports the average validation of their molecular generator. |$^{5}$| The fifth column (Metric) outlines the range of Tanimoto similarity.
The results of the experiments are thoroughly demonstrated in Fig. 5B. Generally, our model significantly surpasses the three other baseline methods with higher average scores and diversity in most cases. Furthermore, due to the limited size of the search database, the molecule candidates of traditional methods exhibit shallow diversity. For instance, CTSK, in MACCS-based metrics, shows a nearly linear form. On the other hand, the non-conditional pre-trained hierVAE [12] produces the widest range of scores due to the unconditional molecular sampling, which simultaneously leads to the deterioration of molecular properties. It occasionally produces molecules with high similarity scores but generates many molecules with very low correlations to known inhibitors, as observed in Fig. 5B’s AKT2. By leveraging genetic information, our model narrows down the generation space and allows for some diversity, striking a balance between structure similarity and molecular diversity. We also refer to how other deep learning methods perform on similar tasks, as presented in Table 1. Despite some differences in the tested genes, it can be observed that our model consistently produces competitive results. The notable advantages of our model are its efficiency with 100% generated molecular validity and high input flexibility, which are missing in other methods. The details of the results can be found in Tables A.4, A.5, and A.6.

Results of screening. A. The flowchart using only our model’s retrieving function. B. Examples of searched molecules using knock-out gene expression signatures.
Screening molecules using cross-modal similarity
Our model GexMolGen also offers a well-defined cross-modal similarity for measuring the correlation between gene expression signatures and molecular structures. Therefore, it also supports screening molecules using gene expression profiles alone.
We evaluated the effectiveness of GexMolGen by the task described in Result 3.2. Specifically, our model first encodes the molecules from the reference dataset into |$Z_{m} = [z_{m_{1}},z_{m_{2}},...]$|, and the gene expression data into |$Z_{g} = [z_{g_{1}},z_{g_{2}},...]$| respectively, then calculates their alignment scores |${Z_{m}}^{T}\cdot Z_{g}$| to find the molecule |$m_{.}$| with the highest score for each genetic input |$g_{.}$|. The visualization of the screening pipeline can refer to Fig. 6A. This procedure is similar to the operations in traditional search methods, but the latter only calculates similarities within a uni-modal scope. For instance, in CMap [2], gene expression queries are compared with recorded gene expressions that molecules have perturbed to analyze modal similarity.
Figure 6B displays some examples of screening from the same molecule datasets, and Table 2 demonstrates statistics. Both indicate that the molecules screened using our cross-modal similarity are much more similar to known inhibitors than those found by traditional methods. This result reflects the advantage of this direct cross-modal similarity calculation. The supporting materials are given in Table. A.7.
Knock-out Genes . | MACCS Scores (Ours) . | MACCS Scores (Cos.) . | MACCS Scores (Euc.) . |
---|---|---|---|
AKT1 | |$0.745\pm 0.056$| | |$0.710\pm 0.091$| | |$0.662\pm 0.130$| |
AKT2 | |$0.690\pm 0.065$| | |$0.679\pm 0.098$| | |$0.661\pm 0.113$| |
AURKB | |$0.694\pm 0.047$| | |$0.600\pm 0.024$| | |$0.621\pm 0.025$| |
CTSK | |$0.658\pm 0.086$| | |$0.603\pm 0.069$| | |$0.590\pm 0.054$| |
EGFR | |$0.752\pm 0.054$| | |$0.693\pm 0.051$| | |$0.724\pm 0.017$| |
HDAC1 | |$0.649\pm 0.058$| | |$0.647\pm 0.028$| | |$0.651\pm 0.017$| |
MTOR | |$0.740\pm 0.087$| | |$0.652\pm 0.045$| | |$0.652\pm 0.035$| |
PIK3CA | |$0.684\pm 0.053$| | |$0.560\pm 0.030$| | |$0.588\pm 0.070$| |
SMAD3 | |$0.820\pm 0.069$| | |$0.811\pm 0.065$| | |$0.782\pm 0.050$| |
TP53 | |$0.793\pm 0.061$| | |$0.780\pm 0.055$| | |$0.766\pm 0.062$| |
Knock-out Genes . | MACCS Scores (Ours) . | MACCS Scores (Cos.) . | MACCS Scores (Euc.) . |
---|---|---|---|
AKT1 | |$0.745\pm 0.056$| | |$0.710\pm 0.091$| | |$0.662\pm 0.130$| |
AKT2 | |$0.690\pm 0.065$| | |$0.679\pm 0.098$| | |$0.661\pm 0.113$| |
AURKB | |$0.694\pm 0.047$| | |$0.600\pm 0.024$| | |$0.621\pm 0.025$| |
CTSK | |$0.658\pm 0.086$| | |$0.603\pm 0.069$| | |$0.590\pm 0.054$| |
EGFR | |$0.752\pm 0.054$| | |$0.693\pm 0.051$| | |$0.724\pm 0.017$| |
HDAC1 | |$0.649\pm 0.058$| | |$0.647\pm 0.028$| | |$0.651\pm 0.017$| |
MTOR | |$0.740\pm 0.087$| | |$0.652\pm 0.045$| | |$0.652\pm 0.035$| |
PIK3CA | |$0.684\pm 0.053$| | |$0.560\pm 0.030$| | |$0.588\pm 0.070$| |
SMAD3 | |$0.820\pm 0.069$| | |$0.811\pm 0.065$| | |$0.782\pm 0.050$| |
TP53 | |$0.793\pm 0.061$| | |$0.780\pm 0.055$| | |$0.766\pm 0.062$| |
Knock-out Genes . | MACCS Scores (Ours) . | MACCS Scores (Cos.) . | MACCS Scores (Euc.) . |
---|---|---|---|
AKT1 | |$0.745\pm 0.056$| | |$0.710\pm 0.091$| | |$0.662\pm 0.130$| |
AKT2 | |$0.690\pm 0.065$| | |$0.679\pm 0.098$| | |$0.661\pm 0.113$| |
AURKB | |$0.694\pm 0.047$| | |$0.600\pm 0.024$| | |$0.621\pm 0.025$| |
CTSK | |$0.658\pm 0.086$| | |$0.603\pm 0.069$| | |$0.590\pm 0.054$| |
EGFR | |$0.752\pm 0.054$| | |$0.693\pm 0.051$| | |$0.724\pm 0.017$| |
HDAC1 | |$0.649\pm 0.058$| | |$0.647\pm 0.028$| | |$0.651\pm 0.017$| |
MTOR | |$0.740\pm 0.087$| | |$0.652\pm 0.045$| | |$0.652\pm 0.035$| |
PIK3CA | |$0.684\pm 0.053$| | |$0.560\pm 0.030$| | |$0.588\pm 0.070$| |
SMAD3 | |$0.820\pm 0.069$| | |$0.811\pm 0.065$| | |$0.782\pm 0.050$| |
TP53 | |$0.793\pm 0.061$| | |$0.780\pm 0.055$| | |$0.766\pm 0.062$| |
Knock-out Genes . | MACCS Scores (Ours) . | MACCS Scores (Cos.) . | MACCS Scores (Euc.) . |
---|---|---|---|
AKT1 | |$0.745\pm 0.056$| | |$0.710\pm 0.091$| | |$0.662\pm 0.130$| |
AKT2 | |$0.690\pm 0.065$| | |$0.679\pm 0.098$| | |$0.661\pm 0.113$| |
AURKB | |$0.694\pm 0.047$| | |$0.600\pm 0.024$| | |$0.621\pm 0.025$| |
CTSK | |$0.658\pm 0.086$| | |$0.603\pm 0.069$| | |$0.590\pm 0.054$| |
EGFR | |$0.752\pm 0.054$| | |$0.693\pm 0.051$| | |$0.724\pm 0.017$| |
HDAC1 | |$0.649\pm 0.058$| | |$0.647\pm 0.028$| | |$0.651\pm 0.017$| |
MTOR | |$0.740\pm 0.087$| | |$0.652\pm 0.045$| | |$0.652\pm 0.035$| |
PIK3CA | |$0.684\pm 0.053$| | |$0.560\pm 0.030$| | |$0.588\pm 0.070$| |
SMAD3 | |$0.820\pm 0.069$| | |$0.811\pm 0.065$| | |$0.782\pm 0.050$| |
TP53 | |$0.793\pm 0.061$| | |$0.780\pm 0.055$| | |$0.766\pm 0.062$| |
Discussion
This paper proposes GexMolGen, a method to generate hit-like molecules based on signatures calculated by desired gene expression relative to the initial states. We adopt a “first-align-then-generate” [21] strategy for cross-modal generation. Specifically, we utilize an advanced single-cell large language model called scGPT [17] as the genetic encoder and pre-train hierVAE [17], a graph-based molecular model. To enhance contrastive learning, we integrate a hierarchical gene-molecule matching approach for alignment, serving as the mining task of hard negative samples. The facilitator initially transitions genetic embeddings into molecular tokens during the generation process. These tokens are subsequently transformed into a continuous molecular embedding, fed into the decoder to generate hit-like molecules.
Our method demonstrates competitive and superior performance compared with traditional search methods in generating molecules from in-domain compound-induced expression signatures and out-of-domain gene knock-outs. Additionally, our model exhibits efficiency by ensuring 100% validity and supports input flexibility, surpassing other deep-learning methods. It also can be used as a screening tool since it provides an explicit metric for measuring the similarity between genetic and molecular modalities.
Despite these strengths, our model faces certain limitations, particularly in the representation of genes and molecules. While current single-cell foundation models represent genes in a sequence format, gene interactions are more accurately depicted in a network form. Utilizing a graph neural network to aggregate information from network neighbors based on predefined biological networks might better mimic biological activity. Furthermore, the optimal molecular representation remains an unresolved issue. SMILES-based models [45–47] can generate large volumes of molecules quickly but may lose critical information due to their one-dimensional format. Emerging 3D molecular generation models [48, 49] offer more detailed representations but are computationally intensive. Integrating higher dimensional information, such as the angles of chemical bonds, into existing 2D molecular generators may strike a balance between molecular expressiveness and generation efficiency.
We propose a novel cross-modal generation-based approach for designing molecules with specific biological activities.
GexMolGen is general and accepts flexible input formats by fully leveraging the power of foundation models.
GexMolGen has an explicit metric to measure the correlation between inputs and outputs and retrieval.
GexMolGen has 100% validity because of its scaffold-based molecular generation strategy.
Funding
This work was supported by grants from the National Natural Science Foundation of China 62103262 (to Y.Y.) and the Shanghai Pujiang Programme (no. 21PJ1407700 to Y.Y.).
Data availability
The data underlying this article are available in the article and its online supplementary material. The curated data can be found at https://zenodo.org/records/11100665?token=eyJhbGciOiJIUzUxMiIsImlhdCI6MTcxNDYyNzUzOSwiZXhwIjoxNzk4Njc1MTk5fQ.eyJpZCI6IjM3ZmFkNGU4LWViMDYtNGNkNy1iOTc4LWI0ZTBkMDk2OWI0YyIsImRhdGEiOnt9LCJyYW5kb20iOiJhOGJjZDM5NWFkY2ZiNDAwNjAzYzIwMTg2ODNjYWI2NCJ9.dADyS-0PBsFKr_z1yDdcDnoGoY5PFOSbnYtt6aIz4RLoNxykoIQffAlzQDPbFgqnZJmp7PmjNXPCXHkDMuZHuA. The raw data are downloaded from https://clue.io/data/CMap2020#LINCS2020, https://www.ebi.ac.uk/chembl/ and https://solr.ideaconsult.net/search/excape/#.
Author contributions
J.C., Q.Y., and Y.Y conceived the concept of this work. The experiments were performed by J.C., with the assistance of X.P., Y.F., K.Y., and Y.X. All authors participated in analyzing the results and writing the manuscript. The project was under the supervision of X.P., Q.Y., and Y.Y.
Code Availability
The source codes of GexMolGen are available on Zenodo (https://zenodo.org/records/11092780?token=eyJhbGciOiJIUzUxMiIsImlhdCI6MTcxNDQ4MjYwOCwiZXhwIjoxNzk4Njc1MTk5fQ.eyJpZCI6IjQwMjc0ZTlkLTQ0YjItNDY0Mi1hYmRjLTlkM2VkMjVhNWE5OSIsImRhdGEiOnt9LCJyYW5kb20iOiI5MzhjMjdmZjdmNmRlZmMxZTU0MTNhMGNmZGIzMDNmOCJ9.p3lxB8h_BDhuHm_1yqr4L1ZjRCTXmftMzpNdYaexmvycISDkoUu1cizUFrmcRZ52QovkPeQDjsHBrdScRkVqiw.), together with a usage documentation and setup environment.
Competing financial interests
The authors declare no competing financial interests.
References
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza. et al. .
Bresson X, Laurent T.
De Cao N, Kipf T.
Jin W, Barzilay R, Jaakkola T.
Cui H, Wang C, Maan H. et al. .
Heimberg G, Kuo T, DePianto D. et al. .
Yang X, Liu G, Feng G. et al. .
Ramesh A, Pavlov M, Goh G. et al. .
Ramesh A, Dhariwal P, Nichol A. et al. .
Liu S, Li Y, Li Z. et al. .
Dai H, Dai B, Song L.
Fey M, Yuen J-G, Weichert F.
Roohani Y, Huang K, Leskovec J.
Khosla P, Teterwak P, Wang C.
van den Oord A, Li Y, Vinyals O.
Ding M, Yang Z, Hong W. et al. .
Radford A, Wu J, Child R. et al. .
Lewis M, Liu Y, Goyal N. et al. .
Raffel C, Shazeer N, Roberts A. et al. .
Open AI, Achiam J, Adler S. et al. .
Kaitoh K, Yamanishi Y.
Pereira T, Abbasi M, Oliveira RI. et al. .