GexMolGen: cross-modal generation of hit-like molecules via large language model encoding of gene expression signatures

Abstract

Designing de novo molecules with specific biological activity is an essential task since it holds the potential to bypass the exploration of target genes, which is an initial step in the modern drug discovery paradigm. However, traditional methods mainly screen molecules by comparing the desired molecular effects within the documented experimental results. The data set limits this process, and it is hard to conduct direct cross-modal comparisons. Therefore, we propose a solution based on cross-modal generation called GexMolGen (Gene Expression-based Molecule Generator), which generates hit-like molecules using gene expression signatures alone. These signatures are calculated by inputting control and desired gene expression states. Our model GexMolGen adopts a “first-align-then-generate” strategy, aligning the gene expression signatures and molecules within a mapping space, ensuring a smooth cross-modal transition. The transformed molecular embeddings are then decoded into molecular graphs. In addition, we employ an advanced single-cell large language model for input flexibility and pre-train a scaffold-based molecular model to ensure that all generated molecules are 100% valid. Empirical results show that our model can produce molecules highly similar to known references, whether feeding in- or out-of-domain transcriptome data. Furthermore, it can also serve as a reliable tool for cross-modal screening.

de novo molecule design, large language model, cross-modal generation, gene expression signatures

Introduction

Designing novel molecules based on genetic information is a challenging but promising task [1]. Traditional search methods like CMap [2] compare observed gene expression changes, referred to as gene expression signatures, with those recorded in molecular perturbation experiments, which leads to a highly restricted search space that cannot accommodate the vast possibilities within the molecular space. Furthermore, they cannot effectively narrow down this search space by directly comparing cross-modal similarities.

Several existing in silico methods are attempting to overcome these shortcomings. Some models, like DLEPS [3], use an indirect approach by first predicting gene expression signatures as the drug effects and then using these signatures to screen molecules, as traditional methods do. Other methods, like Wasserstein generative adversarial network (WGAN) [4] and Gex2SGen [5], directly input gene expression signatures and generate candidate molecules. WGAN uses GANs, while Gex2SGen uses variational autoencoders (VAEs). However, these models still need to improve their generalization ability. For example, they require input gene expression signatures that match the format used during the training phase; otherwise, re-training is required. Additionally, they mostly use a simplified molecular-input line-entry system (SMILES) to represent molecules. This form is a popular one-dimensional representation that can be interpreted as specific “chemical text” so that it is easy to utilize advanced sequence networks like Transformer [6] fully. However, this representation form fails to preserve structural information and has limited potential to expand to higher dimensions [7]. On the other hand, the hypothesis behind this task is that molecules with induced similar gene expression signatures are likely to possess similar structures like scaffolds or pharmacophoric attributes. This structural information conflict makes it difficult to explore inherent mechanisms further [8]. Therefore, exploring new forms of modal encoders and modal interaction is necessary.

For molecular encoders, there are various graph-based approaches. For example, molecular autoencoders represent a molecule as a graph, where vertices correspond to atoms and edges represent chemical bonds [9, 10]. These methods intuitively capture the shape of molecules but need more efficiency due to low validity in generations. Another line of research involves flow-based methods like MolFlow [11], which achieve a perfect reconstruction rate but can be very computationally expensive for long chains. Scaffold-based approaches, such as hierVAE [12], utilize scaffolds as building blocks instead of individual atoms and edges, ensuring high validity and modest computational cost.

For genetic encoders, considering the remarkable success of foundation models that have revolutionized various fields, including omics data in life science, single-cell large language models hold great potential to be general encoders in this specific field. These models are pre-trained on large, unlabeled single-cell expression datasets and then fine-tuned for specific tasks [13]. High-quality models such as scBERT [14], Geneformer [15], TOSICA [16], scGPT [17], scFoundation [18], SCimilarity [19], and GeneCompass [20] are capable of receiving the flexible input form of gene expression signatures due to their powerful generalization abilities.

As to modal interaction, the “first-align-then-generate” strategy, inspired by models like DALL.E [21, 22], has gained great interest in biology. Recent work, including text-molecule matching [23], text-protein sequence generation [24], and PLIP, a pathology language and image pre-training method [25], has achieved impressive performance.

Therefore, we propose GexMolGen, a cross-modal generation-based approach for designing molecules with specific biological activities.

Materials and methods

Overview of GexMolGen

GexMolGen is a general framework capable of generating hit-like molecules from two specific gene expression inputs: initial and desired or post-perturbed states. The underlying principle is that molecular structure correlates with its resulting differential gene expression, also called gene expression signatures. GexMolGen adopts a strategy of “first-align-then-generate” for cross-modal generation. As shown in Fig. 1A, the pipeline generally consists of four steps.

Figure 1

The workflow of GexMolGen employs a "first-align-then-generate" strategy (A), consisting of four steps: Encoding (B and C), Aligning (A), Transforming (E) and Decoding (D).

Open in new tab Download slide

The first step involves the encoding of each modality. For the molecular encoder (illustrated in Fig. 1B), we pre-train a scaffold-based molecular generator [12] on the ChEMBL [26] dataset and then use its encoder to produce molecular embeddings, thereby capturing structural information more efficiently than SMILES-based encoders [3, 4]. For the genetic encoder (visualized in Fig. 1C), we apply an advanced single-cell foundation model, scGPT [17], to encode pre- and post-perturbation gene expressions. We then calculate the difference within the gene embedding space to represent expression signatures.

The second step, as presented in Fig. 1C, is to fine-tune each encoder through aligning. As complex negative examples are crucial to final alignment effects, we employ a hierarchical gene-molecule matching approach to enhance contrastive learning with iteratively refined granularity. The alignment projects gene expression signatures and molecules into a unified space, preparing for the transition from genetic embeddings to molecular ones.

The third step, as depicted in Fig. 1E, is implemented by a module entitled “Facilitator”, which is trained using a teacher-forcing technique and applied in an auto-regressive manner. Molecular embeddings are discretized into molecular tokens through principal component analysis (PCA) and binning. Beam search is adopted in this module to optimize the diversity of the generation.

Finally, the inferred tokens are reordered and made continuous through retrieving and inverse PCA to molecular embeddings. They are then fed into the molecular decoder (as shown in Fig. 1D) to generate molecular graphs. During the training phase, this module needs to compute the reconstruction loss to form a closed training loop within the framework.

The necessity for the molecular encoder and alignment phase is limited to the training phase only. For generation, the initial and the desired gene expressions and the number of molecular candidates are needed. For the retrieval function, additional molecules are required as a screening list. To our knowledge, GexMolGen is the first work that applies large language models to generate molecules cross-modally.

Dataset pre-processing

Dataset used for training

The datasets used in GexMolGen are derived from the L1000 public dataset (GEO ID: GSE70138) [27]. The L1000 technology uses microarray measurements to record the gene expression signatures caused by over 25 000 perturbations, including around 19 800 small molecules, in cell lines. Each perturbation is performed on multiple cell lines. This dataset is processed hierarchically and divided into five levels: levels 1–3 record the absolute expression values of each gene in each bead, while levels 4 and 5 report the differential expression of each gene obtained by comparing samples within each bead.

Specifically, we first collect the perturbed and control data from level 3, ensuring they originate from the same bead. These data belong to six major cell lines—VCAP, PC3, A549, A375, HT29, and MCF7. We average each molecule’s perturbed gene expressions, ignoring the cell line information, resulting in a final dataset size of |$2\times N \times M$|⁠. Here, 2 signifies the experimental and control groups, |$N$| represents the number of molecules, and |$M$| indicates the count of gene symbols. Subsequently, we classify the structures based on the main layer of the InChI key, resulting in a total of 6360 classes. We randomly select 300 structures from this set, which include |$N_{2}$| types of small molecules, to form the hold-out set. Additionally, we choose another 300 structures, including |$N_{3}$| types of small molecules, to create the validation set. The remaining data are used for training the model with |$N_{1}$| types of small molecules.

We randomly sample |$N_{p}$| molecules from ChEMBL [26] as datasets for the molecular pre-training stage.

Dataset used for evaluating

In addition to the hold-out compound-induced gene expression data, we follow the same procedure as WGAN [4], our benchmarking method, to gather known inhibitors and genetic inputs for the comparative analysis of molecules generated from knock-out gene expressions. Initially, we select 10 known inhibitors of human genes from the ExCape dataset [28]. These genes are AKT1, AKT2, AURKB, CTSK, EGFR, HDAC1, MTOR, PIK3CA, SMAD3, and TP53. Our selection criteria consist of a Pic50 value greater than or equal to 5, non-presence of the same InChi-key in the training set, and compliance with valence bond rules. Eventually, we collected 2153, 1293, 2122, 1609, 4244, 1939, 2858, 2292, 23 758, and 12 860 known inhibitors, the numbers, respectively, corresponding to each gene.

For each of these 10 genes, we extract 100 transcriptome profiles from both the perturbed and control groups in the CRISPR knock-out perturbation data of the L1000 dataset [27].

Molecular encoder

For the molecular encoder (see Fig. 1B), hierVAE [12] regards motifs as the smallest building blocks for generating small molecules. This approach achieves 100% validity in molecular generation. It encodes the input molecule hierarchically to form an atom layer |$\mathcal{H}_{\mathcal G}^{g}$|⁠, an attachment layer |$\mathcal{H}_{\mathcal G}^{a}$|⁠, and a motif layer |$\mathcal{H}_{\mathcal G}^{s}$|⁠. Each successive layer utilizes the previous one as its encoding condition.

Atom layer

The input for the atom layer |$\mathcal{H}_{\mathcal G}^{g}$| is the molecular graph |$\mathcal{G}$|⁠, the edge features |$\{\mathrm{e}\left (\mathrm{b}_{\mathrm{uv}}\right )\}$|⁠, and the node features |$\{\mathrm{e}\left (\mathrm{a}_{\mathrm{u}}\right )\}$|⁠. Through the message passing network [29, 30] |$M P N_{\psi }(\cdot )$|⁠, we obtain the encoded features:

$$ \begin{align}& \mathrm{c}_{\mathcal{G}}^{\mathrm{g}}=\left\{\mathrm{h}_{\mathrm{v}}\right\}=M P N_{\psi_{1}}\left(\mathcal{H}_{\mathcal{G}}^{g},\left\{\mathrm{e}\left(\mathrm{a}_{\mathrm{u}}\right)\right\},\left\{\mathrm{e}\left(\mathrm{b}_{\mathrm{uv}}\right)\right\}\right),\end{align} $$

(1)

where |$\mathrm{h}_{\mathrm{v}}$| represents the updated atomic features.

Attachment layer

The attachment layer |$\mathcal{H}_{\mathcal G}^{a}$| updates its nodes using the following equation:

$$ \begin{align}& \mathrm{c}_{\mathcal{G}}^{a}=\left\{\mathrm{h}_{\mathcal{A}_{\mathrm{i}}}\right\}=M P N_{\psi_{2}}\left(\mathcal{H}_{\mathcal{G}}^{g},\left\{f_{\mathcal{A}_{\mathrm{i}}}\right\},\left\{\mathrm{e}\left(\mathrm{d}_{\mathrm{ij}}\right)\right\}\right),\end{align} $$

(2)

where |$\mathrm{e}\left (\mathrm{d}_{\mathrm{ij}}\right )$| denotes the edge features for the attachment layer, and |${f_{\mathcal{A}_{i}}}$| can be calculated based on atom vectors |$\mathrm{h}_{\mathrm{v}}$| and nodes |$\mathcal{A}_{i}$| in the attachment layer:

$$ \begin{align}& f_{\mathcal{A}_{\mathrm{i}}}=e m b_{\mathcal{A}_{\mathrm{i}}}\left(\mathrm{e}\left(\mathcal{A}_{\mathrm{i}}\right), \sum_{\mathrm{v} \in \mathrm{S}_{\mathrm{i}}} \mathrm{h}_{\mathrm{v}}\right),\end{align} $$

(3)

where |${emb}_{\mathcal{A}_{i}}(\cdot )$| represents a fully connected layer and |${e}(\cdot )$| symbolizes a conventional embedding layer. The node |$\mathcal{A}_{i}$| is the set of overlapping atoms between adjacent motifs.

Motif layer

The motif layer |$\mathcal{H}_{\mathcal G}^{s}$| updates its nodes using the following equation:

$$ \begin{align}& \mathrm{c}_{\mathcal{G}}^{s}=\left\{\mathrm{h}_{\mathcal{S}_{\mathrm{i}}}\right\}=M P N_{\psi_{3}}\left(\mathcal{H}_{\mathcal{G}}^{s},\left\{f_{\mathcal{S}_{\mathrm{i}}}\right\},\left\{\mathrm{e}\left(\mathrm{d}_{\mathrm{ij}}\right)\right\}\right),\end{align} $$

(4)

where |$\mathrm{e}\left (\mathrm{d}_{\mathrm{ij}}\right )$| denotes the edge features for the motif layer, and |${f_{\mathcal{S}_{i}}}$| can be calculated based on |$\mathrm{h}_{\mathcal{A}_{i}}$| and |$\mathcal{S}_{i}$|⁠:

$$ \begin{align}& f_{\mathcal{S}_{\mathrm{i}}}=e m b_{\mathcal{S}_{\mathrm{i}}}\left(\mathrm{e}\left(\mathcal{S}_{\mathrm{i}}\right), \mathrm{h}_{\mathcal{A}_{\mathrm{i}}}\right),\end{align} $$

(5)

where |${emb}_{\mathcal{A}_{i}}(\cdot )$| represents a fully connected layer and |${e}(\cdot )$| symbolizes a conventional embedding layer. Nodes |$\mathcal{S}_{i}$| in the motif layer represent the type of the motifs.

Finally, the embedding of the |$i$|th molecule is defined as

$$ \begin{align}& z_{m}^{(\mathrm{i})}=\mu\left(\mathrm{h}_{\mathcal{S}_{1}}\right)+\exp \left(\sum\left(\mathrm{h}_{\mathcal{S}_{1}}\right)\right) \cdot \epsilon, \epsilon \sim \mathcal{N}(0, \mathrm{I}),\end{align} $$

(6)

where |$\mu (\cdot )$| and |$\exp (\cdot )$| are both fully connected layers representing the mean and log variance, respectively, and |$\mathcal{S}_{1}$| represents the root motif, which is also the first motif for decoding.

Molecular decoder

During decoding (see Fig. 1D), we continue to adopt the molecular embedding notion as |$z_{m}$|⁠, and the motif layer encoded vectors as |$h_{\mathcal{S}_{k}}$|⁠, where |$k$| signifies the step. Initially, we set |$k=1$|⁠. Then, the hierarchical decoding of the molecular graph is shown as follows.

Motif prediction

This step predicts the next motif |$\mathcal{S}_{t}$| to be attached to |$\mathcal{S}_{k}$|⁠. It is implemented as a classification task over the motif vocabulary |$\mathcal{V}_{\mathcal S}$|⁠:

$$ \begin{align}& p_{\mathcal{S}_{t}}=\mathrm{softmax}(\mathrm{MLP}(h_{\mathcal{S}_{k}},z_{m})).\end{align} $$

(7)

Attachment prediction

Similar to motif prediction, but applying to the attachment vocabulary |$\mathcal{V}_{\mathcal A}(\mathcal{S}_{t})$|⁠, this step predicts the attachment configuration between |$\mathcal{S}_{t}$| and |$\mathcal{S}_{k}$|⁠. In other words, it determines the type of atom that connects |$\mathcal{S}_{t}$| and |$\mathcal{S}_{k}$|⁠:

$$ \begin{align}&p_{\mathcal{A}_{\iota}}=\mathrm{softmax}(\mathrm{MLP}(h_{\mathcal{S}_{k}},z_{m})).\end{align} $$

(8)

Graph prediction

This step is to decide how |$\mathcal{S}_{t}$| will be attached to |$\mathcal{S}_{k}$|⁠. After attachment prediction, we only narrow down to the connected atom type but do not specify the exact position. The attachment between |$\mathcal{S}_{t}$| and |$\mathcal{S}_{k}$| is defined as atom pairs |$\mathcal{M}_{tk}=\{(u_{j}.v_{j})|u_{j}\in \mathcal{A}_{k}, v_{j}\in \mathcal{A}_{t}\}$|⁠, where atom |$u_{j}$| and |$v_{j}$| are attached together. The probability of a candidate attachment, |$M$|⁠, is calculated based on the atom vectors, |$h_{u_{j}}$| and |$h_{v_{j}}$|⁠:

$$ \begin{align} &\begin{array}{rcl}{p_{M}}={\mathrm{softmax}\left(h_{M}\cdot z_{m}\right)},\end{array} \end{align} $$

(9)

$$ \begin{equation}\begin{array}{rcl}h_M=\sum_j\mathrm{MLP}(h_{u_j},h_{v_j}).\end{array}\end{equation} $$

(10)

Training

During training, this module needs to minimize the negative ELBO:

$$ \begin{align}& \mathcal{L}_{\mathrm ELBO}=-\mathbb{E}_{z_{m}\sim Q}[\log P(\mathcal{G}|z_{m})]+\lambda_{KL}\mathcal{D}_{KL}[Q(z_{m}|\mathcal{G})||P(z_{m})]\end{align} $$

(11)

to complete the training loop.

Validation

During validation, we utilize the Fréchet ChemNet Distance [31] as a selection criterion for models. This metric evaluates whether the generated molecules show diversity and possess chemical and biological properties similar to reference molecules.

Genetic encoder

In Fig. 1C, a single-cell large language model known as scGPT [17] is employed as the genetic encoder. Trained on an extensive amount of unlabeled single-cell expression data, this model shows potential in capturing underlying biological signals. The pre-trained gene tokens from scGPT can uniquely identify each gene in the input data, allowing for flexible gene inputs during the encoding process. Our model receives two sets of gene expression data: the initial and the desired or post-perturbation data. Both sets of data independently undergo the same encoding pipeline. Using the |$i$|th gene expression data point as an example, we initially retrieve its corresponding gene tokens from the scGPT pre-trained gene dictionary:

$$ \begin{align}& t_{g}^{(i)}=\left[i d\left(g_{1}^{(i)}\right), i d\left(g_{2}^{(i)}\right), \ldots \ldots,i d\left(g_{M}^{(i)}\right)\right],\end{align} $$

(12)

where |$t_{g}^{(i)}$| represents the gene token encoding for the |$i$|th data, and |$i d\left (g_{1}^{(j)}\right )$| represents the unique integer identifier for gene |$g_{j}$| in the |$i$|th data.

Next, we pre-process the expression values by implementing transcripts per thousand normalization and |$log1p$| transformation:

$$ \begin{align}& x^{(i)}=\left[x_{1}^{(i)}, x_{2}^{(i)} \ldots \ldots x_{M}^{(i)}\right].\end{align} $$

(13)

The gene embedding is then defined as follows:

$$ \begin{align}& z_{g}^{(i)}=e m b_{g}\left(t_{g}^{(i)}\right)+e m b_{x}\left(x^{(i)}\right),\end{align} $$

(14)

where |$e m b_{g}\left (\cdot \right )$| refers to the conventional embedding layers, and |$e m b_{x}\left (\cdot \right )$| represents the fully connected layer.

We further utilize the mask language model (MLM) task to fine-tune scGPT. Instead of commonly using mean square loss, we use the autofocus direction-aware loss in GEARS [32]. This loss automatically assigns a greater weight to differentially expressed genes by amplifying the exponent of the error. Given a minibatch of |$B_{1}$| data points, we randomly sample 95%

$$ \begin{align}&\mathcal{L}_{\mathrm{autofocus}}=\frac{1}{B_{1}}\sum_{l=1}^{B_{1}}\frac{1}{M_{1}}\sum_{j=1}^{M_{1}}\left(g_{j}-\hat{g}_{j}\right)^{(2+\gamma)}.\end{align} $$

(15)

Moreover, we introduce an additional direction-aware loss to increase the predicted data’s sensitivity to the direction relative to the control group, which is defined as follows:

$$ \begin{align}& \mathcal{L}_{\mathrm{direction}}=\frac{1}{B_{1}}\sum_{l=1}^{B_{1}}\frac{1}{M_{1}}\sum_{j=1}^{M_{1}}\left[\mathrm{sign}\left(g_{j}-g_{j}^{\mathrm{ctr}l}\right)-\mathrm{sign}\left(\hat{g}_{u}-g_{j}^{\mathrm{ctr}l}\right)\right]^{2}.\end{align} $$

(16)

The optimization objective for the MLM task, denoted as |$\mathcal{L}_{\mathrm{MLM}}$|⁠, is ultimately calculated by taking a weighted sum of the values of |$\mathcal{L}_{\mathrm{autofocus}}$| and |$\mathcal{L}_{\mathrm{direction}}$|⁠:

$$ \begin{align}& \mathcal{L}_{\mathrm{MLM}}=\mathcal{L}_{\mathrm{autofocus}}+ \lambda\mathcal{L}_{\mathrm{direction}},\end{align} $$

(17)

where |$\lambda $| functions to adjust the weight for the directional loss.

We then represent the gene expression signatures for subsequent alignment by calculating the difference in post-encoding embedding between the desired and control states:

$$ \begin{align}& z_{g} = z_{g}^{\mathrm{desired}}-z_{g}^{\mathrm{control}}.\end{align} $$

(18)

Given that some other single-cell foundation models are available, we conducted an ablation study in Appendix D to demonstrate scGPT’s advantages in efficiency and alignment accuracy.

Alignment

Alignment is the second step (see Fig. 1A) and aims to extract the mutual information between genetic and small molecular modalities, thus enhancing the controllability of the generation model. In this process, we adopt contrastive learning [33, 34] for multi-modal matching and introduce a hierarchical gene-molecule matching strategy to overcome model collapse.

Hierarchical clustering

Hierarchical clustering merges data into clusters step by step by calculating their similarity until a complete tree-like structure is formed. We use this strategy to cluster the small molecule embeddings into |$C=\{c_{1},c_{2},c_{3}\}$| classes. The detailed process is shown in Algorithm 1.

Hierarchical gene-molecule matching

Through hierarchical clustering, molecules are grouped into clusters, and the model can be guided by the cross-entropy loss function at a coarse resolution. Given that the small molecules have been clustered into |$C=\{c_{1},c_{2},c_{3}\}$| classes, we require the genetic encoder to classify the corresponding perturbations based on gene expression signatures. The cross-entropy loss function measures the quality of this task:

$$ \begin{align}&\mathcal{L}_{\mathrm ce:{c}}=-\sum_{i=1}^{N_{1}}y_{c,i}log(p_{c,i}),\end{align} $$

(19)

where |$y_{c, i}$| and |$p_{c, i}$| denote the truth labels and predicted labels for the |$c$|th cluster of small molecules based on gene embeddings.

Contrastive learning alignment

The final layer |$z_{g}$| output in the genetic encoder is aligned with small molecular vectors |$z_{m}$| via contrastive learning. The optimization objective is to minimize the InfoNCE loss function [35]:

$$ \begin{align}& \begin{aligned}\mathcal{L}_{\mathrm InfoNCE}&=-\frac{1}{2}E_{z_{m}z_{g}}\left[log\frac{exp\left(E(z_{m},z_{g}/\tau)\right)}{exp\left(E(z_{m},z_{g}/\tau)\right)+\sum_{g^{\prime}}exp\left(E(z_{m},z_{g}^{\prime}/\tau)\right)}\right]\\&\quad-\frac{1}{2}E_{z_{m},z_{g}}\left[log\frac{exp\left(E(z_{m},z_{g}/\tau)\right)}{exp\left(E(z_{m},z_{g}/\tau)\right)+\sum_{m^{\prime}}exp\left(E(z_{m}^{\prime},z_{g}/\tau)\right)}\right]\end{aligned},\end{align} $$

(20)

where |${z_{m}}^{\prime }$| and |${z_{g}}^{\prime }$| are unpaired molecule-gene negative sample pairs in a minibatch composed of |$B_{2}$| data points, and |$E(.,.)$| is an energy (scalar) function defined on these two modalities. We use vector dot product for representation; |$ \tau $| is a learnable coefficient. The InfoNCE loss aims to minimize the distance between positive samples and maximize the distance between negative samples.

Overall, during the learning process, our genetic encoder optimizes two functions: InfoNCE loss |$\mathcal{L}_{\mathrm InfoNCE}$| for contrastive learning, and cross-entropy loss |$\mathcal{L}_{\mathrm ce:{c}}$| for the hierarchical gene-molecule matching, which can also serve as the hard sample mining task:

$$ \begin{align}&\mathcal{L}_{\mathrm contrast}=\mathcal{L}_{\mathrm InfoNCE}+\sum_{i=1}^{C}\mathcal{L}_{\mathrm ce:{c}}.\end{align} $$

(21)

Facilitator

Assuming our dataset consists of sample pairs |$(m, g)$|⁠, where |$m$| represents a small molecule and |$g$| represents its corresponding gene expression signatures, we model the transition from the gene expression signatures to the molecules, denoted as |$P(m|g)$|⁠, using a two-phase process involving modality facilitation and model generation [36]. Let |$z_{m}$| and |$z_{g}$| continue representing the small molecule and gene embeddings, respectively. Thus, |$P(m|g)$| can be decomposed as

$$ \begin{align}&P(m|g)=P(m,z_{m}|g)=P(m|z_{m},g)P(z_{m}|g).\end{align} $$

(22)

The first equation holds because |$z_{m}$| is a deterministic function of the small molecule dataset |$m$|⁠, and the second equation holds due to the chain rule [21, 22].

When the small molecule decoder |$P\left (m\middle | z_{m},g\right )$| has been determined [12], the primary focus is on optimizing the facilitator |$P\left (z_{m}\middle | g\right )$| (see Fig. 1E). We adopt an auto-regressive model for modality transformation [37–39].

Since |$z_{m}$| is a continuous value, the cross-entropy loss function cannot be directly applied. To solve this, we adopt the approach utilized in DALL|$\cdot $| E 2 [22], where we reduce the dimension of |$z_{m}$| from |$d$| to |$L$| using PCA. This reduction allows us to retain nearly all the information (with a reconstruction mean squared error below 1%). Subsequently, we discretize the |$L$|-dimensional values into |$N_{m}$| bins, assuming that these values are uniformly distributed.

Figure 1E provides the mechanism of the Facilitator. It takes as input |$M$| genes, a special symbol [start], and |$L$| small molecule tokens and sequentially produces inferred tokens |$z_{m}^{d}$|⁠. The loss function is defined as follows:

$$ \begin{align}&\mathcal{L}_{\mathrm facilitator}=E_{z_{m}^{d}}\left[-\sum_{i=1}^{L}t_{i}log(p(z_{m}^{d(i)}|z_{m}^{d(0:i-1)}))\right],\end{align} $$

(23)

where |$t_{i}$| represents the |$i$|th true token, and |$p(z_{m}^{d(i)}|z_{m}^{d(0:i-1)})$| represents the inferred token of the small molecules.

Beam search optimization

Beam search is a commonly used search algorithm in sequence generation tasks [9, 40]. It utilizes dynamic programming principles by keeping track of a set of candidate sequences, known as the beam, with a beam width of |$K$| to preserve promising results. The beam search algorithm effectively manages the search space and improves the accuracy and diversity of generated sequences. Specifically, the beam search algorithm follows the following steps:

Initialize the beam with a set of candidate sequences, each consisting of |$M$| gene tokens and [start].
At each time step:
- Calculate the likelihood of every candidate token becoming the next token based on the current sequence.
- Select the top |$K$| candidate sequences based on their scores, where |$K$| is the beam width.
- Expand each selected candidate sequence by appending the molecular tokens and updating its score.
Repeat step two until a complete sequence of small molecule tokens |$L$| is formed.
Return the top-ranked sequence from the beam as the final output.

After the beam search process, the final output tokens |$z_{m}^{d}$| are transformed into |$z_{m}$| using the inverse of PCA. The continuous |$z_{m}$| values are then fed into the molecular decoder |$P(m|z_{m},g)$| [12]. This decoding procedure results in |$K$| sets of molecule candidates. We choose the value of |$K$| by trading between molecular generation quality and time consumption; the detailed sensitivity analysis is in Appendix C.

Results

Generating hit-like molecules from compound-induced gene expression signatures

In this section, we assess the performance of our model GexMolGen in generating hit-like molecules using compound-induced gene expression signatures. We input a set of |$N_{3}$| hold-out samples into the model and compare the molecules generated with the reference ones in hold-out samples.

In Fig. 2, we illustrate this comparison (for more details, please refer to Table A.1). The results indicate that the generated molecules can successfully reproduce some complex scaffold structures that appeared in the reference molecules, such as long chains (see Fig. 2(3)(6)), cyclic structures (see Fig. 2(1)), and cross-like structures (see Fig. 2(7)(9)).

Figure 2

Examples of generated molecules using a compound-induced gene expression signature.

Open in new tab Download slide

Figure 3

Examples of generated molecules using knock-out gene expression signatures.

Open in new tab Download slide

We compare our model with two traditional methods: searching molecules in the training set according to the cosine similarity or Euclidean distance among gene expression signatures [4]. We also include a pre-trained hierVAE [12] that generates directly from samples in the molecular latent space to demonstrate the guided influence of genetic information. We adopt Tanimoto similarity based on Morgan fingerprints (radius=3, 1024 bits) as the global similarity metric and use Tanimoto similarity based on MACCS keys and Fraggle similarity as the local structural metric.

Figure 4

Examples of generated molecules based on knock-out gene expression signatures, extending the findings from Fig. 3.

Open in new tab Download slide

Figure 5A shows the distribution of these metrics for different methods. Detailed results can be found in Tables A.1, A.2, and A.3. The results obtained from our model GexMolGen demonstrate a significant improvement over the other three baseline methods (P-value |$\lt 0.001$|⁠, according to a one-sided Mann–Whitney U-test). Specifically, our model achieves a mean Fraggle score of |$0.57$|⁠, a mean Morgan-based Tanimoto score of |$0.30$|⁠, and a mean MACCS-based Tanimoto score of |$0.70$|⁠. These results indicate that the molecules generated by our model exhibit high similarity to the reference molecules in both fingerprints and substructures. Moreover, across all three metrics, the Euclidean distance search method is consistently inferior to the cosine similarity method. This discrepancy can be attributed to the curse of dimensionality, where the Euclidean distance becomes less sensitive to differences in gene expression signatures as the dimension increases. We also notice that the randomly generated model has the poorest overall performance. Although it occasionally produces molecules with high similarity scores (e.g. |$1.19\%$| with a similarity |$\gt 0.29$| in Fig. 5A(b)), it also generates a more significant proportion of irrelevant molecules (e.g. |$26.27\%$| with a similarity |$\lt 0.08$| in Fig. 5A(b)), which highlights the significance of guidance from gene expression signatures in generating desired molecular properties.

Generating hit-like molecules from knock-out gene expression signatures

To test the generalization ability of our model GexMolGen, we evaluate it to generate potential gene inhibitors from knock-out gene expression data, which can also serve as an out-of-distribution test. The underlying hypothesis is that inhibitors of a specific gene will cause equivalent changes in gene expression when that gene is knocked out.

We utilize the same evaluation metrics and comparison methods described in the Result 3.1. The collection of knock-out data and the corresponding known inhibitors can be found in the Methods 2.2.2. The procedure involves each method generating 100 candidate molecules for each gene based on the given knock-out transcriptome. Subsequently, we record the scores between each candidate and its most similar known inhibitors.

Figure 3 and Fig. 4 present a visualization of the molecules generated by our model and the most similar known inhibitors. Morgan-based Tanimoto similarity scores measure this similarity. Compared with WGAN [4] with an average score of |$0.26$| (see Table A11 for details), our model generates molecules that not only exhibit a higher overall degree of similarity to known inhibitors but also successfully reconstructs the continuous complete skeletal structure of the molecules.

Figure 5

Quantitative results of generated molecules. A. Comparison results in generating hit-like molecules from compound-induced gene expression signatures. B. Comparison results of potential gene inhibitors generated from knock-out gene expression signatures.

Open in new tab Download slide

Table 1

Open in new tab

Metrics comparison with other deep learning methods.

Name\|$^{1}$\|	Data\|$^{2}$\|	Form\|$^{3}$\|	Valid\|$^{4}$\|	Metric\|$^{5}$\|
WGAN [4]	L1000	SMILES	\|$10\%$\|	\|$0.15 \thicksim 0.70$\|
Bidirectional Adversarial AE [41]	L1000	SMILES	\|$76\%$\|	\|$0.30 \thicksim 0.48$\|
TRIOMPHE [42]	L1000	SMILES/SELFIES	\|$16.89\% /\ 93.96\% $\|	\|$0.29 \thicksim 0.68$\|
PaccMann [43]	GDSC/CCLE	SMILES	\|$96.2\%$\|	\|$0.27 \thicksim 0.45$\|
Deep Generative with RL [44]	TCGA project	SMILES	\|$95.9\%$\|	\|$0.47 \thicksim 0.83$\|
Ours	L1000	Graph	\|$\textbf{100\%}$\|	\|$\textbf{0.72} \thicksim \textbf{1.00}$\|

Name\|$^{1}$\|	Data\|$^{2}$\|	Form\|$^{3}$\|	Valid\|$^{4}$\|	Metric\|$^{5}$\|
WGAN [4]	L1000	SMILES	\|$10\%$\|	\|$0.15 \thicksim 0.70$\|
Bidirectional Adversarial AE [41]	L1000	SMILES	\|$76\%$\|	\|$0.30 \thicksim 0.48$\|
TRIOMPHE [42]	L1000	SMILES/SELFIES	\|$16.89\% /\ 93.96\% $\|	\|$0.29 \thicksim 0.68$\|
PaccMann [43]	GDSC/CCLE	SMILES	\|$96.2\%$\|	\|$0.27 \thicksim 0.45$\|
Deep Generative with RL [44]	TCGA project	SMILES	\|$95.9\%$\|	\|$0.47 \thicksim 0.83$\|
Ours	L1000	Graph	\|$\textbf{100\%}$\|	\|$\textbf{0.72} \thicksim \textbf{1.00}$\|

|$^{1}$| The first column (Name) outlines the name and source of these methods. |$^{2}$| The second column (Data) is the types of datasets. |$^{3}$| The third column (Form) specifies the input format of the molecular data. |$^{4}$| The fourth column (Valid) reports the average validation of their molecular generator. |$^{5}$| The fifth column (Metric) outlines the range of Tanimoto similarity.

Table 1

Open in new tab

Metrics comparison with other deep learning methods.

Name\|$^{1}$\|	Data\|$^{2}$\|	Form\|$^{3}$\|	Valid\|$^{4}$\|	Metric\|$^{5}$\|
WGAN [4]	L1000	SMILES	\|$10\%$\|	\|$0.15 \thicksim 0.70$\|
Bidirectional Adversarial AE [41]	L1000	SMILES	\|$76\%$\|	\|$0.30 \thicksim 0.48$\|
TRIOMPHE [42]	L1000	SMILES/SELFIES	\|$16.89\% /\ 93.96\% $\|	\|$0.29 \thicksim 0.68$\|
PaccMann [43]	GDSC/CCLE	SMILES	\|$96.2\%$\|	\|$0.27 \thicksim 0.45$\|
Deep Generative with RL [44]	TCGA project	SMILES	\|$95.9\%$\|	\|$0.47 \thicksim 0.83$\|
Ours	L1000	Graph	\|$\textbf{100\%}$\|	\|$\textbf{0.72} \thicksim \textbf{1.00}$\|

Name\|$^{1}$\|	Data\|$^{2}$\|	Form\|$^{3}$\|	Valid\|$^{4}$\|	Metric\|$^{5}$\|
WGAN [4]	L1000	SMILES	\|$10\%$\|	\|$0.15 \thicksim 0.70$\|
Bidirectional Adversarial AE [41]	L1000	SMILES	\|$76\%$\|	\|$0.30 \thicksim 0.48$\|
TRIOMPHE [42]	L1000	SMILES/SELFIES	\|$16.89\% /\ 93.96\% $\|	\|$0.29 \thicksim 0.68$\|
PaccMann [43]	GDSC/CCLE	SMILES	\|$96.2\%$\|	\|$0.27 \thicksim 0.45$\|
Deep Generative with RL [44]	TCGA project	SMILES	\|$95.9\%$\|	\|$0.47 \thicksim 0.83$\|
Ours	L1000	Graph	\|$\textbf{100\%}$\|	\|$\textbf{0.72} \thicksim \textbf{1.00}$\|

The results of the experiments are thoroughly demonstrated in Fig. 5B. Generally, our model significantly surpasses the three other baseline methods with higher average scores and diversity in most cases. Furthermore, due to the limited size of the search database, the molecule candidates of traditional methods exhibit shallow diversity. For instance, CTSK, in MACCS-based metrics, shows a nearly linear form. On the other hand, the non-conditional pre-trained hierVAE [12] produces the widest range of scores due to the unconditional molecular sampling, which simultaneously leads to the deterioration of molecular properties. It occasionally produces molecules with high similarity scores but generates many molecules with very low correlations to known inhibitors, as observed in Fig. 5B’s AKT2. By leveraging genetic information, our model narrows down the generation space and allows for some diversity, striking a balance between structure similarity and molecular diversity. We also refer to how other deep learning methods perform on similar tasks, as presented in Table 1. Despite some differences in the tested genes, it can be observed that our model consistently produces competitive results. The notable advantages of our model are its efficiency with 100% generated molecular validity and high input flexibility, which are missing in other methods. The details of the results can be found in Tables A.4, A.5, and A.6.

Figure 6

Results of screening. A. The flowchart using only our model’s retrieving function. B. Examples of searched molecules using knock-out gene expression signatures.

Open in new tab Download slide

Screening molecules using cross-modal similarity

Our model GexMolGen also offers a well-defined cross-modal similarity for measuring the correlation between gene expression signatures and molecular structures. Therefore, it also supports screening molecules using gene expression profiles alone.

We evaluated the effectiveness of GexMolGen by the task described in Result 3.2. Specifically, our model first encodes the molecules from the reference dataset into |$Z_{m} = [z_{m_{1}},z_{m_{2}},...]$|⁠, and the gene expression data into |$Z_{g} = [z_{g_{1}},z_{g_{2}},...]$| respectively, then calculates their alignment scores |${Z_{m}}^{T}\cdot Z_{g}$| to find the molecule |$m_{.}$| with the highest score for each genetic input |$g_{.}$|⁠. The visualization of the screening pipeline can refer to Fig. 6A. This procedure is similar to the operations in traditional search methods, but the latter only calculates similarities within a uni-modal scope. For instance, in CMap [2], gene expression queries are compared with recorded gene expressions that molecules have perturbed to analyze modal similarity.

Figure 6B displays some examples of screening from the same molecule datasets, and Table 2 demonstrates statistics. Both indicate that the molecules screened using our cross-modal similarity are much more similar to known inhibitors than those found by traditional methods. This result reflects the advantage of this direct cross-modal similarity calculation. The supporting materials are given in Table. A.7.

Table 2

Open in new tab

Summary statistics of similarity.

Knock-out Genes	MACCS Scores (Ours)	MACCS Scores (Cos.)	MACCS Scores (Euc.)
AKT1	\|$0.745\pm 0.056$\|	\|$0.710\pm 0.091$\|	\|$0.662\pm 0.130$\|
AKT2	\|$0.690\pm 0.065$\|	\|$0.679\pm 0.098$\|	\|$0.661\pm 0.113$\|
AURKB	\|$0.694\pm 0.047$\|	\|$0.600\pm 0.024$\|	\|$0.621\pm 0.025$\|
CTSK	\|$0.658\pm 0.086$\|	\|$0.603\pm 0.069$\|	\|$0.590\pm 0.054$\|
EGFR	\|$0.752\pm 0.054$\|	\|$0.693\pm 0.051$\|	\|$0.724\pm 0.017$\|
HDAC1	\|$0.649\pm 0.058$\|	\|$0.647\pm 0.028$\|	\|$0.651\pm 0.017$\|
MTOR	\|$0.740\pm 0.087$\|	\|$0.652\pm 0.045$\|	\|$0.652\pm 0.035$\|
PIK3CA	\|$0.684\pm 0.053$\|	\|$0.560\pm 0.030$\|	\|$0.588\pm 0.070$\|
SMAD3	\|$0.820\pm 0.069$\|	\|$0.811\pm 0.065$\|	\|$0.782\pm 0.050$\|
TP53	\|$0.793\pm 0.061$\|	\|$0.780\pm 0.055$\|	\|$0.766\pm 0.062$\|

Knock-out Genes	MACCS Scores (Ours)	MACCS Scores (Cos.)	MACCS Scores (Euc.)
AKT1	\|$0.745\pm 0.056$\|	\|$0.710\pm 0.091$\|	\|$0.662\pm 0.130$\|
AKT2	\|$0.690\pm 0.065$\|	\|$0.679\pm 0.098$\|	\|$0.661\pm 0.113$\|
AURKB	\|$0.694\pm 0.047$\|	\|$0.600\pm 0.024$\|	\|$0.621\pm 0.025$\|
CTSK	\|$0.658\pm 0.086$\|	\|$0.603\pm 0.069$\|	\|$0.590\pm 0.054$\|
EGFR	\|$0.752\pm 0.054$\|	\|$0.693\pm 0.051$\|	\|$0.724\pm 0.017$\|
HDAC1	\|$0.649\pm 0.058$\|	\|$0.647\pm 0.028$\|	\|$0.651\pm 0.017$\|
MTOR	\|$0.740\pm 0.087$\|	\|$0.652\pm 0.045$\|	\|$0.652\pm 0.035$\|
PIK3CA	\|$0.684\pm 0.053$\|	\|$0.560\pm 0.030$\|	\|$0.588\pm 0.070$\|
SMAD3	\|$0.820\pm 0.069$\|	\|$0.811\pm 0.065$\|	\|$0.782\pm 0.050$\|
TP53	\|$0.793\pm 0.061$\|	\|$0.780\pm 0.055$\|	\|$0.766\pm 0.062$\|

Table 2

Open in new tab

Summary statistics of similarity.

Knock-out Genes	MACCS Scores (Ours)	MACCS Scores (Cos.)	MACCS Scores (Euc.)
AKT1	\|$0.745\pm 0.056$\|	\|$0.710\pm 0.091$\|	\|$0.662\pm 0.130$\|
AKT2	\|$0.690\pm 0.065$\|	\|$0.679\pm 0.098$\|	\|$0.661\pm 0.113$\|
AURKB	\|$0.694\pm 0.047$\|	\|$0.600\pm 0.024$\|	\|$0.621\pm 0.025$\|
CTSK	\|$0.658\pm 0.086$\|	\|$0.603\pm 0.069$\|	\|$0.590\pm 0.054$\|
EGFR	\|$0.752\pm 0.054$\|	\|$0.693\pm 0.051$\|	\|$0.724\pm 0.017$\|
HDAC1	\|$0.649\pm 0.058$\|	\|$0.647\pm 0.028$\|	\|$0.651\pm 0.017$\|
MTOR	\|$0.740\pm 0.087$\|	\|$0.652\pm 0.045$\|	\|$0.652\pm 0.035$\|
PIK3CA	\|$0.684\pm 0.053$\|	\|$0.560\pm 0.030$\|	\|$0.588\pm 0.070$\|
SMAD3	\|$0.820\pm 0.069$\|	\|$0.811\pm 0.065$\|	\|$0.782\pm 0.050$\|
TP53	\|$0.793\pm 0.061$\|	\|$0.780\pm 0.055$\|	\|$0.766\pm 0.062$\|

Knock-out Genes	MACCS Scores (Ours)	MACCS Scores (Cos.)	MACCS Scores (Euc.)
AKT1	\|$0.745\pm 0.056$\|	\|$0.710\pm 0.091$\|	\|$0.662\pm 0.130$\|
AKT2	\|$0.690\pm 0.065$\|	\|$0.679\pm 0.098$\|	\|$0.661\pm 0.113$\|
AURKB	\|$0.694\pm 0.047$\|	\|$0.600\pm 0.024$\|	\|$0.621\pm 0.025$\|
CTSK	\|$0.658\pm 0.086$\|	\|$0.603\pm 0.069$\|	\|$0.590\pm 0.054$\|
EGFR	\|$0.752\pm 0.054$\|	\|$0.693\pm 0.051$\|	\|$0.724\pm 0.017$\|
HDAC1	\|$0.649\pm 0.058$\|	\|$0.647\pm 0.028$\|	\|$0.651\pm 0.017$\|
MTOR	\|$0.740\pm 0.087$\|	\|$0.652\pm 0.045$\|	\|$0.652\pm 0.035$\|
PIK3CA	\|$0.684\pm 0.053$\|	\|$0.560\pm 0.030$\|	\|$0.588\pm 0.070$\|
SMAD3	\|$0.820\pm 0.069$\|	\|$0.811\pm 0.065$\|	\|$0.782\pm 0.050$\|
TP53	\|$0.793\pm 0.061$\|	\|$0.780\pm 0.055$\|	\|$0.766\pm 0.062$\|

Discussion

This paper proposes GexMolGen, a method to generate hit-like molecules based on signatures calculated by desired gene expression relative to the initial states. We adopt a “first-align-then-generate” [21] strategy for cross-modal generation. Specifically, we utilize an advanced single-cell large language model called scGPT [17] as the genetic encoder and pre-train hierVAE [17], a graph-based molecular model. To enhance contrastive learning, we integrate a hierarchical gene-molecule matching approach for alignment, serving as the mining task of hard negative samples. The facilitator initially transitions genetic embeddings into molecular tokens during the generation process. These tokens are subsequently transformed into a continuous molecular embedding, fed into the decoder to generate hit-like molecules.

Our method demonstrates competitive and superior performance compared with traditional search methods in generating molecules from in-domain compound-induced expression signatures and out-of-domain gene knock-outs. Additionally, our model exhibits efficiency by ensuring 100% validity and supports input flexibility, surpassing other deep-learning methods. It also can be used as a screening tool since it provides an explicit metric for measuring the similarity between genetic and molecular modalities.

Despite these strengths, our model faces certain limitations, particularly in the representation of genes and molecules. While current single-cell foundation models represent genes in a sequence format, gene interactions are more accurately depicted in a network form. Utilizing a graph neural network to aggregate information from network neighbors based on predefined biological networks might better mimic biological activity. Furthermore, the optimal molecular representation remains an unresolved issue. SMILES-based models [45–47] can generate large volumes of molecules quickly but may lose critical information due to their one-dimensional format. Emerging 3D molecular generation models [48, 49] offer more detailed representations but are computationally intensive. Integrating higher dimensional information, such as the angles of chemical bonds, into existing 2D molecular generators may strike a balance between molecular expressiveness and generation efficiency.

Key Points

We propose a novel cross-modal generation-based approach for designing molecules with specific biological activities.
GexMolGen is general and accepts flexible input formats by fully leveraging the power of foundation models.
GexMolGen has an explicit metric to measure the correlation between inputs and outputs and retrieval.
GexMolGen has 100% validity because of its scaffold-based molecular generation strategy.

Funding

This work was supported by grants from the National Natural Science Foundation of China 62103262 (to Y.Y.) and the Shanghai Pujiang Programme (no. 21PJ1407700 to Y.Y.).

Data availability

The data underlying this article are available in the article and its online supplementary material. The curated data can be found at https://zenodo.org/records/11100665?token=eyJhbGciOiJIUzUxMiIsImlhdCI6MTcxNDYyNzUzOSwiZXhwIjoxNzk4Njc1MTk5fQ.eyJpZCI6IjM3ZmFkNGU4LWViMDYtNGNkNy1iOTc4LWI0ZTBkMDk2OWI0YyIsImRhdGEiOnt9LCJyYW5kb20iOiJhOGJjZDM5NWFkY2ZiNDAwNjAzYzIwMTg2ODNjYWI2NCJ9.dADyS-0PBsFKr_z1yDdcDnoGoY5PFOSbnYtt6aIz4RLoNxykoIQffAlzQDPbFgqnZJmp7PmjNXPCXHkDMuZHuA. The raw data are downloaded from https://clue.io/data/CMap2020#LINCS2020, https://www.ebi.ac.uk/chembl/ and https://solr.ideaconsult.net/search/excape/#.

Author contributions

J.C., Q.Y., and Y.Y conceived the concept of this work. The experiments were performed by J.C., with the assistance of X.P., Y.F., K.Y., and Y.X. All authors participated in analyzing the results and writing the manuscript. The project was under the supervision of X.P., Q.Y., and Y.Y.

Code Availability

The source codes of GexMolGen are available on Zenodo (https://zenodo.org/records/11092780?token=eyJhbGciOiJIUzUxMiIsImlhdCI6MTcxNDQ4MjYwOCwiZXhwIjoxNzk4Njc1MTk5fQ.eyJpZCI6IjQwMjc0ZTlkLTQ0YjItNDY0Mi1hYmRjLTlkM2VkMjVhNWE5OSIsImRhdGEiOnt9LCJyYW5kb20iOiI5MzhjMjdmZjdmNmRlZmMxZTU0MTNhMGNmZGIzMDNmOCJ9.p3lxB8h_BDhuHm_1yqr4L1ZjRCTXmftMzpNdYaexmvycISDkoUu1cizUFrmcRZ52QovkPeQDjsHBrdScRkVqiw.), together with a usage documentation and setup environment.

Competing financial interests

The authors declare no competing financial interests.

References

Vincent

Nueda

Lee

. et al. .

Phenotypic drug discovery: Recent successes, lessons learned and new directions

Nat Rev Drug Discov

2022

Number: 12 Publisher: Nature Publishing Group

;

899

–

914

10.1038/s41573-022-00472-w

Musa

Ghoraie

Zhang

S-D

. et al. .

A review of connectivity map and computational approaches in pharmacogenomics

Brief Bioinform

2017

;

903

Zhu

Wang

. et al. .

Prediction of drug efficacy from transcriptional profiles with deep learning

Nat Biotechnol

1444

–

November 2021. Number: 11 Publisher: Nature Publishing Group

10.1038/s41587-021-00946-z

Crossref

PubMed

WorldCat

Méndez-Lucio

Baillif

Clevert

D-A

. et al. .

De novo generation of hit-like molecules from gene expression signatures using artificial intelligence

Nat Commun

2020

Number: 1 Publisher: Nature Publishing Group

;

–

10.1038/s41467-019-13807-w

Das

Chakrabarty

Srinivasan

. et al. .

Gex2SGen: Designing drug-like molecules from desired gene expression signatures

J Chem Inf Model

Publisher: American Chemical Society.

2023

;

1882

–

10.1021/acs.jcim.2c01301

Vaswani

Shazeer

Parmar

. et al. .

Attention Is all you Need

. In

Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17

. Red Hook, NY, USA:

Curran Associates Inc

2017

6000

–

Mokaya

Imrie

van Hoorn

. et al. .

Testing the limits of SMILES-based de novo molecular generation with curriculum and deep reinforcement learning

Nat Mach Intell

2023

Number: 4 Publisher: Nature Publishing Group

;

386

–

10.1038/s42256-023-00636-2

Google Scholar

Crossref

WorldCat

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza. et al. .

Generative Adversarial Networks

. arXiv e-prints, page arXiv:1406.2661, June 2014.

Bresson X, Laurent T.

A Two-Step Graph Convolutional Decoder for Molecule Generation

. arXiv e-prints, page arXiv:1906.03412, June 2019.

10.

De Cao N, Kipf T.

MolGAN: An implicit generative model for small molecular graphs

. arXiv e-prints, page arXiv:1805.11973, May 2018.

11.

Zang

Wang

MoFlow: An Invertible Flow Model for Generating Molecular Graphs

. In

Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ‘20

. New York, NY, USA:

Association for Computing Machinery

2020

617

–

12.

Jin W, Barzilay R, Jaakkola T.

Hierarchical Generation of Molecular Graphs using Structural Motifs

. arXiv e-prints, page arXiv:2002.03230, February 2020.

13.

Boiarsky

Singh

Buendia

. et al. .

A deep dive into single-cell RNA sequencing foundation models

bioRxiv

. page 2023.10.19.5631002023.

Publisher: Cold Spring Harbor Laboratory Section: New Results

14.

Yang

Wang

. et al. .

scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data

Nat Mach Intell

2022

Number: 10 Publisher: Nature Publishing Group

;

852

–

10.1038/s42256-022-00534-z

Google Scholar

Crossref

WorldCat

15.

Theodoris

Xiao

Chopra

. et al. .

Transfer learning enables predictions in network biology

Nature

2023

Number: 7965 Publisher: Nature Publishing Group

;

618

616

–

10.1038/s41586-023-06139-9

16.

Chen

Tao

. et al. .

Transformer for one stop interpretable cell type annotation

Nat Commun

2023

Number: 1 Publisher: Nature Publishing Group

;

–

10.1038/s41467-023-35923-4

17.

Cui H, Wang C, Maan H. et al. .

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Nat Methods

2024;

:1470–80.

10.1038/s41592-024-02201-0

OpenURL Placeholder Text

WorldCat

Crossref

18.

Hao

Gong

Zeng

. et al. .

Large-scale foundation model on single-cell transcriptomics

Nat Methods

2024

Number: 8 Publisher: Nature Publishing Group

;

1481

–

10.1038/s41592-024-02305-7

19.

Heimberg G, Kuo T, DePianto D. et al. .

Scalable querying of human cell atlases via a foundational model reveals commonalities across fibrosis-associated macrophages

. bioRxiv, page 2023.07.18.549537, 2023. Publisher: Cold Spring Harbor Laboratory Section: New Results.

20.

Yang X, Liu G, Feng G. et al. .

GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model

Cell Res

2024.

10.1038/s41422-024-01034-y

21.

Ramesh A, Pavlov M, Goh G. et al. .

Zero-Shot Text-to-Image Generation

. arXiv e-prints, page arXiv:2102.12092, February 2021.

22.

Ramesh A, Dhariwal P, Nichol A. et al. .

Hierarchical Text-Conditional Image Generation with CLIP Latents

. arXiv e-prints, page arXiv:2204.06125, April 2022.

23.

Edwards

Zhai

Text2Mol: Cross-modal molecule retrieval with natural language queries

. In:

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 595–607, Online and Punta Cana, Dominican Republic

Association for Computational Linguistics

2021

24.

Liu S, Li Y, Li Z. et al. .

A Text-guided Protein Design Framework

. arXiv e-prints, page arXiv:2302.04611, February 2023.

25.

Huang

Bianchi

Yuksekgonul

. et al. .

A visual—Language foundation model for pathology image analysis using medical twitter

Nat Med

2023

Number: 9 Publisher: Nature Publishing Group

;

2307

–

10.1038/s41591-023-02504-3

26.

Gaulton

Bellis

Bento

. et al. .

ChEMBL: A large-scale bioactivity database for drug discovery

Nucleic Acids Res

2012

Publisher: Oxford Academic

;

D1100

–

27.

Subramanian

Narayan

Corsello

. et al. .

A next generation connectivity map: L1000 platform and the first 1,000,000 profiles

Cell

2017

;

171

1437

–

1452.e17

Publisher: Elsevier

28.

Sun

Jeliazkova

Chupakhin

. et al. .

ExCAPE-DB: An integrated large scale dataset facilitating big data analysis in chemogenomics

J Chem

2017

;

10.1186/s13321-017-0203-5

Google Scholar

Crossref

WorldCat

29.

Dai H, Dai B, Song L.

Discriminative Embeddings of Latent Variable Models for Structured Data

. arXiv e-prints, page arXiv:1603.05629, March 2016.

30.

Fey M, Yuen J-G, Weichert F.

Hierarchical Inter-Message Passing for Learning on Molecular Graphs

. arXiv e-prints, page arXiv:2006.12179, June 2020.

31.

Preuer

Renz

Unterthiner

. et al. .

Fréchet ChemNet distance: A metric for generative models for molecules in drug discovery

J Chem Inf Model

2018

Publisher: American Chemical Society

;

1736

–

10.1021/acs.jcim.8b00234

32.

Roohani Y, Huang K, Leskovec J.

Predicting transcriptional outcomes of novel multigene perturbations with GEARS

Nat Biotechnol

2024;

:927–35.

10.1038/s41587-023-01905-6

OpenURL Placeholder Text

WorldCat

Crossref

33.

Khosla P, Teterwak P, Wang C.

Supervised Contrastive Learning

. arXiv e-prints, page arXiv:2004.11362, April 2020.

34.

Tian

Krishnan

Isola

Contrastive Multiview Coding

. In

Computer Vision — ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI

. Berlin, Heidelberg:

Springer-Verlag

2020

776

–

35.

van den Oord A, Li Y, Vinyals O.

Representation Learning with Contrastive Predictive Coding

. arXiv e-prints, page arXiv:1807.03748, July 2018.

36.

Ding M, Yang Z, Hong W. et al. .

CogView: Mastering Text-to-Image Generation via Transformers

. arXiv e-prints, page arXiv:2105.13290, May 2021.

37.

Radford A, Wu J, Child R. et al. .

Language models are unsupervised multitask learners

OpenAI blog

2019;

:9.

38.

Lewis M, Liu Y, Goyal N. et al. .

BART: Denoising sequence-to-sequence pre-training for natural language generation

Translation, and Comprehension

. arXiv e-prints, page arXiv:1910.13461, October 2019.

OpenURL Placeholder Text

WorldCat

39.

Raffel C, Shazeer N, Roberts A. et al. .

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

. arXiv e-prints, page arXiv:1910.10683, October 2019.

40.

Open AI, Achiam J, Adler S. et al. .

GPT-4 Technical Report

. arXiv e-prints, page arXiv:2303.08774, March 2023

41.

Shayakhmetov

Kuznetsov

Zhebrak

. et al. .

Molecular generation for desired transcriptome changes with adversarial autoencoders

Front Pharmacol

2020

;

10.3389/fphar.2020.00269

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

42.

Kaitoh K, Yamanishi Y.

TRIOMPHE: Transcriptome-Based Inference and Generation of Molecules with Desired Phenotypes by Machine Learning

Journal of Chemical Information and Modeling

2021;

:4303–20. Publisher: American Chemical Society.

43.

Born

Manica

Oskooei

. et al. .

PaccMannRL: Designing anticancer drugs from transcriptomic data via reinforcement learning

. In:

Schwartz

(ed.),

Research in Computational Molecular Biology, Lecture Notes in Computer Science

, pp.

231

–

Cham

Springer International Publishing

2020

10.1007/978-3-030-45257-5_18

44.

Pereira T, Abbasi M, Oliveira RI. et al. .

Deep generative model for therapeutic targets using transcriptomic disease-associated data—USP7 case study

Briefings in Bioinformatics

2022;

:bbac270.

10.1093/bib/bbac270

OpenURL Placeholder Text

WorldCat

Crossref

45.

Blaschke

Arús-Pous

Chen

. et al. .

REINVENT 2.0: An AI tool for De novo drug design

J Chem Inf Model

2020

Publisher: American Chemical Society

;

5918

–

10.1021/acs.jcim.0c00915

46.

Winter

Montanari

Noé

. et al. .

Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations

Chem Sci

2019

Publisher: Royal Society of Chemistry

;

1692

–

701

47.

Segler

MHS

Kogej

Tyrchan

. et al. .

Generating focused molecule libraries for drug discovery with recurrent neural networks

ACS Central Science

2018

;

120

–

10.1021/acscentsci.7b00512

48.

Huang

. et al. .

A dual diffusion model enables 3D molecule generation and lead optimization based on target pockets

Nat Commun

2024

;

–

Number: 1 Publisher: Nature Publishing Group

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

49.

Zhang

Jin

. et al. .

ResGen is a pocket-aware 3D molecular generation model based on parallel multiscale modelling

Nat Mach Intell

1030

2023

;

1020

Publisher: Nature Publishing Group

Google Scholar

Crossref

WorldCat

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

Download all slides

Knock-out Genes	MACCS Scores (Ours)	MACCS Scores (Cos.)	MACCS Scores (Euc.)
AKT1	\|$0.745\pm 0.056$\|	\|$0.710\pm 0.091$\|	\|$0.662\pm 0.130$\|
AKT2	\|$0.690\pm 0.065$\|	\|$0.679\pm 0.098$\|	\|$0.661\pm 0.113$\|
AURKB	\|$0.694\pm 0.047$\|	\|$0.600\pm 0.024$\|	\|$0.621\pm 0.025$\|
CTSK	\|$0.658\pm 0.086$\|	\|$0.603\pm 0.069$\|	\|$0.590\pm 0.054$\|
EGFR	\|$0.752\pm 0.054$\|	\|$0.693\pm 0.051$\|	\|$0.724\pm 0.017$\|
HDAC1	\|$0.649\pm 0.058$\|	\|$0.647\pm 0.028$\|	\|$0.651\pm 0.017$\|
MTOR	\|$0.740\pm 0.087$\|	\|$0.652\pm 0.045$\|	\|$0.652\pm 0.035$\|
PIK3CA	\|$0.684\pm 0.053$\|	\|$0.560\pm 0.030$\|	\|$0.588\pm 0.070$\|
SMAD3	\|$0.820\pm 0.069$\|	\|$0.811\pm 0.065$\|	\|$0.782\pm 0.050$\|
TP53	\|$0.793\pm 0.061$\|	\|$0.780\pm 0.055$\|	\|$0.766\pm 0.062$\|

Knock-out Genes	MACCS Scores (Ours)	MACCS Scores (Cos.)	MACCS Scores (Euc.)
AKT1	\|$0.745\pm 0.056$\|	\|$0.710\pm 0.091$\|	\|$0.662\pm 0.130$\|
AKT2	\|$0.690\pm 0.065$\|	\|$0.679\pm 0.098$\|	\|$0.661\pm 0.113$\|
AURKB	\|$0.694\pm 0.047$\|	\|$0.600\pm 0.024$\|	\|$0.621\pm 0.025$\|
CTSK	\|$0.658\pm 0.086$\|	\|$0.603\pm 0.069$\|	\|$0.590\pm 0.054$\|
EGFR	\|$0.752\pm 0.054$\|	\|$0.693\pm 0.051$\|	\|$0.724\pm 0.017$\|
HDAC1	\|$0.649\pm 0.058$\|	\|$0.647\pm 0.028$\|	\|$0.651\pm 0.017$\|
MTOR	\|$0.740\pm 0.087$\|	\|$0.652\pm 0.045$\|	\|$0.652\pm 0.035$\|
PIK3CA	\|$0.684\pm 0.053$\|	\|$0.560\pm 0.030$\|	\|$0.588\pm 0.070$\|
SMAD3	\|$0.820\pm 0.069$\|	\|$0.811\pm 0.065$\|	\|$0.782\pm 0.050$\|
TP53	\|$0.793\pm 0.061$\|	\|$0.780\pm 0.055$\|	\|$0.766\pm 0.062$\|

Knock-out Genes	MACCS Scores (Ours)	MACCS Scores (Cos.)	MACCS Scores (Euc.)
AKT1	\|$0.745\pm 0.056$\|	\|$0.710\pm 0.091$\|	\|$0.662\pm 0.130$\|
AKT2	\|$0.690\pm 0.065$\|	\|$0.679\pm 0.098$\|	\|$0.661\pm 0.113$\|
AURKB	\|$0.694\pm 0.047$\|	\|$0.600\pm 0.024$\|	\|$0.621\pm 0.025$\|
CTSK	\|$0.658\pm 0.086$\|	\|$0.603\pm 0.069$\|	\|$0.590\pm 0.054$\|
EGFR	\|$0.752\pm 0.054$\|	\|$0.693\pm 0.051$\|	\|$0.724\pm 0.017$\|
HDAC1	\|$0.649\pm 0.058$\|	\|$0.647\pm 0.028$\|	\|$0.651\pm 0.017$\|
MTOR	\|$0.740\pm 0.087$\|	\|$0.652\pm 0.045$\|	\|$0.652\pm 0.035$\|
PIK3CA	\|$0.684\pm 0.053$\|	\|$0.560\pm 0.030$\|	\|$0.588\pm 0.070$\|
SMAD3	\|$0.820\pm 0.069$\|	\|$0.811\pm 0.065$\|	\|$0.782\pm 0.050$\|
TP53	\|$0.793\pm 0.061$\|	\|$0.780\pm 0.055$\|	\|$0.766\pm 0.062$\|

Knock-out Genes	MACCS Scores (Ours)	MACCS Scores (Cos.)	MACCS Scores (Euc.)
AKT1	\|$0.745\pm 0.056$\|	\|$0.710\pm 0.091$\|	\|$0.662\pm 0.130$\|
AKT2	\|$0.690\pm 0.065$\|	\|$0.679\pm 0.098$\|	\|$0.661\pm 0.113$\|
AURKB	\|$0.694\pm 0.047$\|	\|$0.600\pm 0.024$\|	\|$0.621\pm 0.025$\|
CTSK	\|$0.658\pm 0.086$\|	\|$0.603\pm 0.069$\|	\|$0.590\pm 0.054$\|
EGFR	\|$0.752\pm 0.054$\|	\|$0.693\pm 0.051$\|	\|$0.724\pm 0.017$\|
HDAC1	\|$0.649\pm 0.058$\|	\|$0.647\pm 0.028$\|	\|$0.651\pm 0.017$\|
MTOR	\|$0.740\pm 0.087$\|	\|$0.652\pm 0.045$\|	\|$0.652\pm 0.035$\|
PIK3CA	\|$0.684\pm 0.053$\|	\|$0.560\pm 0.030$\|	\|$0.588\pm 0.070$\|
SMAD3	\|$0.820\pm 0.069$\|	\|$0.811\pm 0.065$\|	\|$0.782\pm 0.050$\|
TP53	\|$0.793\pm 0.061$\|	\|$0.780\pm 0.055$\|	\|$0.766\pm 0.062$\|

Month:	Total Views:
October 2024	120
November 2024	252
December 2024	254
January 2025	191
February 2025	209
March 2025	219
April 2025	156

Article Contents

GexMolGen: cross-modal generation of hit-like molecules via large language model encoding of gene expression signatures

Abstract

Introduction

Materials and methods

Overview of GexMolGen

Dataset pre-processing

Dataset used for training

Dataset used for evaluating

Molecular encoder

Atom layer

Attachment layer

Motif layer

Molecular decoder

Motif prediction

Attachment prediction

Graph prediction

Training

Validation

Genetic encoder

Alignment

Hierarchical clustering

Hierarchical gene-molecule matching

Contrastive learning alignment

Facilitator

Beam search optimization

Results

Generating hit-like molecules from compound-induced gene expression signatures

Generating hit-like molecules from knock-out gene expression signatures

Screening molecules using cross-modal similarity

Discussion

Funding

Data availability

Author contributions

Code Availability

Competing financial interests

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only