Predicting the structure of unexplored novel fentanyl analogues by deep learning model

Zhang, Yuan; Jiang, Qiaoyan; Li, Ling; Li, Zutan; Xu, Zhihui; Chen, Yuanyuan; Sun, Yang; Liu, Cheng; Mao, Zhengsheng; Chen, Feng; Li, Hualan; Cao, Yue; Pian, Cong

doi:10.1093/bib/bbac418

Abstract

Fentanyl and its analogues are psychoactive substances and the concern of fentanyl abuse has been existed in decades. Because the structure of fentanyl is easy to be modified, criminals may synthesize new fentanyl analogues to avoid supervision. The drug supervision is based on the structure matching to the database and too few kinds of fentanyl analogues are included in the database, so it is necessary to find out more potential fentanyl analogues and expand the sample space of fentanyl analogues. In this study, we introduced two deep generative models (SeqGAN and MolGPT) to generate potential fentanyl analogues, and a total of 11 041 valid molecules were obtained. The results showed that not only can we generate molecules with similar property distribution of original data, but the generated molecules also contain potential fentanyl analogues that are not pretty similar to any of original data. Ten molecules based on the rules of fentanyl analogues were selected for NMR, MS and IR validation. The results indicated that these molecules are all unreported fentanyl analogues. Furthermore, this study is the first to apply the deep learning to the generation of fentanyl analogues, greatly expands the exploring space of fentanyl analogues and provides help for the supervision of fentanyl.

drug design, molecule generation, SeqGAN, MolGPT, fentanyl and fentanyl analogues, scaffolds

Issue Section:

Problem solving protocol

Introduction

Fentanyl is a phenylpiperidine opioid analogue synthesized by Belgian scientist Paul Janssen. It is applicable to all kinds of pain after and during surgery. It is also used to prevent or alleviate delirium that occurs after surgery, and in combination with anesthetics as an aid to anesthesia. Fentanyl analogues refer to molecules that replace certain groups in the original fentanyl structure with other specific groups. Fentanyl analogues are substances that have a chemical structure whose chemical structure, compared with fentanyl, meets one or more of the following four criteria. First, use other acyl groups instead of propionyl groups. Second, replace the phenyl directly attached to the nitrogen atom with any substituted or not substituted monocyclic aromatic group. Third, there are alkyl, alkenyl, alkoxy, ester, ether, hydroxyl, halogen, haloalkyl, amino and nitro substituents on the piperidine ring. Finally, any other group (except hydrogen atom) is used to replace phenylethyl. Fentanyl and its analogues are originally good analgesics, but their availability in clinics, pharmacies, etc. and dependence after taking them made them abused and turned into third-generation drugs [1]. In recent years, in the work of forensic medicine, illegal and criminal cases caused by drug abuse have occurred from time to time, and deaths caused by overdose of fentanyl drugs have been common [2, 3]. Therefore, the supervision of psychoactive fentanyl analogues is particularly significant to protect the lives of citizens. New fentanyl analogues are new psychoactive analogues that maintain physiological actions by modifying the existing abused fentanyl and its analogues. Such analogues are ‘designer drugs’, which are often more toxic than the original structure. In addition, this is compounded by the fact that the finished product is sometimes adulterated with toxic impurities by drug manufacturers, making it more likely to cause death by poisoning [4]. Because there are too few types of fentanyl and its analogues with record in the public security, criminals can evade the supervision of relevant departments by modifying the chemical structure of fentanyl and its analogues. Therefore, the identification of potential fentanyl analogues and the expansion of the existing drug database can prevent problems before they occur, improving the efficiency of forensic doctors and anti-narcotics police in detecting new fentanyl-type drug cases, and make it difficult for criminals to escape from the judgement of law.

Fentanyl analogues have a large potential chemical space, which means there are many molecules that meet the four criteria for fentanyl analogues. The traditional method of synthesizing fentanyl analogues by experiments is time-consuming and inefficient. Therefore, if deep learning techniques can be used to generate potential fentanyl analogues, it can provide guidance to researchers using experimental synthesis of fentanyl analogues and reduce the time spent on exploring novel fentanyl molecule structures. Deep learning technology has been evolving in recent years with great success in areas such as computer vision and natural language processing. Deep generative model is a kind of deep learning model that randomly generates samples by learning the probability density of observable data. With the continuous development of deep generative model, many new deep generative models have been proposed. In recent years, a variety of deep generative models have made progress in the generation of chemical molecules. These deep generative models are all based on some special representation of chemical molecules. There are numerous representations of chemical molecules, such as string format, etc. The most widely used string format representation is simplified molecular input line entry system (SMILES) [5], but there are other types of string format representations, such as DeepSMILES [6], SMARTS and SELFIES [7]. In addition to the string format representations mentioned above, chemical molecules are also represented by molecular fingerprints and molecular graph. There are several types of deep generative models mainly. The first one is variational autoencoder based on variational inference and its variations, and a number of researchers have started to use this model to generate molecules [8–12]. The second is generative adversarial network, which is the most popular research direction in the field of machine learning and deep learning [13]. Since the generative adversarial network deals with image generation at the beginning, it is difficult to train the generative adversarial network during sequence generation process. To solve this problem, Lantao Yu et al. proposed SeqGAN based on generative adversarial network and reinforcement learning. This model bypasses the problems caused by sampling through reinforcement learning, so that generative adversarial network can be used for the generation of text sequences [14]. Many researchers have proposed various generative adversarial networks that can be used to generate chemical molecules [15–17]. The third model is deep generative model based on recurrent neural network (RNN), which adds atoms one by one until a complete molecule is formed [18, 19]. In addition to atom-by-atom generation, researchers began to study the generation of new chemical molecules based on the scaffolds of chemical molecules [20–22]. Because chemical molecules can be represented in the string format of SMILES, some models used in natural language processing can be used for molecular generation and molecular chemical reaction analysis [23]. For example, the MolGPT proposed by Viraj Bagal et al. uses the decoder of transformer to generate molecules, and directional generation can be carried out based on the property and scaffolds of molecules [24]. Many deep generative models have been proposed for the generation of chemical molecules, and the comparison between new deep generative models and the original generative models becomes a challenge. Benchmark models MOSES and GuacaMol have been proposed to provide researchers with datasets (screened molecules in the ZINC Clean Leads Collection and standardized subsets of ChEMBL24 datasets) and evaluation metrics. It is convenient to compare the subsequent generative model with the previous [25, 26]. Many researchers have applied the above generative models to the synthesis of chemical molecules. Josep Arús-Pous et al. used the CharRNN mentioned above to expand the GDB-13 dataset [27]. Michael A. skinnider et al. automatically elucidated the chemical structure of unidentified new psychoactive analogues using deep generation model and mass spectrometry data [28]. In the generation of specific molecules, we often face the problem of insufficient data. Michael Moret et al. proposed three solutions to the insufficient amount of training data: data augmentation [29–31], temperature sampling and transfer learning [32].

In this study, two generative models were introduced to generate potential fentanyl analogues. The first option is to generate potential fentanyl analogues using the deep generative model SeqGAN. We train the deep generative model SeqGAN by using existing data on fentanyl and its analogues and use the trained deep generative model to generate molecules. Ten thousand molecules were sampled from the trained model. After a series of filter, 546 valid molecules were obtained. The second option uses MolGPT model and the information on the scaffolds of fentanyl and its analogues to generate potential fentanyl analogues. We used the dataset which is used by MOSES to pretrain the MolGPT model. We extracted 14 scaffolds from 55 fentanyl and its analogues. For each of the scaffolds of fentanyl and its analogues, 10 000 molecules were sampled, resulting in a total of 140 000 molecules. After a series of filter, 10 495 valid molecules were obtained. By taking the intersection of the molecules generated by the two schemes reveals that a total of 164 molecules appear in the result generated by both schemes. We chose 10 molecules based on the rules for fentanyl analogues. We selected 5 different molecules from 546 valid molecules generated by SeqGAN and 5 different molecules from 10 495 valid molecules generated by MolGPT for nuclear magnetic resonance (NMR), mass spectra (MS) and infrared spectroscpy (IR) verification, and the results showed that these generated molecules were all unreported fentanyl analogues. These results indicate that both schemes can learn the distribution behind the original fentanyl and its analogues data and generate potential fentanyl analogues. The reason why we adopt the two schemes to generate potential fentanyl analogues is that the molecules generated by different models have some differences and few intersections, and the ideas of molecules generated by different models are different, leading to different focuses of the models. It can be seen from the results that the above two schemes are helpful for us to explore potential fentanyl analogues. We hope that these models can learn the distribution behind fentanyl and its analogue data, generate novel fentanyl molecules that have not been reported and provide new ideas for expanding drug databases.

Materials and methods

Figure 1 illustrates the main process of our work in this paper. The method on the left of Step1 is to use SeqGAN to generate molecules. There are three processes in this method, encoding, generation and decoding. The method on the right of Step1 is to generate molecules by using MolGPT model and scaffold information. We use the scaffold information of fentanyl and its analogues. Finally, we evaluate the generated molecules and select generated molecules for NMR, MS and IR verification.

Figure 1

The workflow. In the generation stage, SeqGAN model and MolGPT model are used for generation. In the molecular evaluation stage, the properties of the generated molecules are calculated and compared with the data of original fentanyl and its analogues. In the validation stage, the selected molecules are verified by NMR, mass spectrometry and IR spectroscopy.

Open in new tab Download slide

Data

We downloaded the data of 8 fentanyl and its analogues from the PubChem dataset (https://pubchem.ncbi.nlm.nih.gov/), collected the data of 63 fentanyl analogues from Chinese and English literature and obtained 55 molecules after removing duplicates.

SeqGAN is used to generate potential fentanyl analogues

Encoding

We use codes to transform the SMILES format data of fentanyl and its analogues into the sequence of numbers from 0 to 44. The corresponding relationship between SMILES format and numbers is listed in Table 1. In our dictionary, we use single symbols to represent some atom notations with multiple characters, multiple hydrogen atoms and electricity price. For example, we use the symbol Q to represent the atom Cl, the symbol W to represent the atom Br, the symbol Z to represent multiple hydrogen atoms H₂, the symbol x to represent multiple hydrogen atoms H₃, the symbol ~ to represent-, the symbol ! to represent electricity price −2, the symbol & for electricity price −3, the symbol u to represent electricity price +2 and the symbol y to represent electricity price +3. The lengths of SMILES data of fentanyl and its analogues are different. To generate data with the deep generative model SeqGAN, we add fentanyl and its analogue data into data of the same length (i.e. 1.5 times the maximum length of all sequences). We use the symbol ‘_’ to complement all sequences at the end of each sequence.

Table 1

Open in new tab

Dictionary. The bold value represents the encoding of possible characters in smiles.

Number	0	1	2	3	4	5	6	7	8
Vocabulary	^	H	B	c	C	n	N	o	O
Number	9	10	11	12	13	14	15	16	17
Vocabulary	p	P	s	S	F	Q	W	I	[
Number	18	19	20	21	22	23	24	25	26
Vocabulary	]	+	u	y	~	!	&	Z	X
Number	27	28	29	30	31	32	33	34	35
Vocabulary	−	=	#	.	(	)	1	2	3
Number	36	37	38	39	40	41	42	43	44
Vocabulary	4	5	6	7	8	@	/	\	_

Number	0	1	2	3	4	5	6	7	8
Vocabulary	^	H	B	c	C	n	N	o	O
Number	9	10	11	12	13	14	15	16	17
Vocabulary	p	P	s	S	F	Q	W	I	[
Number	18	19	20	21	22	23	24	25	26
Vocabulary	]	+	u	y	~	!	&	Z	X
Number	27	28	29	30	31	32	33	34	35
Vocabulary	−	=	#	.	(	)	1	2	3
Number	36	37	38	39	40	41	42	43	44
Vocabulary	4	5	6	7	8	@	/	\	_

Table 1

Open in new tab

Dictionary. The bold value represents the encoding of possible characters in smiles.

Number	0	1	2	3	4	5	6	7	8
Vocabulary	^	H	B	c	C	n	N	o	O
Number	9	10	11	12	13	14	15	16	17
Vocabulary	p	P	s	S	F	Q	W	I	[
Number	18	19	20	21	22	23	24	25	26
Vocabulary	]	+	u	y	~	!	&	Z	X
Number	27	28	29	30	31	32	33	34	35
Vocabulary	−	=	#	.	(	)	1	2	3
Number	36	37	38	39	40	41	42	43	44
Vocabulary	4	5	6	7	8	@	/	\	_

Number	0	1	2	3	4	5	6	7	8
Vocabulary	^	H	B	c	C	n	N	o	O
Number	9	10	11	12	13	14	15	16	17
Vocabulary	p	P	s	S	F	Q	W	I	[
Number	18	19	20	21	22	23	24	25	26
Vocabulary	]	+	u	y	~	!	&	Z	X
Number	27	28	29	30	31	32	33	34	35
Vocabulary	−	=	#	.	(	)	1	2	3
Number	36	37	38	39	40	41	42	43	44
Vocabulary	4	5	6	7	8	@	/	\	_

Generation

Generative adversarial network is a generative model proposed by Goodfellow in 2014 and has been successfully applied in the field of computer vision. The generative adversarial network consists of two main networks. One is the generator and the other is the discriminator. Both generator and discriminator are neural networks, which can be the simplest multilayer perceptron, convolutional neural network or recurrent neural network. The generator attempts to capture the distribution behind the data and generates data that can deceive the discriminator, whereas the discriminator tries to distinguish whether the data are real or generated by the generator. The two networks confront each other until Nash equilibrium is reached. When the discriminator’s recognition ability reaches a certain level, and it is unable to correctly determine the source of the data, a generator learning the real data distribution is obtained. The objective function for training generative adversarial network is as follows:

\begin{array}{r} {min}_{G} {max}_{D} V (D, G) = {min}_{G} {max}_{D} E_{X \sim p_{d a t a (x)}} [\log D (x)] \\ + E_{z \sim P_{z} (z)} [\log (1 - D (G (z)))] \end{array}

where x represents real data and z comes from the latent variable distribution.

Despite the success of generative adversarial networks in computer vision, it is difficult to train the generative adversarial network due to the discretization of sequence information. At present, there are three common methods to solve the problem of text data discretization. First, let the discriminator directly get the output of the generator. Second, use Gumbel-softmax instead of softmax. And finally bypass the problems caused by sampling through reinforcement learning. To solve the problem of sequence generation, Lantao Yu et al. proposed SeqGAN based on generative adversarial network and reinforcement learning. This model bypasses the problems caused by sampling through reinforcement learning, so that generative adversarial network can be used for the generation of text sequences. The generator of SeqGAN is a recurrent neural network with long-term and short-term memory units, and the discriminator of the model is a convolutional neural network. When we train the discriminator, we utilize the fentanyl and its analogue data as well as the sequence data generated by the generator. The discriminator will distinguish whether the obtained sequence data are from the fentanyl analogue data or the data generated by the generator. Because the structure of the generator is a recurrent neural network, it is generated token by token. Supposing that the generator generates to the t-th token, and the token having already generated by the generator constitutes the existing state. And at this point the generator hands the generated data to the discriminator, and then wants to generate the next token. To avoid the large difference in the data structure between the real sequence data and the intermediate state generated by the generator, the discriminator can easily distinguish between true and false. Therefore, after the generator generates the next token, it will use the Monte Carlo search method to sample and obtain a complete sequence data, which is the rollout policy. The discriminator calculates the reward after obtaining a complete sequence data. Since the generator generates discrete data, SeqGAN uses the policy gradient algorithm in reinforcement learning to pass the reward of the discriminator to the generator. In the concrete implementation process, to improve the efficiency of adversarial training in SeqGAN, the generator and discriminator will be pretrained initially. This makes it possible to generate realistic data and let the discriminator know what kind of data is generated. We take the encoded data as the real data for pretraining and formal training SeqGAN, let the generator generate similar encoded data and let the discriminator determine whether the input data are encoded data or generator generated data. After the training, let the generator generate another set of data as our final generated data.

Decoding

The generated sequence of numbers can be decoded into SMILES format by using codes and remove the supplementary symbol ‘ _’. The correspondence between numbers and SMILES characters is listed in Table 1. During decoding, some symbols with two characters which are represented by a symbol in the encoding phase will restore the situation back to get our final generated data.

Data augmentation

The SMILES format is the most popular molecular representation in the fields of molecule generation and molecular property prediction. Although there is only one canonical SMILES for each molecule, there can be multiple different Randomized SMILES format data corresponding to the same molecule. Data augmentation means using different Randomized SMILES data to represent the same molecule. In our experiment, a 10-fold data augmentation is used, and the specific process of data augmentation is shown in Figure 2. We take different atoms of the same molecule as the starting sites (the red numbers in Figure 2) and traversed them in different directions to obtain the molecules with augmented data. We used the RDKit package in Python to do a 10-fold data augmentation on fentanyl and its analogues, yielding 550 data. SeqGAN was then trained using data from 55 fentanyl and its analogues and data augmented by 10 times of data.

Figure 2

Data augmentation. The red numbers indicate the ergodic start sites of Randomized SMILES. RDkit is a Python library that processes chemical molecular information.

Open in new tab Download slide

Model training of SeqGAN models

We built a python (2.7.18) and tensorflow-gpu (1.10.0) environment. Tesla V100S were used for 2.5 days of model training and data generating. The input to the generator is a random variable from a normal distribution, batch_size is set to 1, and the epoch of the pretrained generator is 120. Generator optimizer uses Adam, learning rate: 0.01, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-08. The epoch of the pretrained discriminator was 3, but the process of pretraining the discriminator was done 50 times. Adam was used in the discriminator optimizer, and the learning rate was 0.0001, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-08. The epoch of formal adversarial training is 20.

Potential fentanyl analogues were generated by MolGPT model and the scaffolds’ information of fentanyl and its analogues

The previous generation was based on one character, so the model may not learn the scaffold’s information of fentanyl and its analogues in the generation process. The scaffold’s information of fentanyl plays an important role in fentanyl analogues. So we use MolGPT and scaffold’s information to generate potential fentanyl analogues. The MolGPT model is shown in Figure 1 (step 1 right). The MolGPT model is essentially a mini-version of the Generative Pretraining Transformer model, consisting of the decoder part of Transformer. MolGPT model is composed of stacked decoder blocks, and each decoder block is composed of a fully connected layer containing a mask self-attention layer. The model is a language model that is pretrained by predicting the next token given the starting token. We tokenized using a dictionary and extracted molecular scaffolds using RDKit. All the molecule SMILES tokens are mapped to a 256-dimensional vector using an embedding layer. Scaffolds’ information is also converted into 256-dimensional vectors. Once trained, the scaffold information is passed to the model along with the SMILES format of the molecules. At generation time, we provide the model with a starting token, which in turn predicts the next token to generate a molecule. First, we utilize the dataset of reference set MOSES to pretrain the MolGPT model. Second, we extract the scaffolds of fentanyl and its analogues by RDkit. These 14 Murcko scaffolds are listed in Table 2. Finally, we use MolGPT model and the information of 14 Murcko scaffold extracted from fentanyl and its analogues to generate for each scaffold. Ten thousand molecules were sampled respectively, and a total of 140 000 molecules were sampled.

Table 2

Open in new tab

Scaffolds of the original fentanyl and its analogues

Number	Smiles format for scaffolds of fentanyl and its analogues
1	O=C(C1CC1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
2	O=C(C1CCCO1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
3	O=C(CCc1ccccc1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
4	O=C(c1ccc2c(c1)OCO2)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
5	O=C(c1ccccc1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
6	O=C(c1ccco1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
7	O=C1CCC2C3Cc4cccc5c4C2(CCN3)C1O5.c1ccc(CCN2CCC(Nc3ccccc3)CC2)cc1
8	O = c1[nH]nnn1CCN1CCC(Nc2ccccc2)CC1
9	c1ccc(CCN2CCC(Nc3ccccc3)CC2)cc1
10	c1ccc(CC[NH+]2CCC(Nc3ccccc3)CC2)cc1
11	c1ccc(CN2CCC(Nc3ccccc3)CC2)cc1
12	c1ccc(NC2CCN(CCc3cccs3)CC2)cc1
13	c1ccc(NC2CCNCC2)cc1
14	c1ccccc1

Number	Smiles format for scaffolds of fentanyl and its analogues
1	O=C(C1CC1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
2	O=C(C1CCCO1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
3	O=C(CCc1ccccc1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
4	O=C(c1ccc2c(c1)OCO2)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
5	O=C(c1ccccc1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
6	O=C(c1ccco1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
7	O=C1CCC2C3Cc4cccc5c4C2(CCN3)C1O5.c1ccc(CCN2CCC(Nc3ccccc3)CC2)cc1
8	O = c1[nH]nnn1CCN1CCC(Nc2ccccc2)CC1
9	c1ccc(CCN2CCC(Nc3ccccc3)CC2)cc1
10	c1ccc(CC[NH+]2CCC(Nc3ccccc3)CC2)cc1
11	c1ccc(CN2CCC(Nc3ccccc3)CC2)cc1
12	c1ccc(NC2CCN(CCc3cccs3)CC2)cc1
13	c1ccc(NC2CCNCC2)cc1
14	c1ccccc1

Table 2

Open in new tab

Scaffolds of the original fentanyl and its analogues

Number	Smiles format for scaffolds of fentanyl and its analogues
1	O=C(C1CC1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
2	O=C(C1CCCO1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
3	O=C(CCc1ccccc1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
4	O=C(c1ccc2c(c1)OCO2)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
5	O=C(c1ccccc1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
6	O=C(c1ccco1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
7	O=C1CCC2C3Cc4cccc5c4C2(CCN3)C1O5.c1ccc(CCN2CCC(Nc3ccccc3)CC2)cc1
8	O = c1[nH]nnn1CCN1CCC(Nc2ccccc2)CC1
9	c1ccc(CCN2CCC(Nc3ccccc3)CC2)cc1
10	c1ccc(CC[NH+]2CCC(Nc3ccccc3)CC2)cc1
11	c1ccc(CN2CCC(Nc3ccccc3)CC2)cc1
12	c1ccc(NC2CCN(CCc3cccs3)CC2)cc1
13	c1ccc(NC2CCNCC2)cc1
14	c1ccccc1

Number	Smiles format for scaffolds of fentanyl and its analogues
1	O=C(C1CC1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
2	O=C(C1CCCO1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
3	O=C(CCc1ccccc1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
4	O=C(c1ccc2c(c1)OCO2)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
5	O=C(c1ccccc1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
6	O=C(c1ccco1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1
7	O=C1CCC2C3Cc4cccc5c4C2(CCN3)C1O5.c1ccc(CCN2CCC(Nc3ccccc3)CC2)cc1
8	O = c1[nH]nnn1CCN1CCC(Nc2ccccc2)CC1
9	c1ccc(CCN2CCC(Nc3ccccc3)CC2)cc1
10	c1ccc(CC[NH+]2CCC(Nc3ccccc3)CC2)cc1
11	c1ccc(CN2CCC(Nc3ccccc3)CC2)cc1
12	c1ccc(NC2CCN(CCc3cccs3)CC2)cc1
13	c1ccc(NC2CCNCC2)cc1
14	c1ccccc1

Model training of MolGPT models

We built a python (3.7.9) and pytorch (1.8.1) environment. Tesla V100S were used for 1 day of model training and data generating. In the pretraining phase, the batch_size was set to 384. In the training phase, the batch_size was set to 16. In the generation phase, the batch_size was set to 384. The epoch of the all phase was set to 10. AdamW was used in the optimizer, and the learning rate was 6e-3, beta1 = 0.9, beta2 = 0.95.

Metrics and the evaluation of molecular property

We use three criteria and the properties of generated molecules to measure the quality of our generated data. The three criteria are defined as follows:

valid ity = \frac{the number of valid molecules}{the number of all generated molecules}

unique ness = \frac{the number of unique molecules}{the number of valid molecules}

\begin{aligned} novelty = \\ \frac{the number of valid molecules that are not in the dataset}{the number of valid molecules} \end{aligned}

Validity: Validity refers to the proportion of valid molecules to the total number of generated molecules. Valid molecules refer to generated molecules which can be converted into molecular graphs by SMILES format. If a molecule cannot be converted into a molecular graph by SMILES format, it can be considered that the molecule is invalid. We used RDKit to check the validity of the molecule. Validity measures the ability of the model to learn SMILES syntax.

Uniqueness: Uniqueness refers to the only molecule in the valid molecule set. If a molecule is unique, it means that there is no same generated molecule as this molecule.

Novelty: Novelty is the proportion of the generated molecules that are not present in the training set. If a molecule is novel, this means that the molecule is both a valid molecule and not in the training dataset. Low novelty indicates overfitting.

Molecular properties are introduced as follows:

Molecular weight: Molecular weight defines the sum of the relative atomic weight of each atom in the molecule. We can use RDKit package for calculation [33].

logp [34]: The partition coefficient compares the solubility of solute in equilibrium solvent. If one solvent is water and the other solvent is oil, logp is the oil–water partition coefficient and this indicator reflects the hydrophilicity and hydrophobicity of chemical molecules.

Synthetic accessibility (SA) [35]: This measures the difficulty of synthesizing a chemical molecule, and SA rates the ease of synthesis of molecules on a scale of 1–10, with closer to 1 indicating easier synthesis and closer to 10 indicating more difficult synthesis.

Natural product-likeness (NP) [36]: This index measures the similarity of chemical molecules to natural products. The NP similarity score is between −5 and 5, with larger numbers being more likely to be a natural product.

Quantitative estimate of drug-likeness (QED) [37]: QED is a method to quantify drug-likeness as a value between 0 and 1.

sp³ hybridization ratio: The proportion of sp³ hybrid carbon atoms in each molecule.

The metric between the two distributions:

KL divergence: KL divergence is an asymmetric measure of the difference between two probability distributions. Let

P (x)

and

Q (x)

be two probability distributions on random variable x, then in the case of discrete and continuous random variables, KL divergence is defined as follows:

K L (P ∣ | Q) = \sum P (x) \log \frac{P (x)}{Q (x)}

K L (P ∣ | Q) = \int P (x) \log \frac{P (x)}{Q (x)} d x

JS divergence: JS divergence is a symmetry measure of the difference between two probability distributions. Its value is between 0 and 1. JS divergence is defined as follows:

J S (P ∣ | Q) = \frac{1}{2} K L (P ‖ \frac{1}{2} (P + Q)) + \frac{1}{2} K L (Q ‖ \frac{1}{2} (P + Q))

Wasserstein distance: Wasserstein distance is a set of measurement methods used to measure the distance between two probability distributions. The distance is defined on a metric space

(M, ρ)

⁠, where

ρ (x, y)

represents the distance function of two instances

x

and

y

in set

M

⁠. The p-th Wasserstein distance between the two probability distributions P and Q is defined as follows:

W_{P} (P, Q) = {(\inf_{μ \in Γ (P, Q)} \int ρ {(x, y)}^{p} d μ (x, y))}^{\frac{1}{p}}

Where

Γ (P, Q)

is the joint distribution of all marginal distributions

P

and

Q

within the set

M \times M

⁠.

Results

Our experiments consist of generating molecules from a deep learning model, evaluating the generated molecules and verifying the generated molecules. In first option, we pretrain the generator using data from 55 fentanyl and its analogues so that the generator could be given direction on what data would be realistic to generate. We ended up generating 10 000 data, and after a series of filtering and elimination of duplicates, we got 546 valid molecules without training data. We also used data augmentation to expand the data, and used the augmented data to train the model. We also generated 10 000 data. After a series of screening and elimination of duplication, we obtained 875 valid molecules without training data. In second scheme, we considered the generation based on Murcko scaffolds of fentanyl and its analogues. We extracted 14 Murcko scaffolds of fentanyl and its analogues using RDkit library of Python. We used MolGPT model and the information of 14 Murcko scaffolds extracted from fentanyl and its analogues to generate for each scaffold. About 10 000 molecules were sampled respectively, and a total of 140 000 molecules were sampled. After a series of screening, 10 495 valid molecules without training data were obtained.

Figure 3

The results of no data augmentation of SeqGAN. The molecular similarity heatmap, the molecular property distribution of original fentanyl and its analogues and the molecular property distribution of 546 valid molecules. (A) shows a heatmap of the similarities between original fentanyl and its analogues. (B) shows a heatmap of the similarities between the 546 valid molecules generated by the model and the original fentanyl and its analogues. Each row represents the data of an original fentanyl and fentanyl analogue, and each column represents one of 546 validly generated molecules. The color indicates the degree of similarity. The darker the color, the higher the similarity. The lighter the color, the less similar it is. (C–H) shows the properties (molecular weight, logp, SA, NP and proportion of sp³ hybrid carbon atoms in each molecule) distribution of original fentanyl and its analogues and the properties distribution of generated molecules.

Open in new tab Download slide

Validity, uniqueness and novelty analyses

We evaluate the generated molecules by calculating the validity, uniqueness and novelty of three criteria commonly used in molecular generation models. The validity, uniqueness and novelty of the generated molecules based on the SeqGAN model are described as follows. We pretrain the generator with fentanyl and its analogues during the experiment. The validity of our generated molecules is 0.63, and the novelty of our generated molecules is 0.6. But the uniqueness score (0.1) of the molecules we generated is low, which means that there are repetitive molecules in the molecules we generate, and the repetitiveness is high. We also pretrain the generator with other drug data during the experiment. In this case, the uniqueness of the generated molecule is 0.82. Therefore, we analyze that the high repeatability when pretraining the generator with fentanyl and its analogue data is due to the inherent similarity of the fentanyl and its analogue data (Figure 3A) and the concentrated distribution of fentanyl and its analogues in high-dimensional space. We use data-augmented data to train the model and generate molecules. The validity of the molecules we generated is reduced to 0.15, but the novelty (0.87) and uniqueness (0.59) increase. Using the data-augmented data to train the model, the generated molecules are more diverse. The validity, uniqueness and novelty of the generated molecules based on the MolGPT model are described as follows. The validity, uniqueness and novelty of generation based on MolGPT model and Murcko scaffolds of the fentanyl and its analogues are listed in Table 3. We calculate the mean values of validity, uniqueness and novelty of scaffold-generated molecules, which are 0.459, 0.274 and 0.971, respectively. It can be seen that the novelty of scaffold-generated molecules using MolGPT model is better than that of SeqGAN model. In other words, the MolGPT model can generate more data that are not in the original fentanyl and its analogues.

Table 3

Open in new tab

Three indicators based on MolGPT and Murcko scaffolds of the original fentanyl and their analogues

Murcko scaffolds	Validity	Uniqueness	Novelty
O=C(C1CC1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.388	0.279	0.973
O=C(C1CCCO1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.291	0.377	0.976
O=C(CCc1ccccc1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.504	0.22	0.972
O=C(c1ccc2c(c1)OCO2)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.204	0.344	0.965
O=C(c1ccccc1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.512	0.226	0.974
O=C(c1ccco1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.454	0.259	0.977
O=C1CCC2C3Cc4cccc5c4C2(CCN3)C1O5.c1ccc(CCN2CCC(Nc3ccccc3)CC2)cc1	0.032	0.58	0.926
O = c1[nH]nnn1CCN1CCC(Nc2ccccc2)CC1	0.233	0.313	0.963
c1ccc(CCN2CCC(Nc3ccccc3)CC2)cc1	0.686	0.182	0.973
c1ccc(CC[NH+]2CCC(Nc3ccccc3)CC2)cc1	0.689	0.184	0.975
c1ccc(CN2CCC(Nc3ccccc3)CC2)cc1	0.686	0.175	0.971
c1ccc(NC2CCN(CCc3cccs3)CC2)cc1	0.603	0.215	0.981
c1ccc(NC2CCNCC2)cc1	0.618	0.238	0.98
c1ccccc1	0.53	0.249	0.986

Murcko scaffolds	Validity	Uniqueness	Novelty
O=C(C1CC1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.388	0.279	0.973
O=C(C1CCCO1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.291	0.377	0.976
O=C(CCc1ccccc1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.504	0.22	0.972
O=C(c1ccc2c(c1)OCO2)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.204	0.344	0.965
O=C(c1ccccc1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.512	0.226	0.974
O=C(c1ccco1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.454	0.259	0.977
O=C1CCC2C3Cc4cccc5c4C2(CCN3)C1O5.c1ccc(CCN2CCC(Nc3ccccc3)CC2)cc1	0.032	0.58	0.926
O = c1[nH]nnn1CCN1CCC(Nc2ccccc2)CC1	0.233	0.313	0.963
c1ccc(CCN2CCC(Nc3ccccc3)CC2)cc1	0.686	0.182	0.973
c1ccc(CC[NH+]2CCC(Nc3ccccc3)CC2)cc1	0.689	0.184	0.975
c1ccc(CN2CCC(Nc3ccccc3)CC2)cc1	0.686	0.175	0.971
c1ccc(NC2CCN(CCc3cccs3)CC2)cc1	0.603	0.215	0.981
c1ccc(NC2CCNCC2)cc1	0.618	0.238	0.98
c1ccccc1	0.53	0.249	0.986

Table 3

Open in new tab

Three indicators based on MolGPT and Murcko scaffolds of the original fentanyl and their analogues

Murcko scaffolds	Validity	Uniqueness	Novelty
O=C(C1CC1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.388	0.279	0.973
O=C(C1CCCO1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.291	0.377	0.976
O=C(CCc1ccccc1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.504	0.22	0.972
O=C(c1ccc2c(c1)OCO2)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.204	0.344	0.965
O=C(c1ccccc1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.512	0.226	0.974
O=C(c1ccco1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.454	0.259	0.977
O=C1CCC2C3Cc4cccc5c4C2(CCN3)C1O5.c1ccc(CCN2CCC(Nc3ccccc3)CC2)cc1	0.032	0.58	0.926
O = c1[nH]nnn1CCN1CCC(Nc2ccccc2)CC1	0.233	0.313	0.963
c1ccc(CCN2CCC(Nc3ccccc3)CC2)cc1	0.686	0.182	0.973
c1ccc(CC[NH+]2CCC(Nc3ccccc3)CC2)cc1	0.689	0.184	0.975
c1ccc(CN2CCC(Nc3ccccc3)CC2)cc1	0.686	0.175	0.971
c1ccc(NC2CCN(CCc3cccs3)CC2)cc1	0.603	0.215	0.981
c1ccc(NC2CCNCC2)cc1	0.618	0.238	0.98
c1ccccc1	0.53	0.249	0.986

Murcko scaffolds	Validity	Uniqueness	Novelty
O=C(C1CC1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.388	0.279	0.973
O=C(C1CCCO1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.291	0.377	0.976
O=C(CCc1ccccc1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.504	0.22	0.972
O=C(c1ccc2c(c1)OCO2)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.204	0.344	0.965
O=C(c1ccccc1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.512	0.226	0.974
O=C(c1ccco1)N(c1ccccc1)C1CCN(CCc2ccccc2)CC1	0.454	0.259	0.977
O=C1CCC2C3Cc4cccc5c4C2(CCN3)C1O5.c1ccc(CCN2CCC(Nc3ccccc3)CC2)cc1	0.032	0.58	0.926
O = c1[nH]nnn1CCN1CCC(Nc2ccccc2)CC1	0.233	0.313	0.963
c1ccc(CCN2CCC(Nc3ccccc3)CC2)cc1	0.686	0.182	0.973
c1ccc(CC[NH+]2CCC(Nc3ccccc3)CC2)cc1	0.689	0.184	0.975
c1ccc(CN2CCC(Nc3ccccc3)CC2)cc1	0.686	0.175	0.971
c1ccc(NC2CCN(CCc3cccs3)CC2)cc1	0.603	0.215	0.981
c1ccc(NC2CCNCC2)cc1	0.618	0.238	0.98
c1ccccc1	0.53	0.249	0.986

Similarity analyses of generated molecules and original fentanyl and its analogues

We calculate the similarity between the original fentanyl data and 546 valid molecules, and the results are shown in Figure 3B. Figure 3B shows the similarity heatmap between the molecules generated by the model and the original fentanyl and its analogues. We can see that most of the 546 valid molecules have a similarity greater than 0.5 with the original fentanyl and its analogues. The deep generative model has learned the distribution of the original fentanyl analogue data in high-dimensional space. Therefore, it makes sense for us to use the deep generative models to generate new fentanyl analogues. We also calculate the similarity between the original fentanyl data and the 875 valid molecules generated by the data-augmented data, and the result is shown in supplementary 1 Supplementary Figure S1A. Comparing Figure 3B with supplementary 1 Supplementary Figure S1A, we can see that the molecules generated by the data-augmented training model are more diverse, with the majority of the 875 valid molecules having a similarity of < 0.8 to the original fentanyl data, indicating that our model is likely to be able to expand the range of known fentanyl-like analogues and discover more potential fentanyl analogues. The above similarity is calculated based on molecular fingerprints.

Properties analyses of generated molecules and original fentanyl and its analogues

We calculate some physical and chemical properties of the original fentanyl and its analogues and the newly generated 546 valid molecules. Figure 3C shows the comparison of the molecular weight distribution of the raw fentanyl data and the generated molecules. Figure 3D shows the comparison of the molecule logp distribution of the raw fentanyl data and the generated molecules. Figure 3E shows the comparison of the SA score distribution of the raw fentanyl data and the generated molecules. Figure 3F shows the comparison of the NP score distribution of the raw fentanyl data and the generated molecules. Figure 3G shows the comparison of the QED score distribution of the raw fentanyl data and the generated molecules. Figure 3H shows the comparison of the proportion of carbon atoms in each molecule that are sp³ hybridized (Percent sp³ carbons) distribution of the raw fentanyl data and the generated molecules. From the property distributions of the original fentanyl analogues and 546 valid molecules, we can see that the molecules we generated have similar properties (molecular weight, logp, QED, Percent sp³ carbons) to the original fentanyl analogues. Figure 3E and F seem quite different in figure, but the measurement between property distribution of original fentanyl and its analogues and property distribution of 546 validly generated molecules is not very large. The KL divergence between property (NP) distribution of original fentanyl and its analogues and property distribution of 546 validly generated molecules is 0.178, and the KL divergence between property (SA) distribution of original fentanyl and its analogues and property distribution of 546 validly generated molecules is 0.317. Supplementary 1 Supplementary Figure S1B–G, respectively, shows the comparison of the molecular weight distributions of the original fentanyl data and 875 generated molecules, the comparison of the distribution of molecule logp, the comparison of the distribution of SA, the comparison of the distribution of NP, the comparison of the distribution of QED and the comparison of the distribution of the proportion of sp³ hybrid carbon atoms in each molecule. By comparing Table 4 and Supplementary Table S1, it can be found that the molecular property distributions generated by the data-augmented data training model are more diverse, and the values of the three metrics between the distributions are larger than those without data augmentation. Therefore, some molecules with various properties can be generated after using data augmentation. We also perform dimensionality reduction to 2D with t-SNE on Morgan fingerprints. Original fentanyl and its analogues and 546 generated molecules are plotted in Figure 4A. Five hundred forty-six valid molecules overlap most of the data of original fentanyl and its analogues.

Table 4

Open in new tab

The measurement between property distribution of original fentanyl and its analogues and property distribution of 546 validly generated molecules

	Molecularweight	logp	SA	NP	QED	Percent sp³carbons
KL	0.279	0.316	0.317	0.178	0.132	0.166
JS	0.181	0.147	0.273	0.198	0.156	0.129
Wasserstein distance	0.000143	0.010270	0.024356	0.098023	0.184813	0.193828

	Molecularweight	logp	SA	NP	QED	Percent sp³carbons
KL	0.279	0.316	0.317	0.178	0.132	0.166
JS	0.181	0.147	0.273	0.198	0.156	0.129
Wasserstein distance	0.000143	0.010270	0.024356	0.098023	0.184813	0.193828

Table 4

Open in new tab

The measurement between property distribution of original fentanyl and its analogues and property distribution of 546 validly generated molecules

	Molecularweight	logp	SA	NP	QED	Percent sp³carbons
KL	0.279	0.316	0.317	0.178	0.132	0.166
JS	0.181	0.147	0.273	0.198	0.156	0.129
Wasserstein distance	0.000143	0.010270	0.024356	0.098023	0.184813	0.193828

	Molecularweight	logp	SA	NP	QED	Percent sp³carbons
KL	0.279	0.316	0.317	0.178	0.132	0.166
JS	0.181	0.147	0.273	0.198	0.156	0.129
Wasserstein distance	0.000143	0.010270	0.024356	0.098023	0.184813	0.193828

Figure 4

t-SNE plot of chemical molecular distribution.

Open in new tab Download slide

Murcko scaffolds analyses of generated molecules and original fentanyl and its analogues

We extract the Murcko scaffolds of the original fentanyl and its analogue data [38], 14 in total. We extract the Murcko scaffolds of 546 valid molecules, 143 in total. And then we calculated the similarity between these Murcko scaffolds, and the results are presented in Figure 5A. As can be seen from Figure 5A, most Murcko scaffolds extracted from the generated molecules have similarities above 0.5 to one of the Murcko scaffolds extracted from the original fentanyl and its analogues. However, there are also some scaffolds which belong to the generated molecules that are not very similar to the scaffolds of original fentanyl and its analogues. Figure 5B shows some scaffolds selected by us. It can be seen that Murcko scaffolds extracted from the generated molecules include Murcko scaffolds extracted from the original fentanyl and its analogues, as well as Murcko scaffolds with high or low similarity to those extracted from the original fentanyl and its analogues.

Figure 5

The analyses of Murcko scaffolds. (A) Shows the similarity heatmap between the original fentanyl and its analogues’ Murcko scaffolds and the Murcko scaffolds of generated molecules. Each row represents the scaffold extracted from the original fentanyl and its analogue data, and each column represents the scaffold extracted from 546 validly generated molecule data. The color indicates the degree of similarity. The darker the color, the higher the similarity. The lighter the color, the less similar it is. (B) Shows some Murcko scaffolds we selected.

Open in new tab Download slide

Properties analyses based on MolGPT model and fentanyl and its analogues’ Murcko scaffolds

We perform dimensionality reduction to 2D with t-SNE on Morgan fingerprints. Original fentanyl and its analogues and 10 495 generated molecules are plotted in Figure 4B. Approximately 10 495 valid molecules overlap almost completely with the data of original fentanyl and its analogues. A total of 14 scaffolds are extracted from original fentanyl and its analogues, which are listed in Table 2. Based on the 14 scaffolds, 10 000 molecules are generated for each scaffold, a total of 140 000 molecules are generated. And 10 495 generated molecules are obtained after screening and removing duplication. Supplementary 1 Supplementary Figure S2A shows the similarity heatmap between 55 original fentanyl and its analogues and 10 495 generated molecules. It can be seen that most of the validly generated molecules are > 0.5 similar to original fentanyl and its analogues. Supplementary 1 Supplementary Figure S2B–G shows the comparison of molecular weight distribution of 55 original fentanyl and its analogues and 10 495 generated molecules, logp distribution of 55 original fentanyl and its analogues and 10 495 generated molecules, SA distribution of 55 original fentanyl and its analogues and 10 495 generated molecules, NP distribution of 55 original fentanyl and its analogues and 10 495 generated molecules, QED score distribution between 55 original fentanyl and its analogues and 10 495 generated molecules, and comparison of proportion distribution of sp³ hybrid carbon atoms in each molecule of 55 original fentanyl and its analogues and 10 495 generated molecules. It can be seen from the figure that the scaffold-based molecules also learn the properties of original fentanyl and its analogues.

The results and validation of the two methods

We obtained 546 valid molecules and 10 495 valid molecules, respectively. And a total of 164 molecules appear in the results generated by the two schemes simultaneously. We analyzed the reasons for these results as follows: (i) the generator of SeqGAN model is based on RNN structure, whereas MolGPT is based on attention mechanism, so MolGPT focuses on more comprehensive information; (ii) the MolGPT model uses the dataset of benchmark model Moses for pretraining; (iii) the MolGPT model uses scaffold information about fentanyl and its analogues. Both schemes can generate data similar to the property of original fentanyl and its analogues, and the resulting molecules also contain potential fentanyl analogues that are not very similar to the 55 original fentanyl and its analogues. The molecules generated by the two schemes have some differences and little intersection, and different models have different ideas for generating molecules, leading to different focuses of models. It can be seen from the results that the above two schemes are helpful for us to explore potential fentanyl analogues. We select 5 generated molecules (Figure 6A) from 546 generated molecules and 5 generated molecules (Figure 6B) from 10 495 generated molecules for NMR, mass spectrometry and infrared verification, and the results indicated that these generated molecules are all unreported fentanyl analogues. Figure 7 shows the NMR, MS and IR results of CCCC(=O)N(C1CCN(CC1)CCC2 = CC=CC=C2)C3 = CC=CC=C3F selected from 546 generated molecules, and the NMR, MS and IR results of other molecules are shown in the supplementary 2. The following analysis of these spectrograms shows that the molecule is a fentanyl analogue. ¹H and ¹³C NMR spectra were recorded on a JNM-ECZ400R/S1 (¹H, 400 MHz; ¹³C, 100 MHz) in DMSO-d₆. High resolution mass spectra (ESI) were recorded on a HRMS-Waters, Vion IMS Q Tof. Infrared spectroscopy was taken on a Thermo, Nicolet iZ 10. C₂₃H₂₉FN₂O·HCl was made from Shanghai Yuansi Standard Science and Technology Co., Ltd. These results of ¹H-NMR spectra (Figure 7A), ¹³C-NMR spectra (Figure 7B), ESI mass spectrum (Figure 7C) and IR spectrum (Figure 7D) indicated that the structural formula of the substance is C₂₃H₂₉FN₂O·HCl, which belongs to fentanyl analogues. C₂₃H₂₉FN₂O.HCl conforms to the definition of fentanyl analogue, and the following substitutions are made on the basis of fentanyl (N-(1-phenethylpiperidin-4-yl)-N-phenylpropionamide): (a) Use other acyl groups instead of propionyl groups; (b) The phenyl group directly attached to the nitrogen atom is replaced by any substituted or unsubstituted monocyclic aromatic group; (c) Alkyl groups are present on the piperidine ring. ¹H NMR (400 MHz, DMSO-d₆): δ 10.51 (s, 1H, HCl), 7.54 (tdd, J = 7.0, 5.1, 2.0 Hz, 1H, ArH), 7.49–7.28 (m, 5H, ArH), 7.24 (dq, J = 7.8, 2.9 Hz, 3H, ArH), 4.75 (tt, J = 12.1, 3.8 Hz, 1H, CH), 3.53 (t, J = 11.6 Hz, 2H, NCH₂), 3.16–3.10 (m, 4H, ArCH₂ and NCH₂), 3.03–2.94 (m, 2H, NCH₂), 1.99–1.61 (m, 6H, CH₂CO and NCHCH₂), 1.44 (qd, J = 7.3, 1.4 Hz, 2H, CH₂CH₃), 0.74 (t, J = 7.4 Hz, 3H, CH₃). ¹³C NMR (100 MHz, DMSO-d₆): δ 171.53, 159.84, 157.38, 137.10, 132.76, 131.24, 131.15, 128.65, 126.79, 125.69, 125.55, 125.46, 125.42, 116.83, 116.62, 56.37, 50.83, 50.79, 49.58, 35.62, 29.41, 27.33, 26.42, 17.90, 13.54. HRMS (ESI): m/z calcd for C₂₃H₂₉FN₂O.HCl [M + H]⁺ 369.23367, found 369.23419.

Figure 6

(A) shows validated molecules selected from 546 valid molecules, and (B) shows validated molecules selected from 10 495 valid molecules.

Open in new tab Download slide

NMR (A, B), MS (C) and IR (D) of the verified molecule, where 1H-NMR spectra (A) and 13C-NMR spectra (B) of verified molecule.

Figure 7

NMR (A, B), MS (C) and IR (D) of the verified molecule, where ¹H-NMR spectra (A) and ¹³C-NMR spectra (B) of verified molecule.

Open in new tab Download slide

Discussion and Conclusions

In this paper, we investigated how to use the deep learning model to generate new fentanyl analogues, making a series of analyses on the properties of the generated molecules. We selected 5 generated molecules from the 546 valid molecules generated by SeqGAN and 5 generated molecules from the 10 495 valid molecules generated by MolGPT for NMR, mass spectrometry and infrared verification. After verification, the 10 selected molecules are verified to be new fentanyl analogues, which proves that we can generate new fentanyl analogues and expand the known fentanyl analogues through deep learning model. Since the two models generate molecules in different ways, are able to learn different patterns and generate molecules with some differences and few intersections, both of these options are useful for our exploration of potential fentanyl analogues. However, there are still improvement places in our research. First, there are insufficient data on fentanyl and its analogues and having pretty high similarity in the data. We can make some improvements in improving the diversity of generated data. Secondly, our generation is based on SMILES format, without taking into account the spatial information of molecules, so we can consider the spatial information of molecules in the later stage.

Key Points

Identifying potential fentanyl analogues and broadening the existing drug database can prevent problems before they occur, improve the efficiency of forensic doctors and anti-narcotics police in detecting new fentanyl-type drug cases and make criminals difficult to escape the law.
This study is the first to apply the deep learning model to the generation of potential fentanyl analogues, which means that the generation of other drug molecules can learn from this method and provide new ideas for expanding drug database.
After generation and filtering, we obtained 11 041 valid molecules. We selected 10 generated molecules from 11 041 valid molecules for NMR, MS and IR verification, and the results showed that these generated molecules were all unreported fentanyl analogues.

Authors’ contributions

Y.Z., Y.C. and C.P. conceived and designed the study; Y.Z., L.L., Q.Y.J, Z.T.L., Z.H.X, Y.S., C.L. and H.L.L performed the experiments, analyzed the data, prepared figures and tables; Y.Y.C., Z.S.M, F.C., Y. C. and C.P. authored or reviewed drafts of the paper, and approved the final draft.

Supporting information

Owing to the sensitivity of the data and the potential for misuse, original fentanyl and its analogue data are not available to the public for unrestricted download. You can use these data to try and experience how the code works. A demo dataset of 100 SMILES strings for drug-like small molecules and source code of this article are provided at https://github.com/xueyuanyuan0410/potential_fentanyl_generation/tree/main/code.

Acknowledgements

We thank Shanghai Yuansi Standard Science and Technology Co., Ltd for help.

Funding

Fundamental Research Funds for the Central Universities [No. JCQY202108]; Startup Foundation for Advanced Talents at Nanjing Agricultural University [No. 050/804009]; the National Natural Science Foundation of China [21904068]; the Natural Science Foundation of Jiangsu Province [BK20201351]; Shanghai Key Lab of Forensic Science, Ministry of Justice, China (Academy of Forensic Science Open subject) [KF202006]; and Introduction of Talent Research Start Fund of Nanjing Medical University [KY101RC20190007].

Yuan Zhang is a Bioinformatics Master Student at Nanjing Agricultural University. Her research interest is in Computational Biology and Deep Learning.

Qiaoyan Jiang is a Postgraduate Student at Nanjing Medical University. Her research interest is in Forensic Medicine and SERS.

Ling Li is a Senior Engineer at Zhijiang Laboratory. Her research interest is Computer Vision.

Zutan Li is a Bioinformatics Doctoral Student at Nanjing Agricultural University. His research interest is in Epigenetics and Deep Learning.

Zhihui Xu is a Researcher in Simcere Diagnostics Co., Ltd. His mainly interest is in DNA/RNA Modifications and Corresponding Functions.

Yuanyuan Chen is an Associate Professor of College of Sciences at Nanjing Agricultural University. Her research area is Computational Biology.

Yang Sun is a Postgraduate Student at Nanjing Medical University. Her research interest is in Surface Enhanced Raman Scattering.

Cheng Liu is a Postgraduate Student in the Department of Forensic Medicine, College of Basic Medical Science at Nanjing Medical University. His research area is the Sudden Death from Cardiac Disease.

Zhengsheng Mao is a Lecturer of Forensic Science Department at Nanjing Medical University. His research area is Forensic Toxicology.

Feng Chen is a Professor of Nanjing Medical University. His research area is Forensic Medicine and Genetics.

Hualan Li is a Bioinformatics Master Student at Nanjing Agricultural University. Her research interest is in Computational Biology.

Yue Cao is an Associate Professor of Nanjing Medical University. Her research area is Forensic Toxicology, SERS and Electrochemistry.

Cong Pian is an Associate Professor of College of Sciences at Nanjing Agricultural University. His research area is Computational Biology.

References

1.

Palmer

RB

.

Fentanyl in postmortem forensic toxicology

.

Clin Toxicol

2010

;

48

(

8

):

771

–

84

.

Google Scholar

Crossref

WorldCat

2.

Cunningham

SM

,

Haikal

NA

,

Kraner

JC

.

Fatal intoxication with acetyl fentanyl

.

J Forensic Sci

2016

;

61

:

S276

–

80

.

3.

Marinetti

LJ

,

Ehlers

BJ

.

A series of forensic toxicology and drug seizure cases involving illicit fentanyl alone and in combination with heroin, cocaine or heroin and cocaine

.

J Anal Toxicol

2014

;

38

(

8

):

592

–

8

.

4.

Weaver

MF

,

Hopper

JA

,

Gunderson

EW

.

Designer drugs 2015: assessment and management

.

Addict Sci Clin Pract

2015

;

10

(

1

):

1

–

9

.

Google Scholar

Crossref

WorldCat

5.

David

W

.

SMILES: A chemical language and information system

.

J Chem Inf Comput Sci

1988

;

28

(

1

):

31

–

36

.

Google Scholar

OpenURL Placeholder Text

WorldCat

6.

O'Boyle

N

,

Dalke

A

.

DeepSMILES: an adaptation of SMILES for use in machine-learning of chemical structures

.

Chemrxiv preprint, Chemrxiv

.

2018

. https://doi.org/10.26434/chemrxiv.7097960.v1.

7.

Krenn

M

,

Häse

F

,

Nigam

AK

, et al.

SELFIES: a robust representation of semantically constrained graphs with an example application in chemistry. arXiv preprint

,

arXiv: 1905.13741v1

.

2019

.

8.

Kingma

DP

,

Welling

M

.

Auto-encoding Variational Bayes

.

arXiv preprint, arXiv: 1312.6114

.

2014

.

9.

Blaschke

T

,

Olivecrona

M

,

Engkvist

O

, et al.

Application of generative autoencoder in de novo molecular design

.

arXiv preprint, arXiv: 1711.07839

.

2017

.

10.

Simonovsky

M

,

Komodakis

N

.

GraphVAE: towards generation of small graphs using variational autoencoders

.

arXiv preprint, arXiv: 1802.03480

.

2018

.

11.

Gómez-Bombarelli

R

,

Wei

JN

,

Duvenaud

D

, et al.

Automatic chemical design using a data-driven continuous representation of molecules

.

ACS Central Sci

2018

;

4

(

2

):

268

–

76

.

Google Scholar

Crossref

WorldCat

12.

Winter

R

,

Montanari

F

,

Noé

F

, et al.

Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations

.

Chem Sci

2019

;

10

(

6

):

1692

–

701

.

13.

Goodfellow

I

,

Pouget-Abadie

J

,

Mirza

M

, et al.

Generative adversarial nets

.

arXiv preprint, arXiv: 1406.2661v1

.

2014

.

14.

Lantao

Y

,

Weinan

Z

,

Jun

W

, et al.

SeqGAN: sequence generative adversarial nets with policy gradient

.

arXiv preprint, arXiv: 1609.05473

.

2016

.

15.

Guimaraes

GL

,

Sanchez-Lengeling

B

,

Outeiral

C

, et al.

Objective-reinforced generative adversarial networks (ORGAN) for sequence generation models

.

arXiv preprint, arXiv: 1705.10843

.

2017

.

16.

Prykhodko

O

,

Johansson

SV

,

Kotsias

PC

, et al.

A de novo molecular generation method using latent vector based generative adversarial network

.

J Chem

2019

;

11

(

1

):

74

.

Google Scholar

Crossref

WorldCat

17.

Cao

ND

,

Kipf

T

.

MolGAN: an implicit generative model for small molecular graphs

.

arXiv preprint, arXiv: 1805.11973

.

2018

.

18.

Segler

MHS

,

Kogej

T

,

Tyrchan

C

, et al.

Generating focused molecule libraries for drug discovery with recurrent neural networks

.

ACS Central Sci

2018

;

4

(

1

):

120

–

31

.

Google Scholar

Crossref

WorldCat

19.

Blaschke

T

,

Arús-Pous

J

,

Chen

H

, et al.

Reinvent 2.0: an AI tool for de novo drug design

.

J Chem Inf Model

2020

;

60

(

12

):

5918

–

22

.

20.

Arús-Pous

J

,

Patronov

A

,

Bjerrum

EJ

, et al.

SMILES-based deep generative scaffold decorator for de-novo drug design

.

J Chem

2020

;

12

(

1

):

38

.

Google Scholar

Crossref

WorldCat

21.

Lim

J

,

Hwang

SY

,

Moon

S

, et al.

Scaffold-based molecular design with a graph generative model

.

Chem Sci

2019

;

11

(

4

):

1153

–

64

.

22.

Kaitoh

K

,

Yamanishi

Y

.

Scaffold-retained structure generator to exhaustively create molecules in an arbitrary chemical space

.

J Chem Inf Model

2022

;

62

(

9

):

2212

–

25

.

23.

Schwaller

P

,

Laino

T

,

Gaudin

T

, et al.

Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction

.

ACS Central Sci

2019

;

5

(

9

):

1572

–

83

.

Google Scholar

Crossref

WorldCat

24.

Bagal

V

,

Aggarwal

R

,

Vinod

PK

, et al.

MolGPT: molecular generation using a transformer-decoder model

.

J Chem Inf Model

2022

;

62

(

9

):

2064

–

76

.

25.

Polykovskiy

D

,

Zhebrak

A

,

Sanchez-Lengeling

B

, et al.

Molecular sets (MOSES): a benchmarking platform for molecular generation models

.

Front Pharmacol

2020

;

11

:

1

–

10

.

26.

Brown

N

,

Fiscato

M

,

Segler

MHS

, et al.

GuacaMol: benchmarking models for de novo molecular design

.

J Chem Inf Model

2019

;

59

(

3

):

1096

–

1108

.

27.

Arús-Pous

J

,

Blaschke

T

,

Ulander

S

, et al.

Exploring the GDB-13 chemical space using deep generative models

.

J Chem

2019

;

12

(

1

):

20

.

Google Scholar

OpenURL Placeholder Text

WorldCat

28.

Skinnider

MA

,

Wang

F

,

Pasin

D

, et al.

A deep generative model enables automated structure elucidation of novel psychoactive substances

.

Nat Mach Intell

2021

;

3

(

11

):

973

–

84

.

Google Scholar

Crossref

WorldCat

29.

Moret

M

,

Friedrich

L

,

Grisoni

F

, et al.

Generative molecular design in low data regimes

.

Nat Mach Intell

2020

;

2

(

3

):

171

–

80

.

Google Scholar

Crossref

WorldCat

30.

Bjerrum

EJ

.

SMILES enumeration as data augmentation for neural network modeling of molecules

.

arXiv preprint, arXiv: 1703.07076

.

2017

.

31.

Arús-Pous

J

,

Johansson

SV

,

Prykhodko

O

, et al.

Randomized SMILES strings improve the quality of molecular generative models

.

J Chem

2019

;

11

(

1

):

71

.

Google Scholar

Crossref

WorldCat

32.

Cai

C

,

Wang

S

,

Xu

Y

, et al.

Transfer learning for drug discovery

.

J Med Chem

2020

;

63

(

16

):

8683

–

94

.

33.

Landrum

G.

RDKit: Open-source cheminformatics

.

2020

; http://www.rdkit.org (

1 November 2021, date last accessed

).

34.

Wildman

SA

,

Crippen

GM

.

Prediction of physicochemical parameters by atomic contributions

.

J Chem Inf Comput Sci

1999

;

39

(

5

):

868

–

73

.

Google Scholar

Crossref

WorldCat

35.

Ertl

P

,

Schuffenhauer

A

.

Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions

.

J Chem

2009

;

1

(

1

):

8

.

Google Scholar

Crossref

WorldCat

36.

Ertl

P

,

Roggo

S

,

Schuffenhauer

A

.

Natural product-likeness score and its application for prioritization of compound libraries

.

J Chem Inf Model

2008

;

48

(

1

):

68

–

74

.

37.

Bickerton

GR

,

Paolini

GV

,

Besnard

J

, et al.

Quantifying the chemical beauty of drugs

.

Nat Chem

2012

;

4

(

2

):

90

–

98

.

38.

Bemis

GW

,

Murcko

MA

.

The properties of known drugs. 1. molecular frameworks

.

J Med Chem

1996

;

39

(

15

):

2887

–

93

.

Author notes

Yuan Zhang, Qiaoyan Jiang, and Ling Li contributed equally to this work.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
October 2022	184
November 2022	53
December 2022	52
January 2023	21
February 2023	41
March 2023	29
April 2023	22
May 2023	40
June 2023	26
July 2023	20
August 2023	17
September 2023	21
October 2023	16
November 2023	25
December 2023	21
January 2024	72
February 2024	39
March 2024	53
April 2024	41
May 2024	47
June 2024	39
July 2024	41
August 2024	41
September 2024	29
October 2024	61
November 2024	33
December 2024	39
January 2025	39
February 2025	58
March 2025	102
April 2025	27

Article Contents

Predicting the structure of unexplored novel fentanyl analogues by deep learning model

Abstract

Introduction

Materials and methods

Data

SeqGAN is used to generate potential fentanyl analogues

Encoding

Generation

Decoding

Data augmentation

Model training of SeqGAN models

Potential fentanyl analogues were generated by MolGPT model and the scaffolds’ information of fentanyl and its analogues

Model training of MolGPT models

Metrics and the evaluation of molecular property

Results

Validity, uniqueness and novelty analyses

Similarity analyses of generated molecules and original fentanyl and its analogues

Properties analyses of generated molecules and original fentanyl and its analogues

Murcko scaffolds analyses of generated molecules and original fentanyl and its analogues

Properties analyses based on MolGPT model and fentanyl and its analogues’ Murcko scaffolds

The results and validation of the two methods

Discussion and Conclusions

Authors’ contributions

Supporting information

Acknowledgements

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Predicting the structure of unexplored novel fentanyl analogues by deep learning model

Abstract

Introduction

Materials and methods

Data

SeqGAN is used to generate potential fentanyl analogues

Encoding

Generation

Decoding

Data augmentation

Model training of SeqGAN models

Potential fentanyl analogues were generated by MolGPT model and the scaffolds’ information of fentanyl and its analogues

Model training of MolGPT models

Metrics and the evaluation of molecular property

Results

Validity, uniqueness and novelty analyses

Similarity analyses of generated molecules and original fentanyl and its analogues

Properties analyses of generated molecules and original fentanyl and its analogues

Murcko scaffolds analyses of generated molecules and original fentanyl and its analogues

Properties analyses based on MolGPT model and fentanyl and its analogues’ Murcko scaffolds

The results and validation of the two methods

Discussion and Conclusions

Authors’ contributions

Supporting information

Acknowledgements

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only