Abstract

The methodologies for evaluating similarities between gene expression profiles of different perturbagens are the key to understanding mechanisms of actions (MoAs) of unknown compounds and finding new indications for existing drugs. L1000-based next-generation Connectivity Map (CMap) data is more than a thousand-fold scale-up of the CMap pilot dataset. Although several systematic evaluations have been performed individually to assess the accuracy of the methodologies for the CMap pilot study, the performance of these methodologies needs to be re-evaluated for the L1000 data. Here, using the drug–drug similarities from the Drug Repurposing Hub database as a benchmark standard, we evaluated six popular published methods for the prediction performance of drug–drug relationships based on the partial area under the receiver operating characteristic (ROC) curve at false positive rates of 0.001, 0.005 and 0.01 (AUC0.001, AUC0.005 and AUC0.01). The similarity evaluating algorithm called ZhangScore was generally superior to other methods and exhibited the highest accuracy at the gene signature sizes ranging from 10 to 200. Further, we tested these methods with an experimentally derived gene signature related to estrogen in breast cancer cells, and the results confirmed that ZhangScore was more accurate than other methods. Moreover, based on scoring results of ZhangScore for the gene signature of TOP2A knockdown, in addition to well-known TOP2A inhibitors, we identified a number of potential inhibitors and at least two of them were the subject of previous investigation. Our studies provide potential guidelines for researchers to choose the suitable connectivity method. The six connectivity methods used in this report have been implemented in R package (https://github.com/Jasonlinchina/RCSM).

Introduction

The Library of Integrated Network-Based Cellular Signatures (LINCS) Program has released over 1.3 million transcriptomic profiles using the L1000 technology [1]. The L1000 data is more than a thousand-fold scale-up of the CMap pilot dataset and comprises of more than 20 000 unique perturbagens in multiple human cell lines. This large catalogue of L1000 data provides enormous opportunities for understanding mechanisms of actions (MoAs) of unknown compounds and finding new indications for existing drugs.

The methodologies for evaluating similarities between gene expression signatures of different perturbagens are critically important for applying an existing therapeutic to a new disease indication and discovering potential MoAs of unknown compounds. A nonparametric, ranked-based Kolmogorov–Smirnov (KS) statistic was used for connecting disease gene expression signatures to drug expression profiles in the initial CMap paper [2]. Several CMap methodologies were modified from this initial method [3–5]. Another KS-like method proposed by the Gene Set Enrichment Analysis (GSEA) group [6] was utilized as the core portion of many methods for drug repurposing. Iorio et al. [4] developed an automatic and robust approach to predict similarities in drug effects and MoAs based on this GSEA method. Subramanian et al. [1] also utilized the weighted KS enrichment statistic to compute similarities in their studies. In addition to KS-like methods, a number of alternatives have been proposed by researchers. The methodologies have been extended based on signed-rank statistic (ZhangScore) [7], eXtreme Sum score (XSum) [8] and many other modules [9–18].

Using the CMap pilot dataset, systematic evaluations have been performed to assess the accuracy of these methodologies. Iskar et al. [19] performed quantitative evaluation of CMap methods for identifying compounds that have the same indications. Cheng et al. [20] used the Anatomical Therapeutic Chemical (ATC) classification as the benchmark to compare similarity metrics using two data processing methods and further [21] extended this work by evaluating various CMap similarity metrics across different feature sizes. Cheng et al. [8] also evaluated the CMap performance on predicting drug–disease relationships based on the partial area under the receiver operator characteristic (ROC) curve at false positive rates of 0.1 and 0.01 (FPR = 0.1 and 0.01). The early retrieval performance was measured because it is only practical for researchers to investigate a few of top hypotheses. However, for the large amount of newly generated L1000 data, there is almost no systematic evaluation of the methodologies in place.

The quality of the benchmark set is vital for quantitatively estimating the accuracy of these methods. Previous studies [4, 8, 19, 20] have utilized the ATC classification, which is based on the therapeutic and chemical properties of compounds, as the true positives. In this study, the drug–drug relationships were compiled from a more comprehensive library of clinical compounds that had been curated by the Drug Repurposing Hub database [22]. The compounds in this database were comprehensively annotated based on the FDA Orange Book, prescribing labels, ClinicalTrials.gov, PubMed and other Internet resources. Therefore, the drug–drug relationships based on this database are supposed to be more accurate in terms of MoAs for calculating the AUC values.

Herein, we compiled the benchmark standard of the drug–drug relationships from Drug Repurposing Hub and evaluated six popular published methods for the prediction performance of drug–drug relationships by measuring the AUC0.001, AUC0.005 and AUC0.01 in nine core cell lines of the L1000 project. ZhangScore achieved a higher level of accuracy than other methods at the gene signature sizes ranging from 10 to 200. For such a diversity of chemical perturbations, genetic perturbations and cell types of the L1000 data, our study provides potential guidelines for researchers to choose the suitable connectivity method.

Methods

Data sources and compilation of true drug–drug relationships (benchmark standard)

We downloaded the level 5 data of L1000 (GCTx format) from the Gene Expression Omnibus (accession number: GSE92742), which contains 473 647 replicate-consensus signatures (RCSs) generated by the official data pre-processing pipeline. The level 5 data of L1000 have been normalized, and the LINCS team suggests their direct use without extra processing. Each RCS represents the moderated z-score value of 12 328 genes for one profile. The GCTx file was parsed by an R package [23], and all the names of RCSs related to treatments of small molecules for nine touchstone cell lines (A375, A549, HA1E, HCC515, HEPG2, HT29, MCF7, PC3 and VCAP) were obtained (the touchstone cell lines were defined by Subramanian et al. [1]).

A total of 6113 compounds with annotation information, including compound name, clinical phase, MoAs and protein targets, were downloaded from the Drug Repurposing Hub database (https://clue.io/repurposing, archived version: 5/16/2018). We sorted 4356 compounds that had both information of MoAs and protein targets. After filtering out compounds that did not share MoAs or targets with any of the 4356 remaining compounds, 1919 compounds were reserved for compilation of the benchmark set. Two compounds that share the same MoAs and protein targets are defined as true positive compound pairs. Otherwise, they are defined as true negative compound pairs.

Intersecting the names of replicate-consensus signatures and the 1919 compounds from the Drug Repurposing Hub database, we obtained the positive drugs used to compile the true drug–drug relationships for each cell lines. There were 493 (900 RCSs), 490 (1619 RCSs), 485 (1029 RCSs), 453 (910 RCSs), 341 (544 RCSs), 490 (841 RCSs), 591 (2721 RCSs), 581 (1754 RCSs) and 485 (1350 RCSs) compounds for A375, A549, HA1E, HCC515, HEPG2, HT29, MCF7, PC3 and VCAP cell lines, respectively. The replicate-consensus signatures corresponding to these compounds for each cell line were extracted from the GCTx file.

Pairwise similarity evaluating algorithms

In this study, six methods (GSEAweight0 [6], GSEAweight1 [6], GSEAweight2 [6], KS [2], XSum [8] and ZhangScore [7]) were utilized to measure the similarities between drug pairs. Figure 1 shows the similarities and differences of these methods. The core algorithms of GSEAweight0, GSEAweight1, GSEAweight2 and KS were all derived from the KS-like statistic. The rank-based weights are set to all genes in one gene signature in ZhangScore. The method of XSum was focused on the top genes ranked by fold changes of gene expression. These algorithms are briefly described as follows:

A classification diagram showing the similarities and differences of the six connectivity methods.
Figure 1

A classification diagram showing the similarities and differences of the six connectivity methods.

GSEAweight0, GSEAweight1 and GSEAweight2: the GSEAPreranked algorithm of the GSEA package [6] contains three scoring schemes for calculating weighted KS enrichment statistic (ES): p = 0, p = 1 and p = 2. Here, we calculated the ES0 (p = 0), ES1 (p = 1) and ES2 (p = 2) for GSEAweight0, GSEAweight1 and GSEAweight2, respectively. We used the GSEAweight0 method as an example to show the calculation process.

Enrichment score (ES):

  • 1. Start with a ranked list of genes (L = {g1, g2, …, gN}) that are in (‘hits’) or not in (‘misses’) a gene set (S), using gene expression fold change (FC) as the metric.

  • 2. Phit (S, i) = |$\sum_{j=1}^i\frac{{\Big|{FC}_{g_j}\Big|}^p}{N_R},{g}_j\in S;$|where NR =  |${\sum}_{j=1}^N{\Big|{FC}_{g_j}\Big|}^p,{g}_j\in S;$|  i = 1, 2, …, N; p = 0, 1 and 2 for GSEAweight0, GSEAweight1 and GSEAweight2, respectively.

  • 3. Pmiss(S, i) = |$\sum_{j=1}^i\frac{1}{\Big(N-{N}_H\Big)},\Big(\ {g}_j\in L\Big)\cap \Big(\ {g}_j\notin S\Big);$|  i = 1, 2, …, N; where N is the number of genes in L, and NH is the number of genes in S.

  • 4. ES = the maximum deviation from zero of Phit  − Pmiss.

For drug pair A–B:

  • UpInDrugA = Top N up-regulated genes from replicate-consensus signature of drug A.

  • DownInDrugA = Top N down-regulated genes from replicate-consensus signature of drug A.

  • ES0up = the ES0 score between UpInDrugA and complete replicate-consensus signature of drug B.

  • ES0down = the ES0 score between DownInDrugA and complete replicate-consensus signature of drug B.

  • GSEAweight0(A-B) = ES0up − ES0down if ES0up and ES0down have different algebraic sign; otherwise, GSEAweight0(A-B) = 0.

For drug pair B–A, the GSEAweight0(B-A) could be calculated the same way.

The final similarity score for drug A and drug B: GSEAweight0(A&B) = (GSEAweight0(A-B) + GSEAweight0(B-A))/2.

ZhangScore: this was part of the method proposed by Zhang et al. [7]. We only considered the condition when gene signatures were unordered.

For drug pair A–B:

UpInDrugA and DownInDrugA are defined as above.

  • R = complete replicate-consensus signature of drug B.

  • s = UpInDrugADownInDrugA.

ZhangScore(A-B) = |$\sum_{i=1}^mR\Big({g}_i\Big)s\Big({g}_i\Big)/\sum_{i=1}^m\Big(M-i+1\Big)$|⁠, where gi represents the ith gene in R, s (gi) is 1 for up-regulated genes or −1 for down-regulated genes and R (gi) is this gene’s signed rank in R. m is the length of s, and M is the length of R.

For drug pair B–A, the ZhangScore(B-A) could be calculated the same way.

The final similarity score for drug A and drug B: ZhangScore(A&B) = (ZhangScore(A-B) + ZhangScore(B-A))/2.

KS: KS(A-B) and KS(B-A) are calculated as described by Cheng et al. [8].

Kolmogorov–Smirnov (KS) statistic:

  • 1. Start with ranked list of genes (L = {g1, g2, …, gN}) and a gene set (S) with t genes.

  • 2. Construct a vector V of the position (1, 2, …, N) of each gene based on L and sort the genes in S in ascending order such that V(j) is the position of gene j, where j = 1, 2, …, t. Compute the following two values:
  • 3.

    $$KS = \Big\{\begin{array}{@{}c}a,\mathrm{if}\ a>b,\\ {}-b,\mathrm{if}\ b>a.\end{array}\Big.$$

For drug pair A–B:

UpInDrugA and DownInDrugA are defined as above.

  • KSup = the KS score between UpInDrugA and complete replicate-consensus signature of drug B.

  • KSdown = the KS score between DownInDrugA and complete replicate-consensus signature of drug B.

  • KS(A-B) = KSup − KSdown if KSup and KSdown have different algebraic signs; otherwise, KS(A-B) = 0.

Workflow of connectivity method evaluation for L1000 data. MoAs: mechanisms of actions. RCS: replicate-consensus signature. AUC: area under the curve.
Figure 2

Workflow of connectivity method evaluation for L1000 data. MoAs: mechanisms of actions. RCS: replicate-consensus signature. AUC: area under the curve.

For drug pair B–A:

The KS(B-A) could be calculated the same way.

The final similarity score for drug A and drug B: KS(A&B) = (KS(A-B) + KS(B-A))/2.

XSum: XSum(A-B) and XSum(B-A) are also calculated as described by Cheng et al. [8].

For drug pair A–B:

UpInDrugA and DownInDrugA are defined as above.

  • ChangedByDrugB = Top N up-regulated and N down-regulated genes from replicate-consensus signature of drug B.

  • XUpInDrugA = UpInDrugAChangedByDrugB.

  • XDownInDrugA = DownInDrugAChangedByDrugB.

  • sum (XUpInDrugA) = sum of drug B gene expression fold change values in the set of XUpInDrugA.

  • sum (XDownInDrugA) = sum of drug B gene expression fold change values in the set of XDownInDrugA.

  • XSum(A-B) = sum (XUpInDrugA) − sum (XDownInDrugA).

Table 1

Statistical results of AUC0.01 (partial ROC curve at the FPR = 0.01). The highest AUC0.01 values for each cell line at each gene signature size are in bold. n: the number of genes in the gene signature. ROC curve: receiver operating characteristic curve. FPR: false positive rate

Scoring methodnCell line
A375HA1EHT29VCAPHEPG2MCF7A549HCC515PC3
GSEAweight0
104.46E-043.63E-044.34E-042.56E-043.84E-044.23E-047.80E-048.15E-044.78E-04
404.50E-044.33E-044.32E-043.61E-045.01E-044.58E-049.05E-047.29E-045.19E-04
1004.48E-044.36E-044.45E-043.82E-045.10E-044.28E-048.80E-046.45E-045.15E-04
2004.23E-043.98E-044.14E-043.41E-044.25E-044.04E-047.41E-045.29E-044.89E-04
5003.56E-043.45E-043.86E-043.18E-043.62E-043.55E-045.90E-044.02E-044.40E-04
10003.35E-043.16E-043.54E-042.83E-043.12E-043.05E-044.32E-043.34E-043.57E-04
GSEAweight1
104.44E-043.87E-044.30E-042.26E-043.64E-043.85E-047.47E-049.89E-044.95E-04
404.46E-044.91E-044.67E-043.23E-044.53E-044.21E-049.39E-049.36E-044.80E-04
1004.61E-044.62E-044.65E-043.85E-045.59E-044.42E-049.73E-048.84E-045.15E-04
2004.45E-044.46E-044.62E-044.01E-045.13E-044.30E-049.11E-048.15E-045.18E-04
5004.25E-044.22E-044.20E-043.57E-044.59E-044.18E-047.74E-046.36E-044.96E-04
10003.79E-043.92E-043.93E-043.20E-044.11E-043.81E-046.47E-044.73E-044.56E-04
GSEAweight2
103.10E-043.38E-043.59E-041.82E-042.06E-042.60E-045.98E-048.24E-043.77E-04
404.19E-044.63E-044.00E-042.57E-043.01E-042.69E-047.79E-049.81E-042.97E-04
1004.37E-044.82E-044.15E-043.04E-044.38E-043.50E-049.39E-049.84E-043.61E-04
2004.40E-044.76E-044.54E-043.54E-044.67E-044.18E-049.67E-049.97E-044.40E-04
5004.57E-044.59E-044.64E-044.18E-045.01E-044.48E-049.63E-049.20E-045.14E-04
10004.35E-044.44E-044.37E-043.59E-044.64E-044.26E-048.70E-047.70E-045.07E-04
KS
104.46E-043.63E-044.34E-042.57E-043.84E-044.23E-047.80E-048.15E-044.78E-04
404.51E-044.34E-044.33E-043.61E-044.98E-044.58E-049.06E-047.30E-045.18E-04
1004.50E-044.36E-044.45E-043.83E-045.11E-044.28E-048.81E-046.44E-045.14E-04
2004.23E-043.98E-044.14E-043.40E-044.26E-044.05E-047.40E-045.29E-044.89E-04
5003.57E-043.47E-043.86E-043.19E-043.65E-043.55E-045.90E-044.02E-044.40E-04
10003.34E-043.16E-043.54E-042.83E-043.11E-043.06E-044.31E-043.32E-043.55E-04
XSum
103.88E-042.84E-043.70E-042.41E-043.06E-042.62E-047.65E-048.75E-044.10E-04
404.60E-043.44E-044.01E-042.77E-043.71E-043.88E-048.32E-048.55E-043.98E-04
1004.73E-043.58E-043.20E-042.81E-043.83E-043.73E-047.97E-045.60E-044.33E-04
2004.27E-043.35E-042.96E-042.69E-043.69E-043.68E-047.22E-043.72E-044.20E-04
5004.02E-043.22E-043.13E-042.41E-043.71E-043.52E-046.24E-043.38E-044.05E-04
10003.78E-043.06E-042.99E-042.29E-043.44E-043.43E-045.22E-043.08E-043.39E-04
ZhangScore
104.21E-044.06E-044.53E-042.90E-044.36E-043.45E-046.98E-047.92E-044.41E-04
404.94E-044.29E-044.48E-043.67E-044.55E-044.38E-047.56E-048.09E-045.29E-04
1005.38E-044.61E-044.73E-043.80E-045.50E-044.55E-048.91E-049.74E-045.13E-04
2005.72E-045.33E-045.37E-044.61E-045.18E-044.54E-048.93E-049.66E-045.25E-04
5004.16E-044.06E-044.14E-043.53E-044.12E-043.88E-046.75E-045.97E-044.91E-04
10003.54E-043.28E-043.76E-043.12E-043.26E-043.25E-045.08E-043.97E-043.97E-04
Scoring methodnCell line
A375HA1EHT29VCAPHEPG2MCF7A549HCC515PC3
GSEAweight0
104.46E-043.63E-044.34E-042.56E-043.84E-044.23E-047.80E-048.15E-044.78E-04
404.50E-044.33E-044.32E-043.61E-045.01E-044.58E-049.05E-047.29E-045.19E-04
1004.48E-044.36E-044.45E-043.82E-045.10E-044.28E-048.80E-046.45E-045.15E-04
2004.23E-043.98E-044.14E-043.41E-044.25E-044.04E-047.41E-045.29E-044.89E-04
5003.56E-043.45E-043.86E-043.18E-043.62E-043.55E-045.90E-044.02E-044.40E-04
10003.35E-043.16E-043.54E-042.83E-043.12E-043.05E-044.32E-043.34E-043.57E-04
GSEAweight1
104.44E-043.87E-044.30E-042.26E-043.64E-043.85E-047.47E-049.89E-044.95E-04
404.46E-044.91E-044.67E-043.23E-044.53E-044.21E-049.39E-049.36E-044.80E-04
1004.61E-044.62E-044.65E-043.85E-045.59E-044.42E-049.73E-048.84E-045.15E-04
2004.45E-044.46E-044.62E-044.01E-045.13E-044.30E-049.11E-048.15E-045.18E-04
5004.25E-044.22E-044.20E-043.57E-044.59E-044.18E-047.74E-046.36E-044.96E-04
10003.79E-043.92E-043.93E-043.20E-044.11E-043.81E-046.47E-044.73E-044.56E-04
GSEAweight2
103.10E-043.38E-043.59E-041.82E-042.06E-042.60E-045.98E-048.24E-043.77E-04
404.19E-044.63E-044.00E-042.57E-043.01E-042.69E-047.79E-049.81E-042.97E-04
1004.37E-044.82E-044.15E-043.04E-044.38E-043.50E-049.39E-049.84E-043.61E-04
2004.40E-044.76E-044.54E-043.54E-044.67E-044.18E-049.67E-049.97E-044.40E-04
5004.57E-044.59E-044.64E-044.18E-045.01E-044.48E-049.63E-049.20E-045.14E-04
10004.35E-044.44E-044.37E-043.59E-044.64E-044.26E-048.70E-047.70E-045.07E-04
KS
104.46E-043.63E-044.34E-042.57E-043.84E-044.23E-047.80E-048.15E-044.78E-04
404.51E-044.34E-044.33E-043.61E-044.98E-044.58E-049.06E-047.30E-045.18E-04
1004.50E-044.36E-044.45E-043.83E-045.11E-044.28E-048.81E-046.44E-045.14E-04
2004.23E-043.98E-044.14E-043.40E-044.26E-044.05E-047.40E-045.29E-044.89E-04
5003.57E-043.47E-043.86E-043.19E-043.65E-043.55E-045.90E-044.02E-044.40E-04
10003.34E-043.16E-043.54E-042.83E-043.11E-043.06E-044.31E-043.32E-043.55E-04
XSum
103.88E-042.84E-043.70E-042.41E-043.06E-042.62E-047.65E-048.75E-044.10E-04
404.60E-043.44E-044.01E-042.77E-043.71E-043.88E-048.32E-048.55E-043.98E-04
1004.73E-043.58E-043.20E-042.81E-043.83E-043.73E-047.97E-045.60E-044.33E-04
2004.27E-043.35E-042.96E-042.69E-043.69E-043.68E-047.22E-043.72E-044.20E-04
5004.02E-043.22E-043.13E-042.41E-043.71E-043.52E-046.24E-043.38E-044.05E-04
10003.78E-043.06E-042.99E-042.29E-043.44E-043.43E-045.22E-043.08E-043.39E-04
ZhangScore
104.21E-044.06E-044.53E-042.90E-044.36E-043.45E-046.98E-047.92E-044.41E-04
404.94E-044.29E-044.48E-043.67E-044.55E-044.38E-047.56E-048.09E-045.29E-04
1005.38E-044.61E-044.73E-043.80E-045.50E-044.55E-048.91E-049.74E-045.13E-04
2005.72E-045.33E-045.37E-044.61E-045.18E-044.54E-048.93E-049.66E-045.25E-04
5004.16E-044.06E-044.14E-043.53E-044.12E-043.88E-046.75E-045.97E-044.91E-04
10003.54E-043.28E-043.76E-043.12E-043.26E-043.25E-045.08E-043.97E-043.97E-04
Table 1

Statistical results of AUC0.01 (partial ROC curve at the FPR = 0.01). The highest AUC0.01 values for each cell line at each gene signature size are in bold. n: the number of genes in the gene signature. ROC curve: receiver operating characteristic curve. FPR: false positive rate

Scoring methodnCell line
A375HA1EHT29VCAPHEPG2MCF7A549HCC515PC3
GSEAweight0
104.46E-043.63E-044.34E-042.56E-043.84E-044.23E-047.80E-048.15E-044.78E-04
404.50E-044.33E-044.32E-043.61E-045.01E-044.58E-049.05E-047.29E-045.19E-04
1004.48E-044.36E-044.45E-043.82E-045.10E-044.28E-048.80E-046.45E-045.15E-04
2004.23E-043.98E-044.14E-043.41E-044.25E-044.04E-047.41E-045.29E-044.89E-04
5003.56E-043.45E-043.86E-043.18E-043.62E-043.55E-045.90E-044.02E-044.40E-04
10003.35E-043.16E-043.54E-042.83E-043.12E-043.05E-044.32E-043.34E-043.57E-04
GSEAweight1
104.44E-043.87E-044.30E-042.26E-043.64E-043.85E-047.47E-049.89E-044.95E-04
404.46E-044.91E-044.67E-043.23E-044.53E-044.21E-049.39E-049.36E-044.80E-04
1004.61E-044.62E-044.65E-043.85E-045.59E-044.42E-049.73E-048.84E-045.15E-04
2004.45E-044.46E-044.62E-044.01E-045.13E-044.30E-049.11E-048.15E-045.18E-04
5004.25E-044.22E-044.20E-043.57E-044.59E-044.18E-047.74E-046.36E-044.96E-04
10003.79E-043.92E-043.93E-043.20E-044.11E-043.81E-046.47E-044.73E-044.56E-04
GSEAweight2
103.10E-043.38E-043.59E-041.82E-042.06E-042.60E-045.98E-048.24E-043.77E-04
404.19E-044.63E-044.00E-042.57E-043.01E-042.69E-047.79E-049.81E-042.97E-04
1004.37E-044.82E-044.15E-043.04E-044.38E-043.50E-049.39E-049.84E-043.61E-04
2004.40E-044.76E-044.54E-043.54E-044.67E-044.18E-049.67E-049.97E-044.40E-04
5004.57E-044.59E-044.64E-044.18E-045.01E-044.48E-049.63E-049.20E-045.14E-04
10004.35E-044.44E-044.37E-043.59E-044.64E-044.26E-048.70E-047.70E-045.07E-04
KS
104.46E-043.63E-044.34E-042.57E-043.84E-044.23E-047.80E-048.15E-044.78E-04
404.51E-044.34E-044.33E-043.61E-044.98E-044.58E-049.06E-047.30E-045.18E-04
1004.50E-044.36E-044.45E-043.83E-045.11E-044.28E-048.81E-046.44E-045.14E-04
2004.23E-043.98E-044.14E-043.40E-044.26E-044.05E-047.40E-045.29E-044.89E-04
5003.57E-043.47E-043.86E-043.19E-043.65E-043.55E-045.90E-044.02E-044.40E-04
10003.34E-043.16E-043.54E-042.83E-043.11E-043.06E-044.31E-043.32E-043.55E-04
XSum
103.88E-042.84E-043.70E-042.41E-043.06E-042.62E-047.65E-048.75E-044.10E-04
404.60E-043.44E-044.01E-042.77E-043.71E-043.88E-048.32E-048.55E-043.98E-04
1004.73E-043.58E-043.20E-042.81E-043.83E-043.73E-047.97E-045.60E-044.33E-04
2004.27E-043.35E-042.96E-042.69E-043.69E-043.68E-047.22E-043.72E-044.20E-04
5004.02E-043.22E-043.13E-042.41E-043.71E-043.52E-046.24E-043.38E-044.05E-04
10003.78E-043.06E-042.99E-042.29E-043.44E-043.43E-045.22E-043.08E-043.39E-04
ZhangScore
104.21E-044.06E-044.53E-042.90E-044.36E-043.45E-046.98E-047.92E-044.41E-04
404.94E-044.29E-044.48E-043.67E-044.55E-044.38E-047.56E-048.09E-045.29E-04
1005.38E-044.61E-044.73E-043.80E-045.50E-044.55E-048.91E-049.74E-045.13E-04
2005.72E-045.33E-045.37E-044.61E-045.18E-044.54E-048.93E-049.66E-045.25E-04
5004.16E-044.06E-044.14E-043.53E-044.12E-043.88E-046.75E-045.97E-044.91E-04
10003.54E-043.28E-043.76E-043.12E-043.26E-043.25E-045.08E-043.97E-043.97E-04
Scoring methodnCell line
A375HA1EHT29VCAPHEPG2MCF7A549HCC515PC3
GSEAweight0
104.46E-043.63E-044.34E-042.56E-043.84E-044.23E-047.80E-048.15E-044.78E-04
404.50E-044.33E-044.32E-043.61E-045.01E-044.58E-049.05E-047.29E-045.19E-04
1004.48E-044.36E-044.45E-043.82E-045.10E-044.28E-048.80E-046.45E-045.15E-04
2004.23E-043.98E-044.14E-043.41E-044.25E-044.04E-047.41E-045.29E-044.89E-04
5003.56E-043.45E-043.86E-043.18E-043.62E-043.55E-045.90E-044.02E-044.40E-04
10003.35E-043.16E-043.54E-042.83E-043.12E-043.05E-044.32E-043.34E-043.57E-04
GSEAweight1
104.44E-043.87E-044.30E-042.26E-043.64E-043.85E-047.47E-049.89E-044.95E-04
404.46E-044.91E-044.67E-043.23E-044.53E-044.21E-049.39E-049.36E-044.80E-04
1004.61E-044.62E-044.65E-043.85E-045.59E-044.42E-049.73E-048.84E-045.15E-04
2004.45E-044.46E-044.62E-044.01E-045.13E-044.30E-049.11E-048.15E-045.18E-04
5004.25E-044.22E-044.20E-043.57E-044.59E-044.18E-047.74E-046.36E-044.96E-04
10003.79E-043.92E-043.93E-043.20E-044.11E-043.81E-046.47E-044.73E-044.56E-04
GSEAweight2
103.10E-043.38E-043.59E-041.82E-042.06E-042.60E-045.98E-048.24E-043.77E-04
404.19E-044.63E-044.00E-042.57E-043.01E-042.69E-047.79E-049.81E-042.97E-04
1004.37E-044.82E-044.15E-043.04E-044.38E-043.50E-049.39E-049.84E-043.61E-04
2004.40E-044.76E-044.54E-043.54E-044.67E-044.18E-049.67E-049.97E-044.40E-04
5004.57E-044.59E-044.64E-044.18E-045.01E-044.48E-049.63E-049.20E-045.14E-04
10004.35E-044.44E-044.37E-043.59E-044.64E-044.26E-048.70E-047.70E-045.07E-04
KS
104.46E-043.63E-044.34E-042.57E-043.84E-044.23E-047.80E-048.15E-044.78E-04
404.51E-044.34E-044.33E-043.61E-044.98E-044.58E-049.06E-047.30E-045.18E-04
1004.50E-044.36E-044.45E-043.83E-045.11E-044.28E-048.81E-046.44E-045.14E-04
2004.23E-043.98E-044.14E-043.40E-044.26E-044.05E-047.40E-045.29E-044.89E-04
5003.57E-043.47E-043.86E-043.19E-043.65E-043.55E-045.90E-044.02E-044.40E-04
10003.34E-043.16E-043.54E-042.83E-043.11E-043.06E-044.31E-043.32E-043.55E-04
XSum
103.88E-042.84E-043.70E-042.41E-043.06E-042.62E-047.65E-048.75E-044.10E-04
404.60E-043.44E-044.01E-042.77E-043.71E-043.88E-048.32E-048.55E-043.98E-04
1004.73E-043.58E-043.20E-042.81E-043.83E-043.73E-047.97E-045.60E-044.33E-04
2004.27E-043.35E-042.96E-042.69E-043.69E-043.68E-047.22E-043.72E-044.20E-04
5004.02E-043.22E-043.13E-042.41E-043.71E-043.52E-046.24E-043.38E-044.05E-04
10003.78E-043.06E-042.99E-042.29E-043.44E-043.43E-045.22E-043.08E-043.39E-04
ZhangScore
104.21E-044.06E-044.53E-042.90E-044.36E-043.45E-046.98E-047.92E-044.41E-04
404.94E-044.29E-044.48E-043.67E-044.55E-044.38E-047.56E-048.09E-045.29E-04
1005.38E-044.61E-044.73E-043.80E-045.50E-044.55E-048.91E-049.74E-045.13E-04
2005.72E-045.33E-045.37E-044.61E-045.18E-044.54E-048.93E-049.66E-045.25E-04
5004.16E-044.06E-044.14E-043.53E-044.12E-043.88E-046.75E-045.97E-044.91E-04
10003.54E-043.28E-043.76E-043.12E-043.26E-043.25E-045.08E-043.97E-043.97E-04
Table 2

Statistical results of AUC0.005 (partial ROC curve at the FPR = 0.005). The highest AUC0.005 values for each cell line at each gene signature size are in bold. n: the number of genes in the gene signature. ROC curve: receiver operating characteristic curve. FPR: false positive rate

Scoring methodnCell line
A375HA1EHT29VCAPHEPG2MCF7A549HCC515PC3
GSEAweight0
101.66E-041.48E-041.74E-048.50E-051.60E-041.72E-042.83E-042.73E-042.08E-04
401.76E-041.71E-041.71E-041.20E-041.95E-041.75E-043.39E-042.27E-042.13E-04
1001.80E-041.65E-041.76E-041.33E-041.99E-041.73E-043.36E-042.03E-042.08E-04
2001.63E-041.59E-041.69E-041.25E-041.62E-041.62E-042.84E-041.74E-042.01E-04
5001.42E-041.40E-041.58E-041.20E-041.39E-041.37E-042.13E-041.39E-041.67E-04
10001.32E-041.31E-041.40E-041.02E-041.17E-041.15E-041.44E-041.10E-041.28E-04
GSEAweight1
101.59E-041.57E-041.71E-046.70E-051.30E-041.49E-042.79E-043.92E-041.94E-04
401.55E-041.80E-041.78E-049.60E-051.67E-041.51E-043.73E-043.45E-041.86E-04
1001.64E-041.69E-041.76E-041.20E-042.05E-041.72E-043.85E-043.00E-042.03E-04
2001.70E-041.65E-041.76E-041.33E-041.93E-041.71E-043.59E-042.63E-042.12E-04
5001.68E-041.62E-041.73E-041.23E-041.79E-041.66E-043.01E-041.99E-042.08E-04
10001.48E-041.51E-041.64E-041.22E-041.59E-041.48E-042.55E-041.65E-041.85E-04
GSEAweight2
101.18E-041.28E-041.35E-046.00E-056.30E-058.80E-052.23E-043.16E-041.20E-04
401.49E-041.87E-041.52E-047.80E-051.03E-048.00E-052.99E-043.88E-048.40E-05
1001.50E-041.79E-041.52E-049.10E-051.66E-041.30E-043.69E-043.77E-041.14E-04
2001.51E-041.70E-041.68E-041.04E-041.75E-041.54E-043.84E-043.71E-041.69E-04
5001.74E-041.67E-041.86E-041.25E-041.88E-041.74E-043.72E-043.35E-042.13E-04
10001.74E-041.65E-041.79E-041.27E-041.86E-041.70E-043.39E-042.60E-042.14E-04
KS
101.65E-041.48E-041.74E-048.50E-051.60E-041.72E-042.83E-042.74E-042.08E-04
401.76E-041.71E-041.71E-041.20E-041.93E-041.76E-043.39E-042.29E-042.13E-04
1001.80E-041.65E-041.76E-041.33E-041.99E-041.73E-043.36E-042.02E-042.08E-04
2001.63E-041.60E-041.69E-041.24E-041.63E-041.62E-042.84E-041.73E-042.01E-04
5001.42E-041.40E-041.59E-041.19E-041.40E-041.37E-042.13E-041.39E-041.67E-04
10001.32E-041.30E-041.40E-041.03E-041.17E-041.15E-041.45E-041.10E-041.27E-04
XSum
101.51E-041.18E-041.41E-048.60E-051.09E-049.80E-052.90E-043.71E-041.64E-04
401.78E-041.42E-041.58E-049.40E-051.42E-041.46E-043.16E-043.05E-041.63E-04
1001.76E-041.46E-041.32E-049.70E-051.52E-041.46E-042.76E-041.60E-041.54E-04
2001.53E-041.37E-041.20E-049.90E-051.45E-041.32E-042.41E-041.04E-041.47E-04
5001.40E-041.33E-041.24E-049.90E-051.40E-041.31E-041.98E-041.01E-041.38E-04
10001.30E-041.31E-041.16E-049.80E-051.31E-041.30E-041.66E-049.30E-051.10E-04
ZhangScore
101.58E-041.59E-041.88E-041.04E-041.76E-041.38E-042.64E-042.87E-041.86E-04
401.94E-041.72E-041.91E-041.25E-041.72E-041.74E-042.57E-042.87E-042.22E-04
1002.11E-041.85E-041.86E-041.30E-042.09E-041.80E-043.28E-043.50E-042.24E-04
2002.13E-042.11E-042.18E-041.65E-041.97E-041.78E-043.36E-043.52E-042.19E-04
5001.67E-041.57E-041.72E-041.29E-041.64E-041.57E-042.67E-041.95E-042.07E-04
10001.39E-041.34E-041.59E-041.17E-041.30E-041.26E-041.72E-041.39E-041.50E-04
Scoring methodnCell line
A375HA1EHT29VCAPHEPG2MCF7A549HCC515PC3
GSEAweight0
101.66E-041.48E-041.74E-048.50E-051.60E-041.72E-042.83E-042.73E-042.08E-04
401.76E-041.71E-041.71E-041.20E-041.95E-041.75E-043.39E-042.27E-042.13E-04
1001.80E-041.65E-041.76E-041.33E-041.99E-041.73E-043.36E-042.03E-042.08E-04
2001.63E-041.59E-041.69E-041.25E-041.62E-041.62E-042.84E-041.74E-042.01E-04
5001.42E-041.40E-041.58E-041.20E-041.39E-041.37E-042.13E-041.39E-041.67E-04
10001.32E-041.31E-041.40E-041.02E-041.17E-041.15E-041.44E-041.10E-041.28E-04
GSEAweight1
101.59E-041.57E-041.71E-046.70E-051.30E-041.49E-042.79E-043.92E-041.94E-04
401.55E-041.80E-041.78E-049.60E-051.67E-041.51E-043.73E-043.45E-041.86E-04
1001.64E-041.69E-041.76E-041.20E-042.05E-041.72E-043.85E-043.00E-042.03E-04
2001.70E-041.65E-041.76E-041.33E-041.93E-041.71E-043.59E-042.63E-042.12E-04
5001.68E-041.62E-041.73E-041.23E-041.79E-041.66E-043.01E-041.99E-042.08E-04
10001.48E-041.51E-041.64E-041.22E-041.59E-041.48E-042.55E-041.65E-041.85E-04
GSEAweight2
101.18E-041.28E-041.35E-046.00E-056.30E-058.80E-052.23E-043.16E-041.20E-04
401.49E-041.87E-041.52E-047.80E-051.03E-048.00E-052.99E-043.88E-048.40E-05
1001.50E-041.79E-041.52E-049.10E-051.66E-041.30E-043.69E-043.77E-041.14E-04
2001.51E-041.70E-041.68E-041.04E-041.75E-041.54E-043.84E-043.71E-041.69E-04
5001.74E-041.67E-041.86E-041.25E-041.88E-041.74E-043.72E-043.35E-042.13E-04
10001.74E-041.65E-041.79E-041.27E-041.86E-041.70E-043.39E-042.60E-042.14E-04
KS
101.65E-041.48E-041.74E-048.50E-051.60E-041.72E-042.83E-042.74E-042.08E-04
401.76E-041.71E-041.71E-041.20E-041.93E-041.76E-043.39E-042.29E-042.13E-04
1001.80E-041.65E-041.76E-041.33E-041.99E-041.73E-043.36E-042.02E-042.08E-04
2001.63E-041.60E-041.69E-041.24E-041.63E-041.62E-042.84E-041.73E-042.01E-04
5001.42E-041.40E-041.59E-041.19E-041.40E-041.37E-042.13E-041.39E-041.67E-04
10001.32E-041.30E-041.40E-041.03E-041.17E-041.15E-041.45E-041.10E-041.27E-04
XSum
101.51E-041.18E-041.41E-048.60E-051.09E-049.80E-052.90E-043.71E-041.64E-04
401.78E-041.42E-041.58E-049.40E-051.42E-041.46E-043.16E-043.05E-041.63E-04
1001.76E-041.46E-041.32E-049.70E-051.52E-041.46E-042.76E-041.60E-041.54E-04
2001.53E-041.37E-041.20E-049.90E-051.45E-041.32E-042.41E-041.04E-041.47E-04
5001.40E-041.33E-041.24E-049.90E-051.40E-041.31E-041.98E-041.01E-041.38E-04
10001.30E-041.31E-041.16E-049.80E-051.31E-041.30E-041.66E-049.30E-051.10E-04
ZhangScore
101.58E-041.59E-041.88E-041.04E-041.76E-041.38E-042.64E-042.87E-041.86E-04
401.94E-041.72E-041.91E-041.25E-041.72E-041.74E-042.57E-042.87E-042.22E-04
1002.11E-041.85E-041.86E-041.30E-042.09E-041.80E-043.28E-043.50E-042.24E-04
2002.13E-042.11E-042.18E-041.65E-041.97E-041.78E-043.36E-043.52E-042.19E-04
5001.67E-041.57E-041.72E-041.29E-041.64E-041.57E-042.67E-041.95E-042.07E-04
10001.39E-041.34E-041.59E-041.17E-041.30E-041.26E-041.72E-041.39E-041.50E-04
Table 2

Statistical results of AUC0.005 (partial ROC curve at the FPR = 0.005). The highest AUC0.005 values for each cell line at each gene signature size are in bold. n: the number of genes in the gene signature. ROC curve: receiver operating characteristic curve. FPR: false positive rate

Scoring methodnCell line
A375HA1EHT29VCAPHEPG2MCF7A549HCC515PC3
GSEAweight0
101.66E-041.48E-041.74E-048.50E-051.60E-041.72E-042.83E-042.73E-042.08E-04
401.76E-041.71E-041.71E-041.20E-041.95E-041.75E-043.39E-042.27E-042.13E-04
1001.80E-041.65E-041.76E-041.33E-041.99E-041.73E-043.36E-042.03E-042.08E-04
2001.63E-041.59E-041.69E-041.25E-041.62E-041.62E-042.84E-041.74E-042.01E-04
5001.42E-041.40E-041.58E-041.20E-041.39E-041.37E-042.13E-041.39E-041.67E-04
10001.32E-041.31E-041.40E-041.02E-041.17E-041.15E-041.44E-041.10E-041.28E-04
GSEAweight1
101.59E-041.57E-041.71E-046.70E-051.30E-041.49E-042.79E-043.92E-041.94E-04
401.55E-041.80E-041.78E-049.60E-051.67E-041.51E-043.73E-043.45E-041.86E-04
1001.64E-041.69E-041.76E-041.20E-042.05E-041.72E-043.85E-043.00E-042.03E-04
2001.70E-041.65E-041.76E-041.33E-041.93E-041.71E-043.59E-042.63E-042.12E-04
5001.68E-041.62E-041.73E-041.23E-041.79E-041.66E-043.01E-041.99E-042.08E-04
10001.48E-041.51E-041.64E-041.22E-041.59E-041.48E-042.55E-041.65E-041.85E-04
GSEAweight2
101.18E-041.28E-041.35E-046.00E-056.30E-058.80E-052.23E-043.16E-041.20E-04
401.49E-041.87E-041.52E-047.80E-051.03E-048.00E-052.99E-043.88E-048.40E-05
1001.50E-041.79E-041.52E-049.10E-051.66E-041.30E-043.69E-043.77E-041.14E-04
2001.51E-041.70E-041.68E-041.04E-041.75E-041.54E-043.84E-043.71E-041.69E-04
5001.74E-041.67E-041.86E-041.25E-041.88E-041.74E-043.72E-043.35E-042.13E-04
10001.74E-041.65E-041.79E-041.27E-041.86E-041.70E-043.39E-042.60E-042.14E-04
KS
101.65E-041.48E-041.74E-048.50E-051.60E-041.72E-042.83E-042.74E-042.08E-04
401.76E-041.71E-041.71E-041.20E-041.93E-041.76E-043.39E-042.29E-042.13E-04
1001.80E-041.65E-041.76E-041.33E-041.99E-041.73E-043.36E-042.02E-042.08E-04
2001.63E-041.60E-041.69E-041.24E-041.63E-041.62E-042.84E-041.73E-042.01E-04
5001.42E-041.40E-041.59E-041.19E-041.40E-041.37E-042.13E-041.39E-041.67E-04
10001.32E-041.30E-041.40E-041.03E-041.17E-041.15E-041.45E-041.10E-041.27E-04
XSum
101.51E-041.18E-041.41E-048.60E-051.09E-049.80E-052.90E-043.71E-041.64E-04
401.78E-041.42E-041.58E-049.40E-051.42E-041.46E-043.16E-043.05E-041.63E-04
1001.76E-041.46E-041.32E-049.70E-051.52E-041.46E-042.76E-041.60E-041.54E-04
2001.53E-041.37E-041.20E-049.90E-051.45E-041.32E-042.41E-041.04E-041.47E-04
5001.40E-041.33E-041.24E-049.90E-051.40E-041.31E-041.98E-041.01E-041.38E-04
10001.30E-041.31E-041.16E-049.80E-051.31E-041.30E-041.66E-049.30E-051.10E-04
ZhangScore
101.58E-041.59E-041.88E-041.04E-041.76E-041.38E-042.64E-042.87E-041.86E-04
401.94E-041.72E-041.91E-041.25E-041.72E-041.74E-042.57E-042.87E-042.22E-04
1002.11E-041.85E-041.86E-041.30E-042.09E-041.80E-043.28E-043.50E-042.24E-04
2002.13E-042.11E-042.18E-041.65E-041.97E-041.78E-043.36E-043.52E-042.19E-04
5001.67E-041.57E-041.72E-041.29E-041.64E-041.57E-042.67E-041.95E-042.07E-04
10001.39E-041.34E-041.59E-041.17E-041.30E-041.26E-041.72E-041.39E-041.50E-04
Scoring methodnCell line
A375HA1EHT29VCAPHEPG2MCF7A549HCC515PC3
GSEAweight0
101.66E-041.48E-041.74E-048.50E-051.60E-041.72E-042.83E-042.73E-042.08E-04
401.76E-041.71E-041.71E-041.20E-041.95E-041.75E-043.39E-042.27E-042.13E-04
1001.80E-041.65E-041.76E-041.33E-041.99E-041.73E-043.36E-042.03E-042.08E-04
2001.63E-041.59E-041.69E-041.25E-041.62E-041.62E-042.84E-041.74E-042.01E-04
5001.42E-041.40E-041.58E-041.20E-041.39E-041.37E-042.13E-041.39E-041.67E-04
10001.32E-041.31E-041.40E-041.02E-041.17E-041.15E-041.44E-041.10E-041.28E-04
GSEAweight1
101.59E-041.57E-041.71E-046.70E-051.30E-041.49E-042.79E-043.92E-041.94E-04
401.55E-041.80E-041.78E-049.60E-051.67E-041.51E-043.73E-043.45E-041.86E-04
1001.64E-041.69E-041.76E-041.20E-042.05E-041.72E-043.85E-043.00E-042.03E-04
2001.70E-041.65E-041.76E-041.33E-041.93E-041.71E-043.59E-042.63E-042.12E-04
5001.68E-041.62E-041.73E-041.23E-041.79E-041.66E-043.01E-041.99E-042.08E-04
10001.48E-041.51E-041.64E-041.22E-041.59E-041.48E-042.55E-041.65E-041.85E-04
GSEAweight2
101.18E-041.28E-041.35E-046.00E-056.30E-058.80E-052.23E-043.16E-041.20E-04
401.49E-041.87E-041.52E-047.80E-051.03E-048.00E-052.99E-043.88E-048.40E-05
1001.50E-041.79E-041.52E-049.10E-051.66E-041.30E-043.69E-043.77E-041.14E-04
2001.51E-041.70E-041.68E-041.04E-041.75E-041.54E-043.84E-043.71E-041.69E-04
5001.74E-041.67E-041.86E-041.25E-041.88E-041.74E-043.72E-043.35E-042.13E-04
10001.74E-041.65E-041.79E-041.27E-041.86E-041.70E-043.39E-042.60E-042.14E-04
KS
101.65E-041.48E-041.74E-048.50E-051.60E-041.72E-042.83E-042.74E-042.08E-04
401.76E-041.71E-041.71E-041.20E-041.93E-041.76E-043.39E-042.29E-042.13E-04
1001.80E-041.65E-041.76E-041.33E-041.99E-041.73E-043.36E-042.02E-042.08E-04
2001.63E-041.60E-041.69E-041.24E-041.63E-041.62E-042.84E-041.73E-042.01E-04
5001.42E-041.40E-041.59E-041.19E-041.40E-041.37E-042.13E-041.39E-041.67E-04
10001.32E-041.30E-041.40E-041.03E-041.17E-041.15E-041.45E-041.10E-041.27E-04
XSum
101.51E-041.18E-041.41E-048.60E-051.09E-049.80E-052.90E-043.71E-041.64E-04
401.78E-041.42E-041.58E-049.40E-051.42E-041.46E-043.16E-043.05E-041.63E-04
1001.76E-041.46E-041.32E-049.70E-051.52E-041.46E-042.76E-041.60E-041.54E-04
2001.53E-041.37E-041.20E-049.90E-051.45E-041.32E-042.41E-041.04E-041.47E-04
5001.40E-041.33E-041.24E-049.90E-051.40E-041.31E-041.98E-041.01E-041.38E-04
10001.30E-041.31E-041.16E-049.80E-051.31E-041.30E-041.66E-049.30E-051.10E-04
ZhangScore
101.58E-041.59E-041.88E-041.04E-041.76E-041.38E-042.64E-042.87E-041.86E-04
401.94E-041.72E-041.91E-041.25E-041.72E-041.74E-042.57E-042.87E-042.22E-04
1002.11E-041.85E-041.86E-041.30E-042.09E-041.80E-043.28E-043.50E-042.24E-04
2002.13E-042.11E-042.18E-041.65E-041.97E-041.78E-043.36E-043.52E-042.19E-04
5001.67E-041.57E-041.72E-041.29E-041.64E-041.57E-042.67E-041.95E-042.07E-04
10001.39E-041.34E-041.59E-041.17E-041.30E-041.26E-041.72E-041.39E-041.50E-04
Table 3

Statistical results of AUC0.001 (partial ROC curve at the FPR = 0.001). The highest AUC0.001 values for each cell line at each gene signature size are in bold. n: the number of genes in the gene signature. ROC curve: receiver operating characteristic curve. FPR: false positive rate

Scoring methodnCell line
A375HA1EHT29VCAPHEPG2MCF7A549HCC515PC3
GSEAweight0
101.59E-051.84E-051.88E-058.50E-061.95E-051.94E-051.80E-051.44E-052.61E-05
402.15E-051.93E-052.10E-057.50E-062.21E-051.96E-051.94E-059.80E-062.63E-05
1002.12E-052.06E-052.18E-058.49E-062.13E-051.94E-052.38E-051.11E-052.30E-05
2002.05E-052.06E-052.19E-058.53E-061.91E-051.90E-052.39E-059.67E-062.13E-05
5001.79E-052.09E-052.00E-051.02E-051.77E-051.79E-051.97E-058.20E-061.65E-05
10001.49E-051.97E-051.79E-051.09E-051.41E-051.63E-051.54E-056.16E-061.35E-05
GSEAweight1
101.70E-051.89E-051.77E-057.41E-061.15E-051.65E-052.00E-052.69E-052.27E-05
402.02E-051.89E-051.77E-056.59E-061.79E-051.74E-052.51E-052.02E-052.45E-05
1001.97E-051.96E-051.96E-057.00E-062.06E-051.97E-052.45E-051.46E-052.39E-05
2001.95E-052.05E-052.14E-057.44E-062.09E-051.86E-052.61E-051.20E-052.51E-05
5001.95E-052.12E-052.19E-058.71E-062.01E-051.91E-052.42E-051.08E-052.17E-05
10001.88E-052.14E-052.08E-059.85E-061.95E-051.90E-052.21E-059.77E-061.78E-05
GSEAweight2
101.56E-051.43E-051.33E-055.16E-067.60E-067.16E-061.86E-052.45E-059.84E-06
401.93E-051.98E-051.51E-055.75E-061.22E-055.82E-062.75E-053.00E-056.33E-06
1001.89E-051.92E-051.68E-056.95E-061.75E-051.37E-052.62E-052.34E-059.40E-06
2001.73E-051.90E-051.78E-056.44E-061.88E-051.60E-052.47E-051.72E-051.96E-05
5002.01E-052.04E-052.24E-057.90E-062.09E-051.85E-052.46E-051.48E-052.67E-05
10002.04E-052.17E-052.24E-058.62E-062.10E-051.95E-052.55E-051.28E-052.45E-05
KS
101.59E-051.85E-051.88E-058.54E-061.95E-051.94E-051.80E-051.44E-052.61E-05
402.15E-051.94E-052.11E-057.56E-062.20E-051.96E-051.96E-059.81E-062.61E-05
1002.12E-052.07E-052.17E-058.48E-062.13E-051.95E-052.40E-051.11E-052.28E-05
2002.05E-052.06E-052.19E-058.49E-061.90E-051.90E-052.39E-059.77E-062.13E-05
5001.78E-052.09E-051.99E-051.02E-051.76E-051.79E-051.99E-058.19E-061.66E-05
10001.48E-051.97E-051.78E-051.10E-051.40E-051.64E-051.54E-056.15E-061.35E-05
XSum
101.58E-051.68E-051.47E-056.41E-061.37E-051.15E-052.60E-053.62E-052.20E-05
401.98E-052.15E-051.68E-057.91E-062.17E-051.70E-052.50E-051.57E-052.16E-05
1001.77E-051.97E-051.47E-058.59E-062.05E-051.69E-051.80E-059.50E-061.28E-05
2001.48E-051.79E-051.18E-058.85E-061.97E-051.50E-051.54E-056.92E-061.03E-05
5001.40E-051.67E-051.29E-059.58E-061.85E-051.48E-051.36E-055.70E-061.01E-05
10001.30E-051.75E-051.24E-051.07E-051.74E-051.53E-051.31E-054.91E-068.87E-06
ZhangScore
101.87E-051.68E-052.17E-056.15E-061.56E-051.32E-051.87E-052.27E-052.08E-05
402.59E-052.17E-052.59E-059.70E-062.51E-052.14E-051.62E-051.65E-052.99E-05
1002.39E-052.08E-051.90E-058.01E-061.96E-051.66E-051.62E-051.69E-052.81E-05
2002.36E-052.34E-052.03E-059.36E-062.36E-051.77E-052.22E-051.78E-052.77E-05
5002.18E-052.21E-052.13E-058.82E-062.09E-051.93E-052.32E-051.49E-052.52E-05
10001.78E-052.09E-051.94E-058.18E-061.68E-051.72E-051.62E-058.80E-061.68E-05
Scoring methodnCell line
A375HA1EHT29VCAPHEPG2MCF7A549HCC515PC3
GSEAweight0
101.59E-051.84E-051.88E-058.50E-061.95E-051.94E-051.80E-051.44E-052.61E-05
402.15E-051.93E-052.10E-057.50E-062.21E-051.96E-051.94E-059.80E-062.63E-05
1002.12E-052.06E-052.18E-058.49E-062.13E-051.94E-052.38E-051.11E-052.30E-05
2002.05E-052.06E-052.19E-058.53E-061.91E-051.90E-052.39E-059.67E-062.13E-05
5001.79E-052.09E-052.00E-051.02E-051.77E-051.79E-051.97E-058.20E-061.65E-05
10001.49E-051.97E-051.79E-051.09E-051.41E-051.63E-051.54E-056.16E-061.35E-05
GSEAweight1
101.70E-051.89E-051.77E-057.41E-061.15E-051.65E-052.00E-052.69E-052.27E-05
402.02E-051.89E-051.77E-056.59E-061.79E-051.74E-052.51E-052.02E-052.45E-05
1001.97E-051.96E-051.96E-057.00E-062.06E-051.97E-052.45E-051.46E-052.39E-05
2001.95E-052.05E-052.14E-057.44E-062.09E-051.86E-052.61E-051.20E-052.51E-05
5001.95E-052.12E-052.19E-058.71E-062.01E-051.91E-052.42E-051.08E-052.17E-05
10001.88E-052.14E-052.08E-059.85E-061.95E-051.90E-052.21E-059.77E-061.78E-05
GSEAweight2
101.56E-051.43E-051.33E-055.16E-067.60E-067.16E-061.86E-052.45E-059.84E-06
401.93E-051.98E-051.51E-055.75E-061.22E-055.82E-062.75E-053.00E-056.33E-06
1001.89E-051.92E-051.68E-056.95E-061.75E-051.37E-052.62E-052.34E-059.40E-06
2001.73E-051.90E-051.78E-056.44E-061.88E-051.60E-052.47E-051.72E-051.96E-05
5002.01E-052.04E-052.24E-057.90E-062.09E-051.85E-052.46E-051.48E-052.67E-05
10002.04E-052.17E-052.24E-058.62E-062.10E-051.95E-052.55E-051.28E-052.45E-05
KS
101.59E-051.85E-051.88E-058.54E-061.95E-051.94E-051.80E-051.44E-052.61E-05
402.15E-051.94E-052.11E-057.56E-062.20E-051.96E-051.96E-059.81E-062.61E-05
1002.12E-052.07E-052.17E-058.48E-062.13E-051.95E-052.40E-051.11E-052.28E-05
2002.05E-052.06E-052.19E-058.49E-061.90E-051.90E-052.39E-059.77E-062.13E-05
5001.78E-052.09E-051.99E-051.02E-051.76E-051.79E-051.99E-058.19E-061.66E-05
10001.48E-051.97E-051.78E-051.10E-051.40E-051.64E-051.54E-056.15E-061.35E-05
XSum
101.58E-051.68E-051.47E-056.41E-061.37E-051.15E-052.60E-053.62E-052.20E-05
401.98E-052.15E-051.68E-057.91E-062.17E-051.70E-052.50E-051.57E-052.16E-05
1001.77E-051.97E-051.47E-058.59E-062.05E-051.69E-051.80E-059.50E-061.28E-05
2001.48E-051.79E-051.18E-058.85E-061.97E-051.50E-051.54E-056.92E-061.03E-05
5001.40E-051.67E-051.29E-059.58E-061.85E-051.48E-051.36E-055.70E-061.01E-05
10001.30E-051.75E-051.24E-051.07E-051.74E-051.53E-051.31E-054.91E-068.87E-06
ZhangScore
101.87E-051.68E-052.17E-056.15E-061.56E-051.32E-051.87E-052.27E-052.08E-05
402.59E-052.17E-052.59E-059.70E-062.51E-052.14E-051.62E-051.65E-052.99E-05
1002.39E-052.08E-051.90E-058.01E-061.96E-051.66E-051.62E-051.69E-052.81E-05
2002.36E-052.34E-052.03E-059.36E-062.36E-051.77E-052.22E-051.78E-052.77E-05
5002.18E-052.21E-052.13E-058.82E-062.09E-051.93E-052.32E-051.49E-052.52E-05
10001.78E-052.09E-051.94E-058.18E-061.68E-051.72E-051.62E-058.80E-061.68E-05
Table 3

Statistical results of AUC0.001 (partial ROC curve at the FPR = 0.001). The highest AUC0.001 values for each cell line at each gene signature size are in bold. n: the number of genes in the gene signature. ROC curve: receiver operating characteristic curve. FPR: false positive rate

Scoring methodnCell line
A375HA1EHT29VCAPHEPG2MCF7A549HCC515PC3
GSEAweight0
101.59E-051.84E-051.88E-058.50E-061.95E-051.94E-051.80E-051.44E-052.61E-05
402.15E-051.93E-052.10E-057.50E-062.21E-051.96E-051.94E-059.80E-062.63E-05
1002.12E-052.06E-052.18E-058.49E-062.13E-051.94E-052.38E-051.11E-052.30E-05
2002.05E-052.06E-052.19E-058.53E-061.91E-051.90E-052.39E-059.67E-062.13E-05
5001.79E-052.09E-052.00E-051.02E-051.77E-051.79E-051.97E-058.20E-061.65E-05
10001.49E-051.97E-051.79E-051.09E-051.41E-051.63E-051.54E-056.16E-061.35E-05
GSEAweight1
101.70E-051.89E-051.77E-057.41E-061.15E-051.65E-052.00E-052.69E-052.27E-05
402.02E-051.89E-051.77E-056.59E-061.79E-051.74E-052.51E-052.02E-052.45E-05
1001.97E-051.96E-051.96E-057.00E-062.06E-051.97E-052.45E-051.46E-052.39E-05
2001.95E-052.05E-052.14E-057.44E-062.09E-051.86E-052.61E-051.20E-052.51E-05
5001.95E-052.12E-052.19E-058.71E-062.01E-051.91E-052.42E-051.08E-052.17E-05
10001.88E-052.14E-052.08E-059.85E-061.95E-051.90E-052.21E-059.77E-061.78E-05
GSEAweight2
101.56E-051.43E-051.33E-055.16E-067.60E-067.16E-061.86E-052.45E-059.84E-06
401.93E-051.98E-051.51E-055.75E-061.22E-055.82E-062.75E-053.00E-056.33E-06
1001.89E-051.92E-051.68E-056.95E-061.75E-051.37E-052.62E-052.34E-059.40E-06
2001.73E-051.90E-051.78E-056.44E-061.88E-051.60E-052.47E-051.72E-051.96E-05
5002.01E-052.04E-052.24E-057.90E-062.09E-051.85E-052.46E-051.48E-052.67E-05
10002.04E-052.17E-052.24E-058.62E-062.10E-051.95E-052.55E-051.28E-052.45E-05
KS
101.59E-051.85E-051.88E-058.54E-061.95E-051.94E-051.80E-051.44E-052.61E-05
402.15E-051.94E-052.11E-057.56E-062.20E-051.96E-051.96E-059.81E-062.61E-05
1002.12E-052.07E-052.17E-058.48E-062.13E-051.95E-052.40E-051.11E-052.28E-05
2002.05E-052.06E-052.19E-058.49E-061.90E-051.90E-052.39E-059.77E-062.13E-05
5001.78E-052.09E-051.99E-051.02E-051.76E-051.79E-051.99E-058.19E-061.66E-05
10001.48E-051.97E-051.78E-051.10E-051.40E-051.64E-051.54E-056.15E-061.35E-05
XSum
101.58E-051.68E-051.47E-056.41E-061.37E-051.15E-052.60E-053.62E-052.20E-05
401.98E-052.15E-051.68E-057.91E-062.17E-051.70E-052.50E-051.57E-052.16E-05
1001.77E-051.97E-051.47E-058.59E-062.05E-051.69E-051.80E-059.50E-061.28E-05
2001.48E-051.79E-051.18E-058.85E-061.97E-051.50E-051.54E-056.92E-061.03E-05
5001.40E-051.67E-051.29E-059.58E-061.85E-051.48E-051.36E-055.70E-061.01E-05
10001.30E-051.75E-051.24E-051.07E-051.74E-051.53E-051.31E-054.91E-068.87E-06
ZhangScore
101.87E-051.68E-052.17E-056.15E-061.56E-051.32E-051.87E-052.27E-052.08E-05
402.59E-052.17E-052.59E-059.70E-062.51E-052.14E-051.62E-051.65E-052.99E-05
1002.39E-052.08E-051.90E-058.01E-061.96E-051.66E-051.62E-051.69E-052.81E-05
2002.36E-052.34E-052.03E-059.36E-062.36E-051.77E-052.22E-051.78E-052.77E-05
5002.18E-052.21E-052.13E-058.82E-062.09E-051.93E-052.32E-051.49E-052.52E-05
10001.78E-052.09E-051.94E-058.18E-061.68E-051.72E-051.62E-058.80E-061.68E-05
Scoring methodnCell line
A375HA1EHT29VCAPHEPG2MCF7A549HCC515PC3
GSEAweight0
101.59E-051.84E-051.88E-058.50E-061.95E-051.94E-051.80E-051.44E-052.61E-05
402.15E-051.93E-052.10E-057.50E-062.21E-051.96E-051.94E-059.80E-062.63E-05
1002.12E-052.06E-052.18E-058.49E-062.13E-051.94E-052.38E-051.11E-052.30E-05
2002.05E-052.06E-052.19E-058.53E-061.91E-051.90E-052.39E-059.67E-062.13E-05
5001.79E-052.09E-052.00E-051.02E-051.77E-051.79E-051.97E-058.20E-061.65E-05
10001.49E-051.97E-051.79E-051.09E-051.41E-051.63E-051.54E-056.16E-061.35E-05
GSEAweight1
101.70E-051.89E-051.77E-057.41E-061.15E-051.65E-052.00E-052.69E-052.27E-05
402.02E-051.89E-051.77E-056.59E-061.79E-051.74E-052.51E-052.02E-052.45E-05
1001.97E-051.96E-051.96E-057.00E-062.06E-051.97E-052.45E-051.46E-052.39E-05
2001.95E-052.05E-052.14E-057.44E-062.09E-051.86E-052.61E-051.20E-052.51E-05
5001.95E-052.12E-052.19E-058.71E-062.01E-051.91E-052.42E-051.08E-052.17E-05
10001.88E-052.14E-052.08E-059.85E-061.95E-051.90E-052.21E-059.77E-061.78E-05
GSEAweight2
101.56E-051.43E-051.33E-055.16E-067.60E-067.16E-061.86E-052.45E-059.84E-06
401.93E-051.98E-051.51E-055.75E-061.22E-055.82E-062.75E-053.00E-056.33E-06
1001.89E-051.92E-051.68E-056.95E-061.75E-051.37E-052.62E-052.34E-059.40E-06
2001.73E-051.90E-051.78E-056.44E-061.88E-051.60E-052.47E-051.72E-051.96E-05
5002.01E-052.04E-052.24E-057.90E-062.09E-051.85E-052.46E-051.48E-052.67E-05
10002.04E-052.17E-052.24E-058.62E-062.10E-051.95E-052.55E-051.28E-052.45E-05
KS
101.59E-051.85E-051.88E-058.54E-061.95E-051.94E-051.80E-051.44E-052.61E-05
402.15E-051.94E-052.11E-057.56E-062.20E-051.96E-051.96E-059.81E-062.61E-05
1002.12E-052.07E-052.17E-058.48E-062.13E-051.95E-052.40E-051.11E-052.28E-05
2002.05E-052.06E-052.19E-058.49E-061.90E-051.90E-052.39E-059.77E-062.13E-05
5001.78E-052.09E-051.99E-051.02E-051.76E-051.79E-051.99E-058.19E-061.66E-05
10001.48E-051.97E-051.78E-051.10E-051.40E-051.64E-051.54E-056.15E-061.35E-05
XSum
101.58E-051.68E-051.47E-056.41E-061.37E-051.15E-052.60E-053.62E-052.20E-05
401.98E-052.15E-051.68E-057.91E-062.17E-051.70E-052.50E-051.57E-052.16E-05
1001.77E-051.97E-051.47E-058.59E-062.05E-051.69E-051.80E-059.50E-061.28E-05
2001.48E-051.79E-051.18E-058.85E-061.97E-051.50E-051.54E-056.92E-061.03E-05
5001.40E-051.67E-051.29E-059.58E-061.85E-051.48E-051.36E-055.70E-061.01E-05
10001.30E-051.75E-051.24E-051.07E-051.74E-051.53E-051.31E-054.91E-068.87E-06
ZhangScore
101.87E-051.68E-052.17E-056.15E-061.56E-051.32E-051.87E-052.27E-052.08E-05
402.59E-052.17E-052.59E-059.70E-062.51E-052.14E-051.62E-051.65E-052.99E-05
1002.39E-052.08E-051.90E-058.01E-061.96E-051.66E-051.62E-051.69E-052.81E-05
2002.36E-052.34E-052.03E-059.36E-062.36E-051.77E-052.22E-051.78E-052.77E-05
5002.18E-052.21E-052.13E-058.82E-062.09E-051.93E-052.32E-051.49E-052.52E-05
10001.78E-052.09E-051.94E-058.18E-061.68E-051.72E-051.62E-058.80E-061.68E-05
Comparison of six connectivity methods in nine cell lines. Compounds that have the same MoAs and targets are counted as true positives. AUC0.01 (partial ROC curve at the FPR = 0.01) values were measured for six methods at the gene signature size of 200 in A375, HA1E, HT29, VCAP, HEPG2, MCF7, A549, HCC515 and PC3 cell line, respectively.
Figure 3

Comparison of six connectivity methods in nine cell lines. Compounds that have the same MoAs and targets are counted as true positives. AUC0.01 (partial ROC curve at the FPR = 0.01) values were measured for six methods at the gene signature size of 200 in A375, HA1E, HT29, VCAP, HEPG2, MCF7, A549, HCC515 and PC3 cell line, respectively.

For drug pair B-A:

The XSum(B-A) could be calculated the same way.

The final similarity score for drug A and drug B: XSum(A&B) = (XSum(A-B) + XSum(B-A))/2.

For this study, N was set to 5, 20, 50, 100, 250 and 500 in all metrics, so the actual gene signature sizes were 10, 40, 100, 200, 500 and 1000, respectively.

AUC and P values

We utilized the partial area under the ROC curve at false positive rates of 0.001, 0.005 and 0.01 (AUC0.001, AUC0.005 and AUC0.01) to evaluate the performance of these six methods. Different treatment conditions of the same compound were treated as individual profiles when calculating the AUC0.001, AUC0.005 and AUC0.01.

For comparing the partial ROC curves (FPR ranges from 0 to 0.001/0.005/0.01) of different methods, P values were generated by the R package pROC [24]. To determine the statistical significance of the AUC0.001, AUC0.005 and AUC0.01 results for each method, 10 000 runs with random permutations of the true drug–drug relationships were performed to generate non-parametric P values.

Implementation of R package RCSM

The R package RCSM offers functions for measuring the connectivity based on the six methods (GSEAweight0, GSEAweight1, GSEAweight2, KS, XSum and ZhangScore) and return scores, P values and adjusted P values of the specific connectivity method. Parallelization is also available in the package if your computer has multiple cores. This package can be easy installed from the github (https://github.com/Jasonlinchina/RCSM). This user-friendly package allows both computational and experimental researchers to quickly and effectively explore the connectivities between their interested gene signatures and different perturbagens.

Results

Comparison of connectivity methods

To compare the early retrieval performance of the six connectivity methods, we measured the AUC at false positive rates of 0.001, 0.005 and 0.01 (AUC0.001, AUC0.005 and AUC0.01). The benchmark set for constructing the ROC curve was compiled from the Drug Repurposing Hub database, and the work flow of the performance evaluation can be seen in Figure 2. The actual similarity scores (from ZhangScore) of true positive drug pairs are significantly higher than those of the true negative drug pairs, indicating the effectiveness of this benchmark set (Supplementary Figure 1). The AUC0.001, AUC0.005 and AUC0.01 results of six different connectivity methods were measured in each of the nine cell lines. For each method, six sizes (n = 10, 40, 100, 200, 500, 1000 genes) of gene signature were considered to construct the ROC curves. The AUC0.01 values from ZhangScore were the highest in four, three, three and seven of the nine cell lines at the gene signature size of 10, 40, 100 and 200, respectively. Among 36 events (four gene signature sizes ranging from 10 to 200 in nine cell lines), ZhangScore had the highest AUC0.01 values in 17 events (GSEAweight0 in 6 events, GSEAweight1 in 9 events, GSEAweight2 in 5 events, KS in 4 events, and XSum in none), which was the best in all of these six methods (Table 1). Similarly, ZhangScore had the highest AUC0.005 values in 21 events (GSEAweight0 in 5 events, GSEAweight1 in 3 events, GSEAweight2 in 5 events, KS in 4 events, and XSum in 1 event) (Table 2) and the highest AUC0.001 values in 18 events (GSEAweight0 in 8 events, GSEAweight1 in 3 events, GSEAweight2 in 4 events, KS in 6 events and XSum in 3 events) (Table 3). These results indicated that ZhangScore was superior to other methods at the gene signature sizes ranging from 10 to 200. At the gene signature sizes of 500 and 1000, GSEAweight2 outperformed other methods.

Considering that gene signatures with more than 500 genes are uncommon in actual studies, we chose the gene signature with the length of 200 to further compare different methods. The AUC0.01 values from ZhangScore were 5.72e-4 (P < 0.0001), 5.33e-4 (P < 0.0001), 5.37e-4 (P < 0.0001), 4.61e-4 (P < 0.0001), 5.18e-4 (P < 0.0001), 4.54e-4 (P < 0.0001), 8.93e-4 (P < 0.0001), 9.66e-4 (P < 0.0001) and 5.25e-4 (P < 0.0001) in A375, HA1E, HT29, VCAP, HEPG2, MCF7, A549, HCC515 and PC3 cell lines, respectively. When measuring the fold enrichment at the 0.01 false positive rate level, the fold enrichment scores from ZhangScore are 7.96, 7.61, 6.72, 6.77, 7.67, 5.83, 12.1, 13.0 and 6.53 in the above cell lines, respectively. The AUC0.01 and fold enrichment values of ZhangScore were significantly higher than other methods in A375, HA1E, HT29 and VCAP cell lines. In the HCC515 cell line, ZhangScore had a competitive performance with GSEAweight2 and performed better than the four other methods. In HEPG2, MCF7 and A549 cell lines, ZhangScore had a competitive performance with GSEAweight1 and GSEAweight2 and performed better than the three other methods. (Figure 3, Table 4).

Testing with the estrogen gene signature

The estrogen gene signature was first reported by the initial cMAP study [2]. This gene signature consists of 189 Affymetrix probe-set IDs. We obtained 40 up- and 89 down-regulated genes after annotating these Affymetrix probe-set IDs to the Entrez gene IDs used by L1000. Using the R package RSCM our lab developed, we scored the touchstone compounds according to this gene signature in the breast cancer cell line MCF7. The positively correlated and negatively correlated compounds with the estrogen gene signature were counted for each method based on a unified standard (adjusted P value <0.05). The results showed that the ZhangScore could detect more known estrogen receptor agonists (Figure 4a) and more known estrogen receptor antagonists (Figure 4b) than other methods. This suggested that ZhangScore achieved a higher level of accuracy for the estrogen gene signature than other methods.

Results for the estrogen gene signature. The known (A) estrogen receptor agonists and (B) estrogen receptor antagonists based on the estrogen gene signature identified by six methods in the MCF7 cell line.
Figure 4

Results for the estrogen gene signature. The known (A) estrogen receptor agonists and (B) estrogen receptor antagonists based on the estrogen gene signature identified by six methods in the MCF7 cell line.

Testing with the gene signature of TOP2A knockdown

The LINCS program has generated over 20 000 gene expression profiles from mRNA knockdown experiments. These large-scale profiles of genetic perturbagens have complemented the CMap pilot dataset. To test ZhangScore for the newly generated gene knockdown data from the L1000 project, we focused on TOP2A (DNA Topoisomerase II Alpha) gene that is associated with the prognosis of multiple cancer types [25, 26]. The gene signature of TOP2A knockdown was derived from the level 5 data of L1000. This signature contains the top 100 up-regulated and top 100 down-regulated genes, which were extracted from the gene expression profile of TOP2A knockdown in a liver cancer cell line HEPG2. ZhangScore was employed to measure the connectivity between this gene signature and the gene expression profiles of the touchstone compounds in the HEPG2 cell line. Table 5 shows the top 10 compounds that mimic the regulatory mechanisms of TOP2A knockdown. Among them, doxorubicin, daunorubicin and irinotecan are well-known TOP2A inhibitors. Two of the other seven newly identified potential TOP2A inhibitors, diflunisal and SIB-1983, were also indicated by other studies. Diflunisal was determined to inhibit topo II ATPase and prevent topo II-mediated DNA cleavage [27]. SIB-1983 was predicted to be a topoisomerase inhibitor based on a computational study [28]. Five other connectivity methods were also employed to identify the potential TOP2A inhibitors based on the gene signature of TOP2A knockdown (Supplementary Figure 2a), and ZhangScore could identify more known TOP2A inhibitors than these five methods (Supplementary Figure 2b). In summary, ZhangScore achieved a higher level of accuracy for identifying known TOP2A inhibitors than other methods and was able to find new effective compounds for gene signatures derived from gene knockdown data.

Table 4

Statistical results for Figure 3. AUC0.01 (partial ROC curve at the FPR = 0.01) and random permutation P value were measured. The statistical significance comparing ZhangScore and five other methods were also measured. The results of ZhangScore are shown in bold

A375HA1EHT29
Scoring methodFold enrichmentAUC0.01 (P valuea)P valuebFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P value
GSEAweight05.734.23e-4 (P < 0.0001)1.16e-55.323.98e-4 (P < 0.0001)2.40e-55.594.14e-4 (P < 0.0001)6.27e-4
GSEAweight16.694.45e-4 (P < 0.0001)1.53e-46.254.46e-4 (P < 0.0001)4.03e-46.234.62e-4 (P < 0.0001)5.21e-3
GSEAweight27.174.40e-4 (P < 0.0001)1.65e-47.024.76e-4 (P < 0.0001)0.05556.154.54e-4 (P < 0.0001)0.0102
KS5.714.23e-4 (P < 0.0001)2.01e-55.303.98e-4 (P < 0.0001)5.40e-55.564.14e-4 (P < 0.0001)6.50e-4
XSum6.314.27e-4 (P < 0.0001)1.77e-34.623.35e-4 (P < 0.0001)1.30e-53.912.96e-4 (P < 0.0001)4.90e-7
ZhangScore7.965.72e-4 (P < 0.0001)\7.615.33e-4 (P < 0.0001)\6.725.37e-4 (P < 0.0001)\
VCAPHEPG2MCF7
Scoring methodFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P value
GSEAweight05.063.41e-4 (P < 0.0001)9.01e-36.074.25e-4 (P < 0.0001)0.01485.514.04e-4 (P < 0.0001)0.0437
GSEAweight16.274.01e-4 (P < 0.0001)0.02586.875.13e-4 (P < 0.0001)0.8765.834.30e-4 (P < 0.0001)0.320
GSEAweight26.273.54e-4 (P < 0.0001)1.22e-37.004.67e-4 (P < 0.0001)0.3366.204.18e-4 (P < 0.0001)0.357
KS5.083.40e-4 (P < 0.0001)9.82e-35.964.26e-4 (P < 0.0001)0.02005.534.05e-4 (P < 0.0001)0.0486
XSum3.952.69e-4 (P < 0.0001)2.82e-55.153.69e-4 (P < 0.0001)9.70e-35.093.68e-4 (P < 0.0001)0.0207
ZhangScore6.774.61e-4 (P < 0.0001)\7.675.18e-4 (P < 0.0001)\5.834.54e-4 (P < 0.0001)\
A549HCC515PC3
Scoring methodFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P value
GSEAweight010.37.41e-4 (P < 0.0001)8.13e-49.035.29e-4 (P < 0.0001)4.24e-116.304.89e-4 (P < 0.0001)0.0902
GSEAweight111.99.11e-4 (P < 0.0001)0.57812.28.15e-4 (P < 0.0001)2.99e-46.555.18e-4 (P < 0.0001)0.705
GSEAweight212.29.67e-4 (P < 0.0001)0.056413.49.97e-4 (P < 0.0001)0.4736.434.40e-4 (P < 0.0001)8.01e-3
KS10.47.40e-4 (P < 0.0001)7.83e-48.995.29e-4 (P < 0.0001)5.72e-116.284.89e-4 (P < 0.0001)0.0991
XSum10.77.22e-4 (P < 0.0001)2.16e-57.743.72e-4 (P < 0.0001)6.32e-136.224.20e-4 (P < 0.0001)1.21e-3
ZhangScore12.18.93e-4 (P < 0.0001)\13.09.66e-4 (P < 0.0001)\6.535.25e-4 (P < 0.0001)\
A375HA1EHT29
Scoring methodFold enrichmentAUC0.01 (P valuea)P valuebFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P value
GSEAweight05.734.23e-4 (P < 0.0001)1.16e-55.323.98e-4 (P < 0.0001)2.40e-55.594.14e-4 (P < 0.0001)6.27e-4
GSEAweight16.694.45e-4 (P < 0.0001)1.53e-46.254.46e-4 (P < 0.0001)4.03e-46.234.62e-4 (P < 0.0001)5.21e-3
GSEAweight27.174.40e-4 (P < 0.0001)1.65e-47.024.76e-4 (P < 0.0001)0.05556.154.54e-4 (P < 0.0001)0.0102
KS5.714.23e-4 (P < 0.0001)2.01e-55.303.98e-4 (P < 0.0001)5.40e-55.564.14e-4 (P < 0.0001)6.50e-4
XSum6.314.27e-4 (P < 0.0001)1.77e-34.623.35e-4 (P < 0.0001)1.30e-53.912.96e-4 (P < 0.0001)4.90e-7
ZhangScore7.965.72e-4 (P < 0.0001)\7.615.33e-4 (P < 0.0001)\6.725.37e-4 (P < 0.0001)\
VCAPHEPG2MCF7
Scoring methodFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P value
GSEAweight05.063.41e-4 (P < 0.0001)9.01e-36.074.25e-4 (P < 0.0001)0.01485.514.04e-4 (P < 0.0001)0.0437
GSEAweight16.274.01e-4 (P < 0.0001)0.02586.875.13e-4 (P < 0.0001)0.8765.834.30e-4 (P < 0.0001)0.320
GSEAweight26.273.54e-4 (P < 0.0001)1.22e-37.004.67e-4 (P < 0.0001)0.3366.204.18e-4 (P < 0.0001)0.357
KS5.083.40e-4 (P < 0.0001)9.82e-35.964.26e-4 (P < 0.0001)0.02005.534.05e-4 (P < 0.0001)0.0486
XSum3.952.69e-4 (P < 0.0001)2.82e-55.153.69e-4 (P < 0.0001)9.70e-35.093.68e-4 (P < 0.0001)0.0207
ZhangScore6.774.61e-4 (P < 0.0001)\7.675.18e-4 (P < 0.0001)\5.834.54e-4 (P < 0.0001)\
A549HCC515PC3
Scoring methodFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P value
GSEAweight010.37.41e-4 (P < 0.0001)8.13e-49.035.29e-4 (P < 0.0001)4.24e-116.304.89e-4 (P < 0.0001)0.0902
GSEAweight111.99.11e-4 (P < 0.0001)0.57812.28.15e-4 (P < 0.0001)2.99e-46.555.18e-4 (P < 0.0001)0.705
GSEAweight212.29.67e-4 (P < 0.0001)0.056413.49.97e-4 (P < 0.0001)0.4736.434.40e-4 (P < 0.0001)8.01e-3
KS10.47.40e-4 (P < 0.0001)7.83e-48.995.29e-4 (P < 0.0001)5.72e-116.284.89e-4 (P < 0.0001)0.0991
XSum10.77.22e-4 (P < 0.0001)2.16e-57.743.72e-4 (P < 0.0001)6.32e-136.224.20e-4 (P < 0.0001)1.21e-3
ZhangScore12.18.93e-4 (P < 0.0001)\13.09.66e-4 (P < 0.0001)\6.535.25e-4 (P < 0.0001)\

Fold enrichment is calculated as the ratio between true positive rate and false positive rate at FPR = 0.01.

aRandom permutation P value for AUC0.01.

bThe statistical significance between ZhangScore and five other methods.

Table 4

Statistical results for Figure 3. AUC0.01 (partial ROC curve at the FPR = 0.01) and random permutation P value were measured. The statistical significance comparing ZhangScore and five other methods were also measured. The results of ZhangScore are shown in bold

A375HA1EHT29
Scoring methodFold enrichmentAUC0.01 (P valuea)P valuebFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P value
GSEAweight05.734.23e-4 (P < 0.0001)1.16e-55.323.98e-4 (P < 0.0001)2.40e-55.594.14e-4 (P < 0.0001)6.27e-4
GSEAweight16.694.45e-4 (P < 0.0001)1.53e-46.254.46e-4 (P < 0.0001)4.03e-46.234.62e-4 (P < 0.0001)5.21e-3
GSEAweight27.174.40e-4 (P < 0.0001)1.65e-47.024.76e-4 (P < 0.0001)0.05556.154.54e-4 (P < 0.0001)0.0102
KS5.714.23e-4 (P < 0.0001)2.01e-55.303.98e-4 (P < 0.0001)5.40e-55.564.14e-4 (P < 0.0001)6.50e-4
XSum6.314.27e-4 (P < 0.0001)1.77e-34.623.35e-4 (P < 0.0001)1.30e-53.912.96e-4 (P < 0.0001)4.90e-7
ZhangScore7.965.72e-4 (P < 0.0001)\7.615.33e-4 (P < 0.0001)\6.725.37e-4 (P < 0.0001)\
VCAPHEPG2MCF7
Scoring methodFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P value
GSEAweight05.063.41e-4 (P < 0.0001)9.01e-36.074.25e-4 (P < 0.0001)0.01485.514.04e-4 (P < 0.0001)0.0437
GSEAweight16.274.01e-4 (P < 0.0001)0.02586.875.13e-4 (P < 0.0001)0.8765.834.30e-4 (P < 0.0001)0.320
GSEAweight26.273.54e-4 (P < 0.0001)1.22e-37.004.67e-4 (P < 0.0001)0.3366.204.18e-4 (P < 0.0001)0.357
KS5.083.40e-4 (P < 0.0001)9.82e-35.964.26e-4 (P < 0.0001)0.02005.534.05e-4 (P < 0.0001)0.0486
XSum3.952.69e-4 (P < 0.0001)2.82e-55.153.69e-4 (P < 0.0001)9.70e-35.093.68e-4 (P < 0.0001)0.0207
ZhangScore6.774.61e-4 (P < 0.0001)\7.675.18e-4 (P < 0.0001)\5.834.54e-4 (P < 0.0001)\
A549HCC515PC3
Scoring methodFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P value
GSEAweight010.37.41e-4 (P < 0.0001)8.13e-49.035.29e-4 (P < 0.0001)4.24e-116.304.89e-4 (P < 0.0001)0.0902
GSEAweight111.99.11e-4 (P < 0.0001)0.57812.28.15e-4 (P < 0.0001)2.99e-46.555.18e-4 (P < 0.0001)0.705
GSEAweight212.29.67e-4 (P < 0.0001)0.056413.49.97e-4 (P < 0.0001)0.4736.434.40e-4 (P < 0.0001)8.01e-3
KS10.47.40e-4 (P < 0.0001)7.83e-48.995.29e-4 (P < 0.0001)5.72e-116.284.89e-4 (P < 0.0001)0.0991
XSum10.77.22e-4 (P < 0.0001)2.16e-57.743.72e-4 (P < 0.0001)6.32e-136.224.20e-4 (P < 0.0001)1.21e-3
ZhangScore12.18.93e-4 (P < 0.0001)\13.09.66e-4 (P < 0.0001)\6.535.25e-4 (P < 0.0001)\
A375HA1EHT29
Scoring methodFold enrichmentAUC0.01 (P valuea)P valuebFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P value
GSEAweight05.734.23e-4 (P < 0.0001)1.16e-55.323.98e-4 (P < 0.0001)2.40e-55.594.14e-4 (P < 0.0001)6.27e-4
GSEAweight16.694.45e-4 (P < 0.0001)1.53e-46.254.46e-4 (P < 0.0001)4.03e-46.234.62e-4 (P < 0.0001)5.21e-3
GSEAweight27.174.40e-4 (P < 0.0001)1.65e-47.024.76e-4 (P < 0.0001)0.05556.154.54e-4 (P < 0.0001)0.0102
KS5.714.23e-4 (P < 0.0001)2.01e-55.303.98e-4 (P < 0.0001)5.40e-55.564.14e-4 (P < 0.0001)6.50e-4
XSum6.314.27e-4 (P < 0.0001)1.77e-34.623.35e-4 (P < 0.0001)1.30e-53.912.96e-4 (P < 0.0001)4.90e-7
ZhangScore7.965.72e-4 (P < 0.0001)\7.615.33e-4 (P < 0.0001)\6.725.37e-4 (P < 0.0001)\
VCAPHEPG2MCF7
Scoring methodFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P value
GSEAweight05.063.41e-4 (P < 0.0001)9.01e-36.074.25e-4 (P < 0.0001)0.01485.514.04e-4 (P < 0.0001)0.0437
GSEAweight16.274.01e-4 (P < 0.0001)0.02586.875.13e-4 (P < 0.0001)0.8765.834.30e-4 (P < 0.0001)0.320
GSEAweight26.273.54e-4 (P < 0.0001)1.22e-37.004.67e-4 (P < 0.0001)0.3366.204.18e-4 (P < 0.0001)0.357
KS5.083.40e-4 (P < 0.0001)9.82e-35.964.26e-4 (P < 0.0001)0.02005.534.05e-4 (P < 0.0001)0.0486
XSum3.952.69e-4 (P < 0.0001)2.82e-55.153.69e-4 (P < 0.0001)9.70e-35.093.68e-4 (P < 0.0001)0.0207
ZhangScore6.774.61e-4 (P < 0.0001)\7.675.18e-4 (P < 0.0001)\5.834.54e-4 (P < 0.0001)\
A549HCC515PC3
Scoring methodFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P valueFold enrichmentAUC0.01 (P value)P value
GSEAweight010.37.41e-4 (P < 0.0001)8.13e-49.035.29e-4 (P < 0.0001)4.24e-116.304.89e-4 (P < 0.0001)0.0902
GSEAweight111.99.11e-4 (P < 0.0001)0.57812.28.15e-4 (P < 0.0001)2.99e-46.555.18e-4 (P < 0.0001)0.705
GSEAweight212.29.67e-4 (P < 0.0001)0.056413.49.97e-4 (P < 0.0001)0.4736.434.40e-4 (P < 0.0001)8.01e-3
KS10.47.40e-4 (P < 0.0001)7.83e-48.995.29e-4 (P < 0.0001)5.72e-116.284.89e-4 (P < 0.0001)0.0991
XSum10.77.22e-4 (P < 0.0001)2.16e-57.743.72e-4 (P < 0.0001)6.32e-136.224.20e-4 (P < 0.0001)1.21e-3
ZhangScore12.18.93e-4 (P < 0.0001)\13.09.66e-4 (P < 0.0001)\6.535.25e-4 (P < 0.0001)\

Fold enrichment is calculated as the ratio between true positive rate and false positive rate at FPR = 0.01.

aRandom permutation P value for AUC0.01.

bThe statistical significance between ZhangScore and five other methods.

Table 5

Top 10 compounds matched to a gene signature of TOP2A knockdown in the HEPG2 cell line identified by ZhangScore

Top 10 ranked compounds based on ZhangScoreWhether TOP2A inhibitor
FormoterolU
DoxorubicinY
MethylergometrineU
DaunorubicinY
DiflunisalU
NSC-95397U
SIB-1893U
ThiethylperazineU
IrinotecanY
L-655240U
Top 10 ranked compounds based on ZhangScoreWhether TOP2A inhibitor
FormoterolU
DoxorubicinY
MethylergometrineU
DaunorubicinY
DiflunisalU
NSC-95397U
SIB-1893U
ThiethylperazineU
IrinotecanY
L-655240U

Y: well-known TOP2A inhibitor; U: unknown

Table 5

Top 10 compounds matched to a gene signature of TOP2A knockdown in the HEPG2 cell line identified by ZhangScore

Top 10 ranked compounds based on ZhangScoreWhether TOP2A inhibitor
FormoterolU
DoxorubicinY
MethylergometrineU
DaunorubicinY
DiflunisalU
NSC-95397U
SIB-1893U
ThiethylperazineU
IrinotecanY
L-655240U
Top 10 ranked compounds based on ZhangScoreWhether TOP2A inhibitor
FormoterolU
DoxorubicinY
MethylergometrineU
DaunorubicinY
DiflunisalU
NSC-95397U
SIB-1893U
ThiethylperazineU
IrinotecanY
L-655240U

Y: well-known TOP2A inhibitor; U: unknown

Discussion and conclusion

In this study, we systematically evaluated six CMap methodologies by assessing their performance using L1000-based next-generation Connectivity Map data. Our results suggested that ZhangScore was generally superior to other methods and exhibited the highest accuracy for gene signature sizes ranging from 10 to 200. To our knowledge, ours is the first report to evaluate the early retrieval performance of connectivity methods based on L1000 data. Furthermore, we have developed a user-friendly R package RCSM to easily implement the six connectivity methods.

The number of genes to be included in the gene signatures has always been a question. The gene signatures with lengths greater than 500 are generally not suggested for use in analysis [7, 20]. The decision on the use of the gene signature size may also depend on the specific biological condition being investigated. Six different gene signature sizes covering the common range of length were tested in our study. We chose the gene signature size of 200 to perform the comparison for the consideration that too many or too few genes might be overly biased for representing a biological state. It was a compromise to choose the size of 200 in this comparison, and more evaluations and experiments should be carried out to provide a more detailed guide for choosing the length of a gene signature.

In previous studies [8, 20], the XCos and XSum method performed best on predicting drug–drug relationships and drug-indication pairs from the CMap pilot dataset, respectively. However, the preferred method changed when the comparison was extended to the much larger dataset. There are at least three possible reasons to account for this change. First, the benchmark set in this study was different from that of previous studies. Second, the methods used for comparison were different from these of previous studies. Third, the L1000 assay differed from the previously used microarray technology in both the experimental processes and the data pre-processing pipelines. These differences affect the distributions of gene expression fold changes for all genes, which may influence the performance of these methods. There are various data pre-processing pipelines for the L1000 data. Duan et al. [29] proposed a geometrical multivariate approach (the CD method) to identify differentially expressed genes from the L1000 data. Iorio et al. [4] provided a method (the MANTRA method) to deal with different ranking of genes in replicates by merging each replicate profile using a heuristic-based algorithm. Future work should include a comparison of different data pre-processing pipelines for L1000 data.

Besides these six methods we mentioned above, the performance of other similar methods, such as the XCos [20], TES [4, 30] and DIPS [19] methods, were also evaluated (Supplementary Table 1). The XCos method had an extremely worse performance than other methods. We did not evaluate this method in-depth because the input of the XCos method is different from that of other methods (KS, XSum, GSEAweight0, GSEAweight1, GSEAweight2 and ZhangScore) and it requires that the genes in one gene signature are ordered and given weights (Supplementary Figure 3). The TES method had the same performance with the GSEAweight1 method (Supplementary Table 1). The principles of the TES method and GSEAweight1 method are very similar to each other. The only difference is that the TES method does not require the up and down signature to have consistent direction of scores. The DIPS method was based on the TES method, and the principles of these two methods are almost the same.

The weight for the gene expression fold changes of each gene was a strong factor in characterizing one perturbagen. The core algorithm of ZhangScore measures the ranks for each gene according to the absolute value of the fold change. This rank-based weight of ZhangScore may be more appropriate for measuring the similarities between gene expression profiles of different perturbagens from the level 5 data of L1000. Since the gene signature sizes ranging from 10 to 200 are commonly utilized, ZhangScore might be the appropriate choice for measuring the similarities based on the L1000 data. If one has a gene signature with more than 500 genes, GSEAweight2 is also worth testing.

How to assign weights to the gene expression fold changes by the connectivity methods has a significant influence on the performance of these methods. The AUC0.001, AUC0.005 and AUC0.01 values of the KS method and GESAweight0 method are very close to each other because both these two methods have weakened weights for fold changes. In addition, they are both Kolmogorov–Smirnov-derived methods.

Taken together, evaluation of CMap methodologies has been extended to multiple cell types and more types of perturbagens. The conclusions from this study provide potential guidelines for researchers to choose the suitable method for drug repurposing based on L1000 data. Moreover, in addition to L1000 technology, there are other high-throughput technologies, such as HTS2 (high-throughput sequencing-based high-throughput screening) [31, 32], which can also generate large-scale gene profiling datasets, for which ZhangScore might also be appropriate.

Key Points
  • We utilized the partial area under the ROC curve (AUC0.001, AUC0.005 and AUC0.01) at false positive rates of 0.001, 0.005 and 0.01 (FPR = 0.001, FPR = 0.005 and FPR = 0.01) to evaluate the performance of six connectivity methods.

  • Systematic evaluations of CMap methodologies have been extended to the L1000 data that contain the expression profiles of multiple human cell lines and various types of perturbagens.

  • ZhangScore is generally superior to other methods and exhibits the highest accuracy with the gene signature sizes ranging from 10 to 200.

  • Six methods used in this study have been implemented in R package and is freely available at https://github.com/Jasonlinchina/RCSM.

Funding

National Natural Science Foundation of China (81673460); Key Projects of Science and Technology Plan of Inner Mongolia Autonomous Region (201802115); Tsinghua-Peking Joint Center of Life Sciences.

Kequan Lin is a PhD candidate in School of Life Sciences, Tsinghua University. His research focuses on developing bioinformatics pipelines for cancer research and drug discovery.

Lu Li is a PhD candidate in School of Life Sciences, Tsinghua University. Her research focuses on drug discovery.

Yifei Dai is a PhD candidate at the Department of Basic Medical Sciences, School of Medicine, Tsinghua University. His research focuses on Traditional Chinese Medicine.

Huili Wang is a PhD candidate at the Department of Basic Medical Sciences, School of Medicine, Tsinghua University. Her research focuses on long non-coding RNA (lncRNA) and cancer.

Shuaishuai Teng is a PhD candidate at the Department of Basic Medical Sciences, School of Medicine, Tsinghua University. Her research focuses on epigenetics and cancer.

Xilinqiqige Bao is a chief pharmacist in Innovative Mongolian Pharmaceutical Preparations Laboratory, Inner Mongolia International Mongolian Hospital, whose research focuses on pharmacological research of Mongolian herbs.

Zhi John Lu is a professor in School of Life Sciences, Tsinghua University, whose lab is interested in bioinformatics studies for lncRNA and cancer.

Dong Wang is a professor in School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, whose lab is interested in cancer epigenetics and drug discovery.

References

1.

Subramanian
 
A
,
Narayan
 
R
,
Corsello
 
SM
, et al.  
A next generation connectivity map: L1000 platform and the first 1,000,000 profiles
.
Cell
 
2017
;
171
:
1437
52
.

2.

Lamb
 
J
,
Crawford
 
ED
,
Peck
 
D
, et al.  
The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease
.
Science
 
2006
;
313
:
1929
35
.

3.

Sirota
 
M
,
Dudley
 
JT
,
Kim
 
J
, et al.  
Discovery and preclinical validation of drug indications using compendia of public gene expression data
.
Sci Transl Med
 
2011
;
3
:
96ra77
7
.

4.

Iorio
 
F
,
Bosotti
 
R
,
Scacheri
 
E
, et al.  
Discovery of drug mode of action and drug repositioning from transcriptional responses
.
Proc Natl Acad Sci U S A
 
2010
;
107
:
14621
6
.

5.

Chen
 
B
,
Ma
 
L
,
Paik
 
H
, et al.  
Reversal of cancer gene expression correlates with drug efficacy and reveals therapeutic targets
.
Nat Commun
 
2017
;
8
:
16022
.

6.

Subramanian
 
A
,
Tamayo
 
P
,
Mootha
 
VK
, et al.  
Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles
.
Proc Natl Acad Sci U S A
 
2005
;
102
:
15545
50
.

7.

Zhang
 
SD
.
Gant TW. A simple and robust method for connecting small-molecule drugs using gene-expression signatures
.
BMC Bioinformatics
 
2008
;
9
:
258
.

8.

Cheng
 
J
,
Yang
 
L
,
Kumar
 
V
, et al.  
Systematic evaluation of connectivity map for disease indications
.
Genome Med
 
2014
;
6
:
95
.

9.

Tenenbaum
 
JD
,
Walker
 
MG
,
Utz
 
PJ
, et al.  
Expression-based pathway signature analysis (EPSA): mining publicly available microarray data for insight into human disease
.
BMC Med Genomics
 
2008
;
1
:
51
.

10.

Yi
 
YJ
,
Li
 
C
,
Miller
 
C
, et al.  
Strategy for encoding and comparison of gene expression signatures
.
Genome Biol
 
2007
;
8
:
R133
.

11.

Gower
 
AC
,
Spira
 
A
,
Lenburg
 
ME
.
Discovering biological connections between experimental conditions based on common patterns of differential gene expression
.
BMC Bioinformatics
 
2011
;
12
:
381
.

12.

Engreitz
 
JM
,
Chen
 
R
,
Morgan
 
AA
, et al.  
ProfileChaser: searching microarray repositories based on genome-wide patterns of differential expression
.
Bioinformatics
 
2011
;
27
:
3317
8
.

13.

Sartor
 
MA
,
Leikauf
 
GD
,
Medvedovic
 
M
.
LRpath: a logistic regression approach for identifying enriched biological groups in gene expression data
.
Bioinformatics
 
2009
;
25
:
211
7
.

14.

Vencio
 
RZN
,
Shmulevich
 
I
.
ProbCD: enrichment analysis accounting for categorization uncertainty
.
BMC Bioinformatics
 
2007
;
8
:
383
.

15.

Tanner
 
SW
,
Agarwal
 
P
.
Gene vector analysis (Geneva): a unified method to detect differentially-regulated gene sets and similar microarray experiments
.
BMC Bioinformatics
 
2008
;
9
:
348
.

16.

Freudenberg
 
JM
,
Sivaganesan
 
S
,
Phatak
 
M
, et al.  
Generalized random set framework for functional enrichment analysis using primary genomics datasets
.
Bioinformatics
 
2011
;
27
:
70
7
.

17.

Segal
 
MR
,
Xiong
 
H
,
Bengtsson
 
H
, et al.  
Querying genomic databases: refining the connectivity map
.
Stat Appl Genet Mol Biol
 
2012
;
11
.

18.

Musa
 
A
,
Ghoraie
 
LS
,
Zhang
 
SD
, et al.  
A review of connectivity map and computational approaches in pharmacogenomics (bbw112, 2017)
.
Brief Bioinform
 
2017
;
18
:
903
3
.

19.

Iskar
 
M
,
Campillos
 
M
,
Kuhn
 
M
, et al.  
Drug-induced regulation of target expression
.
PLoS Comput Biol
 
2010
;
6
:e1000925.

20.

Cheng
 
J
,
Xie
 
Q
,
Kumar
 
V
, et al.  
Evaluation of analytical methods for connectivity map data. Biocomputing 2013
.
World Scientific
 
2013
;
5
16
.

21.

Cheng
 
J
,
Yang
 
L
.
Comparing gene expression similarity metrics for connectivity map
. In:
2013 IEEE International Conference on Bioinformatics and Biomedicine
.
2013
,
165
70
.
IEEE
.

22.

Corsello
 
SM
,
Bittker
 
JA
,
Liu
 
ZH
, et al.  
The drug repurposing hub: a next-generation drug library and information resource
.
Nat Med
 
2017
;
23
:
405
.

23.

Enache
 
OM
,
Lahr
 
DL
,
Natoli
 
TE
, et al.  
The GCTx format and cmap {Py, R, M, J} packages: resources for optimized storage and integrated traversal of annotated dense matrices
.
Bioinformatics
 
2018
.

24.

Robin
 
X
,
Turck
 
N
,
Hainard
 
A
, et al.  
pROC: an open-source package for R and S plus to analyze and compare ROC curves
.
BMC Bioinformatics
 
2011
;
17
:
12
.

25.

Wong
 
N
,
Yeo
 
W
,
Wong
 
WL
, et al.  
TOP2A overexpression in hepatocellular carcinoma correlates with early age onset, shorter patients survival and chemoresistance
.
Int J Cancer
 
2009
;
124
:
644
52
.

26.

Brase
 
JC
,
Schmidt
 
M
,
Fischbach
 
T
, et al.  
ERBB2 and TOP2A in breast cancer: a comprehensive analysis of gene amplification, RNA levels, and protein expression and their influence on prognosis and prediction
.
Clin Cancer Res
 
2010
;
16
:
2391
401
.

27.

Bau
 
JT
,
Kurz
 
EU
.
Structural determinants of the catalytic inhibition of human topoisomerase IIα by salicylate analogs and salicylate-based drugs
.
Biochem Pharmacol
 
2014
;
89
:
464
76
.

28.

Liu
 
T-P
,
Hsieh
 
Y-Y
,
Chou
 
C-J
, et al.  
Systematic polypharmacology and drug repurposing via an integrated L1000-based connectivity map database mining
.
R Soc Open Sci
 
2018
;
5
:
181321
.

29.

Duan
 
Q
,
Reid
 
SP
,
Clark
 
NR
, et al.  
L1000CDS2: LINCS L1000 characteristic direction signatures search engine
.
NPJ Syst Biol Appl
 
2016
;
2
:
16015
.

30.

Iorio
 
F
,
Tagliaferri
 
R
,
Di Bernardo
 
D
.
Identifying network of drug mode of action by gene expression profiling
.
J Comput Biol
 
2009
;
16
:
241
51
.

31.

Li
 
H
,
Zhou
 
H
,
Wang
 
D
, et al.  
Versatile pathway-centric approach based on high-throughput sequencing to anticancer drug discovery
.
Proc Natl Acad Sci
 
2012
;
109
:
4609
14
.

32.

Shao
 
W
,
Li
 
S
,
Li
 
L
, et al.  
Chemical genomics reveals inhibition of breast cancer lung metastasis by Ponatinib via c-Jun
.
Protein Cell
 
2019
;
10
:
161
77
.

Author notes

Kequan Lin and Lu Li contributed equally to this work.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data