Abstract

CCancer is an automatically collected database of gene lists, which were reported mostly by experimental studies in various biological and clinical contexts. At the moment, the database covers 3369 gene lists extracted from 2644 papers published in ∼80 peer-reviewed journals. As input, CCancer accepts a gene list. An enrichment analyses is implemented to generate, as output, a highly informative survey over recently published studies that report gene lists, which significantly intersect with the query gene list. A report on gene pairs from the input list which were frequently reported together by other biological studies is also provided. CCancer is freely available at http://mips.helmholtz-muenchen.de/proj/ccancer.

INTRODUCTION

At the moment, various high-throughput experimental platforms are employed intensively to provide new insights into the molecular mechanisms underlying a variety of biological phenomena (1,2). An increasing number of biological or clinical studies report differentially expressed genes, epigenetically silenced genes, frequently mutated genes, genes with copy number variations or other gene lists involved in common biological processes. Although being publicly available, this type of information, at the same time, is dissolved in hundreds of papers. The only way to collect this valuable data is to use automatic text mining systems.

Text-mining systems are employed by biomedical researchers to automatically extract relevant information from the literature [see ref. (3) for a review]. For example, PolySearch (4) is a generic text mining system for extracting relationships between genes and diseases. Several other databases, which are based on text mining, focus on specialized research areas: PubMeth (5) and MeInfoText (6) collect information on gene methylation in cancer. DDOC (7) and DDEC (8) collect heterogeneous information about genes differentially expressed in ovarian and esophageal cancer, such as manually curated information about the promoter regions and associated transcription factors, as well as text-mined reports.

Recently, we have developed the PLIPS database, a collection of protein lists extracted from proteomics studies by text-mining (9). PLIPS also provides a statistical framework for the interpretation of a protein list. To generate the PLIPS database, relatively few ‘text mining’ efforts were required, since a majority of proteomics studies are published in a few highly specialized proteomics journals. PLIPS covers only five major proteomics journals (Proteomics, Journal of Proteome Research, Molecular and Cellular Proteomics, Proteomics—Clinical Applications) and ∼1000 different protein lists extracted from 800 independent studies.

In contrast to proteomics, high-throughput genomic technologies were more frequently used and their results are published in a much wider spectrum of journals. Gene lists, which were characterized to play key roles in molecular mechanisms for a variety of biological phenomena, are regularly reported in general biological journals, as well as in highly specific medical journals. Thus, automatic extraction of this information requires a lot of additional efforts.

Here, we present a database, termed CCancer, which provides a collection of 3369 gene lists automatically extracted from tables in 2644 studies covering ∼80 peer-reviewed journals. Cancer is a major focus of biomedical research. According to our estimates, more than a half of the gene lists stored in CCancer are extracted from cancer related studies. This fact pre-defines the name of the database.

CCancer is not only a database but a web-based analysis platform, which employs an enrichment analyses framework (10–14) to interpret a given user-defined gene list. As input, CCancer accepts gene/protein list. As output, a catalogue of previously published studies that report a table of genes/proteins, which significantly intersects with a query list, is provided. Thus, CCancer supports the interpretation of the functional context for an experimentally derived gene lists. To illustrate the valuable and often unprecedented information that the user can get by using the CCancer database, we present several examples of data analyses.

MATERIALS AND METHODS

Text mining

We collected all articles (∼150 000) published in 80 peer-reviewed journals for the last 5–7 years. The articles were screened for tables which report gene identifiers. The search algorithm was implemented to recognize a table with gene/protein identifiers within the paper text. If the table reports at least 10 unique gene identifiers of the same type [i.e. ‘Entrez Gene IDs’, ‘Gene Symbols’, ‘RefSeq’, ‘UNIGENE’, ‘ENSEMBL’, ‘Affymetrix Probes’, ‘IPISYN (Internatinal Protein Identifire)’, ‘Uniprot, SwissProt’] then the paper was selected. In total 3369 gene/protein lists were identified from 2644 papers. All gene list were mapped to ‘Entrez Gene IDs’. The data in Ccancer covers ∼20 000 unique ‘Entrez Gene IDs’.

The top journals, in terms of the number of extracted gene lists, were highly specific journals in cancer research: ‘Cancer Research’ (327 papers and 411 gene lists), ‘Oncogene’ (214 papers and 278 gene lists), ‘Clinical Cancer Research’ (149 papers and 178 gene list), ‘International Journal of Cancer’ (109 papers and 143 gene lists). The full list of journals is accessible on the web server (http://mips.helmholtz-muenchen.de/proj/ccancer/journals).

We would like to point out that the data were collected automatically. A gene list in the database may be incomplete (in comparison to the originally reported list in the paper) and might have false positive gene identifiers. In our estimate based on 100 randomly selected records, ∼60% of records are of high quality (the original table in the paper and the Ccancer record have <10% of false negative and false positive genes), ∼20% of records are of good quality (containing 10–25% false negatives and not more than 10% of false positives) and ∼15% of records contains ∼35–75% of genes actually reported in the paper table. About 3–5% of the records may represent artefacts, i.e. a result from a wrongly recognized table which does not actually reports a gene list.

Automatic annotation of gene lists with human disease terms

A comprehensive hierarchical controlled vocabulary for human disease (http://do-wiki.nubic.northwestern.edu/index.php/Main_Page) was used to link articles to human disease terms. First we computed the background distribution for each disease term using all available articles (∼150 000). For each disease term ‘A’ we select the subset of articles where the term was present at least once and compute for each article from this subset the number of times the term ‘A’ was present in the paper. The average number of times the term ‘A’ was mentioned per paper across this subset was computed. If a term ‘A’ was mentioned twice as many times as computed average value, then the paper was annotated with term ‘A’.

The gene lists from Ccancer database were annotated with ‘human disease terms’ based on the annotation of the paper from which it was extracted. We would like to point out that the biological context of genes reported in the table may not correspond to the context of the terms overrepresented in the paper text. In each case, manual analysis is required.

Statistical analysis: intersection of gene lists

To statistically link a given gene list (the query list) to the lists from the database we implement standard enrichment analyses. For each gene/protein list L in the database, the number of genes I common between the query list and the gene list L are counted. The null hypothesis H0 ‘Genes from the query list (size NQ) and from the list L (size NL) have at least I common genes by chance’ is tested. The hypergeometric test, adjusted for multiple testing by a Monte–Carlo simulation procedure, is employed to assess the significance of the intersection I. The estimated P-value corresponds exactly to the definition of an experiment-wise Westfall and Young P-value (11,12,15–17).

Statistical analysis: significantly associated gene pairs

Based on the data from the Ccancer database we identified gene pairs which were significantly associated (frequently reported together). Let us denote N as the total number of tables in the Ccancer database. Let us denote Cj to be the number of tables from Ccancer database where gene j was reported. For each pair of genes (j and k) we compute Cjk the number of tables where both gene j and k were reported. The number (intersection) Cjk follows a hypergeometric distribution with parameters N, Cj and Ck (‘Cj’ balls were drawn without replacement from an urn containing ‘N’ balls in total, ‘Ck’ of which are white). A Monte–Carlo simulation procedure was employed to adjust the P-value for multiple testing (for each gene we tested K hypotheses where K is a total number of genes). At the significance level (P < 0.05) each gene from the Ccancer database was associated on average with ∼20 other genes.

Cross-linking of gene lists from the database

Extracted gene/protein lists were mapped to NCBI Entrez Gene IDs. Each gene/protein list L in the database was considered as ‘query’ list to identify the other lists from the database which have significant number of common genes (with ‘query’ list). Thus, we cross-linked all gene list pairs if they share a significant number (P < 0.01) of common genes. This information can be browsed online (http://mips.helmholtz-muenchen.de/proj/ccancer/journals/).

Linking gene lists to gene ontology terms

Each gene/protein list from CCancer database was linked to gene ontology (GO) terms which were significantly (P < 0.01) enriched in the list (13). In analogy to the calculation of the intersection P-value, a hypergeometric test, adjusted for multiple testing using a Monte–Carlo simulation approach, was employed to estimate the statistical significance of a GO category.

Monte–Carlo simulation to adjust P-values for multiple testing

All three considered cases (intersection between gene list and Ccancer records, GO enrichment of gene list and significantly associated gene pairs) can be modelled by the same statistical model ‘sampling without replacement’. In this model, k balls were drawn without replacement from an urn containing ‘N’ balls in total, n of which are white (all others are black). In this case, the number k1 of white balls drawn from the urn follows a hypergeometric distribution with parameters N, k and n. However, in our cases the balls are multicolored and we actually test multiple hypotheses at the same time: ‘white balls were drawn randomly’, ‘blue balls were drawn randomly’, ‘red balls were drawn randomly’ and so on. As we select the most enriched whatever color (let say red for this case), the estimated P-value based on the hypergeometric distribution does not reflect the actual probability to get k1 of whatever color balls; it reflects the probability to get k1 red balls. To get the actual probability to get k1 whatever color balls we need to adjust P-value for multiple testing. One way to do this is to use Monte–Carlo simulation to directly measure this probability based on, let say 1000 simulations.

In this case we simulate a random drawing of k balls 1000 times and each time we estimate the P-value based on hypergeometric distribution for the best (whatever) color. Thus, we got a distribution of size 1000 of the best P-values for a random drawn of k balls and compare it to the P-value for the best (whatever) color balls related to our original drawn of k balls. The estimate of the adjusted P-value is given by the share of random simulations where the best P-value was equal or superior (less) than the P-value for the best (whatever) color balls related to our original drawing of k balls.

RESULTS

CCancer: interpretation of gene list

The user can query his/her list of gene/protein identifiers to find statistically significant links to previously published studies as well as to identify gene pairs from the submitted list which were frequently reported together (P < 0.05). As input, CCancer accepts several types of gene identifiers. CCancer supports most gene and protein identifiers such as ‘Gene Symbol’, ‘Entrez Gene Id’, ‘UniProt/Swiss-Prot’, ‘UniGene’, ‘Ensembl’, ‘RefSeq Protein ID’, ‘RefSeq Transcript ID’ and’Affymetrix probe codes’. As output, a catalogue of previously published studies that report a table of genes that significantly intersect with a query list is provided.

After a list of potential studies is generated, one needs to check manually all interesting hits. First, a list of gene IDs common between query and database list need to be checked (a link ‘Mapping protocol’ is provided on the resulting page). Additionally, by looking into the corresponding study (a link is provided on the resulting page) one can understand better the functional context of the ‘database’ gene list. As been already mentioned in ‘text mining’ section, the database was collected automatically and, thus, some hits may represent artefacts.

CCancer: browsing for gene lists with common biological properties

CCancer also provides an interface to browse gene lists from the database with a common property. At the moment, the user can select gene lists which are statistically linked to either a particular GO biological process (http://mips.helmholtz-muenchen.de/proj/ccancer/go_bp), molecular function (http://mips.helmholtz-muenchen.de/proj/ccancer/go_mf) or cellular component (http://mips.helmholtz-muenchen.de/proj/ccancer/go_cc). The possibility to browse gene lists based on their local properties is going to be extended in the future.

CCancer: examples of possible applications

Next we present examples of analyses of experimental data by CCancer to illustrate the potential utility of our database. In principle, interpretation of gene list using CCancer is based on the widely accepted guilt-by-association principle: significant similarities between protein lists can be indicative of similarity in molecular mechanisms between corresponding phenomena. The next example aims to illustrate this application of CCancer

Example 1: Relationship between the senescence phenotype and cancer

A study by Young et al. (18) provided evidence that autophagy-related genes mediate the acquisition of the senescence phenotype. The authors studied 53 autophagy- and senescence-related genes, which were up- or down-regulated after Ras induction. A screen in CCancer for studies, which report gene sets that significantly intersect with the genes reported in ref. (18), identified several related papers (P < 0.01, Table 1). For example, a study is detected, where Staber et al. (19) report genes associated with recurrent acute myeloid leukemia after high-dose chemotherapy. The genes, which were differentially expressed in patients with acute myeloid leukemia prior to high-dose chemotherapy and after relapse, significantly intersect with the senescence-related genes reported by ref. (18).

Table 1.

An example of a CCancer output: a query list of autophagy- and senescence-related genes

P-valuelfSize of query listSize of database listMapping protocolStudy ‘Table number’, the number and type of the protein/gene recognized, the title of the articleHuman disease ontology terms over-represented in the article text
0.0185427�MapTable 3, 25 UNIGENE_1 in epithelial and stromal Cathepsin K and CXCL14 expression in breast tumour progressionCarcinoma in situ, breast cancer, radial scar, breast tumor, ductal carcinoma, cancer, carcinoma, scar
0.0185433�MapTable 3, 33 REFSEQ_RNA_NAV in effects of imatinib on monocyte-derived dendritic cells are mediated by inhibition of nuclear factor-{kappa}B and Akt signalling pathwaysChronic myeloid leukemia, myeloid leukemia, leukemia
0.0155430�MapTable 1, 32 synonyms in identification of 491 proteins in the tear fluid proteome reveals a large number of proteases and protease inhibitorsTear, disease
0.0155445�MapTable 2, 40 synonyms in in vivo p53 response and immune reaction underlie highly effective low-dose radiotherapy in follicular lymphomaFollicular lymphoma, disease, lymphoma
0.0175499�MapTable 1, 86 synonyms in gene expression analysis of prostate hyperplasia in mice overexpressing the prolactin gene specifically in the prostateBenign prostatic hyperplasia, intraepithelial neoplasia, dna fragmentation, hyperplastic, hyperplasia, cancer
0.0145441�MapTable 1, 41 REFSEQ_RNA_NAV in cytoglobin, the newest member of the globin family, functions as a tumor suppressor geneBreast cancer, cancer
0.01545444�MapTable 2, 44 REFSEQ_RNA_NAV in common alterations in gene expression and increased proliferation in recurrent acute myeloid leukemiaAcute myeloid leukemia, myeloid leukemia, leukemia, disease
0.01555475�MapTable 1, 72 synonyms in identification of epigenetically silenced genes in tumor endothelial cellsCancer
0.01555475�MapTable 2, 60 synonyms in molecular signatures of self-renewal, differentiation, and lineage choice in multipotential hemopoietic progenitor cells in vitroNephroblastoma
0.015654114�MapTable 1, 95 synonyms in knowledge-based gene expression classification via matrix factorizationDisease
0.01555484�MapTable 3, 73 synonyms in genetic expression profiles and biologic pathway alterations in head and neck squamous cell carcinomaSquamous cell carcinoma, neck cancer, oral cancer, squamous carcinoma, cancer, disease, carcinoma
0.01545474�MapTable 1, 74 REFSEQ_RNA_NAV in dlk1/FA1 regulates the function of human bone marrow mesenchymal stem cells by modulating gene expression of pro-inflammatory cytokines and immune response-related factors
0.01545477�MapTable 1, 67 SYNONYMS in Comparative analysis of genes regulated by PML/RAR (20) and PLZF/RAR (20) in response to retinoic acid using oligonucleotide arraysLeukemia
0.01545478�MapTable 2, 67 UNIGENE_1 in genetic expression profiles and biologic pathway alterations in head and neck squamous cell carcinomaSquamous cell carcinoma, neck cancer, oral cancer, squamous carcinoma, cancer, disease, carcinoma
0.01545485�MapTable 1, 76 synonyms in genomic and proteomic analysis of the myeloid differentiation programLeukemia
0.01545493�MapTable 3, 83 synonyms in differentiation of lung adenocarcinoma, pleural mesothelioma, and nonmalignant pulmonary tissues using DNA methylation profilesLung adenocarcinoma, malignant mesothelioma, pleural mesothelioma, cancer, adenocarcinoma, adenocarcinomas, mesothelioma, confusion
0.015654227�MapTable 3, 190 synonyms in inflammatory manifestations of experimental lymphatic insufficiency’Retinoblastoma, disease, lymphedema
0.015454106�MapTable 1, 92 synonyms in overexpression of the Axl tyrosine kinase receptor in cutaneous SCC-derived cell lines and tumorsSkin cancer, cancer, carcinoma
0.025754360�MapTable W2, 433 AFFY in chromosomal instability is associated with higher expression of genes implicated in epithelial-mesenchymal transition, cancer invasiveness, and metastasis and with lower expression of genes involved in cell cycle checkpoints, DNA repair, and chromatin maintenanceSquamous cell carcinoma, neoplastic transformation, leukemia, cancer, lipoma
0.025554206�MapTable 2, 179 synonyms in sequential gene expression changes in cancer cell lines after treatment with the demethylation agent 5-Aza-2-deoxycytidineCancer, carcinoma, hepatoma
0.025454137�MapTable 1, 119 synonyms in transcriptional response of lymphoblastoid cells to ionizing radiation—genome researchDNA damage
P-valuelfSize of query listSize of database listMapping protocolStudy ‘Table number’, the number and type of the protein/gene recognized, the title of the articleHuman disease ontology terms over-represented in the article text
0.0185427�MapTable 3, 25 UNIGENE_1 in epithelial and stromal Cathepsin K and CXCL14 expression in breast tumour progressionCarcinoma in situ, breast cancer, radial scar, breast tumor, ductal carcinoma, cancer, carcinoma, scar
0.0185433�MapTable 3, 33 REFSEQ_RNA_NAV in effects of imatinib on monocyte-derived dendritic cells are mediated by inhibition of nuclear factor-{kappa}B and Akt signalling pathwaysChronic myeloid leukemia, myeloid leukemia, leukemia
0.0155430�MapTable 1, 32 synonyms in identification of 491 proteins in the tear fluid proteome reveals a large number of proteases and protease inhibitorsTear, disease
0.0155445�MapTable 2, 40 synonyms in in vivo p53 response and immune reaction underlie highly effective low-dose radiotherapy in follicular lymphomaFollicular lymphoma, disease, lymphoma
0.0175499�MapTable 1, 86 synonyms in gene expression analysis of prostate hyperplasia in mice overexpressing the prolactin gene specifically in the prostateBenign prostatic hyperplasia, intraepithelial neoplasia, dna fragmentation, hyperplastic, hyperplasia, cancer
0.0145441�MapTable 1, 41 REFSEQ_RNA_NAV in cytoglobin, the newest member of the globin family, functions as a tumor suppressor geneBreast cancer, cancer
0.01545444�MapTable 2, 44 REFSEQ_RNA_NAV in common alterations in gene expression and increased proliferation in recurrent acute myeloid leukemiaAcute myeloid leukemia, myeloid leukemia, leukemia, disease
0.01555475�MapTable 1, 72 synonyms in identification of epigenetically silenced genes in tumor endothelial cellsCancer
0.01555475�MapTable 2, 60 synonyms in molecular signatures of self-renewal, differentiation, and lineage choice in multipotential hemopoietic progenitor cells in vitroNephroblastoma
0.015654114�MapTable 1, 95 synonyms in knowledge-based gene expression classification via matrix factorizationDisease
0.01555484�MapTable 3, 73 synonyms in genetic expression profiles and biologic pathway alterations in head and neck squamous cell carcinomaSquamous cell carcinoma, neck cancer, oral cancer, squamous carcinoma, cancer, disease, carcinoma
0.01545474�MapTable 1, 74 REFSEQ_RNA_NAV in dlk1/FA1 regulates the function of human bone marrow mesenchymal stem cells by modulating gene expression of pro-inflammatory cytokines and immune response-related factors
0.01545477�MapTable 1, 67 SYNONYMS in Comparative analysis of genes regulated by PML/RAR (20) and PLZF/RAR (20) in response to retinoic acid using oligonucleotide arraysLeukemia
0.01545478�MapTable 2, 67 UNIGENE_1 in genetic expression profiles and biologic pathway alterations in head and neck squamous cell carcinomaSquamous cell carcinoma, neck cancer, oral cancer, squamous carcinoma, cancer, disease, carcinoma
0.01545485�MapTable 1, 76 synonyms in genomic and proteomic analysis of the myeloid differentiation programLeukemia
0.01545493�MapTable 3, 83 synonyms in differentiation of lung adenocarcinoma, pleural mesothelioma, and nonmalignant pulmonary tissues using DNA methylation profilesLung adenocarcinoma, malignant mesothelioma, pleural mesothelioma, cancer, adenocarcinoma, adenocarcinomas, mesothelioma, confusion
0.015654227�MapTable 3, 190 synonyms in inflammatory manifestations of experimental lymphatic insufficiency’Retinoblastoma, disease, lymphedema
0.015454106�MapTable 1, 92 synonyms in overexpression of the Axl tyrosine kinase receptor in cutaneous SCC-derived cell lines and tumorsSkin cancer, cancer, carcinoma
0.025754360�MapTable W2, 433 AFFY in chromosomal instability is associated with higher expression of genes implicated in epithelial-mesenchymal transition, cancer invasiveness, and metastasis and with lower expression of genes involved in cell cycle checkpoints, DNA repair, and chromatin maintenanceSquamous cell carcinoma, neoplastic transformation, leukemia, cancer, lipoma
0.025554206�MapTable 2, 179 synonyms in sequential gene expression changes in cancer cell lines after treatment with the demethylation agent 5-Aza-2-deoxycytidineCancer, carcinoma, hepatoma
0.025454137�MapTable 1, 119 synonyms in transcriptional response of lymphoblastoid cells to ionizing radiation—genome researchDNA damage

lf: number of genes in the intersection.

Table 1.

An example of a CCancer output: a query list of autophagy- and senescence-related genes

P-valuelfSize of query listSize of database listMapping protocolStudy ‘Table number’, the number and type of the protein/gene recognized, the title of the articleHuman disease ontology terms over-represented in the article text
0.0185427�MapTable 3, 25 UNIGENE_1 in epithelial and stromal Cathepsin K and CXCL14 expression in breast tumour progressionCarcinoma in situ, breast cancer, radial scar, breast tumor, ductal carcinoma, cancer, carcinoma, scar
0.0185433�MapTable 3, 33 REFSEQ_RNA_NAV in effects of imatinib on monocyte-derived dendritic cells are mediated by inhibition of nuclear factor-{kappa}B and Akt signalling pathwaysChronic myeloid leukemia, myeloid leukemia, leukemia
0.0155430�MapTable 1, 32 synonyms in identification of 491 proteins in the tear fluid proteome reveals a large number of proteases and protease inhibitorsTear, disease
0.0155445�MapTable 2, 40 synonyms in in vivo p53 response and immune reaction underlie highly effective low-dose radiotherapy in follicular lymphomaFollicular lymphoma, disease, lymphoma
0.0175499�MapTable 1, 86 synonyms in gene expression analysis of prostate hyperplasia in mice overexpressing the prolactin gene specifically in the prostateBenign prostatic hyperplasia, intraepithelial neoplasia, dna fragmentation, hyperplastic, hyperplasia, cancer
0.0145441�MapTable 1, 41 REFSEQ_RNA_NAV in cytoglobin, the newest member of the globin family, functions as a tumor suppressor geneBreast cancer, cancer
0.01545444�MapTable 2, 44 REFSEQ_RNA_NAV in common alterations in gene expression and increased proliferation in recurrent acute myeloid leukemiaAcute myeloid leukemia, myeloid leukemia, leukemia, disease
0.01555475�MapTable 1, 72 synonyms in identification of epigenetically silenced genes in tumor endothelial cellsCancer
0.01555475�MapTable 2, 60 synonyms in molecular signatures of self-renewal, differentiation, and lineage choice in multipotential hemopoietic progenitor cells in vitroNephroblastoma
0.015654114�MapTable 1, 95 synonyms in knowledge-based gene expression classification via matrix factorizationDisease
0.01555484�MapTable 3, 73 synonyms in genetic expression profiles and biologic pathway alterations in head and neck squamous cell carcinomaSquamous cell carcinoma, neck cancer, oral cancer, squamous carcinoma, cancer, disease, carcinoma
0.01545474�MapTable 1, 74 REFSEQ_RNA_NAV in dlk1/FA1 regulates the function of human bone marrow mesenchymal stem cells by modulating gene expression of pro-inflammatory cytokines and immune response-related factors
0.01545477�MapTable 1, 67 SYNONYMS in Comparative analysis of genes regulated by PML/RAR (20) and PLZF/RAR (20) in response to retinoic acid using oligonucleotide arraysLeukemia
0.01545478�MapTable 2, 67 UNIGENE_1 in genetic expression profiles and biologic pathway alterations in head and neck squamous cell carcinomaSquamous cell carcinoma, neck cancer, oral cancer, squamous carcinoma, cancer, disease, carcinoma
0.01545485�MapTable 1, 76 synonyms in genomic and proteomic analysis of the myeloid differentiation programLeukemia
0.01545493�MapTable 3, 83 synonyms in differentiation of lung adenocarcinoma, pleural mesothelioma, and nonmalignant pulmonary tissues using DNA methylation profilesLung adenocarcinoma, malignant mesothelioma, pleural mesothelioma, cancer, adenocarcinoma, adenocarcinomas, mesothelioma, confusion
0.015654227�MapTable 3, 190 synonyms in inflammatory manifestations of experimental lymphatic insufficiency’Retinoblastoma, disease, lymphedema
0.015454106�MapTable 1, 92 synonyms in overexpression of the Axl tyrosine kinase receptor in cutaneous SCC-derived cell lines and tumorsSkin cancer, cancer, carcinoma
0.025754360�MapTable W2, 433 AFFY in chromosomal instability is associated with higher expression of genes implicated in epithelial-mesenchymal transition, cancer invasiveness, and metastasis and with lower expression of genes involved in cell cycle checkpoints, DNA repair, and chromatin maintenanceSquamous cell carcinoma, neoplastic transformation, leukemia, cancer, lipoma
0.025554206�MapTable 2, 179 synonyms in sequential gene expression changes in cancer cell lines after treatment with the demethylation agent 5-Aza-2-deoxycytidineCancer, carcinoma, hepatoma
0.025454137�MapTable 1, 119 synonyms in transcriptional response of lymphoblastoid cells to ionizing radiation—genome researchDNA damage
P-valuelfSize of query listSize of database listMapping protocolStudy ‘Table number’, the number and type of the protein/gene recognized, the title of the articleHuman disease ontology terms over-represented in the article text
0.0185427�MapTable 3, 25 UNIGENE_1 in epithelial and stromal Cathepsin K and CXCL14 expression in breast tumour progressionCarcinoma in situ, breast cancer, radial scar, breast tumor, ductal carcinoma, cancer, carcinoma, scar
0.0185433�MapTable 3, 33 REFSEQ_RNA_NAV in effects of imatinib on monocyte-derived dendritic cells are mediated by inhibition of nuclear factor-{kappa}B and Akt signalling pathwaysChronic myeloid leukemia, myeloid leukemia, leukemia
0.0155430�MapTable 1, 32 synonyms in identification of 491 proteins in the tear fluid proteome reveals a large number of proteases and protease inhibitorsTear, disease
0.0155445�MapTable 2, 40 synonyms in in vivo p53 response and immune reaction underlie highly effective low-dose radiotherapy in follicular lymphomaFollicular lymphoma, disease, lymphoma
0.0175499�MapTable 1, 86 synonyms in gene expression analysis of prostate hyperplasia in mice overexpressing the prolactin gene specifically in the prostateBenign prostatic hyperplasia, intraepithelial neoplasia, dna fragmentation, hyperplastic, hyperplasia, cancer
0.0145441�MapTable 1, 41 REFSEQ_RNA_NAV in cytoglobin, the newest member of the globin family, functions as a tumor suppressor geneBreast cancer, cancer
0.01545444�MapTable 2, 44 REFSEQ_RNA_NAV in common alterations in gene expression and increased proliferation in recurrent acute myeloid leukemiaAcute myeloid leukemia, myeloid leukemia, leukemia, disease
0.01555475�MapTable 1, 72 synonyms in identification of epigenetically silenced genes in tumor endothelial cellsCancer
0.01555475�MapTable 2, 60 synonyms in molecular signatures of self-renewal, differentiation, and lineage choice in multipotential hemopoietic progenitor cells in vitroNephroblastoma
0.015654114�MapTable 1, 95 synonyms in knowledge-based gene expression classification via matrix factorizationDisease
0.01555484�MapTable 3, 73 synonyms in genetic expression profiles and biologic pathway alterations in head and neck squamous cell carcinomaSquamous cell carcinoma, neck cancer, oral cancer, squamous carcinoma, cancer, disease, carcinoma
0.01545474�MapTable 1, 74 REFSEQ_RNA_NAV in dlk1/FA1 regulates the function of human bone marrow mesenchymal stem cells by modulating gene expression of pro-inflammatory cytokines and immune response-related factors
0.01545477�MapTable 1, 67 SYNONYMS in Comparative analysis of genes regulated by PML/RAR (20) and PLZF/RAR (20) in response to retinoic acid using oligonucleotide arraysLeukemia
0.01545478�MapTable 2, 67 UNIGENE_1 in genetic expression profiles and biologic pathway alterations in head and neck squamous cell carcinomaSquamous cell carcinoma, neck cancer, oral cancer, squamous carcinoma, cancer, disease, carcinoma
0.01545485�MapTable 1, 76 synonyms in genomic and proteomic analysis of the myeloid differentiation programLeukemia
0.01545493�MapTable 3, 83 synonyms in differentiation of lung adenocarcinoma, pleural mesothelioma, and nonmalignant pulmonary tissues using DNA methylation profilesLung adenocarcinoma, malignant mesothelioma, pleural mesothelioma, cancer, adenocarcinoma, adenocarcinomas, mesothelioma, confusion
0.015654227�MapTable 3, 190 synonyms in inflammatory manifestations of experimental lymphatic insufficiency’Retinoblastoma, disease, lymphedema
0.015454106�MapTable 1, 92 synonyms in overexpression of the Axl tyrosine kinase receptor in cutaneous SCC-derived cell lines and tumorsSkin cancer, cancer, carcinoma
0.025754360�MapTable W2, 433 AFFY in chromosomal instability is associated with higher expression of genes implicated in epithelial-mesenchymal transition, cancer invasiveness, and metastasis and with lower expression of genes involved in cell cycle checkpoints, DNA repair, and chromatin maintenanceSquamous cell carcinoma, neoplastic transformation, leukemia, cancer, lipoma
0.025554206�MapTable 2, 179 synonyms in sequential gene expression changes in cancer cell lines after treatment with the demethylation agent 5-Aza-2-deoxycytidineCancer, carcinoma, hepatoma
0.025454137�MapTable 1, 119 synonyms in transcriptional response of lymphoblastoid cells to ionizing radiation—genome researchDNA damage

lf: number of genes in the intersection.

Two other studies detected by CCancer covered related topics. A study (21) described genes related to macrophage activation and TH1 immune response, which were induced by low-dose radiation therapy in follicular lymphoma. A second study of ref. (22) identified genes differentially expressed in response to ionizing radiation in lymphoblastoid cells.

A relationship between cancer- and senescence-related genes or pathways would have been certainly expected. Interestingly, some topics, which emerged form the comparison of senescence-related genes to the studies in the CCancer database, were related to cancer therapy or prognosis. For instance, the studies commonly reported Cathepsin B and D (CTSB/D) and Cathepsin L1 (CTSL1), which participate in protein degradation and turnover (23,24).

Another output from Ccancer (Table 2) reports pairs of genes which were frequently reported together. For example, the already mentioned genes Cathepsin L1 (CTSL1) and Cathepsin B and D (CTSB/D) were reported together much more frequently than it would be expected by chance. CTSL1 and CTSB were reported together by 12 Ccancer records while each gene was reported 78 and 49 times, respectively (P < 0.05). Interesting, 7 out of 12 papers were related to ‘cancer’ and three to ‘LEUKEMIA’.

Table 2.

An example of a CCancer output: a query list of autophagy- and senescence-related genes

Gene IDAssociated geneThe number of times reported togetherDisease terms, associated with gene pair
CTSL1CTSB12 (78,49)Leukemia (3) cancer (7) carcinoma (3)
CTSL1CTSS5 (49,29)Cancer (3) carcinoma (2)
CTSL1CTSD9 (66,49)Leukemia (4) cancer (3) disease (3) myeloid leukemia (2)
CTSL1CTSA5 (49,20)Leukemia (2) cancer (3)
Gene IDAssociated geneThe number of times reported togetherDisease terms, associated with gene pair
CTSL1CTSB12 (78,49)Leukemia (3) cancer (7) carcinoma (3)
CTSL1CTSS5 (49,29)Cancer (3) carcinoma (2)
CTSL1CTSD9 (66,49)Leukemia (4) cancer (3) disease (3) myeloid leukemia (2)
CTSL1CTSA5 (49,20)Leukemia (2) cancer (3)

Gene pairs frequently reported together (a part of the table).

Table 2.

An example of a CCancer output: a query list of autophagy- and senescence-related genes

Gene IDAssociated geneThe number of times reported togetherDisease terms, associated with gene pair
CTSL1CTSB12 (78,49)Leukemia (3) cancer (7) carcinoma (3)
CTSL1CTSS5 (49,29)Cancer (3) carcinoma (2)
CTSL1CTSD9 (66,49)Leukemia (4) cancer (3) disease (3) myeloid leukemia (2)
CTSL1CTSA5 (49,20)Leukemia (2) cancer (3)
Gene IDAssociated geneThe number of times reported togetherDisease terms, associated with gene pair
CTSL1CTSB12 (78,49)Leukemia (3) cancer (7) carcinoma (3)
CTSL1CTSS5 (49,29)Cancer (3) carcinoma (2)
CTSL1CTSD9 (66,49)Leukemia (4) cancer (3) disease (3) myeloid leukemia (2)
CTSL1CTSA5 (49,20)Leukemia (2) cancer (3)

Gene pairs frequently reported together (a part of the table).

Example 2: Searching for novel potential clinical applications of drugs

CCancer can further be exploited to identify hints for novel clinical applications of known drugs (25) or drugs under development in the case, where the list of drug molecular targets is known. For a variety of cell disorders (including most cancer subtypes) CCancer stores lists of genes identified to be at differential states (comparison between normal versus disordered cells). This type of information can be mined to identify new potential therapeutic implications. Significant number of common genes between drug targets and gene lists from the database related to some cell disorder can be indicative of probable usability of the drug for the corresponding disease.

For example, Bosutinib is a novel promiscuous kinase inhibitor. We extracted a reported list of direct interactors of Bosutinib identified by chemical proteomics (26). We used CCancer to identify previously reported gene lists that have a significantly share of common genes/proteins (http://mips.helmholtz-muenchen.de/proj/ccancer/example.html) and, thus, to identify potential physiological conditions, where an application of Bosutinib might be effective. Among already known applications, like different types of ‘leukemia’ (27,28), CCancer suggests other specific cancer types, like ‘oral squamous cell carcinoma’(28).

DISCUSSION

Here, we have generated a comprehensive collection of gene list reported by the papers in 80 peer-reviewed journals. Tables in articles usually present, to some extent, pre-processed gene lists, which are selected for significance by experts. To our knowledge, no existing text-mining system provides a similarly accessible and comparable collection of experimentally derived gene lists for analysis.

The CCancer database provides a computational interface, which generates a highly informative survey over recently published cancer-related studies, which report similar and significantly intersecting gene lists. As we have demonstrated, by applying this automatic analysis, the user may obtain sometimes unexpected links to previous studies. It would be a tedious, if not impossible task for experimental researchers to gain these insights by a manual analysis of the literature. Articles, which contain significantly intersecting gene lists, are not necessarily listed as ‘related articles’ in PubMed. In addition, CCancer implements a robust statistical treatment of the intersection between a query and a database gene list, and provides a valid estimate of the P-values by a Monte–Carlo simulation procedure. The P-values actually reflect the probability of getting an intersection of the same size, in terms of the number of genes, for any random query gene list.

FUNDING

This work was supported by the Helmholtz Alliance on Systems Biology (project ‘CoReNe’). Funding for open access charge: Helmholtz Center Munich – German Research Center for Environmental Health (GmbH).

Conflict of interest statement. None declared.

REFERENCES

1
Fernandes
TG
Diogo
MM
Clark
DS
Dordick
JS
Cabral
JM
,
High-throughput cellular microarray platforms: applications in drug discovery, toxicology and stem cell research
Trends Biotechnol.
,
2009
, vol.
27
(pg.
342
-
349
)
2
Powell
AK
Zhi
ZL
Turnbull
JE
,
Saccharide microarrays for high-throughput interrogation of glycan-protein binding interactions
Methods Mol. Biol.
,
2009
, vol.
534
(pg.
313
-
329
)
3
Krallinger
M
Valencia
A
Hirschman
L
,
Linking genes to literature: text mining, information extraction, and retrieval applications for biology
Genome Biol.
,
2008
, vol.
9
Suppl. 2
pg.
S8
4
Cheng
D
Knox
C
Young
N
Stothard
P
Damaraju
S
Wishart
DS
,
PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites
Nucleic Acids Res.
,
2008
, vol.
36
(pg.
W399
-
W405
)
5
Ongenaert
M
Van
NL
De
MT
Menschaert
G
Bekaert
S
Van
CW
,
PubMeth: a cancer methylation database combining text-mining and expert annotation
Nucleic Acids Res.
,
2008
, vol.
36
(pg.
D842
-
D846
)
6
Fang
YC
Huang
HC
Juan
HF
,
MeInfoText: associated gene methylation and cancer information from text mining
BMC Bioinformatics
,
2008
, vol.
9
pg.
22
7
Kaur
M
Radovanovic
A
Essack
M
Schaefer
U
Maqungo
M
Kibler
T
Schmeier
S
Christoffels
A
Narasimhan
K
Choolani
M
et al.
,
Database for exploration of functional context of genes implicated in ovarian cancer
Nucleic Acids Res.
,
2009
, vol.
37
(pg.
D820
-
D823
)
8
Essack
M
Radovanovic
A
Schaefer
U
Schmeier
S
Seshadri
SV
Christoffels
A
Kaur
M
Bajic
VB
,
DDEC: Dragon database of genes implicated in esophageal cancer 2
BMC Cancer
,
2009
, vol.
9
pg.
219
9
Antonov
AV
Dietmann
S
Wong
P
Igor
R
Mewes
HW
,
PLIPS, an automatically collected database of protein lists reported by proteomics studies
J. Proteome. Res.
,
2009
, vol.
8
(pg.
1193
-
1197
)
10
Huang
dW
Sherman
BT
Lempicki
RA
,
Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists
Nucleic Acids Res.
,
2009
, vol.
37
(pg.
1
-
13
)
11
Dietmann
S
Georgii
E
Antonov
A
Tsuda
K
Mewes
HW
,
The DICS repository: module-assisted analysis of disease-related gene lists
Bioinformatics
,
2009
, vol.
25
(pg.
830
-
831
)
12
Berriz
GF
King
OD
Bryant
B
Sander
C
Roth
FP
,
Characterizing gene sets with FuncAssociate
Bioinformatics
,
2003
, vol.
19
(pg.
2502
-
2504
)
13
Antonov
AV
Schmidt
T
Wang
Y
Mewes
HW
,
ProfCom: a web tool for profiling the complex functionality of gene groups identified from high-throughput data
Nucleic Acids Res.
,
2008
, vol.
36
(pg.
W347
-
W351
)
14
Antonov
AV
Dietmann
S
Wong
P
Lutter
D
Mewes
HW
,
GeneSet2miRNA: finding the signature of cooperative miRNA activities in the gene lists
Nucleic Acids Res.
,
2009
, vol.
37
(pg.
W323
-
W328
)
15
Westfall
PN
Young
SS
Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment
,
1993
New York
John Wiley & Sons Inc
16
Antonov
AV
Dietmann
S
Mewes
HW
,
KEGG spider: interpretation of genomics data in the context of the global gene metabolic network
Genome Biol.
,
2008
, vol.
9
pg.
R179
17
Antonov
AV
Dietmann
S
Rodchenkov
I
Mewes
HW
,
PPI spider: a tool for the interpretation of proteomics data in the context of protein-protein interaction networks
Proteomics
,
2009
, vol.
9
(pg.
2740
-
2749
)
18
Young
AR
Narita
M
Ferreira
M
Kirschner
K
Sadaie
M
Darot
JF
Tavare
S
Arakawa
S
Shimizu
S
Watt
FM
et al.
,
Autophagy mediates the mitotic senescence transition
Genes Dev.
,
2009
, vol.
23
(pg.
798
-
803
)
19
Staber
PB
Linkesch
W
Zauner
D
Beham-Schmid
C
Guelly
C
Schauer
S
Sill
H
Hoefler
G
,
Common alterations in gene expression and increased proliferation in recurrent acute myeloid leukemia
Oncogene
,
2004
, vol.
23
(pg.
894
-
904
)
20
Gronborg
M
Bunkenborg
J
Kristiansen
TZ
Jensen
ON
Yeo
CJ
Hruban
RH
Maitra
A
Goggins
MG
Pandey
A
,
Comprehensive proteomic analysis of human pancreatic juice
J. Proteome. Res.
,
2004
, vol.
3
(pg.
1042
-
1055
)
21
Knoops
L
Haas
R
de Kemp
S
Majoor
D
Broeks
A
Eldering
E
de Boer
JP
Verheij
M
van Ostrom
C
de Vries
A
et al.
,
In vivo p53 response and immune reaction underlie highly effective low-dose radiotherapy in follicular lymphoma 1
Blood
,
2007
, vol.
110
(pg.
1116
-
1122
)
22
Jen
KY
Cheung
VG
,
Identification of novel p53 target genes in ionizing radiation response
Cancer Res.
,
2005
, vol.
65
(pg.
7666
-
7673
)
23
Benes
P
Vetvicka
V
Fusek
M
,
Cathepsin D–many functions of one aspartic protease
Crit Rev. Oncol. Hematol.
,
2008
, vol.
68
(pg.
12
-
28
)
24
Cordes
C
Bartling
B
Simm
A
Afar
D
Lautenschlager
C
Hansen
G
Silber
RE
Burdach
S
Hofmann
HS
,
Simultaneous expression of Cathepsins B and K in pulmonary adenocarcinomas and squamous cell carcinomas predicts poor recurrence-free and overall survival
Lung Cancer
,
2009
, vol.
64
(pg.
79
-
85
)
25
Antonov
AV
,
Mining protein lists from proteomics studies: applications for drug discovery
Expert Opin. Drug Discovery
,
2010
, vol.
5
(pg.
322
-
331
)
26
Fernbach
NV
Planyavsky
M
Muller
A
Breitwieser
FP
Colinge
J
Rix
U
Bennett
KL
,
Acid elution and one-dimensional shotgun analysis on an Orbitrap mass spectrometer: an application to drug affinity chromatography
J. Proteome. Res.
,
2009
, vol.
8
(pg.
4753
-
4765
)
27
Walters
DK
Goss
VL
Stoffregen
EP
Gu
TL
Lee
K
Nardone
J
McGreevey
L
Heinrich
MC
Deininger
MW
Polakiewicz
R
et al.
,
Phosphoproteomic analysis of AML cell lines identifies leukemic oncogenes
Leuk. Res.
,
2006
, vol.
30
(pg.
1097
-
2004
)
28
Sticht
C
Freier
K
Knopfle
K
Flechtenmacher
C
Pungs
S
Hofele
C
Hahn
M
Joos
S
Lichter
P
,
Activation of MAP kinase signaling through ERK5 but not ERK1 expression is associated with lymph node metastases in oral squamous cell carcinoma (OSCC)
Neoplasia
,
2008
, vol.
10
(pg.
462
-
470
)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Comments

0 Comments
Submit a comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.