GO-function: deriving biologically relevant functions from statistically significant functions

Wang, Jing; Zhou, Xianxiao; Zhu, Jing; Gu, Yunyan; Zhao, Wenyuan; Zou, Jinfeng; Guo, Zheng

doi:10.1093/bib/bbr041

Abstract

In high-throughput studies of diseases, terms enriched with disease-related genes based on Gene Ontology (GO) are routinely found. However, most current algorithms used to find significant GO terms cannot handle the redundancy that results from the dependencies of GO terms. Simply based on some numerical considerations, current algorithms developed for reducing this redundancy may produce results that do not account for biologically interesting cases. In this article, we present several rules used to design a tool called GO-function for extracting biologically relevant terms from statistically significant GO terms for a disease. Using one gene expression profile for colorectal cancer, we compared GO-function with four algorithms designed to treat redundancy. Then, we validated results obtained in this data set by GO-function using another data set for colorectal cancer. Our analysis showed that GO-function can identify disease-related terms that are more statistically and biologically meaningful than those found by the other four algorithms.

Gene Ontology, enrichment analysis, redundancy, statistically significant term, biologically relevant term

INTRODUCTION

To study a disease based on a high-throughput technology, a list of ‘interesting genes’ is usually extracted. Depending on the goal of a study, ‘interesting genes’ often represents genes related to a disease such as genes significantly differentially expressed in disease samples in comparison with normal controls or genes whose mutations elicit a phenotype of interest. Then, terms significantly enriched with the extracted interesting genes are routinely found using the Gene Ontology (GO) database [1], based on the assumption that terms in which the frequencies of interesting genes are significantly higher than expected by chance are likely to be relevant to the disease. We could also assume that the frequency of interesting genes in a GO term can measure the term’s relevance to a given disease. At least 68 enrichment analysis tools based on GO were developed prior to 2008 [2], and new tools continue to appear [3, 4]. The most often applied tools belong to a class of tools called singular enrichment analysis (SEA), which calculates enrichment P-values for all GO terms based on a list of interesting genes [2]. However, because each gene annotated by a term is also annotated by its ancestor terms according to GO’s ‘true path’ rule [1, 5], many terms found by SEA tools exhibit a redundant ancestor–offspring relationship. One popular approach to address this problem is to simply select the most specific GO terms from these terms [6–11], assuming that specific terms might be more biologically meaningful. However, a complex disease such as cancer could be a ‘systems disease’ with global molecular changes [12, 13]. Thus, whether a ‘specific’ or a ‘general’ term is more biologically relevant should be judged on the basis of the data rather than a priori assumption.

Several tools have been developed to address the redundancy problem [2]. For example, Grossmann et al. [14] developed the parent–child union (PCU) algorithm to analyse the significance of each GO term using genes in its parent term(s) as the background. However, the frequency of interesting genes in a term selected by PCU is not guaranteed to be significantly higher than the random background (i.e. the frequency of interesting genes in all genes under study) if the frequency of interesting genes in its parent term(s) happens to be lower than the random background (see examples in the ‘Results’ section). Another often used algorithm is called elim [15] which analyses terms iteratively from the most specific level to the most general level of GO: once a GO term is determined to be statistically significant, all genes associated with it are removed in the following analysis of its ancestor terms. However, elim may miss significant terms because excluding genes in significant terms from subsequent analysis affects the ranks of all terms in the FDR control procedure. These two algorithms combine the process of identifying statistically significant terms with the process of treating redundant terms, which may introduce the above mentioned problems. Thus, we believe that it is more appropriate to treat redundancy among the statistically significant terms pre-selected using a traditional enrichment analysis tool. As shown in Figure 1A, an ancestor term and an offspring term can both be detected as statistically significant in several situations. First, changes in genes in only the offspring term may actually be related to the disease such that non-random signals in this term may lead to the entire ancestor term becoming detected as statistically significant. In this situation, one should only retain the offspring term (Case 1-1). Second, changes in genes in the entire ancestor term may be truly related to the disease, which may result in one of its offspring terms being detected as statistically significant. In this situation, two possible cases might be biologically meaningful: the frequency of interesting genes in the offspring term is not different from that in the other genes in the entire ancestor term after removing genes in the offspring term (Case 1-2);it is different from that in the other genes in the ancestor term (Case 1-3). In Case 1-2, we should only retain the entire ancestor term. In Case 1-3, we could select both terms because their relationship might be biologically meaningful (see the Materials and Methods section for details). Because PCU and elim are only based on numerical considerations, they may produce results that do not account for these biologically interesting cases.

Figure 1:

(A) Cases of GO structure with two statistically significant terms in an ancestor–offspring relationship; (B) cases of multi-function genes with two statistically significant terms with overlapping genes. See explanations for each case in the ‘Introduction’ section.

Open in new tab Download slide

Another concern in gene enrichment analysis is the treatment of genes with multiple annotations to terms with no ancestor–offspring relationship. Some proposed methods penalize overlapping terms to treat this problem, based on the assumption that one of the overlapping terms could numerically explain the subset of genes belonging to any of these terms. For example, GENerative GO analysis (GenGO) [3] maximizes a log-likelihood function by penalizing the overall number of active terms. Similarly, model-based gene set analysis (MGSA) [4] uses the prior probabilities of terms being in the active state to promote parsimonious explanations of the data. Neither algorithm uses statistical tests for term selection, thus, they may identify terms that are not statistically relevant to the disease under study. As shown in Figure 1B, two terms sharing genes can both be identified as statistically significant in several situations. First, if changes in genes in only one term are truly related to a disease, another irrelevant term may be detected as statistically significant (Case 2-1). Second, genes with multiple annotations may actually play important roles in multiple pathways [7, 16–19], such that both terms are related to the disease (Case 2-2) (see the Materials and Methods section for details). As demonstrated in the ‘Results’ section, the two algorithms described above may miss most biologically relevant terms.

In this work, we designed a tool called GO-function for deriving biologically relevant, non-redundant terms from statistically significant terms for a disease. First, GO-function pre-selects statistically significant GO terms at a false discovery rate (FDR) control level. Then, from these statistically significant terms, it searches for terms which are biologically relevant to a disease according to several explicit rules for the cases described in Figure 1. Using one gene expression profile for colorectal cancer, we compared GO-function with the four above-mentioned algorithms. Compared with these four tools, GO-function can find GO terms that are more statistically and biologically meaningful. Finally, we used another gene expression profile for colorectal cancer to validate results obtained by GO-function in the previous gene expression profile. The GO-function R package can be found at its project homepage at http://bioinfo.hrbmu.edu.cn/fcensus/GO_function.jsp.

MATERIALS AND METHODS

Data sets

We mainly analysed a gene expression profile (referred to as the Sabates-Bellver data set) based on 32 pairs of colorectal adenomas tissue and adjacent normal mucosa [20]. Another gene expression profile (referred to as the Hong data set) [21], including 70 early stage colorectal cancer samples and 12 normal control samples, was used for validating results obtained from the Sabates-Bellver data set. Each of the two data sets was preprocessed using the robust multi-array average (RMA) package [22]. The SOURCE database [23] (April 2009) was used for the translation of ProbeIDs to GeneIDs. For each data set, differentially expressed (DE) genes between the disease samples and the normal samples were selected by the significance analysis of microarrays (SAM) algorithm [24] at a given FDR level.

In this work, we analysed the biological process (BP) GO categories obtained from Bioconductor GO.db on 21 March 2010 [25]. GO-function is currently applicable to 18 organisms including human (see details in the Document at http://bioinfo.hrbmu.edu.cn/fcensus/GO_function.jsp) and it will be readily applicable to other organisms when their gene annotation packages become available in Bioconductor. GO-function uses the Rgraphviz package [25] to output the hierarchy structure of the identified terms. Because the Rgraphviz package is dependent on the non-R tool graphviz (http://www.graphviz.org/), users should install the graphviz tool before running GO-function.

Deriving biologically relevant terms from statistically significant terms

At a given FDR control level, GO-function first selects statistically significant terms from all terms based on the hypergeometric distribution model [26], with enrichment probabilities (i.e. P-values) corrected by the Benjamini and Yekutieli FDR procedure [27]. For these terms, it treats local redundancy between ancestor and offspring terms produced by GO’s ‘true path’ rule [1, 5]. The comparison between an ancestor term and its offspring terms is to find whether the entire ancestor term or only its offspring terms is related to the disease. For all the three types of ancestor–offspring relationships, which are ‘Part-of’, ‘Is-a’, and ‘Regulate’, the genes in the offspring term are a subset of all the genes in the ancestor term according to the ‘true path’ rule (http://www.geneontology.org/GO.ontology.relations.shtml). Finally, for the terms retained in the second step, GO-function treats the global redundancy between non-ancestor–offspring terms with overlapping genes.

Local redundancy treatment

To reduce local redundancy among the pre-selected, statistically significant GO terms, GO-function processes these terms iteratively from the most specific level to the most general level of GO according to three rules as described below. An ancestor term is removed after comparison with its significant offspring term(s) if there is no evidence that it is still related to the disease after excluding genes in its significant offspring terms. Formally, an ancestor term is removed if (i) after the removal of genes in its significant offspring term(s), the frequency of interesting genes in the remaining genes is not significantly higher than the random background at a given significance level (P-value cutoff), and (ii) this frequency is lower than that in its significant offspring terms (Rule 1-1 for Case 1-1 in Figure 1A). Otherwise, the ancestor term is retained. In this situation, the offspring terms are removed if at a given significance level these terms are not significantly enriched with interesting genes when using the ancestor term as the background (Rule 1-2 for Case 1-2 in Figure 1A). Otherwise, the offspring terms are also retained (Rule 1-3 for Case 1-3 in Figure 1A).

An offspring term that is considered redundant with one of its ancestor terms is still retained for comparison with other ancestor terms; it is finally removed after comparison with all ancestor–offspring terms. Notably, because the statistical power of detecting significance for a smaller set of genes tends to be lower, elim may miss some ancestor terms. GO-function remedies this problem to some extent with Rule 1-2 and 1-3 (see the ‘Results’ section for examples). However, some significant offspring terms may be removed simply due to insufficient power of using statistical testing for decision making.

Global redundancy treatment

GO-function offers an option for treating global redundancy among the terms retained after treating for local redundancy. For a pair of terms with overlapping genes, GO-function only retains one term if the following conditions apply (Rule 2-1 for Case 2-1 in Figure 1B): (i) there is evidence that the non-overlapping genes in a term may be related to the disease if at a given significance level the frequency of interesting genes in the non-overlapping genes is not significantly different from that in the overlapping genes or is significantly higher than the random background; and (ii) there is no such evidence for the non-overlapping genes of another term. In all other situations, GO-function retains both terms (Rule 2-2 for Case 2-2 in Figure 1B).

We highlight that we should consider local and global redundancy treatments with different confidence. When comparing an ancestor term with its offspring terms, we would rather be certain that the entire ancestor term is either redundant or non-redundant. On the other hand, for terms with overlapping genes, we would rather be uncertain to delete a term determined to be redundant only based on the numerical evidence because many genes with multiple annotations may truly play roles in multiple functions [7, 16–19]. Thus, to facilitate a comprehensive interpretation of the enrichment results, GO-function outputs the following: (i) all statistically significant terms, organized according to their ancestor–offspring relationship; (ii) any terms remaining after treating for local redundancy; and (iii) the terms remaining after treating for global redundancy.

Comparison

We compared GO-function with PCU [14] and elim [15]. The P-values obtained from both PCU and elim were corrected using the Benjamini and Yekutieli FDR procedure [27]. We also compared GO-function with GenGO [3] and MGSA [4]. Because MGSA uses a Monte Carlo random walk algorithm, we considered that a term was significant only if it was significant in at least five of ten runs of MGSA.

We used the percentage of overlapping (PO) score [28] to measure the consistency between two lists of terms found by two algorithms: if k terms are shared by term list 1 with length L₁ and term list 2 with length L₂, then the PO score is calculated as PO = [(k/L₁+k/L₂)/2] × 100%.

RESULTS

Comparing local redundancy test to PCU and elim

In the analysis of the Sabates-Bellver microarray data on colorectal cancer, we used the SAM with an FDR of 1% to identify 9201 genes as DE genes between the colorectal adenomas tissue samples and the adjacent normal mucosa samples. Using GO-function, 108 GO terms were identified as statistically significant with a 5% FDR. We set P-value as 0.05 for local redundancy treatment with respect to ancestor–offspring relationships. After treatment for local redundancy, 33 GO terms remained (see Table 1 and Supplementary Figure S1).

Table 1:

Open in new tab

Analysis of terms found by GO-function in the Sabates-Bellver microarray data for colorectal cancer

GOID	Name	Global redundancy	PCU	Elim
GO:0000075	Cell cycle checkpoint	N^a	Rule 1-1^b	Overlap^c
GO:0000226	Microtubule cytoskeleton organization	Y	Overlap	Overlap
GO:0006260	DNA replication	N	Overlap	Overlap
GO:0006261	DNA-dependent DNA replication	N	Rule 1-1	Overlap
GO:0006396	RNA processing	N	Overlap	Rule 1-2 and 1-3^b
GO:0006412	Translation	N	Overlap	Overlap
GO:0006974	Response to DNA damage stimulus	N	Overlap	Rule 1-2 and 1-3
GO:0007005	Mitochondrion organization	N	Rule 1-1	Rule 1-2 and 1-3
GO:0007049	Cell cycle	N	Overlap	Rule 1-2 and 1-3
GO:0007059	Chromosome segregation	N	Overlap	Overlap
GO:0007067	Mitosis	N	Rule 1-1	Overlap
GO:0007346	Regulation of mitotic cell cycle	Y	Rule 1-1	Overlap
GO:0009451	RNA modification	N	Overlap	Overlap
GO:0010212	Response to ionizing radiation	N	Overlap	Overlap
GO:0010564	Regulation of cell cycle process	N	Rule 1-1	Rule 1-2 and 1-3
GO:0022403	Cell cycle phase	N	Rule 1-1	Rule 1-2 and 1-3
GO:0022618	Ribonucleoprotein complex assembly	N	Rule 1-1	Rule 1-2 and 1-3
GO:0031145	Anaphase-promoting complex-dependent Proteasomal ubiquitin-dependent protein catabolic process	Y	Rule 1-1	Overlap
GO:0032268	Regulation of cellular protein metabolic process	Y	Overlap	Rule 1-2 and 1-3
GO:0032446	Protein modification by small protein conjugation	Y	Rule 1-1	Rule 1-2 and 1-3
GO:0032508	DNA duplex unwinding	N	Rule 1-1	Rule 1-2 and 1-3
GO:0034621	Cellular macromolecular complex subunit organization	N	Overlap	Rule 1-2 and 1-3
GO:0042221	Response to chemical stimulus	N	Overlap	Rule 1-2 and 1-3
GO:0042254	Ribosome biogenesis	N	Rule 1-1	Overlap
GO:0043161	Proteasomal ubiquitin-dependent protein catabolic process	Y	Overlap	Rule 1-2 and 1-3
GO:0044419	Interspecies interaction between organisms	N	Overlap	Rule 1-2 and 1-3
GO:0048518	Positive regulation of biological process	Y	Rule 1-1	Rule 1-2 and 1-3
GO:0050658	RNA transport	N	Rule 1-1	Overlap
GO:0051301	Cell division	Y	Overlap	Overlap
GO:0051351	Positive regulation of ligase activity	N	Overlap	Rule 1-2 and 1-3
GO:0051436	Negative regulation of ubiquitin–protein ligase activity during mitotic cell cycle	N	Rule 1-1	Overlap
GO:0051726	Regulation of cell cycle	N	Overlap	Rule 1-2 and 1-3
GO:0065003	Macromolecular complex assembly	N	Rule 1-1	Rule 1-2 and 1-3

GOID	Name	Global redundancy	PCU	Elim
GO:0000075	Cell cycle checkpoint	N^a	Rule 1-1^b	Overlap^c
GO:0000226	Microtubule cytoskeleton organization	Y	Overlap	Overlap
GO:0006260	DNA replication	N	Overlap	Overlap
GO:0006261	DNA-dependent DNA replication	N	Rule 1-1	Overlap
GO:0006396	RNA processing	N	Overlap	Rule 1-2 and 1-3^b
GO:0006412	Translation	N	Overlap	Overlap
GO:0006974	Response to DNA damage stimulus	N	Overlap	Rule 1-2 and 1-3
GO:0007005	Mitochondrion organization	N	Rule 1-1	Rule 1-2 and 1-3
GO:0007049	Cell cycle	N	Overlap	Rule 1-2 and 1-3
GO:0007059	Chromosome segregation	N	Overlap	Overlap
GO:0007067	Mitosis	N	Rule 1-1	Overlap
GO:0007346	Regulation of mitotic cell cycle	Y	Rule 1-1	Overlap
GO:0009451	RNA modification	N	Overlap	Overlap
GO:0010212	Response to ionizing radiation	N	Overlap	Overlap
GO:0010564	Regulation of cell cycle process	N	Rule 1-1	Rule 1-2 and 1-3
GO:0022403	Cell cycle phase	N	Rule 1-1	Rule 1-2 and 1-3
GO:0022618	Ribonucleoprotein complex assembly	N	Rule 1-1	Rule 1-2 and 1-3
GO:0031145	Anaphase-promoting complex-dependent Proteasomal ubiquitin-dependent protein catabolic process	Y	Rule 1-1	Overlap
GO:0032268	Regulation of cellular protein metabolic process	Y	Overlap	Rule 1-2 and 1-3
GO:0032446	Protein modification by small protein conjugation	Y	Rule 1-1	Rule 1-2 and 1-3
GO:0032508	DNA duplex unwinding	N	Rule 1-1	Rule 1-2 and 1-3
GO:0034621	Cellular macromolecular complex subunit organization	N	Overlap	Rule 1-2 and 1-3
GO:0042221	Response to chemical stimulus	N	Overlap	Rule 1-2 and 1-3
GO:0042254	Ribosome biogenesis	N	Rule 1-1	Overlap
GO:0043161	Proteasomal ubiquitin-dependent protein catabolic process	Y	Overlap	Rule 1-2 and 1-3
GO:0044419	Interspecies interaction between organisms	N	Overlap	Rule 1-2 and 1-3
GO:0048518	Positive regulation of biological process	Y	Rule 1-1	Rule 1-2 and 1-3
GO:0050658	RNA transport	N	Rule 1-1	Overlap
GO:0051301	Cell division	Y	Overlap	Overlap
GO:0051351	Positive regulation of ligase activity	N	Overlap	Rule 1-2 and 1-3
GO:0051436	Negative regulation of ubiquitin–protein ligase activity during mitotic cell cycle	N	Rule 1-1	Overlap
GO:0051726	Regulation of cell cycle	N	Overlap	Rule 1-2 and 1-3
GO:0065003	Macromolecular complex assembly	N	Rule 1-1	Rule 1-2 and 1-3

^a‘N’ represents that there was evidence that the significance of this term should not be simply due to the overlapping genes and ‘Y’ represents no such evidence; ^b‘Rule 1-1’, ‘Rule 1-2’, and ‘Rule 1-3’ represent that a term was retained by GO-function according to the three rules described in ‘Materials and Methods’ section but missed by another method, respectively; ^c‘Overlap’ represents that a term found by GO-function was also found by another method.

Table 1:

Open in new tab

Analysis of terms found by GO-function in the Sabates-Bellver microarray data for colorectal cancer

GOID	Name	Global redundancy	PCU	Elim
GO:0000075	Cell cycle checkpoint	N^a	Rule 1-1^b	Overlap^c
GO:0000226	Microtubule cytoskeleton organization	Y	Overlap	Overlap
GO:0006260	DNA replication	N	Overlap	Overlap
GO:0006261	DNA-dependent DNA replication	N	Rule 1-1	Overlap
GO:0006396	RNA processing	N	Overlap	Rule 1-2 and 1-3^b
GO:0006412	Translation	N	Overlap	Overlap
GO:0006974	Response to DNA damage stimulus	N	Overlap	Rule 1-2 and 1-3
GO:0007005	Mitochondrion organization	N	Rule 1-1	Rule 1-2 and 1-3
GO:0007049	Cell cycle	N	Overlap	Rule 1-2 and 1-3
GO:0007059	Chromosome segregation	N	Overlap	Overlap
GO:0007067	Mitosis	N	Rule 1-1	Overlap
GO:0007346	Regulation of mitotic cell cycle	Y	Rule 1-1	Overlap
GO:0009451	RNA modification	N	Overlap	Overlap
GO:0010212	Response to ionizing radiation	N	Overlap	Overlap
GO:0010564	Regulation of cell cycle process	N	Rule 1-1	Rule 1-2 and 1-3
GO:0022403	Cell cycle phase	N	Rule 1-1	Rule 1-2 and 1-3
GO:0022618	Ribonucleoprotein complex assembly	N	Rule 1-1	Rule 1-2 and 1-3
GO:0031145	Anaphase-promoting complex-dependent Proteasomal ubiquitin-dependent protein catabolic process	Y	Rule 1-1	Overlap
GO:0032268	Regulation of cellular protein metabolic process	Y	Overlap	Rule 1-2 and 1-3
GO:0032446	Protein modification by small protein conjugation	Y	Rule 1-1	Rule 1-2 and 1-3
GO:0032508	DNA duplex unwinding	N	Rule 1-1	Rule 1-2 and 1-3
GO:0034621	Cellular macromolecular complex subunit organization	N	Overlap	Rule 1-2 and 1-3
GO:0042221	Response to chemical stimulus	N	Overlap	Rule 1-2 and 1-3
GO:0042254	Ribosome biogenesis	N	Rule 1-1	Overlap
GO:0043161	Proteasomal ubiquitin-dependent protein catabolic process	Y	Overlap	Rule 1-2 and 1-3
GO:0044419	Interspecies interaction between organisms	N	Overlap	Rule 1-2 and 1-3
GO:0048518	Positive regulation of biological process	Y	Rule 1-1	Rule 1-2 and 1-3
GO:0050658	RNA transport	N	Rule 1-1	Overlap
GO:0051301	Cell division	Y	Overlap	Overlap
GO:0051351	Positive regulation of ligase activity	N	Overlap	Rule 1-2 and 1-3
GO:0051436	Negative regulation of ubiquitin–protein ligase activity during mitotic cell cycle	N	Rule 1-1	Overlap
GO:0051726	Regulation of cell cycle	N	Overlap	Rule 1-2 and 1-3
GO:0065003	Macromolecular complex assembly	N	Rule 1-1	Rule 1-2 and 1-3

GOID	Name	Global redundancy	PCU	Elim
GO:0000075	Cell cycle checkpoint	N^a	Rule 1-1^b	Overlap^c
GO:0000226	Microtubule cytoskeleton organization	Y	Overlap	Overlap
GO:0006260	DNA replication	N	Overlap	Overlap
GO:0006261	DNA-dependent DNA replication	N	Rule 1-1	Overlap
GO:0006396	RNA processing	N	Overlap	Rule 1-2 and 1-3^b
GO:0006412	Translation	N	Overlap	Overlap
GO:0006974	Response to DNA damage stimulus	N	Overlap	Rule 1-2 and 1-3
GO:0007005	Mitochondrion organization	N	Rule 1-1	Rule 1-2 and 1-3
GO:0007049	Cell cycle	N	Overlap	Rule 1-2 and 1-3
GO:0007059	Chromosome segregation	N	Overlap	Overlap
GO:0007067	Mitosis	N	Rule 1-1	Overlap
GO:0007346	Regulation of mitotic cell cycle	Y	Rule 1-1	Overlap
GO:0009451	RNA modification	N	Overlap	Overlap
GO:0010212	Response to ionizing radiation	N	Overlap	Overlap
GO:0010564	Regulation of cell cycle process	N	Rule 1-1	Rule 1-2 and 1-3
GO:0022403	Cell cycle phase	N	Rule 1-1	Rule 1-2 and 1-3
GO:0022618	Ribonucleoprotein complex assembly	N	Rule 1-1	Rule 1-2 and 1-3
GO:0031145	Anaphase-promoting complex-dependent Proteasomal ubiquitin-dependent protein catabolic process	Y	Rule 1-1	Overlap
GO:0032268	Regulation of cellular protein metabolic process	Y	Overlap	Rule 1-2 and 1-3
GO:0032446	Protein modification by small protein conjugation	Y	Rule 1-1	Rule 1-2 and 1-3
GO:0032508	DNA duplex unwinding	N	Rule 1-1	Rule 1-2 and 1-3
GO:0034621	Cellular macromolecular complex subunit organization	N	Overlap	Rule 1-2 and 1-3
GO:0042221	Response to chemical stimulus	N	Overlap	Rule 1-2 and 1-3
GO:0042254	Ribosome biogenesis	N	Rule 1-1	Overlap
GO:0043161	Proteasomal ubiquitin-dependent protein catabolic process	Y	Overlap	Rule 1-2 and 1-3
GO:0044419	Interspecies interaction between organisms	N	Overlap	Rule 1-2 and 1-3
GO:0048518	Positive regulation of biological process	Y	Rule 1-1	Rule 1-2 and 1-3
GO:0050658	RNA transport	N	Rule 1-1	Overlap
GO:0051301	Cell division	Y	Overlap	Overlap
GO:0051351	Positive regulation of ligase activity	N	Overlap	Rule 1-2 and 1-3
GO:0051436	Negative regulation of ubiquitin–protein ligase activity during mitotic cell cycle	N	Rule 1-1	Overlap
GO:0051726	Regulation of cell cycle	N	Overlap	Rule 1-2 and 1-3
GO:0065003	Macromolecular complex assembly	N	Rule 1-1	Rule 1-2 and 1-3

^a‘N’ represents that there was evidence that the significance of this term should not be simply due to the overlapping genes and ‘Y’ represents no such evidence; ^b‘Rule 1-1’, ‘Rule 1-2’, and ‘Rule 1-3’ represent that a term was retained by GO-function according to the three rules described in ‘Materials and Methods’ section but missed by another method, respectively; ^c‘Overlap’ represents that a term found by GO-function was also found by another method.

Using PCU with a 5% FDR, 52 terms enriched with DE genes were found. Among these, seven terms were not statistically significant and were removed by GO-function, 17 were also identified by GO-function, and 28 were replaced by their offspring or ancestor terms identified by GO-function during local redundancy treatment. The detailed results of terms found by PCU are shown in Supplementary Table S1. The seven removed terms were not statistically relevant to the disease, but PCU detected them as significant because the frequencies of DE genes in their corresponding parent terms used as the background by PCU were lower than the random background. For example, the term ‘locomotory behavior’ (GO:0007626) contained 265 genes, among which 149 were DE genes, and its parent term ‘behavior’ (GO:0007610) contained 47.1% DE genes, which was lower than the 51.6% random background. Thus, the child term ‘locomotory behavior’ was significant when its parent term was set as the background by PCU (adjusted P = 4.9E-3), whereas it was not significant when all of the annotated genes measured in the microarray data were set as the background (adjusted P = 1.0). As shown in Supplementary Table S1, the other 27 terms found by PCU were removed by GO-function based on Rule 1-1 during local redundancy treatment. For each of these removed terms, after the removal of the genes in its significant offspring term(s), the frequency of DE genes in the remaining genes was lower than that in its corresponding offspring terms; moreover, this frequency was not significantly higher than the random background (P > 0.05). For example, as shown in Figure 2A, ‘cellular response to stress’ (GO:0033554) was identified as a significant term by PCU (adjusted P = 9.1E-4). However, after removing the genes in its child term ‘response to DNA damage stimulus’ (GO:0006974), the remaining genes associated with ‘cellular response to stress’ had a lower frequency of DE genes (54.2%) than it did in its child term (71.1%) and were not enriched with DE genes (P = 0.22). Thus, the term ‘cellular response to stress’ was removed by GO-function. For another example, ‘organelle fission’ (GO:0048285) was found by PCU (adjusted P = 5.5E-6). However, after removing the genes in its offspring term ‘mitosis’ (GO:0007067), the remaining genes associated with ‘organelle fission’ had lower frequencies of DE genes (62.5%) than it did in its offspring term (75.1%) and were not significantly enriched with DE genes (P = 0.40). Thus, ‘organelle fission’ was removed by GO-function.

Figure 2:

Some terms identified by PCU or elim but removed by GO-function. (A) The term ‘cellular response to stress’ found by PCU was replaced by its child term ‘response to DNA damage stimulus’ identified by GO-function during local redundancy treatment. Here, we used different grey scales to represent the frequencies of DE genes in different terms. After removing genes in the child term (i.e. the smaller grey ellipse in the larger ellipse), the remaining genes of the ‘cellular response to stress’ (i.e. the light grey part of the larger ellipse) had a lower frequency of DE genes than it did in its offspring term and was not enriched with DE genes; (B) the term ‘positive regulation of ubiquitin–protein ligase activity during mitotic cell cycle’ was identified by elim; it was replaced by its ancient term ‘positive regulation of ligase activity’ identified by GO-function. After removing genes in the offspring term (i.e. the dark grey ellipse part of the largest ellipse), the frequency of the DE genes in the remaining genes in the ‘positive regulation of ligase activity’ (i.e. the black part of the largest ellipse) was higher than the frequency in the offspring term.

Open in new tab Download slide

As shown in Table 1, among the 33 terms identified by GO-function, 17 terms were also found by PCU, and another 16 terms that were retained by GO-function according to Rule 1-1 were missed by PCU because the frequency of DE genes in their parent terms were higher than the random background. For example, the term ‘protein modification by small protein conjugation’ (GO:0032446) was annotated with 235 genes, including 159 DE genes; the term was significant when the background was set to include all annotated genes measured in the microarray data (adjusted P = 3.9E-4). However, it was not significant when using its parent term ‘protein modification by small protein conjugation or removal’ (GO:0070647), which were annotated with 265 genes, including 178 DE genes, as the background (adjusted P = 1.0). However, after the removal of genes associated with the term ‘protein modification by small protein conjugation’, the remaining genes associated with ‘protein modification by small protein conjugation or removal’ had lower frequencies of DE genes (63.3%) than did ‘protein modification by small protein conjugation’ (67.7%) and were not enriched with DE genes (P = 0.14). Thus, only ‘protein modification by small protein conjugation’ was retained by GO-function.

Using elim with an FDR of 5%, 24 terms enriched with DE genes were found, among which 15 terms were also found by GO-function. Another nine terms were replaced by their ancestor terms identified by GO-function during local redundancy treatment according to Rule 1-2. All terms found by elim are described in Supplementary Table S2. Of the 33 terms (Table 1) identified by GO-function, 15 were also found by elim, and 18 were the ancestor terms of some terms found by elim and retained by GO-function according to Rules 1-2 and 1-3. For example, as shown in Figure 2B, when using elim, the term ‘positive regulation of ubiquitin–protein ligase activity during mitotic cell cycle’ (GO:0051437) was found to be significant, and its ancestor term, ‘positive regulation of ligase activity’ (GO:0051351), was removed due to redundancy. After removing the genes in the term ‘positive regulation of ubiquitin–protein ligase activity during mitotic cell cycle’, there remained only six genes associated with ‘positive regulation of ligase activity’, among which all genes were DE genes that were not significant according to elim (adjusted P = 1.0). However, the frequency of DE genes among the remaining genes in the ancestor term was 100%, which was higher than the frequency (87.7%) of DE genes in the offspring term ‘positive regulation of ubiquitin–protein ligase activity during mitotic cell cycle’. Thus, the non-significance of ‘positive regulation of ligase activity’ was simply due to the lower statistical power of detecting the significance of the enrichment of DE genes in a smaller gene set. Meanwhile, the frequency of DE genes in the offspring term was not significantly different from that in the ancestor term (P = 1.0). Thus, GO-function retained only the ancestor term ‘positive regulation of ligase activity’ as the disease relevant term. For another example, elim found term ‘rRNA processing’ (GO:0006364) (adjusted P = 4.2E-7). After removing genes in this term, the frequency of DE genes among the remaining genes in its ancestor term ‘ribosome biogenesis’ (GO:0042254) was 93.6%, which was higher than the frequency (88.3%) of DE genes in the offspring term ‘rRNA processing’. Meanwhile, the frequency of DE genes in the offspring term was not significantly different from that in the ancestor term (P = 0.94). Thus, GO-function retained only the ancestor term ‘ribosome biogenesis’.

Then, we analysed the data by GO-function using various P-value cutoffs for local redundancy treatment, and compared the results with the terms found by PCU and elim, respectively. Setting the P-value cutoff as 0.005, 0.01, 0.05 and 0.1, we found that the PO score (see ‘Materials and Methods’ section) took values ranging from 40.4 to 42.1% when comparing GO-function with PCU and from 54 to 57.8% when comparing GO-function with elim. Taken together, the above results indicated that terms found by PCU and elim are rather different from the terms found by GO-function.

Finally, we compared the terms found by GO-function using the P-value cutoff of 0.05 with the terms found by GO-function using other P-value cutoffs (0.005, 0.01 and 0.1) for local redundancy treatment. The results showed that the PO score took values ranging from 75.6% to 93.4%. When using various P-value cutoffs (0.005, 0.01 and 0.1) for both local and global redundancy treatments, the PO score between the final term lists took values ranging from 74.5% to 94.6%. These results indicated that GO-function is rather robust when using different P-value cutoffs.

Comparing global redundancy test to MGSA and GenGO

For two terms with overlapping genes, we set P-value as 0.05 to test whether the non-overlapping genes of each term are significantly enriched with DE genes and whether it is significantly different from the overlapping genes in the enrichment of DE genes. As shown in Table 1, for the 33 retained terms after local redundancy treatment, eight terms were removed according to Rule 2-1. For example, the terms ‘positive regulation of biological process’ (GO:0048518) and ‘cell cycle’ (GO:0007049) contained 319 overlapping genes, among which 69.6% were DE genes. After removing the overlapping genes from the term ‘positive regulation of biological process’, the frequency of DE genes among the remaining genes (53%) was not significantly higher than the random background (P = 0.11) and was significantly lower than the frequency in the overlapping genes (P = 2.4E-8). However, the non-overlapping genes of the term ‘cell cycle’ were enriched with DE genes (P < 2.2E-16) and were not different from the frequency in the overlapping genes (P = 0.19). According to Rule 2-1, the term ‘cell cycle’ was retained, whereas the term ‘positive regulation of biological process’ was not because we could not find evidence that the significance of the latter was not due to the overlapping genes.

Using the DE genes found in the Sabates-Bellver data set, MGSA identified only two terms from all annotated BP terms, which were replaced by their significant offspring or ancestor terms identified by GO-function during local redundancy treatment (see Supplementary Table S3). GenGO found only five GO terms from all annotated BP terms (see Supplementary Table S4), among which (i) two terms were also found by GO-function, (ii) one term was replaced by its significant offspring terms identified by GO-function during local redundancy treatment, and (iii) another two terms were not significantly enriched with DE genes. The adjusted P-values of the latter two terms were 0.18 and 1.0, respectively, based on the hypergeometric distribution test with a 5% FDR. GenGO and MGSA tend to over-penalize overlapping terms and miss many biologically relevant terms, as shown in Table 1. Similarly, GenGO and MGSA missed most of the terms found by GO-function with other P-value cutoffs (0.005, 0.01 or 0.1) for treating local and global redundancy.

Because both GenGO and MGSA are based on pre-set optimization goals rather than statistical criteria, it is difficult to interpret their results statistically. We consider this a shortcoming of these two algorithms.

Validation of application results using an independent experimental data

One major advantage of GO-function is that the comparison between an ancestor term and its significant offspring terms can provide information for determining whether the entire ancestor term or only it’s an offspring term(s) is likely to be biologically related to the disease. This may provide a novel insight for understanding the disease mechanism at the systems biology level. Thus, we validated that the terms found in the Sabates-Bellver data set by GO-function, as well as the ancestor–offspring comparison results, could be reproduced in the Hong data set [21].

In the Hong data set, we used the SAM with an FDR of 1% to identify 7900 genes as DE genes between the colorectal cancer samples and the normal control samples. Using GO-function, 62 GO terms were identified as statistically significant with a 5% FDR. After treatment for local redundancy, 17 GO terms remained (see Table 2 and Supplementary Figure S2), among which eight terms (47.1%) were also found in the Sabates-Bellver data set. Then, we did a randomization experiment to test the null hypothesis that the number (eight) of overlapping terms would be expected for random genes unrelated to the disease. We randomly selected 7900 (i.e. the number of DE genes) genes from the Hong data set and used GO-function with an FDR of 5% to find terms enriched with these genes. Then, we counted the overlaps of the found terms with the terms found in the Sabates-Bellver data set. This process was repeated 10 000 times. The result showed that the average number of overlapping terms was 0.0001, which was significantly smaller than the number (eight) of overlapping terms observed for the same number of DE genes extracted from the Hong data set (P < 1.0E-4). Thus, we could reject the null hypothesis that the observed number (eight) of overlapping terms would be expected for random genes. After treatment for global redundancy, two terms were removed in the Hong data set, and both of them were also removed after treatment for global redundancy of the Sabates-Bellver data set.

Table 2:

Open in new tab

Terms found by GO-function in the Hong microarray data for colorectal cancer

GOID	Name	Global redundancy
GO:0000226	Microtubule cytoskeleton organization	Y
GO:0006260	DNA replication	N
GO:0006399	tRNA metabolic process	N
GO:0006403	RNA localization	N
GO:0006974	Response to DNA damage stimulus	N
GO:0007049	Cell cycle	N
GO:0007059	Chromosome segregation	N
GO:0008380	RNA splicing	N
GO:0012501	Programmed cell death	N
GO:0015931	Nucleobase, nucleoside, nucleotide and nucleic acid transport	N
GO:0016071	mRNA metabolic process	N
GO:0022618	Ribonucleoprotein complex assembly	N
GO:0033554	Cellular response to stress	N
GO:0034470	ncRNA processing	N
GO:0042254	Ribosome biogenesis	N
GO:0044267	Cellular protein metabolic process	N
GO:0051301	Cell division	Y

GOID	Name	Global redundancy
GO:0000226	Microtubule cytoskeleton organization	Y
GO:0006260	DNA replication	N
GO:0006399	tRNA metabolic process	N
GO:0006403	RNA localization	N
GO:0006974	Response to DNA damage stimulus	N
GO:0007049	Cell cycle	N
GO:0007059	Chromosome segregation	N
GO:0008380	RNA splicing	N
GO:0012501	Programmed cell death	N
GO:0015931	Nucleobase, nucleoside, nucleotide and nucleic acid transport	N
GO:0016071	mRNA metabolic process	N
GO:0022618	Ribonucleoprotein complex assembly	N
GO:0033554	Cellular response to stress	N
GO:0034470	ncRNA processing	N
GO:0042254	Ribosome biogenesis	N
GO:0044267	Cellular protein metabolic process	N
GO:0051301	Cell division	Y

The notations are the same as described in the footnote of Table 1.

Table 2:

Open in new tab

Terms found by GO-function in the Hong microarray data for colorectal cancer

GOID	Name	Global redundancy
GO:0000226	Microtubule cytoskeleton organization	Y
GO:0006260	DNA replication	N
GO:0006399	tRNA metabolic process	N
GO:0006403	RNA localization	N
GO:0006974	Response to DNA damage stimulus	N
GO:0007049	Cell cycle	N
GO:0007059	Chromosome segregation	N
GO:0008380	RNA splicing	N
GO:0012501	Programmed cell death	N
GO:0015931	Nucleobase, nucleoside, nucleotide and nucleic acid transport	N
GO:0016071	mRNA metabolic process	N
GO:0022618	Ribonucleoprotein complex assembly	N
GO:0033554	Cellular response to stress	N
GO:0034470	ncRNA processing	N
GO:0042254	Ribosome biogenesis	N
GO:0044267	Cellular protein metabolic process	N
GO:0051301	Cell division	Y

GOID	Name	Global redundancy
GO:0000226	Microtubule cytoskeleton organization	Y
GO:0006260	DNA replication	N
GO:0006399	tRNA metabolic process	N
GO:0006403	RNA localization	N
GO:0006974	Response to DNA damage stimulus	N
GO:0007049	Cell cycle	N
GO:0007059	Chromosome segregation	N
GO:0008380	RNA splicing	N
GO:0012501	Programmed cell death	N
GO:0015931	Nucleobase, nucleoside, nucleotide and nucleic acid transport	N
GO:0016071	mRNA metabolic process	N
GO:0022618	Ribonucleoprotein complex assembly	N
GO:0033554	Cellular response to stress	N
GO:0034470	ncRNA processing	N
GO:0042254	Ribosome biogenesis	N
GO:0044267	Cellular protein metabolic process	N
GO:0051301	Cell division	Y

The notations are the same as described in the footnote of Table 1.

In addition, most (83.1%) of all the 83 ancestor–offspring comparison results in the Sabates-Bellver data set could be reproduced in the Hong data set. For example, in the Sabates-Bellver data set, the term ‘cellular response to stress’ was redundant in comparison with its offspring term ‘response to DNA damage stimulus’ according to Rule 1-1. The same result was also observed in the Hong data set. This result indicated that the offspring term ‘response to DNA damage stimulus’ is more likely to be closely related to cancer [29, 30]. The second example is that, in both the Sabates-Bellver data set and Hong data set, the term ‘organelle fission’ was redundant in comparison with its offspring term ‘mitosis’ according to Rule 1-1. GO-function kept the term ‘mitosis’ which is also proved to be closely related to cancer [31]. Notably, the ancestor terms of the above two examples were found by PCU but removed by GO-function. As the third example, in both Sabates-Bellver data set and Hong data set, the term ‘positive regulation of ubiquitin–protein ligase activity during mitotic cell cycle’ was redundant in comparison with its ancestor term ‘positive regulation of ligase activity’ according to Rule 1-2. Many genes annotated in the ancestor term but not in the offspring term, such as PIN1 [32] and XRCC4 [33], are closely related to cancer. Thus, we could assume that the entire ancestor term is disturbed in the disease. The fourth example is that, in both the Sabates-Bellver data set and Hong data set, ‘rRNA processing’ was redundant in comparison with its ancestor term ‘ribosome biogenesis’ according to Rule 1-2. Studies also suggested that many genes annotated in the ancestor term but not in the offspring term, such as AATF [34] and MINA [35], are closely related to cancer. We note that the offspring terms of the above two examples were found by elim but removed by GO-function.

DISCUSSION

GO-function is designed to extract biologically relevant GO terms from statistically significant terms for a given disease. In comparison with GO-function, PCU tends to choose more ‘general’ terms whose significances might be simply due to their significant offspring terms. Elim tends to choose ‘specific’ terms, even if the frequencies of interesting genes in these terms are not different from or even lower than the frequencies in their corresponding ancestor terms. In such a case, changes in genes across the entire BPs, rather than only in the sub-processes, may actually be related to the disease. Both PCU and elim do not treat global redundancy. For the two algorithms that do treat global redundancy, GenGO and MGSA, we found that they tend to miss most biologically relevant terms and it is difficult to interpret their results statistically. In contrast to these other methods, GO-function can find GO terms that are more statistically and biologically meaningful. As shown in the ‘Results’ section, GO-function is rather robust when using different P-value cutoffs for local and global redundancy treatment. Notably, even after redundancy treatment that tends to exclude terms including many irrelevant genes, a term identified to be relevant to a disease is still based on the statistical evidence of the enrichment of interesting genes in this term. Thus, we still cannot infer that every gene in a selected term is relevant to the disease. However, we could assume that the changes of some ‘non-interesting genes’ in this term are also likely to be relevant to the disease because the power of statistical testing for determining ‘interesting genes’ may be insufficient in the presence of large biological variation.

GO-function belongs to the class of tools called modular enrichment analysis (MEA) [2], which calculates enrichment P-values of ‘interesting genes’ for all GO terms. One major limitation of this type of method is that they rely on thresholds for defining genes as ‘interesting genes’. In contrast, another type of method named threshold-free methods can handle this difficulty by using continuous measures [36]. However, the threshold-free methods also have their own limitations, as discussed in [2]. Taking the tool named GSEA (gene set enrichment analysis) [36] for example, a few highly changed genes in a term can introduce a small enrichment P-value even if most genes in this term are randomly changed [2]. When using GSEA, users still need to choose parameters which can affect the output. For example, for the Sabates-Bellver data set, no significant term could be found with a FDR of 5% when using GSEA with the ‘phenotype’ permutation but 94 terms could be found with the ‘gene_set’ permutation. The ‘Min size’ and ‘Max size’ parameters of GSEA can also affect the output. In general, as Huang et al. [2] commented, ‘current enrichment analysis is more of an exploratory procedure, with the aid of enrichment P-value, rather than a pure statistical solution. The notion that the enriched terms should make sense based on a priori biological knowledge of the study is the most important guideline to help users in adjusting analytic thresholds.’ Finally, threshold-free methods also face the redundancy problem, which needs to be addressed in our future work.

One common problem in the evaluation of enrichment analysis tools is that there are no gold standards for phenotype-associated pathways by which to assess the results. Some studies [3, 4, 14, 15] have attempted to address this problem using simulated data. However, simulation studies are heavily dependent on the models themselves for generating the simulated data, and as such, they often risk producing biased data preferentially favorable to specific algorithms. For example, we found that although PCU and elim performed well with certain simulated data [14, 15], the precisions of these two algorithms decreased to below 0.1 based on simulated data used to evaluate MGSA [4]. Thus, in this work, we did not use simulated data to compare GO-function with other methods. Instead, we used certain rules to compare results from different algorithms. The logic behind these rules is simple and can be interpreted by users. In addition, we have validated some interesting results obtained by GO-function in one data set using another independent laboratory data set for colorectal cancer.

Currently, a serious problem in studying disease-associated pathways is that the criteria for defining pathways vary greatly in different pathway databases [37–40]. For example, Kyoto Encyclopedia of Genes and Genomes (KEGG) database [41] tends to define general pathways, whereas BioCarta database (http://www.biocarta.com/) tends to define specific pathways [37]. Based on these pathway data sources, it would be hard if not impossible to find whether only a part of or the entire pathway is likely to be disturbed in a disease. In contrast, GO [1] defines functions at various specific levels in a hierarchical manner. As such, GO is relatively flexible to address the problem that whether only an offspring term of or the entire term is likely to be related to a disease. We recognize that BPs defined in GO are ‘series of events accomplished by one or more ordered assemblies of molecular functions’ [1], which usually cannot be simply mapped to traditional pathways. Nevertheless, some GO terms could be mapped to traditional pathways or parts of these pathways. For example, most of genes in the GO term ‘cell cycle’ can be mapped to the ‘cell cycle’ pathway collected in KEGG. GO contains not only the ‘cell cycle’ term but also its offspring terms such as ‘mitosis’ which can be largely mapped to the M phase of the ‘cell cycle’ pathway in KEGG (see Supplementary Figure S3). In the Sabates-Bellver data set for colorectal cancer, we found that the frequency of DE genes in the ‘mitosis’ term was significantly higher than that in the ‘cell cycle’ term. Therefore, these two terms were retained by GO-function according to Rule 1-3. This result indicated that ‘mitosis’ may play a key role in cancer mechanism [31]. Notably, this difference could not be found based on KEGG because no sub-pathways of ‘cell cycle’ pathway are described in KEGG. However, as a common problem of using current enrichment analysis tools based on GO, how to explain the selected GO terms in the context of traditional pathways deserves our future study.

SUPPLEMENTARY DATA

Supplementary data are available online at http://bib.oxfordjournals.org/.

Key Points

Most current algorithms used to find significant GO terms cannot handle the redundancy that results from the dependencies of GO terms.
Simply based on some numerical considerations, current algorithms developed for reducing this redundancy produce results that do not account for biologically interesting cases.
GO-function extracts disease-related GO terms from pre-selected statistically significant terms and the results are interpretable based on several explicit rules.

FUNDING

The National Natural Science Foundation of China (grant numbers 91029717, 30770558, 30970668 and 81071646); Excellent Youth Foundation of Heilongjiang Province of China (grant number JC200808); Natural Science Foundation of Heilongjiang Province of China (grant number QC2010012); and Scientific Research Fund of Heilongjiang Provincial Education Department of China (grant number 11541156).

References

1

Ashburner

M

Ball

CA

Blake

JA

et al. ,

Gene ontology: tool for the unification of biology. The Gene Ontology Consortium

,

Nat Genet

,

2000

, vol.

25

(pg.

25

-

9

)

2

Huang da

W

Sherman

BT

Lempicki

RA

,

Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists

,

Nucleic Acids Res

,

2009

, vol.

37

(pg.

1

-

13

)

3

Lu

Y

Rosenfeld

R

Simon

I

et al. ,

A probabilistic generative model for GO enrichment analysis

,

Nucleic Acids Res

,

2008

, vol.

36

pg.

e109

4

Bauer

S

Gagneur

J

Robinson

PN

,

GOing Bayesian: model-based gene set analysis of genome-scale data

,

Nucleic Acids Res

,

2010

, vol.

38

(pg.

3523

-

32

)

5

Camon

E

Magrane

M

Barrell

D

et al. ,

The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology

,

Nucleic Acids Res.

,

2004

, vol.

32

(pg.

D262

-

6

)

6

Zhu

J

Wang

J

Guo

Z

et al. ,

GO-2D: identifying 2-dimensional cellular-localized functional modules in Gene Ontology

,

BMC Genomics

,

2007

, vol.

8

pg.

30

7

Ma

W

Yang

D

Gu

Y

et al. ,

Finding disease-specific coordinated functions by multi-function genes: insight into the coordination mechanisms in diseases

,

Genomics

,

2009

, vol.

94

(pg.

94

-

100

)

8

Tchagang

AB

Gawronski

A

Berube

H

et al. ,

GOAL: a software tool for assessing biological significance of genes groups

,

BMC Bioinformatics

,

2010

, vol.

11

pg.

229

9

Buske

FA

Boden

M

Bauer

DC

et al. ,

Assigning roles to DNA regulatory motifs using comparative genomics

,

Bioinformatics

,

2010

, vol.

26

(pg.

860

-

6

)

10

Piper

EK

Jonsson

NN

Gondro

C

et al. ,

Immunological profiles of Bos taurus and Bos indicus cattle infested with the cattle tick, Rhipicephalus (Boophilus) microplus

,

Clin Vaccine Immunol

,

2009

, vol.

16

(pg.

1074

-

86

)

11

Tedder

PM

Bradford

JR

Needham

CJ

et al. ,

Gene function prediction using semantic similarity clustering and enrichment analysis in the malaria parasite Plasmodium falciparum

,

Bioinformatics

,

2010

, vol.

26

(pg.

2431

-

7

)

12

Yeang

CH

McCormick

F

Levine

A

,

Combinatorial patterns of somatic gene mutations in cancer

,

Faseb J

,

2008

, vol.

22

(pg.

2605

-

22

)

13

Hanahan

D

Weinberg

RA

,

The hallmarks of cancer

,

Cell

,

2000

, vol.

100

(pg.

57

-

70

)

14

Grossmann

S

Bauer

S

Robinson

PN

et al. ,

Improved detection of overrepresentation of Gene-Ontology annotations with parent child analysis

,

Bioinformatics

,

2007

, vol.

23

(pg.

3024

-

31

)

15

Alexa

A

Rahnenfuhrer

J

Lengauer

T

,

Improved scoring of functional groups from gene expression data by decorrelating GO graph structure

,

Bioinformatics

,

2006

, vol.

22

(pg.

1600

-

7

)

16

Moloshok

TD

Klevecz

RR

Grant

JD

et al. ,

Application of Bayesian decomposition for analysing microarray data

,

Bioinformatics

,

2002

, vol.

18

(pg.

566

-

75

)

17

Kim

T

Yoon

J

Cho

H

et al. ,

Downregulation of lipopolysaccharide response in Drosophila by negative crosstalk between the AP1 and NF-kappaB signaling modules

,

Nat Immunol

,

2005

, vol.

6

(pg.

211

-

8

)

18

Han

JD

Bertin

N

Hao

T

et al. ,

Evidence for dynamically organized modularity in the yeast protein-protein interaction network

,

Nature

,

2004

, vol.

430

(pg.

88

-

93

)

19

Sgambato

A

Cittadini

A

Faraglia

B

et al. ,

Multiple functions of p27(Kip1) and its alterations in tumor cells: a review

,

J Cell Physiol

,

2000

, vol.

183

(pg.

18

-

27

)

20

Sabates-Bellver

J

Van der Flier

LG

de Palo

M

et al. ,

Transcriptome profile of human colorectal adenomas

,

Mol Cancer Res

,

2007

, vol.

5

(pg.

1263

-

75

)

21

Hong

Y

Downey

T

Eu

KW

et al. ,

A ‘metastasis-prone’ signature for early-stage mismatch-repair proficient sporadic colorectal cancer patients and its implications for possible therapeutics

,

Clin Exp Metastasis

,

2010

, vol.

27

(pg.

83

-

90

)

22

Irizarry

RA

Hobbs

B

Collin

F

et al. ,

Exploration, normalization, and summaries of high density oligonucleotide array probe level data

,

Biostatistics

,

2003

, vol.

4

(pg.

249

-

64

)

23

Diehn

M

Sherlock

G

Binkley

G

et al. ,

SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data

,

Nucleic Acids Res

,

2003

, vol.

31

(pg.

219

-

23

)

24

Tusher

VG

Tibshirani

R

Chu

G

,

Significance analysis of microarrays applied to the ionizing radiation response

,

Proc Natl Acad Sci USA

,

2001

, vol.

98

(pg.

5116

-

21

)

25

Gentleman

RC

Carey

VJ

Bates

DM

et al. ,

Bioconductor: open software development for computational biology and bioinformatics

,

Genome Biol

,

2004

, vol.

5

pg.

R80

26

Gonin

HT

,

The use of factorial moments in the treatment of the hypergeometric distribution and in tests for regression

,

Philosophical Magazine Series 7

,

1936

, vol.

21

(pg.

215

-

26

)

Google Scholar

Crossref

WorldCat

27

Benjamini

Y

Yekutieli

D

,

The control of the false discovery rate in multiple testing under dependency

,

Ann Stat

,

2001

, vol.

29

(pg.

1165

-

88

)

Google Scholar

Crossref

WorldCat

28

Zhang

M

Yao

C

Guo

Z

et al. ,

Apparently low reproducibility of true differential expression discoveries in microarray studies

,

Bioinformatics

,

2008

, vol.

24

(pg.

2057

-

63

)

29

Bartek

J

Lukas

J

,

Mammalian G1- and S-phase checkpoints in response to DNA damage

,

Curr Opin Cell Biol

,

2001

, vol.

13

(pg.

738

-

47

)

30

Mailand

N

Falck

J

Lukas

C

et al. ,

Rapid destruction of human Cdc25A in response to DNA damage

,

Science

,

2000

, vol.

288

(pg.

1425

-

9

)

31

Weaver

BA

Cleveland

DW

,

Decoding the links between mitosis, cancer, and chemotherapy: The mitotic checkpoint, adaptation, and cell death

,

Cancer Cell

,

2005

, vol.

8

(pg.

7

-

12

)

32

Lu

KP

Zhou

XZ

,

The prolyl isomerase PIN1: a pivotal new twist in phosphorylation signalling and disease

,

Nat Rev Mol Cell Biol

,

2007

, vol.

8

(pg.

904

-

16

)

33

Wang

Y

Wang

L

Li

X

et al. ,

Polymorphisms of XRCC4 are involved in reduced colorectal cancer risk in Chinese schizophrenia patients

,

BMC Cancer

,

2010

, vol.

10

pg.

523

34

Haanpaa

M

Reiman

M

Nikkila

J

et al. ,

Mutation analysis of the AATF gene in breast cancer families

,

BMC Cancer

,

2009

, vol.

9

pg.

457

35

Zhang

Y

Lu

Y

Yuan

BZ

et al. ,

The Human mineral dust-induced gene, mdig, is a cell growth regulating gene associated with lung cancer

,

Oncogene

,

2005

, vol.

24

(pg.

4873

-

82

)

36

Subramanian

A

Tamayo

P

Mootha

VK

et al. ,

Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles

,

Proc Natl Acad Sci USA

,

2005

, vol.

102

(pg.

15545

-

50

)

37

Gu

Y

Yang

D

Zou

J

et al. ,

Systematic interpretation of comutated genes in large-scale cancer mutation profiles

,

Mol Cancer Ther

,

2010

, vol.

9

(pg.

2186

-

95

)

38

Lu

LJ

Sboner

A

Huang

YJ

et al. ,

Comparing classical pathways and modern networks: towards the development of an edge ontology

,

Trends Biochem Sci

,

2007

, vol.

32

(pg.

320

-

31

)

39

Soh

D

Dong

D

Guo

Y

et al. ,

Consistency, comprehensiveness, and compatibility of pathway databases

,

BMC Bioinformatics

,

2010

, vol.

11

pg.

449

40

Korcsmaros

T

Farkas

IJ

Szalay

MS

et al. ,

Uniformly curated signaling pathways reveal tissue-specific cross-talks and support drug target discovery

,

Bioinformatics

, vol.

26

(pg.

2042

-

50

)

Crossref

PubMed

WorldCat

41

Kanehisa

M

Goto

S

,

KEGG: kyoto encyclopedia of genes and genomes

,

Nucleic Acids Res

,

2000

, vol.

28

(pg.

27

-

30

)

Download all slides

Month:	Total Views:
February 2017	7
March 2017	6
April 2017	7
May 2017	5
June 2017	9
July 2017	7
August 2017	1
September 2017	10
October 2017	4
November 2017	10
December 2017	24
January 2018	13
February 2018	12
March 2018	22
April 2018	15
May 2018	9
June 2018	25
July 2018	16
August 2018	34
September 2018	6
October 2018	11
November 2018	14
December 2018	68
January 2019	15
February 2019	10
March 2019	28
April 2019	21
May 2019	16
June 2019	14
July 2019	36
August 2019	17
September 2019	17
October 2019	24
November 2019	10
December 2019	13
January 2020	15
February 2020	6
March 2020	22
April 2020	16
May 2020	8
June 2020	13
July 2020	7
August 2020	12
September 2020	10
October 2020	4
November 2020	14
December 2020	6
January 2021	16
February 2021	10
March 2021	20
April 2021	17
May 2021	6
June 2021	25
July 2021	14
August 2021	12
September 2021	18
October 2021	17
November 2021	11
December 2021	23
January 2022	11
February 2022	7
March 2022	13
April 2022	21
May 2022	19
June 2022	9
July 2022	15
August 2022	11
September 2022	55
October 2022	66
November 2022	21
December 2022	30
January 2023	16
February 2023	13
March 2023	10
April 2023	35
May 2023	7
June 2023	12
July 2023	9
August 2023	18
September 2023	6
October 2023	9
November 2023	14
December 2023	21
January 2024	18
February 2024	15
March 2024	32
April 2024	11
May 2024	32
June 2024	8
July 2024	11
August 2024	12
September 2024	11
October 2024	13
November 2024	16
December 2024	9
January 2025	8
February 2025	7
March 2025	14
April 2025	5

Article Contents

GO-function: deriving biologically relevant functions from statistically significant functions

Abstract

INTRODUCTION

MATERIALS AND METHODS

Data sets

Deriving biologically relevant terms from statistically significant terms

Local redundancy treatment

Global redundancy treatment

Comparison

RESULTS

Comparing local redundancy test to PCU and elim

Comparing global redundancy test to MGSA and GenGO

Validation of application results using an independent experimental data

DISCUSSION

SUPPLEMENTARY DATA

FUNDING

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

GO-function: deriving biologically relevant functions from statistically significant functions

Abstract

INTRODUCTION

MATERIALS AND METHODS

Data sets

Deriving biologically relevant terms from statistically significant terms

Local redundancy treatment

Global redundancy treatment

Comparison

RESULTS

Comparing local redundancy test to PCU and elim

Comparing global redundancy test to MGSA and GenGO

Validation of application results using an independent experimental data

DISCUSSION

SUPPLEMENTARY DATA

FUNDING

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only