Comparing enrichment analysis and machine learning for identifying gene properties that discriminate between gene classes

Comparison of the top 10 GO terms associated with ‘pro-longevity’ genes in the worm model organism according to enrichment and machine-learning (feature selection) methods. Each sub-table shows the GO term identifier, the full GO term name and the |$p$|-value used to rank the GO terms (note that these |$p$|-values are not directly comparable since they are testing different hypotheses). The results were taken from the supplementary materials of [35]. Note that although the two GO term sets are distinct at first glance, they have important similarities. For instance, GO terms that appear both in the machine-learning gene set (GO:0001708, GO:0045138, GO:0010172) and in the enrichment method set (GO:0010259, GO:0007568, GO:0002119, GO:0002164, GO:0040024) are related to developmental processes

Rank	Machine-learning method			Enrichment method
	GO Id.	GO term name	\|$p$\|-value	GO Id.	GO term name	\|$p$\|-value
1	GO:0006914	autophagy	1.53E-03	GO:0010259	multicellular organismal aging	1.69E-48
2	GO:0051094	positive regulation of developmental process	3.56E-03	GO:0008340	determination of adult life span	1.69E-48
3	GO:0001708	cell fate specification	5.19E-03	GO:0007568	aging	1.69E-48
4	GO:0008285	negative regulation of cell proliferation	2.46E-02	GO:0002119	nematode larval development	4.49E-39
5	GO:0044262	cellular carbohydrate metabolic process	2.46E-02	GO:0002164	larval development	4.97E-39
6	GO:0045138	tail tip morphogenesis	2.46E-02	GO:0009791	post-embryonic development	8.13E-39
7	GO:0070265	necrotic cell death	2.46E-02	GO:0040007	growth	4.05E-25
8	GO:0018991	oviposition	4.77E-02	GO:0040024	dauer larval development	2.76E-21
9	GO:0010172	embryonic body morphogenesis	6.21E-02	GO:0040008	regulation of growth	1.81E-19
10	GO:0006352	DNA-templated transcription, initiation	6.21E-02	GO:0045927	positive regulation of growth	2.88E-19

Rank	Machine-learning method			Enrichment method
	GO Id.	GO term name	\|$p$\|-value	GO Id.	GO term name	\|$p$\|-value
1	GO:0006914	autophagy	1.53E-03	GO:0010259	multicellular organismal aging	1.69E-48
2	GO:0051094	positive regulation of developmental process	3.56E-03	GO:0008340	determination of adult life span	1.69E-48
3	GO:0001708	cell fate specification	5.19E-03	GO:0007568	aging	1.69E-48
4	GO:0008285	negative regulation of cell proliferation	2.46E-02	GO:0002119	nematode larval development	4.49E-39
5	GO:0044262	cellular carbohydrate metabolic process	2.46E-02	GO:0002164	larval development	4.97E-39
6	GO:0045138	tail tip morphogenesis	2.46E-02	GO:0009791	post-embryonic development	8.13E-39
7	GO:0070265	necrotic cell death	2.46E-02	GO:0040007	growth	4.05E-25
8	GO:0018991	oviposition	4.77E-02	GO:0040024	dauer larval development	2.76E-21
9	GO:0010172	embryonic body morphogenesis	6.21E-02	GO:0040008	regulation of growth	1.81E-19
10	GO:0006352	DNA-templated transcription, initiation	6.21E-02	GO:0045927	positive regulation of growth	2.88E-19

Table 1

Comparison of the top 10 GO terms associated with ‘pro-longevity’ genes in the worm model organism according to enrichment and machine-learning (feature selection) methods. Each sub-table shows the GO term identifier, the full GO term name and the |$p$|-value used to rank the GO terms (note that these |$p$|-values are not directly comparable since they are testing different hypotheses). The results were taken from the supplementary materials of [35]. Note that although the two GO term sets are distinct at first glance, they have important similarities. For instance, GO terms that appear both in the machine-learning gene set (GO:0001708, GO:0045138, GO:0010172) and in the enrichment method set (GO:0010259, GO:0007568, GO:0002119, GO:0002164, GO:0040024) are related to developmental processes

Rank	Machine-learning method			Enrichment method
	GO Id.	GO term name	\|$p$\|-value	GO Id.	GO term name	\|$p$\|-value
1	GO:0006914	autophagy	1.53E-03	GO:0010259	multicellular organismal aging	1.69E-48
2	GO:0051094	positive regulation of developmental process	3.56E-03	GO:0008340	determination of adult life span	1.69E-48
3	GO:0001708	cell fate specification	5.19E-03	GO:0007568	aging	1.69E-48
4	GO:0008285	negative regulation of cell proliferation	2.46E-02	GO:0002119	nematode larval development	4.49E-39
5	GO:0044262	cellular carbohydrate metabolic process	2.46E-02	GO:0002164	larval development	4.97E-39
6	GO:0045138	tail tip morphogenesis	2.46E-02	GO:0009791	post-embryonic development	8.13E-39
7	GO:0070265	necrotic cell death	2.46E-02	GO:0040007	growth	4.05E-25
8	GO:0018991	oviposition	4.77E-02	GO:0040024	dauer larval development	2.76E-21
9	GO:0010172	embryonic body morphogenesis	6.21E-02	GO:0040008	regulation of growth	1.81E-19
10	GO:0006352	DNA-templated transcription, initiation	6.21E-02	GO:0045927	positive regulation of growth	2.88E-19

Rank	Machine-learning method			Enrichment method
	GO Id.	GO term name	\|$p$\|-value	GO Id.	GO term name	\|$p$\|-value
1	GO:0006914	autophagy	1.53E-03	GO:0010259	multicellular organismal aging	1.69E-48
2	GO:0051094	positive regulation of developmental process	3.56E-03	GO:0008340	determination of adult life span	1.69E-48
3	GO:0001708	cell fate specification	5.19E-03	GO:0007568	aging	1.69E-48
4	GO:0008285	negative regulation of cell proliferation	2.46E-02	GO:0002119	nematode larval development	4.49E-39
5	GO:0044262	cellular carbohydrate metabolic process	2.46E-02	GO:0002164	larval development	4.97E-39
6	GO:0045138	tail tip morphogenesis	2.46E-02	GO:0009791	post-embryonic development	8.13E-39
7	GO:0070265	necrotic cell death	2.46E-02	GO:0040007	growth	4.05E-25
8	GO:0018991	oviposition	4.77E-02	GO:0040024	dauer larval development	2.76E-21
9	GO:0010172	embryonic body morphogenesis	6.21E-02	GO:0040008	regulation of growth	1.81E-19
10	GO:0006352	DNA-templated transcription, initiation	6.21E-02	GO:0045927	positive regulation of growth	2.88E-19

Note also that, overall, the top 10 GO terms identified by the enrichment and feature selection approaches are quite different, although there is some overlap (e.g. for GO terms related to development). This difference reinforces the motivation to use both approaches, since they make different assumptions and have, to some extent, complementary pros and cons, as discussed next. Using both approaches we have more opportunities to discover biological patterns, and patterns identified by both approaches (like development-related GO terms in the above example) can be considered particularly strong.

Advantages and disadvantages of classification methods from machine learning

The main advantages of classification methods are as follows.

First, most modern classification methods are non-parametric in the statistical sense—i.e. they do not assume that the data are distributed in a certain way. Instead, they adapt the learned model to the characteristics of the problem automatically during their training phase. Therefore, in principle, most classification algorithms can be used to discover very different types of relationships among variables in the data, including the discovery of highly non-linear correlations between the features (gene properties) and the class labels (the phenotype of interest). Most enrichment methods, on the other hand, are parametric in the statistical sense, and each method performs the same statistical calculations regardless of the extent to which the data satisfies the assumptions of the statistical test used.

Second, some types of classification models (e.g. decision trees) are relatively easily interpretable by users [42]. Such models can be used both for predictions and to gain insights about how the class label is related to the features in a relatively human-friendly fashion.

Third, most classification methods consider multivariate interactions between the features and the class label. On the other hand, most enrichment methods analyse one feature at a time, ignoring the fact that, sometimes, two or more gene properties, when taken at the same time, can be much more predictive (or enriched) than the individual properties.

The main disadvantages of classification methods are as follows.

First, some classification methods lack formal statistical basis—several classification algorithms cannot make principled statistical assessments regarding the data. That is, the predictions are made without confidence intervals or |$p$|-values.

Second, many classification methods are very computationally intensive. For instance, deep neural networks are very computationally demanding, often requiring the use of specialized hardware to run in reasonable times [7]. Note, however, that some well-known classification methods, like most decision tree algorithms and Naive Bayes, are relatively fast [43].

Third, hyper-parameter setting is not trivial. Recall that most classification algorithms have settings (hyper-parameters) that control important aspects of the learning process. A poor hyper-parameter choice can lead to low (even close to random) predictive performance. Many classification algorithms are very sensitive to these settings, requiring either expert knowledge or computationally expensive hyper-parameter tuning methods. These tuning methods usually work by running the classification algorithm several times, with different hyper-parameter settings, estimating the predictive performance of the constructed models to determine which hyper-parameter setting is the best one. One must be careful while performing this hyper-parameter tuning to not measure the predictive performance in the ‘validation set’, where the final predictive performance estimation will be carried out, but rather in a subset of the ‘training set’. The predictive power of classification algorithms will very likely be grossly overestimated if one uses the ‘validation set’ to tune the algorithm’s hyper-parameters.

Fourth, bioinformatics data sets often have two important particularities that can negatively impact the predictive performance of traditional classification algorithms: high class imbalance and structured biological descriptors. Regarding the issue of class imbalance, the data sets are often very unbalanced towards the negative class label—most whole-genome enrichment analyses involve thousands of genes without the phenotype of interest and only a few dozens with the phenotype of interest. Most classification algorithms do not cope well with this high level of class imbalance. However, there has been extensive research on methods for improving the performance of classification algorithms in this scenario, including the use of over(under) sampling of the minority (majority) class to create a more balanced training set [44]. Regarding the issue of structured biological descriptors, some descriptors (e.g. GO and FunCat terms) have a hierarchical structure. However, most classification algorithms treat them as unstructured, which may lead to problems due to the high correlation between terms. Exceptions are classification algorithms for hierarchical classification [40] and hierarchical feature selection methods for classification [45].

Advantages and disadvantages of enrichment methods

Enrichment methods are an extremely popular approach to summarize the functional characteristics of seed gene sets. These methods present several advantages when compared to other approaches, as follows. First, they are quick and computationally light, often able to analyse large gene sets using only a laptop computer, especially given the large number of web tools available. This makes enrichment analysis very suitable for small labs that may not have access to high-power computing clusters or machine-learning experts or for situations where a quick summary of gene set functionality is sufficient and a more sophisticated method would be unnecessary and overly time consuming.

Second, there are a wide variety of tools available covering multiple statistical methods. Many of these tools (for instance the highly popular DAVID tool [46]) are very user friendly with good documentation and clear explanations of their methodology to allow users to determine the best method for their data. These tools tend to use methods based on classical statistical tests that non-statisticians are likely to have at least some understanding of.

Finally, although less popular, Bayesian statistical methods have been incorporated into some enrichment analysis tools, allowing a more sophisticated statistical approach. The oldest of these is BayGO, which uses a Bayesian inference method to incorporate Goodman and Kruskal’s Gamma score of association. The association of differential expression to each GO term is measured, and Monte Carlo simulations are employed to determine the probability of randomly observing a stronger level of GO term enrichment than the measured level [47]. Other Bayesian tools are GO-Bayes [48], model-based gene set analysis [49] and multi-level ontology analysis [50], which all attempt to infer the probability that a given GO term is associated with a supplied gene set. These methods alleviate some of the concerns affecting most enrichment analysis methods, since the probability estimations account for some of the network characteristics inherent in biological data, while also considering all terms simultaneously thus removing the need for multiple hypothesis testing correction. Most Bayesian methods also have the advantage of not relying on classical tests of statistical significance, whose limitations were discussed earlier. Instead, they are based on the prior probability (before building the model) and the probability of observing the data given the model, which are, arguably, easier concepts for most people to grasp than |$p$|-values.

The main disadvantages of enrichment methods are as follows.

First, most enrichment methods are heavily based on tests of significance using |$p$|-values as the decision criterion. However, |$p$|-values by themselves are not adequate as the main basis for scientific conclusions, since they do not measure the effect size, importance and reproducibility of a result. For this reason, they should not be taken as definitive evidence for the existence or size of an effect [1, 51]. Instead, researchers should use |$p$|-values to help guide a broader analysis, avoiding absolute conclusions based on them.

In [2] the authors point out that the |$p$|-values of enrichment methods are often treated as a score of ‘interestingness’, and seldom the sensitivity and specificity of the list of ‘interesting’ properties are estimated. That is, little importance is given to the actual predictive power of the properties, giving more value to differences in relative frequencies instead. The authors also make the interesting point that the definition of the seed genes (for SEA methods) and gene rankings (for GSEA methods) are based on the assumption that the higher the differential expression of a gene, the more important the gene should be considered in the analysis. This is often a valid assumption but not always; a small change in expression of a regulatory gene may be much more biologically relevant than larger changes in, for instance, a metabolism-related gene.

Second, most ‘traditional’ statistical tests assume that the sampling units are independent. This is clearly not the case in most gene-expression experiments (where the sampling unit is usually a gene), a common application of enrichment methods. There are several regulatory genes that modulate the expression of other genes. When this assumption is not satisfied, the tests tend to make more type I errors than what would be expected (incorrectly rejecting the null hypothesis of ‘no differential expression’) [52].

Third, SEA and GSEA enrichment methods (see the Overviewof enrichment methods for bioinformatics section) ignore correlations between gene properties, analysing their enrichment significance independently. However, normally there are strong correlations among the gene properties; it is common that if a gene is annotated with a property, it is much more likely to be annotated with a 2nd property. This is particularly common when using GO terms, which are hierarchically structured (e.g. every gene annotated with the term ‘detection of stimulus’ is, by definition, also annotated with the term ‘response to stimulus’). Arguably, this fact is not so detrimental to the enrichment methods as high gene correlation (mentioned in the previous paragraph) [53], but it is still an important source of bias.

Table 2 lists the advantages and disadvantages of classification and enrichment methods to identify gene properties.

Table 2

Summary of advantages and disadvantages for classification and enrichment methods to identify biological patterns

	Method type
	Classification	Enrichment
Advantages	\|$\bullet$\| Most classification models are non-parametric in the statistical sense; they do not assume the data follows a certain type of probability distribution. \|$\bullet$\| Some models are interpretable. For instance, decision trees and logistic regression models can be easily interpreted by the user in many cases. \|$\bullet$\| Most classification algorithms consider multivariate interactions between the features and class labels.	\|$\bullet$\| Computationally light. \|$\bullet$\| There is a wide variety of tools, many with good documentation and a clear methodology. \|$\bullet$\| Some tools use Bayesian methods instead of classical statistical significance tests (whose problems were discussed earlier).
Disadvantages	\|$\bullet$\| Many classification algorithms lack formal statistical basis. \|$\bullet$\| Many methods are very computationally expensive. \|$\bullet$\| Hyper-parameter setting is not trivial. \|$\bullet$\| Many methods do not cope well with high class imbalance and structured feature types (e.g. GO and FunCat), common in bioinformatics data sets. Although there are methods to mitigate both issues.	\|$\bullet$\| Tests of statistical significance based on \|$p$\|-values (used by many enrichment methods) are difficult to interpret and provide limited information. \|$\bullet$\| The assumption made by most enrichment methods that the genes are independent seldom holds in the bioinformatics setting. \|$\bullet$\| There are strong correlations between gene properties, which also violates the assumptions of many traditional tests of statistical significance.

	Method type
	Classification	Enrichment
Advantages	\|$\bullet$\| Most classification models are non-parametric in the statistical sense; they do not assume the data follows a certain type of probability distribution. \|$\bullet$\| Some models are interpretable. For instance, decision trees and logistic regression models can be easily interpreted by the user in many cases. \|$\bullet$\| Most classification algorithms consider multivariate interactions between the features and class labels.	\|$\bullet$\| Computationally light. \|$\bullet$\| There is a wide variety of tools, many with good documentation and a clear methodology. \|$\bullet$\| Some tools use Bayesian methods instead of classical statistical significance tests (whose problems were discussed earlier).
Disadvantages	\|$\bullet$\| Many classification algorithms lack formal statistical basis. \|$\bullet$\| Many methods are very computationally expensive. \|$\bullet$\| Hyper-parameter setting is not trivial. \|$\bullet$\| Many methods do not cope well with high class imbalance and structured feature types (e.g. GO and FunCat), common in bioinformatics data sets. Although there are methods to mitigate both issues.	\|$\bullet$\| Tests of statistical significance based on \|$p$\|-values (used by many enrichment methods) are difficult to interpret and provide limited information. \|$\bullet$\| The assumption made by most enrichment methods that the genes are independent seldom holds in the bioinformatics setting. \|$\bullet$\| There are strong correlations between gene properties, which also violates the assumptions of many traditional tests of statistical significance.

Table 2

Summary of advantages and disadvantages for classification and enrichment methods to identify biological patterns

	Method type
	Classification	Enrichment
Advantages	\|$\bullet$\| Most classification models are non-parametric in the statistical sense; they do not assume the data follows a certain type of probability distribution. \|$\bullet$\| Some models are interpretable. For instance, decision trees and logistic regression models can be easily interpreted by the user in many cases. \|$\bullet$\| Most classification algorithms consider multivariate interactions between the features and class labels.	\|$\bullet$\| Computationally light. \|$\bullet$\| There is a wide variety of tools, many with good documentation and a clear methodology. \|$\bullet$\| Some tools use Bayesian methods instead of classical statistical significance tests (whose problems were discussed earlier).
Disadvantages	\|$\bullet$\| Many classification algorithms lack formal statistical basis. \|$\bullet$\| Many methods are very computationally expensive. \|$\bullet$\| Hyper-parameter setting is not trivial. \|$\bullet$\| Many methods do not cope well with high class imbalance and structured feature types (e.g. GO and FunCat), common in bioinformatics data sets. Although there are methods to mitigate both issues.	\|$\bullet$\| Tests of statistical significance based on \|$p$\|-values (used by many enrichment methods) are difficult to interpret and provide limited information. \|$\bullet$\| The assumption made by most enrichment methods that the genes are independent seldom holds in the bioinformatics setting. \|$\bullet$\| There are strong correlations between gene properties, which also violates the assumptions of many traditional tests of statistical significance.

	Method type
	Classification	Enrichment
Advantages	\|$\bullet$\| Most classification models are non-parametric in the statistical sense; they do not assume the data follows a certain type of probability distribution. \|$\bullet$\| Some models are interpretable. For instance, decision trees and logistic regression models can be easily interpreted by the user in many cases. \|$\bullet$\| Most classification algorithms consider multivariate interactions between the features and class labels.	\|$\bullet$\| Computationally light. \|$\bullet$\| There is a wide variety of tools, many with good documentation and a clear methodology. \|$\bullet$\| Some tools use Bayesian methods instead of classical statistical significance tests (whose problems were discussed earlier).
Disadvantages	\|$\bullet$\| Many classification algorithms lack formal statistical basis. \|$\bullet$\| Many methods are very computationally expensive. \|$\bullet$\| Hyper-parameter setting is not trivial. \|$\bullet$\| Many methods do not cope well with high class imbalance and structured feature types (e.g. GO and FunCat), common in bioinformatics data sets. Although there are methods to mitigate both issues.	\|$\bullet$\| Tests of statistical significance based on \|$p$\|-values (used by many enrichment methods) are difficult to interpret and provide limited information. \|$\bullet$\| The assumption made by most enrichment methods that the genes are independent seldom holds in the bioinformatics setting. \|$\bullet$\| There are strong correlations between gene properties, which also violates the assumptions of many traditional tests of statistical significance.

Conclusions and recommendations

Conclusions

Given a list of genes associated with a phenotype of interest (seed genes), enrichment methods have been extensively used by biologists to retrieve properties associated with the seed genes and sometimes to retrieve non-seed genes for further investigation. Enrichment methods have several desirable characteristics; they are usually computationally inexpensive to run, produce principled statistically based scores of importance, are easily accessible and are popular among bioinformatics researchers.

However, in some scenarios, machine learning–based classification algorithms may be more suited to deal with the task of identifying patterns in genomic data. Unlike enrichment methods, classification approaches aim to maximize ‘predictive performance’, that is, building a classification model to discriminate between gene classes by maximizing measures of predictive performance estimated using different gene sets for training and validation. Most enrichment methods, on the other hand, aim at finding statistically significantly enriched properties in the seed genes. These properties by themselves may not have good predictive power.

Besides the focus on maximizing predictive power, some classification models, like decision trees, are able to output an interpretable classification model, which can be analysed by the user, potentially giving insights about the underlying biological processes. Also, most machine-learning methods are capable of finding non-linear relationships and are capable of combining different gene properties to make a prediction.

Table 3

Summary of recommendations for classification and enrichment methods to identify biological patterns

Method type
Classification	Enrichment
\|$\bullet$\| One should carefully study the characteristics of the biological data. Annotations may vary significantly in terms of confidence and may have an underlying hierarchical structure, which ideally should be considered by the algorithms. E.g. GO term annotations have varying degrees of confidence and have an underlying structure, which ideally should be considered by the algorithms. \|$\bullet$\| Choosing the best kind of classification algorithm is important. The user should consider aspects like model interpretability, training time and predictive power. \|$\bullet$\| Testing several types of classification algorithm is always recommended (since different algorithms learn different types of classification models), always being careful to estimate their predictive performance properly. When comparing the performance of multiple algorithms via statistical significance tests, use appropriate multiple hypothesis correction methods.	\|$\bullet$\| One should carefully choose which enrichment method to use (checking its assumptions) rather than trying several methods and choosing the preferred result—which could lead to the ‘p-hacking’ problem. \|$\bullet$\| According to some authors, the use of a seed gene set is usually preferable to ranked lists of genes [16]. \|$\bullet$\| If the creation of seed gene sets is too difficult, consider using a GSEA approach or an MEA approach using multiple thresholds. \|$\bullet$\| The results of enrichment methods are mainly descriptive. If gene prioritization is required, consider using guilt-by-association or machine-learning approaches.

Method type

Classification

Enrichment

|$\bullet$| One should carefully study the characteristics of the biological data. Annotations may vary significantly in terms of confidence and may have an underlying hierarchical structure, which ideally should be considered by the algorithms. E.g. GO term annotations have varying degrees of confidence and have an underlying structure, which ideally should be considered by the algorithms.
|$\bullet$| Choosing the best kind of classification algorithm is important. The user should consider aspects like model interpretability, training time and predictive power.
|$\bullet$| Testing several types of classification algorithm is always recommended (since different algorithms learn different types of classification models), always being careful to estimate their predictive performance properly. When comparing the performance of multiple algorithms via statistical significance tests, use appropriate multiple hypothesis correction methods.

|$\bullet$| One should carefully choose which enrichment method to use (checking its assumptions) rather than trying several methods and choosing the preferred result—which could lead to the ‘p-hacking’ problem.
|$\bullet$| According to some authors, the use of a seed gene set is usually preferable to ranked lists of genes [16].
|$\bullet$| If the creation of seed gene sets is too difficult, consider using a GSEA approach or an MEA approach using multiple thresholds.
|$\bullet$| The results of enrichment methods are mainly descriptive. If gene prioritization is required, consider using guilt-by-association or machine-learning approaches.

Table 3

Summary of recommendations for classification and enrichment methods to identify biological patterns

Method type
Classification	Enrichment
\|$\bullet$\| One should carefully study the characteristics of the biological data. Annotations may vary significantly in terms of confidence and may have an underlying hierarchical structure, which ideally should be considered by the algorithms. E.g. GO term annotations have varying degrees of confidence and have an underlying structure, which ideally should be considered by the algorithms. \|$\bullet$\| Choosing the best kind of classification algorithm is important. The user should consider aspects like model interpretability, training time and predictive power. \|$\bullet$\| Testing several types of classification algorithm is always recommended (since different algorithms learn different types of classification models), always being careful to estimate their predictive performance properly. When comparing the performance of multiple algorithms via statistical significance tests, use appropriate multiple hypothesis correction methods.	\|$\bullet$\| One should carefully choose which enrichment method to use (checking its assumptions) rather than trying several methods and choosing the preferred result—which could lead to the ‘p-hacking’ problem. \|$\bullet$\| According to some authors, the use of a seed gene set is usually preferable to ranked lists of genes [16]. \|$\bullet$\| If the creation of seed gene sets is too difficult, consider using a GSEA approach or an MEA approach using multiple thresholds. \|$\bullet$\| The results of enrichment methods are mainly descriptive. If gene prioritization is required, consider using guilt-by-association or machine-learning approaches.

Method type

Classification

Enrichment

|$\bullet$| One should carefully study the characteristics of the biological data. Annotations may vary significantly in terms of confidence and may have an underlying hierarchical structure, which ideally should be considered by the algorithms. E.g. GO term annotations have varying degrees of confidence and have an underlying structure, which ideally should be considered by the algorithms.
|$\bullet$| Choosing the best kind of classification algorithm is important. The user should consider aspects like model interpretability, training time and predictive power.
|$\bullet$| Testing several types of classification algorithm is always recommended (since different algorithms learn different types of classification models), always being careful to estimate their predictive performance properly. When comparing the performance of multiple algorithms via statistical significance tests, use appropriate multiple hypothesis correction methods.

|$\bullet$| One should carefully choose which enrichment method to use (checking its assumptions) rather than trying several methods and choosing the preferred result—which could lead to the ‘p-hacking’ problem.
|$\bullet$| According to some authors, the use of a seed gene set is usually preferable to ranked lists of genes [16].
|$\bullet$| If the creation of seed gene sets is too difficult, consider using a GSEA approach or an MEA approach using multiple thresholds.
|$\bullet$| The results of enrichment methods are mainly descriptive. If gene prioritization is required, consider using guilt-by-association or machine-learning approaches.

Recommendations

One of the main practical challenges faced by biologists when applying machine-learning techniques to biological problems is how to construct the classification data sets. While most enrichment tools have built-in data sources, machine-learning algorithms often require file inputs. Note that having a built-in data source clearly facilitates the use of the tool but, on the other hand, may lead to the unintentional use of low-quality data (due to an out-of-date data source or to the use of low-confidence annotations). Fortunately, most bioinformatics databases have a link for downloading the entire database or Web APIs that can be used to extract the desired data. Also, there are Python (https://github.com/biopython), R [54] and Perl (https://www.ncbi.nlm.nih.gov/books/NBK25501/) libraries that can be used to obtain gene and protein data from several online resources. Biologists should also keep in mind the characteristics of the data they are using. For instance, not all gene annotations have the same level of confidence, and the lack of an annotation does not guarantee the absence of that property [3, 28]. These aspects should be carefully weighted when building and interpreting the results of both enrichment tools and classification models.

Choosing the right type of classification method for the task at hand is essential. For instance, classification model interpretability is often desirable when working with biological data [42]. If that is the case, the user can focus on interpretable classification models. Note that ‘interpretability’ is subjective and highly dependent on the background knowledge of the user of the classification system. Having said that, decision trees, rule-based classifiers, naive Bayes and logistic regression classifiers are commonly considered ‘interpretable’. When high predictive power is more important than interpretability, we suggest using ‘black-box’ models, which are very difficult to interpret but tend to have better predictive performance. Support Vector Machines and Deep Neural Network classifiers are popular examples of such models.

It is common to use ensembles of classification algorithms [55] to improve the predictive performance of the classification system. Ensemble methods combine the prediction of several ‘base’ classification models to output the final prediction of the ensemble. Ensembles tend to have a better predictive performance than the base models but have the drawback of increased training and testing times and reduced interpretability [56]. Random forests (ensembles of a type of decision tree), in particular, are a popular approach in bioinformatics that usually have high predictive performance [57] and are still somewhat interpretable, having a good compromise between predictive power and interpretability.

For readers with no machine-learning expertise who are interested in more information about these (and other) machine-learning topics, we recommend the comprehensive book of [43], which covers these topics providing an accessible theoretical basis and practical examples in the Java programming language using the popular WEKA software tool [58]. The newer scikit-learn software tool (https://scikit-learn.org/) is another option for readers interested in applying machine learning to their data using the Python programming language. The scikit-learn tool has several extensions that implement advanced machine-learning approaches and is arguably a better option for users looking for state-of-the-art algorithms.

When possible, we recommend testing a range of classification algorithms and hyper-parameter settings for the problem at hand. This can be done either manually, using expert knowledge, or automatically, using Automated Machine Learning approaches (Auto-ML) [59, 60]. In either case, it is important to compare the predictive performance of the models using statistical tests of significance, always being careful to apply the correct test and adjust the alpha (significance) values if multiple hypothesis comparisons are made [61]—in order to avoid the risk of unintentional p-hacking. Note that these statistical tests can be applied regardless of the underlying assumptions of the classification algorithms; the tests treat the models as ‘black boxes’ capable of making predictions.

Choosing the right enrichment method is equally important. It is essential to consider carefully which method to use, checking the assumptions of the method, rather than trying multiple approaches, and choosing the method that gives results that ‘make sense’—which would lead to over-optimistic |$p$|-values (⁠|$p$|-value ‘hacking’ [5, 6]). The first consideration that should be made is whether to use an approach requiring a seed gene set or an approach that tests all genes simultaneously based on a ranked list. Seed gene set–based approaches have been shown to perform better in many cases [62, 63] and so should be used when possible. However, the creation of a seed gene set is not always easy. Seed gene set–based approaches can be extremely sensitive to the thresholds used for inclusion in the seed gene set [64, 65]. Creating a seed gene set based purely on statistically significantly different expression changes, for instance, often requires setting arbitrary cut-off values. E.g. when dealing with large sample sizes, ‘popular’ |$p$|-value cut-offs will lead to inflated seed gene sets, so using fold-change cut-offs is also necessary. However, choosing a fold-change cut-off has its own problems, as genes with low mean expression and high expression variance may erroneously meet the cut-off (note that there are methods to alleviate this issue [66]). If there is not strong evidence behind a seed list, then consider using either a GSEA approach that tests all genes simultaneously, or test for enrichment using MEA methods using multiple thresholds for inclusion in the seed gene set, with enriched terms overlapping between the tests being likely true positives.

Enrichment analysis is mainly descriptive (rather than predictive) in nature, and so the results should be interpreted as such. Being able to describe the characteristics of a seed list or ranked gene list is useful for understanding the mechanisms behind a response to a perturbation, drug treatment or disease; however, it is not sufficient evidence for the prioritization of candidate genes for further study. For this purpose, there are a wide range of further tools, ranging from guilt-by-association methods [13, 31, 67] to the machine-learning methods previously discussed. Combining enrichment analysis with predictive analysis tools is thus a powerful way to identify the biological response to a perturbation and subsequently identify potential novel candidates for manipulating that response.

Overall, we reinforce that none of the approaches discussed here is the best for all problems. Nonetheless, we recommend the addition of machine-learning classification methods in the toolset of biologists when exploring their data. In addition, machine-learning principles (such as the concept of separate training and validation sets for predictive performance estimation, see the Glossary in Figure 1) should be considered when extracting candidate genes from the data. Finally, if enrichment methods are used, one should be aware of the limitations of the underlying statistical methods and how to properly interpret the |$p$|-value statistics [1], which are not easy to fully grasp.

A summary of the above recommendations is provided in Table 3.

Key Points

If enrichment methods are used, the limitations of null hypothesis significance testing should be considered. Also, |$p$|-value statistics, which are not easy to fully grasp, should be properly interpreted.
We recommend the addition of machine-learning classification methods in the toolset of biologists when exploring their data.
No single machine learning or enrichment method approach is the best for all problems.

Funding

This work was supported by a Leverhulme Trust Research Grant (Ref. No. RPG-2016-015).

Fabio Fabris is a postdoctoral research associate applying data mining to ageing research. He completed his doctoral thesis on graphical models applied to ageing-related classification tasks under the supervision of Alex A. Freitas.

Daniel Palmer is a postdoctoral research associate currently working in the Integrative Genomics of Ageing Group at the University of Liverpool. He has previously studied the mechanisms of longevity in long-lived strains of Drosophila melanogaster.

João Pedro de Magalhães is a reader at the University of Liverpool where he leads the Integrative Genomics of Ageing Group (http://pcwww.liv.ac.uk/aging/). The group’s research integrates experimental and computational strategies to help decipher the human genome and how it regulates complex processes like ageing.

Alex Freitas is a Professor of Computational Intelligence at the University of Kent, UK. He has a PhD in Computer Science (1997) and a master’s degree (MPhil) in Biological Sciences (2011). His main research interests are machine learning and the biology of ageing.

References

1.

Goodman

S

.

A dirty dozen: twelve P-Value misconceptions

.

Semin Hematol

2008

;

45

(

3

):

135

–

140

.

2.

Huang

DW

,

Sherman

BT

,

Lempicki

RA

.

Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists

.

Nucleic Acids Res

2009

;

37

(

1

):

1

–

13

.

3.

Gaudet

P

,

Dessimoz

C

.

Gene ontology: pitfalls, biases, and remedies

. In:

The Gene Ontology Handbook

, Vol. 1446.

Springer

,

2017

,

189

–

205

.

4.

Khatri

P

,

Sirota

M

,

Butte

AJ

.

Ten years of pathway analysis: current approaches and outstanding challenges

.

PLoS Comput Biol

2012

;

8

(

2

):

1

–

10

.

5.

Head

ML

,

Holman

L

,

Lanfear

R

, et al.

The extent and consequences of p-hacking in science

.

PLoS Biol

2015

;

13

(

3

):

1

–

15

.

6.

Cumming

G

.

The new statistics: why and how

.

Psychol Sci

2014

;

25

(

1

):

7

–

29

.

7.

Camacho

DM

,

Collins

KM

,

Powers

RK

, et al.

Next-generation machine learning for biological networks

.

Cell

2018

;

173

(

7

):

1

–

12

.

8.

Libbrecht

MW

,

Noble

WS

.

Machine learning in genetics and genomics

.

Nat Rev Gen

2017

;

16

(

6

):

321

–

332

.

9.

Villavicencio-Diaz

TN

,

Rodríguez-Ulloa

A

,

Guirola-Cruz

O

, et al.

Bioinformatics tools for the functional interpretation of quantitative proteomics results

.

Curr Top Med Chem

2014

;

14

(

3

):

435

–

449

.

10.

Yan

J

,

Risacher

SL

,

Shen

L

, et al.

Network approaches to systems biology analysis of complex disease: integrative methods for multi-omics data

.

Brief Bioinform

2017

;

19

(

6

):

1370

–

1381

.

11.

Subramanian

A

,

Tamayo

P

,

Mootha

VK

, et al.

Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles

.

Proc Natl Acad Sci U S A

2005

;

102

(

43

):

15545

–

15550

.

12.

Kanehisa

M

,

Goto

S

.

KEGG: Kyoto encyclopedia of genes and genomes

.

Nucleic Acids Res

2000

;

28

(

1

):

27

–

30

.

13.

Fabregat

A

,

Jupe

S

,

Matthews

L

, et al.

The reactome pathway knowledgebase

.

Nucleic Acids Res

2018

;

46

(

D1

):

D649

–

55

.

14.

Gama-Castro

S

,

Salgado

H

,

Santos-Zavaleta

A

, et al.

Regulondb version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond

.

Nucleic Acids Res

2016

;

44

:

D133

–

D143

.

15.

Alhamdoosh

M

,

Ng

M

,

Wilson

NJ

, et al.

Combining multiple tools outperforms individual methods in gene set enrichment analyses

.

Bioinformatics

2017

;

33

(

3

):

414

–

24

.

16.

Bayerlová

M

,

Jung

K

,

Kramer

F

, et al.

Comparative study on gene set and pathway topology-based enrichment methods

.

BMC Bioinformatics

,

16

(

1

):

334

,

2015

.

17.

Dutta

R

,

McDonough

J

,

Yin

X

, et al.

Mitochondrial dysfunction as a cause of axonal degeneration in multiple sclerosis patients

.

Ann Neurol

2006

;

59

(

3

):

478

–

89

.

18.

Zhang

B

,

Wang

J

,

Wang

X

, et al.

Proteogenomic characterization of human colon and rectal cancer

.

Nature

2014

;

513

(

7518

):

382

.

19.

Baur

JA

,

Pearson

KJ

,

Price

NL

, et al.

Resveratrol improves health and survival of mice on a high-calorie diet

.

Nature

2006

;

444

(

7117

):

337

.

20.

Lagouge

M

,

Argmann

C

,

Gerhart-Hines

Z

, et al.

Resveratrol improves mitochondrial function and protects against metabolic disease by activating sirt1 and pgc-1|$\alpha$|

.

Cell

2006

;

127

(

6

):

1109

–

22

.

21.

Schaub

FX

,

Dhankani

V

,

Berger

AC

, et al.

Pan-cancer alterations of the myc oncogene and its proximal network across the cancer genome atlas

.

Cell Syst

2018

;

6

(

3

):

282

–

300

.

22.

Cheng

H-W

,

Chen

Y-F

,

Wong

J-M

, et al.

Cancer cells increase endothelial cell tube formation and survival by activating the pi3k/akt signalling pathway

.

J Exp Clin Cancer Res

2017

;

36

(

1

):

27

.

23.

Fabris

F

,

de Magalhães

JP

,

Freitas

AA

.

A review of supervised machine learning applied to ageing research

.

Biogerontology

2017

;

18

(

2

):

171

–

188

.

24.

Jiang

Y

,

Oron

TR

,

Clark

WT

, et al.

An expanded evaluation of protein function prediction methods shows an improvement in accuracy

.

Genome Biol

2016

;

17

(

184

):

1

–

70

.

25.

Ashburner

M

,

Ball

CA

,

Blake

JA

, et al.

Gene ontology: tool for the unification of biology

.

Nat Genet

2000

;

25

(

1

):

25

.

26.

Chatr-Aryamontri

A

,

Oughtred

R

,

Boucher

L

, et al.

The BioGRID interaction database: 2017 update

.

Nucleic Acids Res

2017

;

45

(

D1

):

D369

–

D379

.

27.

Szklarczyk

D

,

Morris

JH

,

Cook

H

, et al.

The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible

.

Nucleic Acids Res

2017

;

45

(

D1

):

D362

–

D368

.

28.

Fabris

F

,

Doherty

A

,

Palmer

D

, et al.

A new approach for interpreting random forest models and its application to the biology of ageing

.

Bioinformatics

2018

;

34

(

14

):

2449

–

56

.

29.

Silla

CN, J

r.,

Freitas

AA

.

Selecting different protein representations and classification algorithms in hierarchical protein function prediction

.

Intelligent Data Analysis

2011

;

15

(

6

):

979

–

99

.

30.

van Dam

S

,

Craig

T

,

de Magalhaes

JP

.

GeneFriends: a human RNA-seq–based gene and transcript co-expression database

.

Nucleic Acids Res

2014

;

43

(

D1

):

D1124

–

D1132

.

31.

van Dam

S

,

Võsa

U

,

van der Graaf

A

, et al.

Gene co-expression analysis for functional classification and gene–disease predictions

.

Brief Bioinform

2018

;

19

(

4

):

575

–

92

.

32.

Carithers

LJ

,

Ardlie

K

,

Barcus

M

, et al.

A novel approach to high-quality postmortem tissue procurement: the GTEx project

.

Biopreserv Biobank

2015

;

13

(

5

):

311

–

319

.

33.

Koller

D

,

Friedman

N

.

Probabilistic Graphical Models: Principles and Techniques

.

Cambridge, MA

:

MIT Press

,

2009

.

34.

Montavon

G

,

Samek

W

,

Müller

K-R

.

Methods for interpreting and understanding deep neural networks

.

Digit Signal Process

2017

;

73

:

1

–

15

.

35.

Fernandes

M

,

Wan

C

,

Tacutu

R

, et al.

Systematic analysis of the gerontome reveals links between aging and age-related diseases

.

Hum Mol Genet

2016

;

25

(

21

):

4804

–

18

.

36.

Freitas

AA

,

Vasieva

O

,

de Magalhães

JP

.

A data mining approach for classifying DNA repair genes into ageing-related or non-ageing-related

.

BMC Genomics

2011

;

12

(

27

):

11

.

37.

Min

S

,

Lee

B

,

Yoon

S

.

Deep learning in bioinformatics

.

Brief Bioinform

2017

;

18

(

5

):

851

–

69

.

38.

Goodfellow

I

,

Bengio

Y

,

Courville

A

, et al.

Deep Learning

, Vol. 1.

Cambridge

:

MIT Press

,

2016

.

39.

Szalkai

B

,

Grolmusz

V

.

Near perfect protein multi-label classification with deep neural networks

.

Methods

2018

;

132

:

50

–

56

.

40.

Silla

CN,

Jr.,

Freitas

AA

.

A survey of hierarchical classification across different application domains

.

Data Min Knowl Discov

2011

;

44

(

1–2

):

31

–

72

.

41.

Kerepesi

C

,

Daróczy

B

,

Sturm

Á

, et al.

Prediction and characterization of human ageing-related proteins by using machine learning

.

Sci Rep

2018

;

8

(

4094

):

13

.

42.

Freitas

AA

,

Wieser

DC

,

Apweiler

R

.

On the importance of comprehensible classification models for protein function prediction

.

IEEE/ACM Trans Comput Biol Bioinform

2010

;

7

(

1

):

172

–

182

.

43.

Witten

IH

,

Frank

E

,

Hall

MA

,

Pal

CJ

.

Data Mining: Practical Machine Learning Tools and Techniques

.

Cambridge, MA

:

Morgan Kaufmann

,

2016

.

44.

Haixiang

G

,

Yijing

L

,

Shang

J

, et al.

Learning from class-imbalanced data: review of methods and applications

.

Expert Syst Appl

2017

;

73

:

220

–

39

.

45.

Wan

C

,

Freitas

AA

,

De Magalhães

JP

.

Predicting the pro-longevity or anti-longevity effect of model organism genes with new hierarchical feature selection methods

.

IEEE/ACM Trans Comput Biol Bioinform

2015

;

12

(

2

):

262

–

275

.

46.

Huang

DW

,

Sherman

BT

,

Lempicki

RA

.

Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources

.

Nat Protoc

2009

;

4

(

1

):

44

–

57

.

47.

Vêncio

RZN

,

Koide

T

,

Gomes

SL

,

de B Pereira

CA

.

BayGO: Bayesian analysis of ontology term enrichment in microarray data

.

BMC Bioinformatics

2006

;

7

(

86

):

1

–

11

.

48.

Zhang

S

,

Cao

J

,

Kong

YM

, et al.

Go-bayes: Gene Ontology–based overrepresentation analysis using a Bayesian approach

.

Bioinformatics

2010

;

26

:

905

–

911

.

49.

Bauer

S

,

Gagneur

J

,

Robinson

PN

.

GOing Bayesian: model-based gene set analysis of genome-scale data

.

Nucleic Acids Res

2010

;

38

:

3523

–

3532

.

50.

Sass

S

,

Buettner

F

,

Mueller

NS

, et al.

A modular framework for gene set analysis integrating multilevel omics data

.

Nucleic Acids Res

2013

;

41

(

21

):

9622

–

9633

.

51.

Wasserstein

RL

,

Lazar

NA

, et al.

The ASA’s statement on p-values: context, process, and purpose

.

Am Stat

2016

;

70

(

2

):

129

–

133

.

52.

Goeman

JJ

,

Bühlmann

P

.

Analyzing gene expression data in terms of gene sets: methodological issues

.

Bioinformatics

2007

;

23

(

8

):

980

–

987

.

53.

Gold

DL

,

Coombes

KR

,

Wang

J

, et al.

Enrichment analysis in high-throughput genomics-accounting for dependency in the NULL

.

Brief Bioinform

2006

;

8

(

2

):

71

–

77

.

54.

Durinck

S

,

Spellman

PT

,

Birney

E

, et al.

Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt

.

Nat Protoc

2009

;

4

(

8

):

1184

.

55.

Zhou

Z-H

.

Ensemble Methods: Foundations and Algorithms

.

Boca Raton, FL

:

Chapman and Hall/CRC

,

2012

.

56.

Yang

P

,

Hwa Yang

Y

,

B. Zhou

B

, et al.

A review of ensemble methods in bioinformatics

.

Curr Bioinform

2010

;

5

(

4

):

296

–

308

.

57.

Zhang

C

,

Liu

C

,

Zhang

X

, et al.

An up-to-date comparison of state-of-the-art classification algorithms

.

Expert Syst Appl

2017

;

82

:

128

–

150

.

58.

Hall

M

,

Frank

E

,

Holmes

G

, et al.

The WEKA data mining software: an update

.

SIGKDD Explor

2009

;

11

(

1

):

10

–

8

.

59.

Feurer

M

,

Klein

A

,

Eggensperger

K

, et al.

Efficient and robust automated machine learning

. In:

Advances in Neural Information Processing Systems 28

,

2015

,

2962

–

70

.

60.

Kotthoff

L

,

Thornton

C

,

Hoos

HH

, et al.

Auto-WEKA 2.0: automatic model selection and hyperparameter optimization in WEKA

.

J Mach Learn Res

2017

;

18

(

25

):

1

–

5

.

61.

Japkowicz

N

,

Shah

M

.

Evaluating Learning Algorithms: A Classification Perspective

.

Cambridge, UK

:

Cambridge University Press

,

2011

.

62.

Narise

T

,

Sakurai

N

,

Obayashi

T

, et al.

Co-expressed pathways database for tomato: a database to predict pathways relevant to a query gene

.

BMC Genomics

2017

;

18

(

1

):

437

.

63.

Ackermann

M

,

Strimmer

K

.

A general modular framework for gene set enrichment analysis

.

BMC Bioinformatics

2009

;

10

(

1

):47.

64.

Pan

K-H

,

Lih

C-J

,

Cohen

SN

.

Effects of threshold choice on biological conclusions reached during analysis of gene expression by DNA microarrays

.

Proc Natl Acad Sci U S A

2005

;

102

(

25

):

8961

–

8965

.

65.

Liu

Z

,

Li

X

,

Yuan

Y-C

, et al.

Comprehensive comparison of gene set analysis tools

. In:

Proceedings of the International Conference on Bioinformatics & Computational Biology (BIOCOMP)

,

2011

,

4

.

66.

Mutch

DM

,

Berger

A

,

Mansourian

R

, et al.

The limit fold change model: a practical approach for selecting differentially expressed genes from microarray data

.

BMC Bioinformatics

2002

;

3

(

1

):

17

.

67.

van Dam

S

,

Craig

T

,

de Magalhaes

JP

.

Genefriends: a human RNA-seq-based gene and transcript co-expression database

.

Nucleic Acids Res

2014

;

43

(

D1

):

D1124

–

D1132

.