-
PDF
- Split View
-
Views
-
Cite
Cite
Yves Sucaet, Taru Deva, Evolution and applications of plant pathway resources and databases, Briefings in Bioinformatics, Volume 12, Issue 5, September 2011, Pages 530–544, https://doi.org/10.1093/bib/bbq083
- Share Icon Share
Abstract
Plants are important sources of food and plant products are essential for modern human life. Plants are increasingly gaining importance as drug and fuel resources, bioremediation tools and as tools for recombinant technology. Considering these applications, database infrastructure for plant model systems deserves much more attention. Study of plant biological pathways, the interconnection between these pathways and plant systems biology on the whole has in general lagged behind human systems biology. In this article we review plant pathway databases and the resources that are currently available. We lay out trends and challenges in the ongoing efforts to integrate plant pathway databases and the applications of database integration. We also discuss how progress in non-plant communities can serve as an example for the improvement of the plant pathway database landscape and thereby allow quantitative modeling of plant biosystems. We propose Good Database Practice as a possible model for collaboration and to ease future integration efforts.
INTRODUCTION
A biological pathway is a programmed sequence of molecular events in a cell. This chain of events executes a particular cellular function or brings about a specific biological effect. Knowledge of an organism’s pathways is essential to understand a biological system at different levels, from simple metabolism to complex regulatory reactions. Many pathways are complex and hierarchical and are themselves interconnected to form, to participate in, or to regulate a network of events. Over the last couple of decades, there has been an exponential increase in the information on these pathways, their components and their functions [1]. This stems from the biotechnological advancements in genomics and proteomics and high throughput technologies like microarray and two-hybrid screens. For numerous species, this has increased our knowledge about normal pathways as well as rogue/aberrant pathways that lead to a variety of diseases. Examples include pathways that lead to cancer [2] or pathways that lead to aberrant leaf development in plants [3]. Production of large amounts of data necessitates the creation of pathway databases and repositories, where information about the pathways along with their molecular components and reactions is stored. These data sets often become data-sources in their own right, and are shared with the public, explaining in part the large number of databases that exist today [1].
Simultaneously, technological advancements that allow access to and discovery of novel pathway information have resulted in the creation of many more pathway databases [1] that target different organisms, processes and mechanisms. Availability of such vast amounts of information in an ordered format has led us to ask new questions. Ideker and colleagues [4] have raised questions pertinent to evolutionary and comparative biology, e.g. ‘considering that the protein sequences and structures are conserved, could the protein-interaction networks be conserved as well? Is there a minimal set of pathways that is required by all living organisms? Can the evolutionary distance be measured at the network connectivity level rather than at the DNA or protein level?’ Answers to these and other questions will lead to an increased understanding of living systems, which in turn may result in more questions, at other levels, that are currently unimaginable. Information aggregated from different pathway databases is often more useful than information from individual databases. Integration of information from various pathway databases can be used to reveal novel information about a system.
Information from pathway databases has been used for different purposes. Information analysis and data mining holds the potential for discovery of orthologous/analogous pathways and pathway components in other related organisms [5]. For example, organisms which are difficult to cultivate in vitro and therefore are less amenable to laboratory studies could be examined in silico through a study of orthologs. Iterative expansion of pathway data can be utilized to build models of biological mechanisms based on the hypotheses derived from these initial data; see Bumgarner and Yeung [6] for a recent review. Models can (and should) in turn generate experimentally verifiable predictions.
Pathway database analysis can be used to find patterns in the pathways that are related to a disease [7] and aid in the identification of new drug targets [8]. Another idea is targeted drug discovery by screening the complete pathway as compared to a single pathway component [9]. Pathway analysis can also be used to identify molecular switches that lead to disease and to efficiently turn them off to silence them without affecting the rest of the system. A recent study on riboswitches illustrates how one can reengineer components of a pathway to control expression of multiple genes [10].
Compared to the exponential increase in human/animal pathway databases, development of plant pathway databases has been modest and a smaller number of applications have resulted. Plant pathway databases have remained relatively under-utilized. This apparent lacuna is all the more concerning considering that plants are important as food crop, fiber and plant-based fuel source. Examples from non-plant resources and their applications can serve as inspiration for plant scientists who wish to control pathways, for instance, to produce crops with longer shelf life or enhance immunity to plant pathogens.
In this review, we provide an overview of existing plant pathway databases, look at current progress and how the information contained in the databases has been used in the past and can be used in the future. We use examples from the existing plant pathway databases to showcase the potential of database integration. Non-plant integration applications are discussed to suggest future potential. Finally, we discuss how already existing information can be further enriched, organized and utilized for practical applications. We also highlight the acute need of robust, long-term, and user-friendly interactive databases.
The pathway database landscape
Pathguide [1], an online pathway resource meta-database, provides an overview of more than 300 biological pathway resources that have been developed to date. These include pathway databases, tools for data analysis, visualization and data extrapolation and other (peripheral) databases that can be linked with pathway databases to provide additional information. Some databases are specific to a particular organism, e.g. AraCyc [11] deals with the metabolic pathways of Arabidopsis thaliana. Some pathway databases are specific to a certain disorder or disease, e.g. the Human Cancer Protein Interaction Network (HCPIN)[12]; other contain information about a certain system in an organism, e.g. InnateDB [13], a repository for pathways involved in the innate immune system of humans and mice.
Plant pathway databases, when compared to human pathway databases, are fewer in number (Figure 1) and much less diverse. There is an increasing awareness about the importance of plants as food crops, but it appears that only limited resources have been devoted to uncovering and understanding plant pathways. A comparison of the number of genomes sequenced to date for mammals and higher plants (Figure 2) shows that plants receive less attention from the sequencing community when compared to other organisms. The absolute numbers differ between the databases (some sites are kept more current than others), but the trend remains the same. There are many biologically, medically and economically important plants that differ in their physiology. In addition, secondary metabolism is important from a pharmacological point of view. Therefore, there is a need for many more genomes to be sequenced, proteomes to be studied and pathways to be uncovered for the optimal utilization of plants. While lower numbers of genome sequencing data do not completely explain the lack of pathway databases, they certainly contribute to it.

Pathway resources with plants and humans annotated as major organisms from a total of the 328 resources available in Pathguide. Inclusive—databases containing several other major organisms apart from plant or human; dedicated—databases dedicated to plants or humans; other—databases for other organisms, or databases for numerous organisms which may also include human and plant information, and pathway tools. Numbers indicate the actual number of resources available for each category in Pathguide.

A comparison of genomes sequenced for mammals and higher plants. Data from NCBI Genome Database (http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html), Genome Pages at EBI (http://www.ebi.ac.uk/genomes/eukaryota.html) and GOLD database (http://www.genomesonline.org/cgi-bin/GOLD/bin/gold.cgi?page_requested=Complete+Published) are compared. Numbers in the bars indicate the number of genomes sequenced.
Most plant pathway databases contain information on the networks in their own right, e.g. metabolic or regulatory networks in A. thaliana or soybean. However, there are no specialized databases yet that deal with pathways for plant immunity, plant growth or for controlling the size of plant organs.
For the purpose of this review, pathway databases are broadly classified into four types: metabolic pathways, gene regulatory networks, protein–protein interaction networks, and signaling pathways.
‘Metabolic pathways’ are the earliest discovered and best studied pathways. Metabolic pathways are represented by a series of enzymatic reactions that take place at the level of small molecules. These have been elaborated and characterized for many organisms. Table 1 presents an overview of available metabolic pathway databases dedicated to different plant species and the sites that host them. Metabolic pathway databases like MetaCyc [14] contain experimentally verified metabolic pathways and enzyme information for more than 2000 organisms and can be used to predict orthologous pathways in another organism for which the genome has been sequenced and annotated. A dedicated portal for plant metabolic pathway databases is SolCyc (available at http://solcyc.solgenomics.net/). SolCyc is a Pathway Tools-based (and thus MetaCyc inferred) pathway genome database (PGDB) currently containing small molecule metabolism data for five plants belonging to family solanacea—tomato, potato, tobacco, pepper and petunia.
The pathways section of Gramene database [15] (a database for grasses such as rice, maize, sorghum, barley, oats, wheat and rye) contains the known and predicted biochemical pathways of rice (RiceCyc) and sorghum (SorghumCyc), both of which are curated by the Gramene database and were built using the Pathway Tools’ PathoLogic module. The website also mirrors the known and predicted biochemical pathways from SolCyc, AraCyc, EcoCyc and the MetaCyc reference databases.
The ‘golden standard’ AraCyc for A. thaliana was built using the Pathway Tools' PathoLogic module with MetaCyc. AraCyc, in addition, uses manual curation to enrich its data. The trade-off is slower progress in completing the network, yet the end result is highly documented and has a more accurate structure. One can argue that databases are of higher quality when domain experts scrutinize the available literature and manually curate them. They can add their scientific experience and intuition to find facts in a way that any algorithm is yet to mimic. However, this all depends on the availability of such experts and for genome-wide projects it is certainly challenging to gather all potentially involved.
The success of AraCyc has led to a broader plant-centric rather than organism-centric initiative, the Plant Metabolic Network (PMN) (available at http://www.plantcyc.org/). This is a collaborative project to build a broad network of plant metabolic pathway databases. PlantCyc, that incorporates some data from MetaCyc, is the central feature of PMN and is a database containing manually curated or reviewed information about shared metabolic pathways present in more than 300 plant species. PlantCyc serves as a reference database, while PMN also contains single species/taxon based databases. Additionally, PMN has a small number of pathways that are known to be present in other organisms and are predicted to exist in plants.
‘Gene regulatory networks’ consist of transcription factors and the genes that they regulate. These networks comprise of protein–DNA interactions and may also include sRNA/miRNA and sRNA/miRNA target gene regulation. A regulatory network is formed by a series of events where regulation of one gene leads to the control of another. An example of a regulatory network database is the Arabidopsis Gene Regulatory Information Server (AGRIS) [16] which contains information on the transcription factors and cis-regulatory elements that are regulated by them in A. thaliana. AGRIS presently consists of three databases: AtcisDB, AtTFDB and AtRegNet. AtcisDB contains upstream regions of annotated A. thaliana genes and describes the experimentally validated and predicted cis-regulatory elements. AtTFDB holds information on the transcription factors grouped into 50 conserved domain families. AtRegNet describes direct interactions between transcription factors and target genes. AGRIS also contains a Regulatory Networks Interaction Module (ReIN), that allows creation, visualization and identification of regulatory networks in A. thaliana. While AGRIS contains data from sequence annotations, TRANSFAC [17] is a gene regulatory network database that contains data on transcription factors, their experimentally proven binding sites and the genes they regulate in 300 species. TRANSFAC is one of the few proprietary plant database resources in PathGuide.
PlantCARE [18] is a database of plant cis-acting regulatory elements where the data on the transcription sites are extracted from literature supplemented with predicted data. PlantCARE provides levels of confidence for experimental evidence, functional information and position of the promoter. Additionally, a plant DNA query sequence can be searched for cis-regulatory elements using a query tool in PlantCARE.
PlantTFDB [19] is a recently constructed database that contains transcription factors from 49 plant species, grouped into 58 families. Each transcription factor is comprehensively annotated with respect to functional domains, 3D structures, gene ontology, gene expression information from expressed sequence tags (ESTs) and microarrays and annotations from other databases.
AthaMap [20] is a genome-wide map of published or experimentally determined transcription factor binding sites (TFBS) in A. thaliana. It also includes predicted sites. AthaMap allows searching for a genomic sequence or a gene to display the potential TFBS. It also provides search functionality for user defined potential co-localization elements. Genes of interest can be analyzed for identification of common TFBSs. Conversely, genes that harbor specific TFBS can also be identified using AthaMap.
Gene co-expression network databases for plants are under development. Such databases contain information on co-expression of genes after examining a large number of experimental conditions. These can be used for identification of genes involved in a certain function, identification of cis-regulatory elements, construction of regulatory networks (although co-expression does not necessarily mean co-regulation [21]) and assist in many other biological problems. Some examples of gene co-expression networks and their applications are discussed in the Supplementary Data.
‘Protein–protein interaction pathways’ contain all interactions, stable or transient, between same or different proteins that are important for the functioning of a cell. Protein–protein interactions take place during protein modification, protein transport, protein oligomerization for activity/non-activity, chaperone assisted protein folding, signal transduction, etc. Protein–protein interaction pathways contain information on all these interactions. The A. thaliana protein interactome database (AtPID) is one such database [22]. It contains protein interaction pairs found through manual text mining or in silico predictions using various bioinformatics methods, along with protein pairs that have been confirmed.
It is now recognized that the experiments required to generate protein interaction data (e.g. yeast-two-hybrid systems) often give false positives as well as false negatives and hence it is important to use this type of data with caution. To discern whether a certain result is reliable, one needs to know the type of experiment and the conditions used, as well as details about the results. A rational assessment as to whether an interaction is truly possible in vivo can be made based on a variety of factors, including the domains involved in interaction and the type of interaction. The IntAct database [23], which contains protein–protein interaction information on several organisms including plant systems, includes such high level details.
Another database, the Predicted Arabidopsis Interactome Resource (PAIR)[24], predicts the potential interactions in A. thaliana using a support vector machine (SVM) model (a machine learning approach) and careful preparation of example data, selection of indirect evidence and a tight control of false positives. We believe that the PAIR database is currently the most accurate and comprehensive database on A. thaliana protein–protein interactions.
Combining interaction data generated through experimental and predictive methods increases the coverage of an interactome and can lead to more reliable information. When the same data is obtained through different methods one can reasonably expect more accurate data. STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) [25] is a multi-organism (not limited to the kingdom Plantae) database that includes all available protein–protein interactions. It scores and weighs this information and augments it with predicted interactions and automated text-mining results. STRING includes both physical and functional information on the interactions. This adds an extra measure of reliability to the interaction data.
‘Signaling pathways’ comprise of molecular networks in the signal transduction cascade. These are involved in transmission of information from one part of the cell to another (intracellular, e.g. from the cytoplasm to the nucleus) or from one cell to another (intercellular, e.g. from one neuron to another). Extracellular stimuli can also bring about the activation or inhibition of a pathway and thus a change in the cellular environment. Signaling pathways often involve protein–protein interactions at different levels like protein modification (e.g. protein phosphorylation), protein translocation and protein complex formation or dissociation. Several signaling pathway databases, for example SPIKE [26], exist for non-plant eukaryotes. INOH (hosted at http://www.inoh.org/) is a signaling pathway database for Drosophila melanogaster. SignaLink (hosted at http://signalink.org/) is a cross-species database that includes pathways from human, D. melanogaster and Caenorhabditis elegans. In contrast, few plant signaling pathway databases exist and they lack the quality and efficiency in comparison to their non-plant counterparts. The DRASTIC [27] database resource for analysis of signal transduction in cells developed by the Scottish Crop Research Institute (SCRI) was one of the first relational databases in this area. It included ESTs and regulated genes in response to various environmental factors like pathogens, chemical exposure, drought, salt and low temperature. The data was collected from refereed journals. However, this reference resource is no longer available.
Recently, a database containing the stress response transcription factor database, STIFDB [28], has been created for A. thaliana. It contains the abiotic stress response genes that were found upregulated in microarray experiments, with options to identify possible transcription factor binding sites. PathoPlant [29, 30] is another relational database that contains components of signal transduction pathways related to plant pathogenesis. It also contains microarray data of genes expressed in response to pathogens.
There is a glaring need for plant signaling pathway databases that contain and regularly update all proven and potential/putative signaling pathways in plants as these are discovered. MAPK signaling cascades were discovered >15 years ago in plants [31]. Analogues of pathways that were only known in animals are now being found as well. For example, glutamate receptors (iGluRs) that are involved in excitatory neurotransmission pathways have been extensively studied in the animal kingdom and have been included in several pathway databases. Glutamate receptor-like proteins (GLRs) were reported in 1998 in A. thaliana [32]. Since then these proteins in A. thaliana and other plants have been suggested to be involved in a wide array of pathways, through transgenic plant studies or pharmacological studies. Suggested functions include Ca2+ allocation [33], carbon/nitrogen sensing [34], regulation of abscisic acid and water balance [35], coordinating mitosis in root apical meristem [36], light signal transduction [37] and resistance to fungal infections [38]. Both MAPKs and glutamate-like receptors from A. thaliana are included in a few plant pathway databases like AtPID. However, it is difficult for a biologist looking for pathways involved in resistance to fungal infections, for example, to come immediately across the glutamate receptor-like system or conversely to find all the plant pathways that glutamate receptor like-proteins are involved in by using a keyword. Such databases would be essential to ‘de-specialize’ information and make it available to a wider range of scientists. This also highlights the need for such databases to be freely available to allow biologists irrespective of the system/field that they work with (plant, animal, microbial and so on) with an interest in a particular pathway to retrieve all the relevant information available.
Signaling pathway mechanisms like sugar signaling [39], light signaling [40], jasmonate signaling [41] and their components have been discovered in plants and call for dedicated pathway databases. Looking at the signaling pathways and the properties that these affect in plants, it can be concluded that these pathways cross-connect. It is important to understand these pathways and to integrate this information with other databases in order to obtain a more complete picture which would then enable plant scientists to modulate certain plant properties without affecting other mechanisms and pathways.
Pathway visualization tools
Visualization of pathway data is important not only to understand the data, but also to analyze and to build valid hypotheses based on these data. To address these requirements, many pathway/network visualization tools have been constructed with different functionalities. The level of visualization that these tools offer range from simple two-dimensional pathway maps like those provide by KEGG, to three-dimensional and hierarchical visualizations in immersive virtual reality (C6) environments like those provided by MetNetGE [42]. Interactive visualization allows users to analyze, edit and modify the pathways based on their own experimental data, as is provided by GenMAPP [43]. Gehlenborg et al. [44] in their recent review have thoroughly reviewed available pathway visualization tools and have broadly divided these tools into two partly overlapping categories—tools focused on automated methods for interpreting and exploring large biological networks and tools focused on assembly and curation of pathways. Many of these tools integrate with public databases, allowing the users to analyze and visualize their own data. Another exhaustive overview of visualization tools has been presented by Suderman and Hallett [45]. For a critical evaluation of the requirements for biological visualization tools based on interviews conducted to understand the needs for pathway analysis, see ref. [46].
Pathway database evolution through integration
An individual pathway database holds a variety of information. This has proved to be challenging for scientists who want to access and use this information. Information is scattered across various databases that differ not only in the type of data they contain, but also the form in which they exist. Additionally, in an actual living cell, the pathways are vastly interconnected. Integration of pathway databases thus becomes imperative in order to understand a biological mechanism in its entirety. Researchers interested in a particular biological mechanism should be able to easily find and access all the data they need, without having to go through the difficult process of shifting data from different databases that are based on different platforms.
One of the biggest challenges to the integration of databases is their diversity. The existing databases have syntactic differences in the form of data file formats and retrieval methods and semantic differences in the terminologies and data models [47]. Several pathway database resources listed in Pathguide are not machine-readable. Machine-readability is an essential requirement for automatic data retrieval and processing. Recognition of these challenges has demanded increased efforts to establish pathway ontology standards for defining models. Systems Biology Markup Language (SBML) has presented itself as one such standard for storing and sharing of computational models of biological networks [48]. Another, named BioPAX [49] was developed for detailed pathway depiction and for permitting data exchange as used in the development of MetNet [50]. PSI-MI [51] allows data exchange for protein–protein interactions, while CellML [52] enables storage and exchange of computer based mathematical models. Other data exchange formats exist that are peripherally associated with network-data and can certainly serve as input for other software packages that determine such networks. The Chemical Markup Language (CML) can be used to describe small molecules and ligands that participate in networks [53], whereas the Protein Markup Language (ProML), along with its predecessor PDB, can be used to characterize larger binding-partners [54]. The Microarray Gene Expression Markup Language (MAGE-ML) can be used as input to determine gene co-expression networks under various conditions [55]. The Ondex eXchange Language (OXL) format claims superiority over a range of formats [56], but is more general and requires more coding to implement correctly. Finally, an Application Programming Interface (API) can be provided [57], but then each API requires some study of its peculiarities (as it applies to only one particular database) as well.
Providing an easy-to-use interface for end-users is challenging with formats that allow too many options. All standards are now being used by at least some pathway databases and are certainly steps in the right direction. While laudable efforts in their own right, the proliferation of different data formats creates its own problems: providers need to decide which formats to support and each format represents a laborious and resource-intensive effort. Therefore, many times data formats still need to be converted from one format into another [58].
Ongoing efforts to automate data access and retrieval make the process much simpler for a biologist. KEGG [59] is a comprehensive resource for metabolic pathways and contained data that were originally curated manually from literature and the pathways existed as simple drawings. All pathway maps in KEGG have been redrawn, using KegSketch. The resulting KGML+ files [60] are machine readable and editable.
Plant pathway database integration is a challenge as far fewer plant genomes have been sequenced compared to other life forms (which makes it more difficult to base inferences on homology) and the data resources on plant pathways are more dispersed [61]. The uniqueness of secondary metabolism that exists in many systems adds another layer of complexity. It is, therefore, even more important for plant pathway databases to start incorporating and supporting already existing standard formats for better integration of information and knowledge extraction. The positive side of having a limited number of plant pathway databases is that standardization needs to be applied to a smaller number of pathways. This entails less work than what would be required in other settings.
Supplementary Table S1 shows plant database resources available to date with a short description and other information like the availability of these databases, included organisms, whether the database is included in Pathguide, access to the database, data sources and standard formats supported (if any). As can be seen from Figures 1 and 2 and Table 1 and Supplementary Table S1, plant databases are still far from being overwhelmed with information and diversity load. This makes their standardization and implementation efforts much more realistic than for other systems. Furthermore, this in itself can pave the way for other systems to follow suit by learning from the successes and challenges of plant pathway database integration projects. It would therefore be a tremendously useful exercise for all upcoming plant pathway databases to start following universal standardization right from their conception. Perhaps journals should only accept the publication of databases that conform to—what we term as—Good Databasing Practice (GDbP) standards (Table 2), thereby forcing these to become standard practice. Such practices have already been incorporated for microarray and sequencing results.
Good databasing practice . | Usefulness . |
---|---|
Easy user access | Easy access for even the non-specialists |
Integrated visualization tools | Ease understanding and analysis of large data sets |
Standard ontology | Ease of data exchange |
Possibility to integrate data from other databases | Expansion of available information |
Proper documentation of stored data; provision of source and reliability of original data | Possibility to get back to the original source if required, enable judgment of accuracy of inferred information |
Provision of risk factors and probability of error propagation when deriving orthologs in another species | Using particular data with caution when inferring a pathway or an ortholog |
Good user support | Good response time to user queries |
Regular update/maintenance | Update information; removal of errors, bugs |
Regular and professional data curation and annotation | Manual curation of the data/annotations to remove errors generated by automatic data retrieval; annotation—both derived from source and inferred—help describing an entity or an event |
Good databasing practice . | Usefulness . |
---|---|
Easy user access | Easy access for even the non-specialists |
Integrated visualization tools | Ease understanding and analysis of large data sets |
Standard ontology | Ease of data exchange |
Possibility to integrate data from other databases | Expansion of available information |
Proper documentation of stored data; provision of source and reliability of original data | Possibility to get back to the original source if required, enable judgment of accuracy of inferred information |
Provision of risk factors and probability of error propagation when deriving orthologs in another species | Using particular data with caution when inferring a pathway or an ortholog |
Good user support | Good response time to user queries |
Regular update/maintenance | Update information; removal of errors, bugs |
Regular and professional data curation and annotation | Manual curation of the data/annotations to remove errors generated by automatic data retrieval; annotation—both derived from source and inferred—help describing an entity or an event |
Good databasing practice . | Usefulness . |
---|---|
Easy user access | Easy access for even the non-specialists |
Integrated visualization tools | Ease understanding and analysis of large data sets |
Standard ontology | Ease of data exchange |
Possibility to integrate data from other databases | Expansion of available information |
Proper documentation of stored data; provision of source and reliability of original data | Possibility to get back to the original source if required, enable judgment of accuracy of inferred information |
Provision of risk factors and probability of error propagation when deriving orthologs in another species | Using particular data with caution when inferring a pathway or an ortholog |
Good user support | Good response time to user queries |
Regular update/maintenance | Update information; removal of errors, bugs |
Regular and professional data curation and annotation | Manual curation of the data/annotations to remove errors generated by automatic data retrieval; annotation—both derived from source and inferred—help describing an entity or an event |
Good databasing practice . | Usefulness . |
---|---|
Easy user access | Easy access for even the non-specialists |
Integrated visualization tools | Ease understanding and analysis of large data sets |
Standard ontology | Ease of data exchange |
Possibility to integrate data from other databases | Expansion of available information |
Proper documentation of stored data; provision of source and reliability of original data | Possibility to get back to the original source if required, enable judgment of accuracy of inferred information |
Provision of risk factors and probability of error propagation when deriving orthologs in another species | Using particular data with caution when inferring a pathway or an ortholog |
Good user support | Good response time to user queries |
Regular update/maintenance | Update information; removal of errors, bugs |
Regular and professional data curation and annotation | Manual curation of the data/annotations to remove errors generated by automatic data retrieval; annotation—both derived from source and inferred—help describing an entity or an event |
Applications of pathway database integration
Pathway database integration yields many potential advantages for the biologist and software developer alike. If successful, numerous applications will follow, many of which will be surprising or even unthinkable today. To better appreciate the potential of integration, a few case studies from other fields are presented.
One study [62] integrated data from three metabolic pathways—fatty acid synthesis genes from Arabidopsis Lipid Gene Database [63] (http://lipids.plantbiology.msu.edu/), starch metabolism genes from Starch Metabolism Network project (http://www.starchmetnet.org/) and the original references for leucine catabolism—with transcriptomics data, leading to a picture that no individual study was able to show by itself. The integration revealed that each of these pathways is structured as a co-expressed module with the possibility that these modules exist in a hierarchical organization. The transcripts from each module co-accumulate over a wide range of environmental and genetic perturbations and developmental stages.
In another case study [61], A. thaliana pathways from protein interaction databases were integrated with co-expression data using the Ondex system (http://www.ondex.org/). This method enabled the determination of co-expression of the interacting protein partners and the levels of expression.
An interesting example of using database integration to obtain enhanced information about a system is AraGEM [64]. AraGEM is an attempt at building genome scale reconstruction of the primary metabolic network in A. thaliana. It used A. thaliana metabolic genome information from KEGG as a core enriched with information on the cellular compartmentalization of metabolic pathways from literature and, apart from others, databases like AraPerox [65] and Arabidopsis information resource TAIR [66]. A total of 75 essential primary metabolism reactions were identified for which genetic information was unknown. The resulting genome-scale model was then used to construct a metabolic flux model of plant metabolism representing both photosynthetic and non-photosynthetic cell types. The model was validated by simulation of plant metabolic functions inferred from literature. AraGEM exemplifies how genome-scale models can be first built and then used to explore highly complex and compartmentalized eukaryotic networks and to construct and examine testable, non-trivial hypotheses.
A thorough literature search on plant pathways and newly discovered mechanisms can enable design of new applications through database integration. In plants, for example, hormonal and defense signaling pathways have been found to cross-talk through identical components [67]. An integration of these two types of information can point towards new targets to counteract the microbial components that decrease plant resistance and lead to disease.
Additional examples of applications of database integration are presented in the supplementary material.
Non-plant references and opportunities for the future
Human databases have already benefitted from integration of information from different pathway databases. For example, a meta-analysis study of Type-2 diabetes was conducted to find different genes that are involved in the disease. Various types of data were used: medical reviews, phenotype information, proteome analysis results, candidate gene lists from previous studies, differential gene expression and time series microarray studies [68]. The study also incorporated information from several pathway databases including KEGG, Reactome [69], BioCyc [70], GO [71], IntAct and TRANSFAC to add pathway information and to derive cellular network information on these genes. This allowed identification of 213 genes with overall disease relevance indicating common, tissue-independent processes related to the disease and also identified genes showing changes with respect to a single study.
In another study [72], an integrated human interactome network was constructed using physical and direct binary protein–protein interactions. Data were retrieved from a variety of sources: Biomolecular Interaction Database (BIND), BioGRID, DIP, GeneRIG, IntAct, MINT and Reactome. All of these play a particular role in the integration scheme. BIND [73] contains data from large-scale cell mapping studies and molecular interactions in PDB. BioGRID [74] has protein and genetic interaction information as well as information from primary literature. DIP [75] contains experimentally determined protein–protein interactions. Gene reference into function (GeneRIF) [76] contains short text about curated articles that are relevant to known genes. IntAct contains highly curated interaction data from literature or direct deposition by experienced curators. MINT [77] focuses on experimentally verified protein–protein interactions and Reactome is a knowledgebase containing interaction data in different pathways. The Hepatitis C virus (HCV)-host infection network that was generated experimentally and from text mining was also incorporated on top of this integrated interactome network–—a type of meta-integration. This led to the identification of previously unknown, novel functional pathways of HCV biology and its pathogenesis. One could extrapolate the advantages of a similar approach followed for crop plant systems and pathogens that could then divulge information on plant host–pathogen interactions and the pathways involved in pathogenesis. This could lead to development of methods to bestow pathogen resistance on crop plants or target these pathways against the pathogen.
Not only can plant science benefit from the animal pathway database and integration examples, animal biologists can in turn benefit from the study of plant pathways by asking the question whether pathways discovered only in plants to date also exist in animals or how similar or different are the pathway networks that exist both in plants and animals. Many opportunities become available through such a feedback loop: can we unlock more evolutionary secrets? Can we become better at harnessing plants for our use or could human diseases be experimentally modeled in plants if common pathways do indeed exist for plants and animals? Applications are endless and the potential for knowledge creation extreme.
A survey of integrated pathway databases and tools
Two approaches exist to perform database integration: through the use of tools and through already integrated databases [78] (that hopefully get rebuilt periodically to stay current). Pathway database integration tools along with integrated pathway databases play a very important role in easing data integration for biologists. These tools can also be used for various other purposes like data visualization, pathway prediction, pathway gap-fillers and biological network analysis. Applications of pathway databases and tools help further knowledge of the pathways and on the inner workings of living systems.
Pathway database tools for plant systems are important because of the widely dispersed information within several databases and a lack of consistency among these databases. A growing need exists to bring this information together in a standard format to aid access and model-building. Plants show more heterogeneity among different species (e.g. in terms of secondary metabolism [79]). This makes it even more important to integrate pathway data for all important plant species and to design tools that would aid in pointing out interspecies similarities and differences.
A separate version of Reactome, Arabidopsis Reactome, [80] represents a knowledgebase of biological processes in A. thaliana and several other plant species. It integrates pathway information curated in-house, as well as from KEGG and AraCyc. It also provides a platform to navigate and discover interconnected pathways in A. thaliana. The data model of Arabidopsis Reactome uses reactions and their interconnections; it treats protein modifications, proteins localized in different compartments, as well as protein complexes, as entities on their own. It furthermore allows generalization of protein isoforms, paralogues and splice variants with a possibility of tracing these components back. The model contains both real and inferred data along with proper annotations that allow distinction between the two.
Tools like CORNET [81] help integrate A. thaliana related microarray expression data. The data sets for CORNET were obtained from Gene Expression Omnibus (GEO) [82] and from experiments carried out on Affymetrix ATH1 arrays. Also retrieved were the corresponding meta-data (which is unstructured and hence cumbersome to retrieve and parse automatically), including information about sample tissues, treatments and sampling time points, protein interaction data, localization data and functional information. The meta-data have manually assigned ontology terms using Plant ontology [83–85], the Microarray gene expression data (MGED) ontology (MO) [86] and the Plant environmental ontology (EO) (www.gramene.org/plant_ontology/index.html#eo). Protein–protein interactions were obtained from BIND, IntAct, BioGRID, DIP, MINT, TAIR. Predicted PPIs were obtained from the BAR Arabidopsis interaction viewer [87] and AtPID. Information was also obtained from their own study [88]. Localization data were obtained from SUBA [89], iPSORT [90], LOCtree [91], MITOPRED [92], MitoProt [93], MultiLoc [94], PeroxiP [95], Predotar [96], SubLoc [97], TargetP [98] and WoLF_PSORT [99]—Table S2 provides a short description of these resources. CORNET includes all available data along with related meta-data. The tool then provides a reliability score for each result based on the search options, parameters and thresholds used (supplied by the user). A visualization tool additionally allows the users to distinguish more reliable predictions from less predictable ones.
CORNET aims to provide functional context to genes and conversely, to provide an ability to predict functions of genes that have unknown functions. It is a tool that could also, in the future, use the information on A. thaliana to extrapolate networks in other plant species.
Many pathway resources use only the general localization predictors. In contrast, CORNET has made an attempt to also use species-specific localization information. Thus, CORNET uses localization data from both ‘general’ localization predictors and from an A. thaliana specific localization database SUBA, which was the only species-specific resource available then. SUBA contains data retrieved from literature, experiments and from prediction tools. It has become clearer over time that use of organism-specific predictors and multiple (general) predictors are likely to lead to more accurate predicted localization [100–103]. Predictions from general predictors may not be suitable for predicting localization of an individual organism as these prediction tools are trained on proteins from a variety of organisms (and can suffer from sampling bias). Localization data from any single predictor needs to be treated with caution keeping in mind that inclusion of false positives into the integrated databases would result in amplification of the wrong information. Fortunately for plants, some organism-specific localization predictors have recently become available, e.g. AtSubP (Arabidopsis)[103] and RSLpred (rice) [104]. These should be used while integrating pathway information for the respective species. If a tool similar to CORNET is developed for rice, RSLpred would definitely be an important resource for protein localization data. A need for localization predictors specific to a variety of plants cannot be emphasized enough for a more reliable extrapolation of networks.
The ‘MetNet’ platform contains both metabolic and regulatory networks of A. thaliana, soybean [50] and grapevine. It is an attempt to integrate metabolic data from AraCyc and regulatory data from AGRIS, with additional manually curated signal transduction pathways (in A. thaliana). The pathway information is integrated with other resources like TAIR, GO-classifications (retrieved through TAIR) and MapMan [105] that supply gene related information. Protein information is obtained from PPDB [106], AMPDB [107], AtNoPDB [108], AraPerox, PLprot [109], SUBA and BRENDA [110]. These also provide the subcellular localization information for the entities. Metabolite data from ChEBI [111], PubChem [112], KEGG, NCI [113] are also integrated into the database. As there are large holes in the information on the function of a large number of genes in A. thaliana, MetNet is aimed at formulating testable hypotheses. MetNet supports various types of users and data retrieval methods. MetNet Online (available at http://metnetonline.org/) is an online interface to MetNet. MetNetAPI is an Application Programming Interface to the platform that facilitates automated data retrieval [57] and a plug-in exists for the CellDesigner environment [114].
‘VitisNet’ [115] is a web-based tool for grapevine (Vitis vinifera) that integrates metabolomic, proteomic and transcriptomic pathway information within molecular networks like metabolic or signaling networks and presents a molecular network model. VitisNet allows visualization of genes and biochemical pathways involved in growth, fruiting cycles and environmental stress response. Data from VitisNet is now also available in MetNet.
‘Metacrop’ [116] contains manually curated metabolic pathway information in crop plants (with special emphasis on seeds and tubers), along with a wide variety of other factors like reactions, location, transport processes, kinetics, taxonomy and literature. MetaCrop has an easy to use web interface and allows automatic export of information for creation of metabolic models.
Pathway database maintenance—an easily overlooked detail
Although Pathguide lists more than 300 pathway resources, at least 30 of these databases and resources are no longer functional. At the time of writing this review (October 2010), inaccessible databases ‘not’ marked as non-functional in Pathguide include aMAZE [117,118], Sentra [119] and EMP [120] among others. Other databases may change location. During the preparation of this article, this happened with AtPID. The publication on AtPID is now destined to refer to an incorrect URL. Several of these databases contained high quality data and unavailability of databases is a loss from several angles. For example, aMAZE boasted an excellent data model. It could deal with metabolic, protein–protein interaction, gene regulation, sub-cellular localization, signal transduction and transport and thus had the capacity to integrate a large variety of data. Its current absence is a significant loss to the scientific community at large. While papers do exist for many of these projects, the technical details of an implementation can often only be obtained through communication with the implementing team. This effectively means that if anyone else ventures to do the same elsewhere in the world, they will have to retrace the time and steps to achieve the quality of aMAZE. Similarly, Arabidopsis Reactome is another dedicated database on A. thaliana, which is currently no longer being developed as the continuation of this project requires new funding initiatives.
Due to their ever expanding and evolving nature, pathway databases (like any other scientific database) need to be maintained, curated and developed on a long-term basis. Finding financial support for long-term maintenance of pathway databases is a challenging task. One possibility is to raise funds by establishing license purchase requirements for the use of databases, but this restricts open access to the information contained therein and can thus hinder the development of the field [121]. In addition, this is unfeasible for smaller projects that attract limited attention, but may be useful as part of integration efforts. Solutions are needed to ensure provision of continued funding for especially promising databases (without promoting an uncontrolled proliferation of new platforms) and avoid the loss of valuable information in established resources. Loss of such databases is not only a loss to the scientific community, but also is a waste of resources that have been spent on the creation and development of an excellent database in the first place. Funding agencies could, for example, provide continued funding to the database projects that they have already funded provided that the projects follow the GDbP standards which are continually and rigorously monitored and reported by an independent workgroup. Another solution could be an integration of especially promising databases into more permanent structures such as Gramene or NCBI.
The Arabidopsis Information Resource (TAIR) funding can serve as a recent example of search for alternative funding sources. NSF funding for TAIR would phase out over the next 3 years (http://www.nature.com/news/2009/091118/full/462258b.html). For its continued maintenance, TAIR has recently come up with a corporate sponsorship program. The idea is to avoid subscription requirements for the corporate sector and thus keep the resource open and free of login requirements, thereby allowing continued open access to the data for all scientists. TAIR has already secured several corporate sponsors through this program. Such programs would certainly help survival of at least some databases. However, this is not a real alternative to public funding as such a solution could end up introducing a corporate bias into the system—only those database would survive that are able to find corporate sponsorship. Various funding models for these community resources (that are not necessarily research-projects in their own right) have recently received more attention [122, 123]. These could be applied for plant pathway database integration and maintenance. Funding a community resource requires a different approach compared to more conventional research projects. Various scenarios for databases need to be discussed and changed, a recommendation also posited by Bastow and Leonelli [123].
CONCLUSION
Pathway databases play an important role in advancing our knowledge of the biological functions and mechanisms. Increased understanding of living systems as a whole can, in turn, aid successful application design in silico, in vitro and in vivo.
Plants are important as veritable food, drug and fuel sources, as well as bioremediation and biotechnological tools. This provides a strong incentive to create better, more integrated and easily accessible plant pathway databases. Such efforts would lead to discovery and elucidation of the yet unknown components involved in various pathways and their function. This would also result in the creation of testable models that can further enrich the knowledge on plant systems. This then could lead to the design of more specialized intervention technologies along with potential commercial applications: innovation as a result of integration.
SUPPLEMENTARY DATA
Supplementary data are available online at http://bib.oxfordjournals.org/.
Considering the importance of plants in human life and in general all life on Earth, plant pathway databases (as well as supporting their continued existence after creation) deserve more attention than they have thus far received.
Plant pathway databases, being fewer in number, are more amenable to format standardization and data integration. Upcoming plant pathway databases should strive to follow standard formats and aim at database integration from the start. This can be facilitated through the formulation of Good Database Practice (GDbP).
Plant pathway database integration could help expand both basic and applied aspects of information utilization.
It is important that pathway databases/tools/resources be regularly curated and maintained. This demands improved strategies to provide continual financial support. The loss of a good database is expensive in terms of the years of hard work required to make a good database/tool/resource in the first place.
Acknowledgements
The authors are thankful to Prof. Eve Syrkin Wurtele and Dr Julie Dickerson at Iowa State University, who supported us in a number of ways with encouragement, sound advice, guidance and lots of good ideas. The authors thank the anonymous reviewers for their constructive criticisms and suggestions which have helped improve this article.
FUNDING
This material is based upon work supported by the National Science Foundation under Awards EEC-0813570 and MCB-0951170.