Abstract

Coeliac disease (CD) is a complex, multifactorial pathology caused by different factors, such as nutrition, immunological response and genetic factors. Many autoimmune diseases are comorbidities for CD, and a comprehensive and integrated analysis with bioinformatics approaches can help in evaluating the interconnections among all the selected pathologies. We first performed a detailed survey of gene expression data available in public repositories on CD and less commonly considered comorbidities. Then we developed an innovative pipeline that integrates gene expression, cell-type data and online resources (e.g. a list of comorbidities from the literature), using bioinformatics methods such as gene set enrichment analysis and semantic similarity. Our pipeline is written in R language, available at the following link: http://bioinformatica.isa.cnr.it/COELIAC_DISEASE/SCRIPTS/. We found a list of common differential expressed genes, gene ontology terms and pathways among CD and comorbidities and the closeness among the selected pathologies by means of disease ontology terms. Physicians and other researchers, such as molecular biologists, systems biologists and pharmacologists can use it to analyze pathology in detail, from differential expressed genes to ontologies, performing a comparison with the pathology comorbidities or with other diseases.

Introduction

Bioinformatics tools offer a wide spectrum of resources to analyze omics data, as for instance gene expression data. Public databases contain such data in a structured way and constitute an impressive amount of information, in most cases not completely exploited. The integration of different bioinformatics tools makes it possible a synergism of the analysis methods thus offering the opportunity to extract more information from the available data. This approach can be very useful in particular cases, as human pathologies with multifactorial aspects.

Coeliac disease (CD) is a chronic condition and an autoimmune disorder that causes an inflammatory reaction localized in the human small intestine. Two factors are significant: the genetic and the environmental components. It occurs in predisposed individuals with genetic variants, in which an autoimmune response from the small intestine is activated by gluten, a storage protein fraction found in different cereals, such as wheat, rye and barley. Gliadins and glutenins are the main components of gluten, responsible for triggering the autoimmune response after the crossing of the small intestine epithelium [1, 2]. The key aspects of known disease mechanism can be summarized as follows:

CD comorbidity map. The autoimmune pathologies associated to CD are clustered for typologies: neurological diseases (ochre, 1), endocrine diseases (green, 2), cardiological disease (red, 3), liver diseases (cyan, 4), dermatological diseases (purple, 5), rheumatological and connective tissue diseases (pink, 6) and various diseases (gray, 7). The little squares inside the human silhouette represent the principal zones of the body affected by the single autoimmune pathology. (The pathologies with * are localized all over the internal/external part of the body.)
Figure 1

CD comorbidity map. The autoimmune pathologies associated to CD are clustered for typologies: neurological diseases (ochre, 1), endocrine diseases (green, 2), cardiological disease (red, 3), liver diseases (cyan, 4), dermatological diseases (purple, 5), rheumatological and connective tissue diseases (pink, 6) and various diseases (gray, 7). The little squares inside the human silhouette represent the principal zones of the body affected by the single autoimmune pathology. (The pathologies with * are localized all over the internal/external part of the body.)

  • (i)

    Gluten induces enterocytes to upregulate zonulin (haptoglobin 2 precursor), which loosens enterocytes tight junctions, increasing the permeability of the small intestine. The interaction between gliadin and an overexpressed C-X-C Motif Chemokine Receptor 3 is essential for the release of zonulin [3];

  • (ii)

    After the crossing toward lamina propria, gluten causes cytokines secretion, which induces the activation of intraepithelial lymphocytes against the enterocytes. In more detail, Interleukin 15 is a proinflammatory cytokine that regulates T and natural killer cell activation and proliferation, activating cytotoxicity by NKG2D+ cells and interferon-gamma (IFNγ) production by intraepithelial lymphocytes [4];

  • (iii)

    Damaged enterocytes release tissue transglutaminase, which deamidates selected glutamines in gluten fragments. Then, the immune mechanisms recognize them by means of immunoglobulin A (IgA) antibodies and activate antigen-presenting cells by joining to HLA-DQ2 or HLA-DQ8 molecules [5];

  • (iv)

    Helper T cells (CD4+) recognize the complex and begin to damage enterocytes, both directly by means of secreted cytokines and killer T cells and indirectly by means of B cells. Mature B cells generate antibodies against tTG and gluten, persisting with the damage of the enterocytes and contributing to the final atrophy of the villi [6].

Genetic analyses show that CD is associated with HLA class II molecules that are encoded by DQA1*0501/DQB1*02 (DQ2) or DQA1*03/DQB1*0302 (DQ8) genes [7]. HLA-DQ2 and HLA-DQ8 recognize the deamidated gliadin. Nevertheless, there is a genetic risk due to an unknown number of non-human leucocyte antigens (HLA) genes [8].

The inherent multifactorial nature characterizes the pathology as complex disease [6, 9]. This disease is very often associated with autoimmune diseases; the comorbidity may be triggered by the activation of similar biological pathways, related to immune response and inflammation development. Moreover, the atrophy of the villi and the increased permeability can contribute to the occurrence of indirect-related disease [10]. From this point of view, not only the CD is limited to the gastrointestinal zone of the body but also it is a systemic disease. The importance to study the comorbidities of a complex disease is highlighted in [11]. In Figure 1, we present the CD comorbidity map with particular focus on autoimmune diseases associated to CD.

Another issue in the pathology is the onset time, which is related to its etiology. For a long time, CD has been diagnosed in child and considered a gastrointestinal disorder of childhood. Nowadays, it is often diagnosed in adults, associated to many symptoms and conditions. A significant amount of individuals is asymptomatic at diagnosis. Furthermore, endoscopic and histological evaluations are the best ways of assessing small intestine healing, despite their invasiveness and costs. The well-known treatment is a lifelong gluten-free diet (GFD), but it can be ineffective for a large percentage of the patients [12–14].

The availability of the gene expression data for the CD and other related pathologies in public repositories makes this disease approachable using bioinformatics and ontology approaches and testing how an integrative pipeline may offer a wider view of the pathology itself.

Methods

Survey of available data

We retrieved experimental data sets from the Gene Expression Omnibus (GEO) database [15], available from the National Center for Biotechnology Information (NCBI), and the Array Express database [16], available from the European Bioinformatics Institute. Four hundred twenty-seven query results about CD are available in GEO repository (June 2018). In order to eliminate the studies with a low number of samples, we selected a lower bound of 10 samples, thus the number of results decreases to 18 from which only 7 are strictly related to the disease or not redundant. This information about the quantity of data is reported to show how few collected CD data are usable, in comparison with all the available data sets. Indeed, there are four principal issues on selecting microarray data sets:

  • (i)

    Redundancy. Some data sets are replicated in similar condition or explored with different methods, but the samples are the same, thus there is no necessity to take them into account more than once;

  • (ii)

    Typology. Some data sets are not in a form suitable for performing a simple differential expression evaluation, such as sequence data, thus they are not useful for the integration;

  • (iii)

    Relevance. Some data sets are annotated with information related to a certain pathology, with particular emphasis on biological relationships, but the samples are not about the pathology itself, thus they are not suitable;

  • (iv)

    Species. some data sets come from non-human species, thus the analysis is not compatible.

The search within the Array Express database did not add more data sets with respect to GEO.

Gene expression data sets research for CD returned the following studies from GEO: GSE11501, a study of genetic variation about primary leucocytes [17]; GSE87629, a genome-wide analysis of lymphocytes gene expression during a 6 week gluten challenge, looking for changes correlated to villi atrophy [18]; GSE72625, a combined study on intraepithelial lymphocytes between common variable immunodeficiency and CD [19]; GSE106260, an immunopathology study in children, particularly on intestinal epithelial cells [20]; GSE61849, an expression profiling of selected genes from children intestinal mucosa, among sick, healthy and GFD-treated children [21]; GSE76168, a compared study between adult and children about differential gene expression signatures [22]; and GSE84729, an ‘immunoancestry-based’ study about novel candidate genes related to CD [23]. In our study, the 4th data set is divided into intestinal epithelial cells and intraepithelial lymphocytes; the 5th and the 7th data sets are divided into two different analyses: CD versus control and CD versus GFD-treated; furthermore, the 5th data set is reduced to the children case because of the unbalanced number of the adult individuals.

The selected studies on autoimmune disease, for the comparison with CD, are GSE45512, a transcriptomic description of affected alopecic scalp skin [24]; GSE68801, a gene expression profiling from a whole genome analysis on alopecia areata samples [25]; GSE63425, a study on expression data of vascular smooth muscle cell in patients with suspected giant cell arteritis [26]; GSE71956, an analysis of epigenetic profile in T cells from Graves’ disease patients [27]; GSE11971, an evaluation of the chronic inflammation on gene expression in skeletal muscle from children with juvenile dermatomyositis [28]; GSE1551, a study of the innate response of some cytokines, from B and T helper cell muscle infiltration, in dermatomyositis [29]; GSE5370, an overall analysis of idiopathic inflammatory myopathies (polymyositis and dermatomyositis) [30]; GSE93170, an analysis of microRNA and mRNA expressions in CD4+ T cells to investigate primary biliary cirrhosis pathogenesis [31]; GSE31014, a study on the transcriptional profile of leukocytes in the patients with Guillain–Barré syndrome [32]; GSE95849, a study on the identification of mechanisms underlying the pathogenesis of diabetes mellitus and diabetic peripheral neuropathy [33]; GSE17755, an evaluation of gene expression profiling of peripheral blood cells from patients with different autoimmune diseases, with a focus on systemic lupus erythematosus [34]; GSE4588, gene expression studies performed on peripheral blood mononuclear cell from systemic lupus erythematosus patients and other autoimmune diseases [35]; GSE90880, a study of the autoimmune response in vitiligo regarding the systemic changes underlying skin-specific manifestation [36]; and GSE53148, the analysis of some cytokines and chemokines in patients with vitiligo [37]. In our study, the 3rd data set takes into account histological-proven and not histological-proven patients; the 4th study is divided into CD4 and CD8 T cells; the 11th study is divided into juvenile idiopathic arthritis and canonical rheumatoid arthritis. For completeness and to our best knowledge, the following pathologies have no data set on GEO: Addison’s disease, dermatitis herpetiformis, gluten ataxia, autoimmune pericarditis and microscopic colitis. Type I diabetes has a strong correlation with CD, especially in children, as claimed in [38], but we preferred to focus our work on less commonly pathologies studied together with CD.

Ontologies

The gene ontology (GO) project [39] provides comprehensive structured information of biological systems, from the molecular level to larger pathways, cellular- and organism-level systems. There are three GO domains: cellular component, molecular function and biological process (BP). Our approach is based on BP.

The disease ontology (DO) project is conceptually similar to the previous one, described in [40]; the mission is to provide an open-source ontology for integrated biomedical data, associated with human disease. The hierarchical organization takes into account clinical, etiological and location information and trying to provide an online comprehensive theory of disease. The DO terms for the selected diseases are alopecia areata DOID:986, cerebral arteritis DOID:11390, temporal arteritis DOID:13375, Takayasu arteritis DOID:2508, polyarteritis nodosa DOID:9810, autoimmune thyroid disease DOID:7188, dermatomyositis DOID:10223, autonomic peripheral neuropathy DOID:0060054, primary biliary cirrhosis DOID:12236, rheumatoid arthritis DOID:7148, juvenile rheumatoid arthritis DOID:676, vitiligo DOID:12306 and celiac disease DOID:10608.

Gene set enrichment analysis and semantic similarity

Gene set enrichment analysis (GSEA) is a set of statistical methods to classify genes in groups, taking into account common biological functions, chromosomal location or regulation [41]. Genes from microarrays or next generation sequencing (NGS) technologies can be analyzed by their differential expression among several conditions, selected as representative for a certain disease and correlated to GO terms, finding possible pathways linked to phenotype changes, e.g. disease development. The power of these methods lies in overcoming the possibility of masking single gene information, such as the pathways disease-related in which it is involved. Gene sets are more relevant of a single gene, especially when correlated to phenotype information.

Semantic similarity is a method to measure the closeness between two terms, according to a given ontology. Many methods are based on the annotation statistics of their common ancestor terms, i.e. on the information content of each term. In this work, the connections among the terms (genes, GO and DO) are more important than single comparisons. The Wang method fits for the purpose because it is a graph-based method constructed on the topology inherited by the selected ontology. In a nutshell, an ontology term can be represented with a directed acyclic graph |${DAG}_X=\left(X,{T}_X,{E}_X\right)$|⁠, where |$X$| is the term, |${T}_X$| is the set of ancestors of |$X$| and |${E}_X$|is the set of semantic relations (edges) among the terms. The semantic value of |$X$| is
(1)
where |$t$| is a generic term, |${t}^{\prime }$| is a child term and |${w}_e$| is the semantic contribution between |$t$| and |${t}^{\prime }$|⁠. The global semantic value for |$X$| is
(2)
and the semantic similarity between two terms is
(3)
where |$X\ \mathrm{and}\ Y$| are two different terms. The most important advantage of this method is the calculation of a semantic similarity measure without taking into account any annotation statistics [42, 43]. The best-match average method calculates the semantic similarity between two different sets of terms [44]. Given two sets of terms |${T}_1=\left\{{t}_{11},{t}_{12},\dots {t}_{1l}\right\}$| and |${T}_2=\left\{{t}_{21},{t}_{22},\dots {t}_{2m}\right\}$| where |$l$| is the length of the 1st set and |$m$| is the length of the 2nd set, the semantic similarity between the two sets is
(4)
with |$i\ \mathrm{and}\ j$| indices on |${T}_1\ \mathrm{and}\ {T}_2$| terms.

Designing of pipeline

Figure 2 shows the steps of the pipeline:

Block diagram of the pipeline. Different colors indicate different levels of analysis, from the output of the previous step.
Figure 2

Block diagram of the pipeline. Different colors indicate different levels of analysis, from the output of the previous step.

Table 1

CD statistics summary. The features from left to right are number of raw genes, number of genes with significant P-value, number of genes with significant adjusted P-value, number of differential expressed genes, number of raw GO terms and number of significant GO terms. Statistics related to LogFC with threshold 2 are between brackets

Data setCell sourceRaw genesP-valueAdjusted P-valueLogFCRaw GSEAFisher GSEA
GSE11501Peripheral blood18 98123411561 (0)570012 (0)
GSE87629Peripheral blood20 624785014 (0)492826 (0)
GSE72625Intestinal biopsy47 32376323358260 (23)6128296 (44)
GSE106260aIntestinal biopsy24 52629800267 (10)5907106 (18)
GSE106260bIntestinal biopsy24 526235406818 (2120)5783109 (28)
GSE61849aIntestinal biopsy46934 (2)1760 (0)
GSE61849bIntestinal biopsy46308 (3)1760 (0)
GSE76168Intestinal biopsy401042 (2)2320 (0)
GSE84729aIntestinal biopsy65322911 (4)3270 (0)
GSE84729bIntestinal biopsy65201411 (2)3270 (0)
Data setCell sourceRaw genesP-valueAdjusted P-valueLogFCRaw GSEAFisher GSEA
GSE11501Peripheral blood18 98123411561 (0)570012 (0)
GSE87629Peripheral blood20 624785014 (0)492826 (0)
GSE72625Intestinal biopsy47 32376323358260 (23)6128296 (44)
GSE106260aIntestinal biopsy24 52629800267 (10)5907106 (18)
GSE106260bIntestinal biopsy24 526235406818 (2120)5783109 (28)
GSE61849aIntestinal biopsy46934 (2)1760 (0)
GSE61849bIntestinal biopsy46308 (3)1760 (0)
GSE76168Intestinal biopsy401042 (2)2320 (0)
GSE84729aIntestinal biopsy65322911 (4)3270 (0)
GSE84729bIntestinal biopsy65201411 (2)3270 (0)
Table 1

CD statistics summary. The features from left to right are number of raw genes, number of genes with significant P-value, number of genes with significant adjusted P-value, number of differential expressed genes, number of raw GO terms and number of significant GO terms. Statistics related to LogFC with threshold 2 are between brackets

Data setCell sourceRaw genesP-valueAdjusted P-valueLogFCRaw GSEAFisher GSEA
GSE11501Peripheral blood18 98123411561 (0)570012 (0)
GSE87629Peripheral blood20 624785014 (0)492826 (0)
GSE72625Intestinal biopsy47 32376323358260 (23)6128296 (44)
GSE106260aIntestinal biopsy24 52629800267 (10)5907106 (18)
GSE106260bIntestinal biopsy24 526235406818 (2120)5783109 (28)
GSE61849aIntestinal biopsy46934 (2)1760 (0)
GSE61849bIntestinal biopsy46308 (3)1760 (0)
GSE76168Intestinal biopsy401042 (2)2320 (0)
GSE84729aIntestinal biopsy65322911 (4)3270 (0)
GSE84729bIntestinal biopsy65201411 (2)3270 (0)
Data setCell sourceRaw genesP-valueAdjusted P-valueLogFCRaw GSEAFisher GSEA
GSE11501Peripheral blood18 98123411561 (0)570012 (0)
GSE87629Peripheral blood20 624785014 (0)492826 (0)
GSE72625Intestinal biopsy47 32376323358260 (23)6128296 (44)
GSE106260aIntestinal biopsy24 52629800267 (10)5907106 (18)
GSE106260bIntestinal biopsy24 526235406818 (2120)5783109 (28)
GSE61849aIntestinal biopsy46934 (2)1760 (0)
GSE61849bIntestinal biopsy46308 (3)1760 (0)
GSE76168Intestinal biopsy401042 (2)2320 (0)
GSE84729aIntestinal biopsy65322911 (4)3270 (0)
GSE84729bIntestinal biopsy65201411 (2)3270 (0)
Table 2

Autoimmune CD-related pathologies statistics summary. The features from left to right are number of raw genes, number of genes with significant P-value, number of genes with significant adjusted P-value, number of differential expressed genes, number of raw GO terms and number of significant GO terms. Statistics related to LogFC with threshold 2 are between brackets. Data set legend: AA—GSE45512 and GSE68801; AR—GSE63425; AT—GSE71956; DM—GSE11971, GSE1551 and GSE5370; PB—GSE93170; PN—GSE31014 and GSE95849; RA—GSE17755 and GSE4588; VI—GSE90880 and GSE53148

Data setCell sourceRaw genesP-valueAdjusted P-valueLogFCRaw GSEAFisher GSEA
GSE45512Skin biopsy (from scalp)54 675290076 (10)39016 (0)
GSE68801Skin biopsy (from scalp)54 67521 28815 5081198 (163)6118406 (95)
GSE63425aTemporal artery biopsy22 74112600142 (13)6088222 (13)
GSE63425bTemporal artery biopsy22 7411138046 (2)6088107 (6)
GSE71956aPeripheral blood47 32316340470 (3)612853 (32)
GSE71956bPeripheral blood47 3239960122 (0)6128123 (0)
GSE11971Skeletal muscle biopsy22 28310 47883087314 (1687)576247 (179)
GSE1551Muscle biopsy22 283709618207876 (543)576293 (322)
GSE5370Skeletal muscle biopsy22 283845047376522 (965)57626 (26)
GSE93170miRNA and mRNA profile40 848106953313780 (26)6101152 (175)
GSE31014Peripheral blood22 28333830455 (6)5762128 (42)
GSE95849Peripheral blood31 74116 41013 9638641 (1663)6152105 (133)
GSE17755aPeripheral blood30 33611 844882066 (5)581928 (31)
GSE17755bPeripheral blood30 33617 12715 605393 (26)581913 (0)
GSE4588Peripheral blood54 6755819445185 (234)611842 (91)
GSE90880Peripheral blood12 625517043 (4)527499 (22)
GSE53148Skin biopsy25 326105206349 (816)589918 (151)
Data setCell sourceRaw genesP-valueAdjusted P-valueLogFCRaw GSEAFisher GSEA
GSE45512Skin biopsy (from scalp)54 675290076 (10)39016 (0)
GSE68801Skin biopsy (from scalp)54 67521 28815 5081198 (163)6118406 (95)
GSE63425aTemporal artery biopsy22 74112600142 (13)6088222 (13)
GSE63425bTemporal artery biopsy22 7411138046 (2)6088107 (6)
GSE71956aPeripheral blood47 32316340470 (3)612853 (32)
GSE71956bPeripheral blood47 3239960122 (0)6128123 (0)
GSE11971Skeletal muscle biopsy22 28310 47883087314 (1687)576247 (179)
GSE1551Muscle biopsy22 283709618207876 (543)576293 (322)
GSE5370Skeletal muscle biopsy22 283845047376522 (965)57626 (26)
GSE93170miRNA and mRNA profile40 848106953313780 (26)6101152 (175)
GSE31014Peripheral blood22 28333830455 (6)5762128 (42)
GSE95849Peripheral blood31 74116 41013 9638641 (1663)6152105 (133)
GSE17755aPeripheral blood30 33611 844882066 (5)581928 (31)
GSE17755bPeripheral blood30 33617 12715 605393 (26)581913 (0)
GSE4588Peripheral blood54 6755819445185 (234)611842 (91)
GSE90880Peripheral blood12 625517043 (4)527499 (22)
GSE53148Skin biopsy25 326105206349 (816)589918 (151)
Table 2

Autoimmune CD-related pathologies statistics summary. The features from left to right are number of raw genes, number of genes with significant P-value, number of genes with significant adjusted P-value, number of differential expressed genes, number of raw GO terms and number of significant GO terms. Statistics related to LogFC with threshold 2 are between brackets. Data set legend: AA—GSE45512 and GSE68801; AR—GSE63425; AT—GSE71956; DM—GSE11971, GSE1551 and GSE5370; PB—GSE93170; PN—GSE31014 and GSE95849; RA—GSE17755 and GSE4588; VI—GSE90880 and GSE53148

Data setCell sourceRaw genesP-valueAdjusted P-valueLogFCRaw GSEAFisher GSEA
GSE45512Skin biopsy (from scalp)54 675290076 (10)39016 (0)
GSE68801Skin biopsy (from scalp)54 67521 28815 5081198 (163)6118406 (95)
GSE63425aTemporal artery biopsy22 74112600142 (13)6088222 (13)
GSE63425bTemporal artery biopsy22 7411138046 (2)6088107 (6)
GSE71956aPeripheral blood47 32316340470 (3)612853 (32)
GSE71956bPeripheral blood47 3239960122 (0)6128123 (0)
GSE11971Skeletal muscle biopsy22 28310 47883087314 (1687)576247 (179)
GSE1551Muscle biopsy22 283709618207876 (543)576293 (322)
GSE5370Skeletal muscle biopsy22 283845047376522 (965)57626 (26)
GSE93170miRNA and mRNA profile40 848106953313780 (26)6101152 (175)
GSE31014Peripheral blood22 28333830455 (6)5762128 (42)
GSE95849Peripheral blood31 74116 41013 9638641 (1663)6152105 (133)
GSE17755aPeripheral blood30 33611 844882066 (5)581928 (31)
GSE17755bPeripheral blood30 33617 12715 605393 (26)581913 (0)
GSE4588Peripheral blood54 6755819445185 (234)611842 (91)
GSE90880Peripheral blood12 625517043 (4)527499 (22)
GSE53148Skin biopsy25 326105206349 (816)589918 (151)
Data setCell sourceRaw genesP-valueAdjusted P-valueLogFCRaw GSEAFisher GSEA
GSE45512Skin biopsy (from scalp)54 675290076 (10)39016 (0)
GSE68801Skin biopsy (from scalp)54 67521 28815 5081198 (163)6118406 (95)
GSE63425aTemporal artery biopsy22 74112600142 (13)6088222 (13)
GSE63425bTemporal artery biopsy22 7411138046 (2)6088107 (6)
GSE71956aPeripheral blood47 32316340470 (3)612853 (32)
GSE71956bPeripheral blood47 3239960122 (0)6128123 (0)
GSE11971Skeletal muscle biopsy22 28310 47883087314 (1687)576247 (179)
GSE1551Muscle biopsy22 283709618207876 (543)576293 (322)
GSE5370Skeletal muscle biopsy22 283845047376522 (965)57626 (26)
GSE93170miRNA and mRNA profile40 848106953313780 (26)6101152 (175)
GSE31014Peripheral blood22 28333830455 (6)5762128 (42)
GSE95849Peripheral blood31 74116 41013 9638641 (1663)6152105 (133)
GSE17755aPeripheral blood30 33611 844882066 (5)581928 (31)
GSE17755bPeripheral blood30 33617 12715 605393 (26)581913 (0)
GSE4588Peripheral blood54 6755819445185 (234)611842 (91)
GSE90880Peripheral blood12 625517043 (4)527499 (22)
GSE53148Skin biopsy25 326105206349 (816)589918 (151)
  • (i)

    ‘Data retrieval’. The selection of GEO data is carried out by GEO sample (GSM) records, downloaded with the platform and matrices information and converted into expression set class for the analysis. In order to evaluate the genes’ differential expression, healthy controls or treated patients with gluten-free diet are necessary for CD case;

  • (ii)

    ‘Sample selection’. An automatic selection for the GEO samples is not straightforward because sick and healthy patients are often mixed together; therefore, we performed a prior check over GSM records. This step helps in creating the design model, factorizing the classes of the patients (sick, healthy, treated). The design model is coeliac versus control or treated in CD cases, autoimmune disease versus control in the other cases;

  • (iii)

    ‘Differential expression’. After a linear and a Bayesian filtering over the design model, the results are tabulated. Three parameters determine the choice about the selection of candidate genes: P-value, adjusted P-value (Benjamini–Hochberg correction) and base 2 logarithm of fold change (logFC), sorted by logFC values;

  • (iv)

    ‘GO terms test’. We created the ‘topGOdata’ class by selecting the GO domain, using a filtering function to establish selected genes (out of genes universe) and providing the annotation for the mapping. We performed the Fisher’s exact test in order to obtain a filtering on GO terms and their relationships with the selected genes;

  • (v)

    ‘Semantic similarity’. After extracting the correspondences gene/GO term, we applied semantic similarity to all the autoimmune pathologies, using genes, GO terms and DO terms for the comparisons, in order to quantify the closeness among all the selected datasets;

  • (vi)

    ‘Cluster comparison’. Furthermore, we performed an enrichment test based on hypergeometric distribution for differential expressed genes and KEGG pathways [45], to find the principal pathways or pathologies for the CD and its comorbidities data sets;

  • (vii)

    ‘Results’. Finally, the pipeline output consisted of stastical summary, gene/GO term correspondences, GO graph topologies, semantic similarity matrices and dendrograms (for genes, GO terms and DO terms), and KEGG enrichment graph with the list of pathways/pathologies. Moreover, we used the list of differential expressed genes in common between CD and comorbidities to generate a gene network, with information on involved pathways/pathologies.

The pipeline is a combination of two main scripts in R language, available at the following link: http://bioinformatica.isa.cnr.it/COELIAC_DISEASE/SCRIPTS/. They build up the pipeline using the following Bioconductor packages [46]: ‘GEOquery’ [47] for data transformation in expression set class; ‘limma’ [48] for data analysis, linear models and differential expression on microarray data; ‘genefilter’ [49] for basic functions on filtering genes; ‘topGO’ [50] for the test of GO terms and the topology arrangement of the GO graph; ‘GOSemSim’ [51] for the semantic similarity evaluation among the diseases; ‘DOSE’ [52] for the semantic similarity evaluation among DO terms; and ‘clusterProfiler’ [53] for the enrichment analysis with KEGG pathways. We would point out that ‘GEOquery’ package provides the possibility to download GEO data sets directly but last updates show some issues in interaction with the other packages. Thus, we use a call to GEO File Transfer Protocol (FTP). Finally, we report the software and packages version used for the work: ‘R’ 3.3.1, ‘R Studio’ 1.0.136, ‘Bioconductor’ 3.4, ‘GEOquery’ 2.40.0, ‘limma’ 3.30.8, ‘genefilter’ 1.56, ‘topGO 2.26’, ‘GOSemSim’ 2.0.4, ‘DOSE’ 3.0.10 and ‘clusterProfiler’ 3.2.14.

Results

Statistics and GO terms tree

Table 1 shows some statistics for the selected CD studies. The 2nd, 3rd, 4th and 6th columns are the numbers obtained from the comparison with the threshold of 0.01 for the P-values. The standard threshold for logFC is 1.00 (among the brackets the number of left differential expressed genes with a threshold of 2.00). In Table 1, it is possible to notice that the studies are not standardized. The top four have a raw filter on microarray genes, based on unused or defected probes, whereas the following three are already curated. The step for searching for the differential expression of the genes is not trivial. The scores (adjusted P-value, P-value and logFC) are strictly related to data set dimension; furthermore, logFC is dependent from the kind of data set. For example, the 3rd data set has 23 genes with |logFC| > 2 and this subset should be considered as selected genes for the analysis. Moreover, the application of the Fisher’s exact test is biased for the selection of the GO terms because of the dimension of the data set and, probably, the performed manipulations. In such circumstances, we chose the candidate genes’ subset with a trade off among all the scores; thus, the creation of a GO graph eludes the Fisher’s exact test, keeping its biological importance to define process clusters but losing in statistical power.

Table 2 shows a selection of statistics for the autoimmune pathologies.

Example of GO graph with GSEA on GSE72625 data set. The complete graph (on the left) and a zoom (on the right) are shown. The rectangles represent the top five GO terms after the test. The red and orange colors indicate the most significant GO terms. In this case, an inflammatory sub-graph is linked to the interferon gamma cytokine and response to drug.
Figure 3

Example of GO graph with GSEA on GSE72625 data set. The complete graph (on the left) and a zoom (on the right) are shown. The rectangles represent the top five GO terms after the test. The red and orange colors indicate the most significant GO terms. In this case, an inflammatory sub-graph is linked to the interferon gamma cytokine and response to drug.

Network on common differential expressed genes between CD and its comorbidities. Color legend: gray, consolidated pathway; pink, physical interaction; light violet, co-expression; dark yellow, predicted interaction; indigo, co-localization; light blue, other pathway; light green, genetic interaction; beige, shared protein domain. Circle legend: striated circle, common genes; normal circle, genes added after enrichment. The cluster network has been performed with the online tool ‘GeneMania’ [57].
Figure 4

Network on common differential expressed genes between CD and its comorbidities. Color legend: gray, consolidated pathway; pink, physical interaction; light violet, co-expression; dark yellow, predicted interaction; indigo, co-localization; light blue, other pathway; light green, genetic interaction; beige, shared protein domain. Circle legend: striated circle, common genes; normal circle, genes added after enrichment. The cluster network has been performed with the online tool ‘GeneMania’ [57].

Semantic similarity matrix for differential expressed genes (from the first five GO terms). The two-letter suffix before the GSE codes are referred to the following: AA, alopecia areata; AR, arteritis; AT, autoimmune thyroid disease; DM, dermatomyositis; PB, primary biliary cirrhosis; PN, peripheral neuropathy; RA, rheumatoid arthritis; VI, vitiligo. The number after the two letters indicates the logFC threshold.
Figure 5

Semantic similarity matrix for differential expressed genes (from the first five GO terms). The two-letter suffix before the GSE codes are referred to the following: AA, alopecia areata; AR, arteritis; AT, autoimmune thyroid disease; DM, dermatomyositis; PB, primary biliary cirrhosis; PN, peripheral neuropathy; RA, rheumatoid arthritis; VI, vitiligo. The number after the two letters indicates the logFC threshold.

Semantic similarity matrix for GO terms (1st five GO terms). Codes are same as in Figure 5.
Figure 6

Semantic similarity matrix for GO terms (1st five GO terms). Codes are same as in Figure 5.

Semantic similarity matrix for DO terms. Legend: AA, alopecia areata; ARa, cerebral arteritis; ARb, temporal arteritis; ARc, Takayasu arteritis; ARd, polyarteritis nodosa; AT, autoimmune thyroid disease; CD, coeliac disease;
DM, dermatomyositis; PB, primary biliary cirrhosis; PN, autonomic peripheral neuropathy; RAa, rheumatoid arthritis; RAb, juvenile rheumatoid arthritis;
VI, vitiligo.
Figure 7

Semantic similarity matrix for DO terms. Legend: AA, alopecia areata; ARa, cerebral arteritis; ARb, temporal arteritis; ARc, Takayasu arteritis; ARd, polyarteritis nodosa; AT, autoimmune thyroid disease; CD, coeliac disease; DM, dermatomyositis; PB, primary biliary cirrhosis; PN, autonomic peripheral neuropathy; RAa, rheumatoid arthritis; RAb, juvenile rheumatoid arthritis; VI, vitiligo.

Enrichment analysis with KEGG pathways from differential expressed genes. KEGG pathways related to diseases are listed in the rows; data sets (with logFC threshold) are listed in the columns. The dimension of the circles shows gene prevalence in the pathway; the color of the circles shows the statistical validation, with an upper bound P-value of 0.05. CD disease data sets are inside the vertical lines. The circles indicate the KEGG pathways for CD (dark green) and for comorbidities (light green).
Figure 8

Enrichment analysis with KEGG pathways from differential expressed genes. KEGG pathways related to diseases are listed in the rows; data sets (with logFC threshold) are listed in the columns. The dimension of the circles shows gene prevalence in the pathway; the color of the circles shows the statistical validation, with an upper bound P-value of 0.05. CD disease data sets are inside the vertical lines. The circles indicate the KEGG pathways for CD (dark green) and for comorbidities (light green).

Table 3 reports a summary of the data sets and differential expressed genes analyzed. The hierarchy of the GO terms reveals that some of them are encapsulated; thus, not all the listed GO terms are fundamental. An example of hierarchy with a zoom on the most important GO terms is shown in Figure 3.

Table 3

Summary of results along the pipeline steps for the selected pathologies. The features from left to right are selected disease, main location in the body, number of data sets, number of selected data sets and number of upregulated differential expressed genes (DEG up) and downregulated differential expressed genes (DEG down)

DiseaseOrgan/tissueData setSelected data setDEG upDEG down
CDSmall intestine427756*43*
Alopecia areataSkin (scalp)722132
ArteritisArteries24124
Autoimmune thyroid diseaseThyroid713840
DermatomyositisSkin, muscle223636
Primary biliary cirrhosisLiver38112
Peripheral neuropathyNerves752184
Rheumatoid arthritisJoints15922118
VitiligoSkin925417
DiseaseOrgan/tissueData setSelected data setDEG upDEG down
CDSmall intestine427756*43*
Alopecia areataSkin (scalp)722132
ArteritisArteries24124
Autoimmune thyroid diseaseThyroid713840
DermatomyositisSkin, muscle223636
Primary biliary cirrhosisLiver38112
Peripheral neuropathyNerves752184
Rheumatoid arthritisJoints15922118
VitiligoSkin925417

*Numbers referred to differential expressed genes with strong evidences in CD.

Table 3

Summary of results along the pipeline steps for the selected pathologies. The features from left to right are selected disease, main location in the body, number of data sets, number of selected data sets and number of upregulated differential expressed genes (DEG up) and downregulated differential expressed genes (DEG down)

DiseaseOrgan/tissueData setSelected data setDEG upDEG down
CDSmall intestine427756*43*
Alopecia areataSkin (scalp)722132
ArteritisArteries24124
Autoimmune thyroid diseaseThyroid713840
DermatomyositisSkin, muscle223636
Primary biliary cirrhosisLiver38112
Peripheral neuropathyNerves752184
Rheumatoid arthritisJoints15922118
VitiligoSkin925417
DiseaseOrgan/tissueData setSelected data setDEG upDEG down
CDSmall intestine427756*43*
Alopecia areataSkin (scalp)722132
ArteritisArteries24124
Autoimmune thyroid diseaseThyroid713840
DermatomyositisSkin, muscle223636
Primary biliary cirrhosisLiver38112
Peripheral neuropathyNerves752184
Rheumatoid arthritisJoints15922118
VitiligoSkin925417

*Numbers referred to differential expressed genes with strong evidences in CD.

Pathways

We suggest the following framework on the BP involved in each study on CD:

  • (i)

    GSE11501: peptidyl-tyrosine phosphorylation, phosphatidylinositol 3-kinase signaling and response to endoplasmic reticulum stress;

  • (ii)

    GSE87629: mitosis regulation, microtubule cytoskeleton organization and protein destabilization;

  • (iii)

    GSE72625: signaling pathway and cellular response about IFNγ;

  • (iv)

    GSE106260a: signaling pathway and cellular response about IFNγ (and immune response);

  • (v)

    GSE106260b: leukocyte migration and cardiac conduction;

  • (vi)

    GSE61849a: immune response and immune system development;

  • (vii)

    GSE61849b: protein phosphorylation, apoptotic process and regulation of cell adhesion;

  • (viii)

    GSE76168: cytokine-mediated signaling pathways;

  • (ix)

    GSE84729a: cytokine-mediated signaling pathways and response to virus;

  • (x)

    GSE84729b: cytokine-mediated signaling pathways and response to virus.

In summary, most of the previous pathways are directly connected to the pathology, with a particular interest in the cases with the cytokine signaling pathway and the immune response topics, i.e. inflammation and autoimmune response generated by CD, as confirmed from the candidate genes. In addition, we underline interesting functional annotations, such as response to endoplasmic reticulum stress [54], cardiac conduction [55] and response to virus [56].

The comparison of the differential expressed genes between the CD data sets and the autoimmune diseases shows that the following genes are in common: HLA-DRB1, CXCL10, IFI27, STAT1, GBP1, BST2, IFITM1, IFITM2, IFITM3, IRF1, ISG15, OAS2, CCL20, CCL5, HLA-DPA1, CTLA4, ICOS, CXCL11, GPR65, GZMA, GZMB, XCL1, MMP12, DUSP10, CXCR4, IFNG, KLRF1, RGS1, MASP1, CCR2, TLR4, CD3D, CD3G, CD8A and HAVCR2. A cluster network is generated on the list of genes taking into account co-expression, consolidated pathways, co-localization, shared protein domain, predicted and physical interactions and shown in Figure 4. The most important clusters are related to the following pathways and pathologies, with the percentage of coverage: immune system (23.53%), chemokine signaling pathway (8.70%), rheumatoid arthritis (7.54%), influenza A (6.71%), interferon alpha/beta signaling (5.58%), Biocarta CTLA4 pathway (5.29%), Cytokine Signaling in Immune System (4.52%), chemokine receptors bind chemokines (4.08%) and IL12-mediated signaling events (2.71%).

In the same way, the comparison between GO terms from CD list and autoimmune disorders list highlights the presence of terms in common:

  • (i)

    GO:0002376: immune system process;

  • (ii)

    GO:0009615: response to virus;

  • (iii)

    GO:0006955: immune response;

  • (iv)

    GO:0019221: cytokine-mediated signaling pathway;

  • (v)

    GO:0051607: defense response to virus;

  • (vi)

    GO:0034340: response to type I interferon;

  • (vii)

    GO:0060337: type I interferon signaling pathway;

  • (viii)

    GO:0071357: cellular response to type I interferon.

Semantic similarity and KEGG enrichment

Figure 5 shows the semantic similarity matrix for differential expressed genes. Over a semantic similarity value of 0.7, CD1_GSE106260a/b is connected to many selected comorbidities, in particular with dermatomyositis, arteritis and vitiligo. Considering the other strong evidences from CD1_GSE84729a and CD2_GSE106260b, peripheral neuropathy, primary biliary cirrhosis and alopecia areata should be taken into account (for the selected data sets). In addition, giant cell arteritis (AR2_GSE63425a/b) and rheumatoid arthritis (RA2_GSE17755b) are outliers, with 2 as logFC threshold.

Figure 6 shows the semantic similarity matrix for GO terms. Over a semantic similarity value of 0.7, CD1_GSE106260a, CD1_GSE84729a, CD2_GSE106260a and CD2_GSE72625 are well clustered with all dermatomyositis data sets, normal and juvenile. Moreover, it is possible to notice from the little square in the center of the matrix that CD1_GSE84729a and VI1_GSE90880 (vitiligo) data sets have a spurious similarity.

Figure 7 shows the semantic similarity matrix for DO terms. From the CD row, alopecia areata, autoimmune thyroid disease, primary biliary cirrhosis and vitiligo are the most connected pathologies (over a threshold of 0.3). In details, vitiligo has a similarity value of 0.55.

Figure 8 represents the relationships between KEGG pathways and all the selected data sets. Recurring pathways among CD data sets, with at least three evidences, are natural killer cell-mediated cytotoxicity, autoimmune thyroid disease, chemokine signaling pathway, T cell receptor signaling pathway, herpes simplex infection, influenza A, cell adhesion molecules, NOD-like receptor signaling pathway, hepatitis C, measles, cytokine–cytokine receptor interaction and prolactin signaling pathway. In addition to the previous list of pathways, recurring pathways in common between CD and the others pathologies, with at least one CD evidence, are allograft rejection, graft versus host disease, type I diabetes mellitus, malaria, Chagas’ disease, antigen processing and presentation, hematopoietic cell lineage, herpes simplex infection, human papilloma infection, hepatitis B, prion diseases, asthma and intestinal immune network for IgA production.

Discussion

The aim of this work is to evidence the effectiveness of a pipeline of bioinformatics analyses in extracting information from public databases and investigating the relationships of a complex disease and its comorbidities. We studied the state-of-the-art CD and some autoimmune pathologies, in order to select gene expression data from microarray experiments available online. GSEA is a method suitable for giving a direction in studying CD, especially in terms of involved processes, inflammation pathways, relationships with other disease and interconnection bridge among different kinds of omic data by means of different ontology, such as GO and DO. Furthermore, we remark the usefulness of the semantic similarity method that quantifies the closeness among different data sets on the basis of the selected ontology without taking into account any other statistic or parameter.

We applied the described pipeline both to CD and to some of its comorbidities. Therefore, this work is methodologically general and could be applied to the following: (1) other CD data sets, (2) other CD comorbidities and (3) other complex pathologies. The only constraint is the typology of data, downloaded by GEO repository (or by Array Express repository), that consist of gene expression sets from sick and healthy patients, even GFD-treated in CD case. The threshold of 10 samples is considered as at least five individuals with active state of the disease in comparison with at least five individuals as healthy control (or under a treatment). This cutoff is a bound on the difference of samples’ number among the pathologies, with no more than one order of magnitude for all the considered microarrays. In this way, we tried to control the overrepresentation of one data set over another. Furthermore, even if selecting an objective number of samples is a heuristic choice, very small numbers affects statistical validations, as happen often for biological samples.

Briefly, we started from the set of differential expressed genes and applied GSEA to them, in order to obtain the most important GO terms. Then, we used a semantic similarity method on genes, GO terms and DO terms, making a comparison among all the results. The pipeline in this work is implemented in R language with packages from Bioconductor repository but it can be simply rewritten in other programming languages. It is important to highlight the choice of using different logFC thresholds; in this way, we have an augmentation of the data, not only in terms of quantity (when data are few) but even in terms of quality because we can extract different sets of differential expressed genes and GO terms, improving the research of common genes and pathways, in the same disease or among several diseases. From a methodological point of view, one strong point of the pipeline is the possibility of reusing available data. There are many projects with published results, which can be grasped for further analyses, but it is not simple to work on them. In other cases, such as in many CD comorbidities, microarray data are not publicly available because of legal/ethical reasons or difficulties in making the experiments. For this reason, our pipeline could help in obtaining novel evidences from old available data. It is important to underline how the selection of the microarray data sets can influence the evidences. The proposed method is data-driven; thus, the choice of the data sets is up to the scientists. However, we started from CD data and found results strengthened by the scientific literature, showing how the pipeline highlights the most important aspects involved in the selected pathologies. Obviously, we expect that the use of more and more microarray data sets may enrich the set of evidences, without affecting the previous ones. Therefore, as shown in this work, we suggest taking into account the data sets from different sources and different cell types, in order to integrate data and get evidences that are more robust for the single disease and among several diseases. One possible study target could be the analysis of the differential expression for genes of HLA, according to the importance claimed in [58]. In detail, the role of HLA class II genes is not completely described, but their importance is well known in autoimmune diseases. The availability of suitable microarray data sets would be a good seed for our pipeline, taking into account not only the classic case of patients versus control but also different states coming from genetic variants, in order to provide a support in defining the risk of having a disease. Moreover, novel microarray data sets with the concurrency of autoimmune diseases, e.g. patients with CD and one comorbidity versus patients with CD and another comorbidity, could help in supporting our finding or suggesting a new direction of investigation.

An inner issue with the typology of data is the comparison. The standardization is not similar all over the data set, for the kind of the data (conditions, tissues, normalization) or for the lack of curated records. It is meaningful to underline that GEO series (GSE) needs more effort in preparing data for the analysis because they are not curated. A possible direction in improving the pipeline can be the inclusion of a quality control on microarray data sets. Usually, researchers upload microarray data sets after a personal quality control step, in order to provide data sets suitable for semi-automated analysis. Nevertheless, some tools are available online to perform locally the quality control. One example is ‘arrayQualityMetrics’ package from Bioconductor repository, which provides a semi-interactive report on microarray quality evaluation, for both raw and normalized data sets, and an R object, suitable for the upstream integration in a pipeline [59]. The abovementioned ‘limma’ package provides the MA-plot, based on the comparison of the signal intensities, for making decisions on normalizing data. Furthermore, CD microarray data have a high sensitivity to the used filters, and therefore, it is up to the investigator finding a good trade off among the scores, making pretty difficult the realization of a standard protocol. From this perspective, a good work is explained in [60]. Nevertheless, our approach provides a good way to analyze, compare and integrate the described data. As future work, a complete workflow could be useful to automate the process, once we have coped with the manual selection of the patients (GSM records).

The integration of data from different studies, also obtained with different technologies, represent the most promising approach for investigating in details the complex disease. Our work proposes a pipeline suitable for analyzing different data sets from pathologies and their comorbidities. Novel omics data can be a further source of information for our pipeline, providing the opportunity to create a more complete perspective of the multifactorial disorders themselves.

Bioinformatics statistical analysis of CD and its comorbidities is a thriving direction of medical research. It has two important benefits: the improved understanding of the CD that shape the network of comorbidity diseases and the introduction of new workflow of analyses and statistical estimators for complex diseases.

Finally, our work is important from the clinical bioinformatics point of view because we provide methodology for comorbidity analysis and improved visualization based on developing bits of code to harmonize existing software into an effective pipeline that could be used by trained physicians. Although many molecular biologists use bioinformatics R scripts (or similar) in the research work, the impact of bioinformatics in clinical investigation is still at the beginning. We believe that our bioinformatics approach could be useful to researchers not only in CD but also in other complex diseases.

Key Points

  • Comprehensive bioinformatics analysis of a complex disease and its comorbidities can help in looking for novel findings or in strengthening old ones.

  • CD is a complex disease that needs many other efforts to cope with, in order to find important biomarkers and help clinicians and physicians in their evaluations.

  • A bioinformatics approach by means of some methodologies, such as gene set enrichment analysis and semantic similarity, is suitable for these purposes.

  • A straightforward pipeline in R language is useful to analyze common genes, gene ontology and disease ontology terms and pathways, by reusing online available microarray data.

Acknowledgements

A.F. activity has been supported by the Flagship “InterOmics” project (PB.P05) funded by the Italian Ministry of Education, University and Research and Italian National Research Council organizations.

Eugenio Del Prete is a PhD student in Applied Biology at the University of Basilicata, Potenza, Italy. His work for this article has been performed also at the Laboratory of Bioinformatics and Computational Biology, Institute of Food Sciences, Italian National Research Council, Avellino, Italy, and at the Computer Laboratory, Department of Computer Science, University of Cambridge, UK. He has an MSc in Telecommunications Engineering. His research interests include bioinformatics, biostatistics and computational biology.

Angelo Facchiano is a senior researcher at the Bioinformatics and Computational Biology Laboratory, Institute of Food Sciences, Italian National Research Council, Avellino, Italy. He is as a degree in Chemistry with specialization in Biotechnology and PhD in Cellular Biochemistry. His research interests include biochemistry, genetics and molecular biology, bioinformatics and computational biology, pharmacology, toxicology and pharmaceutics.

Pietro Liò is a reader in Computational Biology at the Computer Laboratory, a department of Computer Science at the University of Cambridge. He has an MA from Cambridge, a PhD in complex systems and non-linear dynamics and a PhD in theoretical genetics. His research interests include biomolecular and clinical data integration, computational medicine, complex disease dynamics and comorbidities, bioengineering, ICT and social innovation technologies.

References

1.

Bergamo
P
,
Maurano
F
,
Mazzarella
G
, et al.
Immunological evaluation of the alcohol-soluble protein fraction from gluten-free grains in relation to celiac disease
.
Mol Nutr Food Res
2011
;
55
:
1266
70
.

2.

Nijhawan
S
,
Goyal
G
.
Celiac disease review
.
J Gastrointest Dig Syst
2015
;
5
:
350
.

3.

Fasano
A
.
Zonulin, regulation of tight junctions, and autoimmune diseases
.
Ann N Y Acad Sci
2012
;
1258
(
1
):
25
33
.

4.

Heyman
M
,
Menard
S
.
Pathways of gliadin transport in celiac disease
.
Ann N Y Acad Sci
2009
;
1165
:
274
8
.

5.

Sollid
LM
,
Lundin
KE
.
Diagnosis and treatment of celiac disease
.
Mucosal Immunol
2009
;
2
:
3
7
.

6.

Alaedini
A
,
Green
PH
.
Narrative review: coeliac disease: understanding a complex autoimmune disorder
.
Ann Intern Med
2005
;
142
(
4
):
289
98
.

7.

Louka
AS
,
Sollid
LM
.
HLA in coeliac disease: unravelling the complex genetics of a complex disorder
.
Tissue Antigens
2003
;
61
(
2
):
105
17
.

8.

Fasano
A
.
Surprises from celiac disease
.
Sci Am
2009
;
301
:
54
61
.

9.

Mitchell
KJ
.
What is complex about complex disorders?
Genome Biol
2012
;
13
(
1
):
237
.

10.

Lauret
E
,
Rodrigo
L
.
Celiac disease and autoimmune-associated conditions
.
Biomed Res Int
2013
;
2013
:
127589
.

11.

Capobianco
E
,
Liò
P
.
Comorbidity: a multidimensional approach
.
Trends Mol Med
2013
;
19
(
9
):
515
21
.

12.

Ryan
D
,
Newnham
ED
,
Prenzler
PD
, et al.
Metabolomics as a tool for diagnosis and monitoring in coeliac disease
.
Metabolomics
2015
;
11
:
980
90
.

13.

Östensson
M
,
Montén
C
,
Bacelis
J
, et al.
A possible mechanism behind autoimmune disorders discovered by genome-wide linkage and association analysis in celiac disease
.
PLoS One
2013
;
8
(
8
):
e70174
.

14.

Buddrick
O
,
Cornell
HJ
,
Small
DM
.
Reduction of toxic gliadin content of wholegrain bread by the enzyme caricain
.
Food Chem
2015
;
170
:
343
7
.

15.

Wilhite
SE
,
Barrett
T
.
Strategies to explore functional genomics data sets in NCBI’s GEO database
.
Methods Mol Biol
2012
;
802
:
729
43
.

16.

Kolesnikov
N
,
Hastings
E
,
Keays
M
, et al.
ArrayExpress update—simplifying data submissions
.
Nucleic Acids Res
2015
;
43
(
Database issue
):
D1113
6
.

17.

Heap
GA
,
Trynka
G
,
Jansen
RC
, et al.
Complex nature of SNP genotype effects on gene expression in primary human leucocytes
.
BMC Med Genomics
2009
;
2
:
1
.

18.

Garber
ME
,
Saldanha
A
,
Parker
JS
, et al.
A B-cell gene signature correlates with the extent of gluten-induced intestinal injury in celiac disease
.
Cell Mol Gastroenterol Hepatol
2017
;
4
(
1
):
1
17
.

19.

Jorgensen
SF
,
Reims
HM
,
Freydenlund
D
, et al.
A cross-sectional study of the prevalence of gastrointestinal symptoms and pathology in patients with common variable immunodeficiency
.
Am J Gastroenterol
2016
;
111
(
10
):
1467
75
.

20.

Pietz
G
,
De
R
,
Hedberg
M
, et al.
Immunopathology of childhood celiac disease—key role of intestinal epithelial cells
.
PLoS One
2017
;
12
(
9
):
e0185025
.

21.

Plaza-Izurieta
L
,
Fernandez-Jimenez
N
,
Irastorza
I
, et al.
Expression analysis in intestinal mucosa reveals complex relations among genes under the association peaks in celiac disease
.
Eur J Hum Genet
2015
;
23
(
8
):
1100
5
.

22.

Pascual
V
,
Medrano
LM
,
López-Palacios
N
, et al.
Different gene expression signatures in children and adults with celiac disease
.
PLoS One
2016
;
11
(
2
):
e0146276
.

23.

Garcia-Etxebarria
K
,
Jauregi-Miguel
A
,
Romero-Garmendia
I
, et al.
Ancestry-based stratified analysis of Immunochip data identifies novel associations with celiac disease
.
Eur J Hum Genet
2016
;
24
(
12
):
1831
4
.

24.

Xing
L
,
Dai
Z
,
Jabbari
A
, et al.
Alopecia areata is driven by cytotoxic T lymphocytes and is reversed by JAK inhibition
.
Nat Med
2014
;
20
(
9
):
1043
9
.

25.

Jabbari
A
,
Cerise
JE
,
Chen
JC
, et al.
Molecular signatures define alopecia areata subtypes and transcriptional biomarkers
.
EBioMedicine
2016
;
7
:
240
7
.

26.

Regent
A
,
Mouthon
I
. Expression Data of Vascular Smooth Muscle Cell in Patients With Suspected Giant Cell Arteritis (GCA). .

27.

Limbach
M
,
Saare
M
,
Tserel
L
, et al.
Epigenetic profiling in CD4+ and CD8+ T cells from Graves’ disease patients reveals changes in genes associated with T cell receptor signaling
.
J Autoimmun
2016
;
67
:
46
56
.

28.

Chen
YW
,
Shi
R
,
Geraci
N
, et al.
Duration of chronic inflammation alters gene expression in muscle from untreated girls with juvenile dermatomyositis
.
BMC Immunol
2008
;
9
:
43
.

29.

Greenberg
SA
,
Pinkus
JL
,
Pinkus
GS
, et al.
Interferon-alpha/beta-mediated innate immune mechanisms in dermatomyositis
.
Ann Neurol
2005
;
57
(
5
):
664
78
.

31.

Nakagawa
R
,
Muroyama
R
,
Saeki
C
, et al.
miR-425 regulates inflammatory cytokine production in CD4+ T cells via N-Ras upregulation in primary biliary cholangitis
.
J Hepatol
2017
;
66
(
6
):
1223
30
.

32.

Chang
KH
,
Chuang
TJ
,
Lyu
RK
, et al.
Identification of gene networks and pathways associated with Guillain–Barré syndrome
.
PLoS One
2012
;
7
(
1
):
e29506
.

33.

Luo
L
,
Zhou
WH
,
Cai
JJ
, et al.
Gene expression profiling identifies downregulation of the neurotrophin-MAPK signaling pathway in female diabetic peripheral neuropathy patients
.
J Diabetes Res
2017
;
2017
:
8103904
.

34.

Lee
HM
,
Sugino
H
,
Aoki
C
, et al.
Underexpression of mitochondrial-DNA encoded ATP synthesis-related genes and DNA repair genes in systemic lupus erythematosus
.
Arthritis Res Ther
2011
;
13
(
2
):
R63
.

35.

Lauwerys
BR
,
Louahed
J
,
Knoops
L
, et al. Identification of Genes Specifically Over-Expressed in Lupus CD4 T and B cells. .

36.

Dey-Rao
R
,
Sinha
AA
.
Vitiligo blood transcriptomics provides new insights into disease mechanisms and identifies potential novel therapeutic targets
.
BMC Genomics
2017
;
18
(
1
):
109
.

37.

Rashighi
M
,
Agarwal
P
,
Richmond
JM
, et al.
CXCL10 is critical for the progression and maintenance of depigmentation in a mouse model of vitiligo
.
Sci Transl Med
2014
;
6
(
223
):
223ra23
.

38.

Kurppa
K
,
Laitinen
A
,
Agardh
D
.
Coeliac disease in children with type 1 diabetes
.
Lancet Child Adolesc Health
2018
;
2
(
2
):
133
43
.

39.

Gene Ontology Consortium
.
Gene Ontology Consortium: going forward
.
Nucleic Acids Res
2015
;
43
(
Database issue
):
D1049
56
.

40.

Kibbe
WA
,
Arze
C
,
Felix
V
, et al.
Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data
.
Nucleic Acids Res
2015
;
43
(
Database issue
):
D1071
8
.

41.

Subramanian
A
,
Tamayo
P
,
Mootha
VK
, et al.
Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles
.
Proc Natl Acad Sci U S A
2005
;
102
(
43
):
15545
50
.

42.

Wang
JZ
,
Ali
F
,
Srimani
PK
. Efficient method to measure the semantic similarity of ontologies. In:
Wu
S
,
Yang
LT
,
Xu
TL
(eds).
Advances in Grid and Pervasive Computing, GPC 2008, Lecture Notes in Computer Science,
Vol. 5036.
Berlin, Heidelberg
:
Springer
,
2008
.

43.

Moni
MA
,
Liò
P
.
comoR: a software for disease comorbidity risk assessment
.
J Clin Bioinforma
2014
;
4
:
8
.

44.

Pesquita
C
,
Faria
D
,
Bastos
H
, et al.
Metrics for GO based protein semantic similarity: a systematic evaluation
.
BMC Bioinformatics
2008
;
9
(
Suppl 5
):
S4
.

45.

Kanehisa
M
,
Sato
Y
,
Kawashima
M
, et al.
KEGG as a reference resource for gene and protein annotation
.
Nucleic Acids Res
2016
;
44
:
D457
62
.

46.

Huber
W
,
Carey
VJ
,
Gentleman
R
, et al.
Orchestrating high-throughput genomic analysis with Bioconductor
.
Nat Methods
2015
;
12
(
2
):
115
21
.

47.

Davis
S
,
Meltzer
P
.
GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor
.
Bioinformatics
2007
;
23
(
14
):
1846
7
.

48.

Ritchie
ME
,
Phipson
B
,
Wu
D
, et al.
limma powers differential expression analyses for RNA-sequencing and microarray studies
.
Nucleic Acids Res
2015
;
43
(
7
):
e47
.

49.

Gentleman
R
,
Carey
V
,
Huber
W
,
Hahne
F
. genefilter: methods for filtering genes from high-throughput experiments.
R package version 1.60.0
,
2017
.

50.

Alexa
A
,
Rahnenfuhrer
J
topGO: enrichment analysis for gene ontology.
R package version 2.30.0
,
2016
.

51.

Yu
G
,
Li
F
,
Qin
Y
, et al.
GOSemSim: an R package for measuring semantic similarity among GO terms and gene products
.
Bioinformatics
2010
;
26
(
7
):
976
8
.

52.

Yu
G
,
Wang
LG
,
Yan
GR
, et al.
DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis
.
Bioinformatics
2015
;
31
(
4
):
608
9
.

53.

Yu
G
,
Wang
LG
,
Han
Y
, et al.
clusterProfiler: an R package for comparing biological themes among gene clusters
.
OMICS
2012
;
16
(
5
):
284
7
.

54.

Marrè
ML
,
James
EA
,
Piganelli
JD
.
β cell ER stress and the implications for immunogenicity in type 1 diabetes
.
Front Cell Dev Biol
2015
;
3
:
67
.

55.

Demirtas
K
,
Yayla
C
,
Yüksel
M
, et al.
Tp-e interval and Tp-e/QT ratio in patients with celiac disease
.
Rev Clin Esp
2017
;
217
(
8
):
439
45
.

56.

Bouziat
R
,
Hinterleitner
R
,
Brown
JJ
, et al.
Reovirus infection triggers inflammatory responses to dietary antigens and development of celiac disease
.
Science
2017
;
356
(
6333
):
44
50
.

57.

Warde-Farley
D
,
Donaldson
SL
,
Comes
O
, et al.
The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function
.
Nucleic Acids Res
2010
;
38
(
Web Server issue
):
W214
20
.

58.

Gianfrani
C
,
Pisapia
L
,
Picascia
S
, et al.
Expression level of risk genes of MHC class II is a susceptibility factor for autoimmunity: new insights
.
J Autoimmun
2018
;
89
:
1
10
.

59.

Kauffmann
A
,
Gentleman
R
,
Huber
W
.
arrayQualityMetrics—a bioconductor package for quality assessment of microarray data
.
Bioinformatics
2009
;
25
(
3
):
415
6
.

60.

Quinn
EM
,
Coleman
C
,
Molloy
B
, et al.
Transcriptome analysis of CD4+ T cells in coeliac disease reveals imprint of BACH2 and IFNγ regulation
.
PLoS One
2015
;
10
(
10
):
e0140049
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)