Bioinformatics methodologies for coeliac disease and its comorbidities

Del Prete, Eugenio; Facchiano, Angelo; Liò, Pietro

doi:10.1093/bib/bby109

Abstract

Coeliac disease (CD) is a complex, multifactorial pathology caused by different factors, such as nutrition, immunological response and genetic factors. Many autoimmune diseases are comorbidities for CD, and a comprehensive and integrated analysis with bioinformatics approaches can help in evaluating the interconnections among all the selected pathologies. We first performed a detailed survey of gene expression data available in public repositories on CD and less commonly considered comorbidities. Then we developed an innovative pipeline that integrates gene expression, cell-type data and online resources (e.g. a list of comorbidities from the literature), using bioinformatics methods such as gene set enrichment analysis and semantic similarity. Our pipeline is written in R language, available at the following link: http://bioinformatica.isa.cnr.it/COELIAC_DISEASE/SCRIPTS/. We found a list of common differential expressed genes, gene ontology terms and pathways among CD and comorbidities and the closeness among the selected pathologies by means of disease ontology terms. Physicians and other researchers, such as molecular biologists, systems biologists and pharmacologists can use it to analyze pathology in detail, from differential expressed genes to ontologies, performing a comparison with the pathology comorbidities or with other diseases.

coeliac disease, coeliac disease comorbidities, clinical bioinformatics, ontologies, gene set enrichment analysis, semantic similarity

Introduction

Bioinformatics tools offer a wide spectrum of resources to analyze omics data, as for instance gene expression data. Public databases contain such data in a structured way and constitute an impressive amount of information, in most cases not completely exploited. The integration of different bioinformatics tools makes it possible a synergism of the analysis methods thus offering the opportunity to extract more information from the available data. This approach can be very useful in particular cases, as human pathologies with multifactorial aspects.

Coeliac disease (CD) is a chronic condition and an autoimmune disorder that causes an inflammatory reaction localized in the human small intestine. Two factors are significant: the genetic and the environmental components. It occurs in predisposed individuals with genetic variants, in which an autoimmune response from the small intestine is activated by gluten, a storage protein fraction found in different cereals, such as wheat, rye and barley. Gliadins and glutenins are the main components of gluten, responsible for triggering the autoimmune response after the crossing of the small intestine epithelium [1, 2]. The key aspects of known disease mechanism can be summarized as follows:

Figure 1

CD comorbidity map. The autoimmune pathologies associated to CD are clustered for typologies: neurological diseases (ochre, 1), endocrine diseases (green, 2), cardiological disease (red, 3), liver diseases (cyan, 4), dermatological diseases (purple, 5), rheumatological and connective tissue diseases (pink, 6) and various diseases (gray, 7). The little squares inside the human silhouette represent the principal zones of the body affected by the single autoimmune pathology. (The pathologies with * are localized all over the internal/external part of the body.)

Open in new tab Download slide

(i)
Gluten induces enterocytes to upregulate zonulin (haptoglobin 2 precursor), which loosens enterocytes tight junctions, increasing the permeability of the small intestine. The interaction between gliadin and an overexpressed C-X-C Motif Chemokine Receptor 3 is essential for the release of zonulin [3];
(ii)
After the crossing toward lamina propria, gluten causes cytokines secretion, which induces the activation of intraepithelial lymphocytes against the enterocytes. In more detail, Interleukin 15 is a proinflammatory cytokine that regulates T and natural killer cell activation and proliferation, activating cytotoxicity by NKG2D+ cells and interferon-gamma (IFNγ) production by intraepithelial lymphocytes [4];
(iii)
Damaged enterocytes release tissue transglutaminase, which deamidates selected glutamines in gluten fragments. Then, the immune mechanisms recognize them by means of immunoglobulin A (IgA) antibodies and activate antigen-presenting cells by joining to HLA-DQ2 or HLA-DQ8 molecules [5];
(iv)
Helper T cells (CD4+) recognize the complex and begin to damage enterocytes, both directly by means of secreted cytokines and killer T cells and indirectly by means of B cells. Mature B cells generate antibodies against tTG and gluten, persisting with the damage of the enterocytes and contributing to the final atrophy of the villi [6].

Genetic analyses show that CD is associated with HLA class II molecules that are encoded by DQA1*0501/DQB1*02 (DQ2) or DQA1*03/DQB1*0302 (DQ8) genes [7]. HLA-DQ2 and HLA-DQ8 recognize the deamidated gliadin. Nevertheless, there is a genetic risk due to an unknown number of non-human leucocyte antigens (HLA) genes [8].

The inherent multifactorial nature characterizes the pathology as complex disease [6, 9]. This disease is very often associated with autoimmune diseases; the comorbidity may be triggered by the activation of similar biological pathways, related to immune response and inflammation development. Moreover, the atrophy of the villi and the increased permeability can contribute to the occurrence of indirect-related disease [10]. From this point of view, not only the CD is limited to the gastrointestinal zone of the body but also it is a systemic disease. The importance to study the comorbidities of a complex disease is highlighted in [11]. In Figure 1, we present the CD comorbidity map with particular focus on autoimmune diseases associated to CD.

Another issue in the pathology is the onset time, which is related to its etiology. For a long time, CD has been diagnosed in child and considered a gastrointestinal disorder of childhood. Nowadays, it is often diagnosed in adults, associated to many symptoms and conditions. A significant amount of individuals is asymptomatic at diagnosis. Furthermore, endoscopic and histological evaluations are the best ways of assessing small intestine healing, despite their invasiveness and costs. The well-known treatment is a lifelong gluten-free diet (GFD), but it can be ineffective for a large percentage of the patients [12–14].

The availability of the gene expression data for the CD and other related pathologies in public repositories makes this disease approachable using bioinformatics and ontology approaches and testing how an integrative pipeline may offer a wider view of the pathology itself.

Methods

Survey of available data

We retrieved experimental data sets from the Gene Expression Omnibus (GEO) database [15], available from the National Center for Biotechnology Information (NCBI), and the Array Express database [16], available from the European Bioinformatics Institute. Four hundred twenty-seven query results about CD are available in GEO repository (June 2018). In order to eliminate the studies with a low number of samples, we selected a lower bound of 10 samples, thus the number of results decreases to 18 from which only 7 are strictly related to the disease or not redundant. This information about the quantity of data is reported to show how few collected CD data are usable, in comparison with all the available data sets. Indeed, there are four principal issues on selecting microarray data sets:

(i)
Redundancy. Some data sets are replicated in similar condition or explored with different methods, but the samples are the same, thus there is no necessity to take them into account more than once;
(ii)
Typology. Some data sets are not in a form suitable for performing a simple differential expression evaluation, such as sequence data, thus they are not useful for the integration;
(iii)
Relevance. Some data sets are annotated with information related to a certain pathology, with particular emphasis on biological relationships, but the samples are not about the pathology itself, thus they are not suitable;
(iv)
Species. some data sets come from non-human species, thus the analysis is not compatible.

The search within the Array Express database did not add more data sets with respect to GEO.

Gene expression data sets research for CD returned the following studies from GEO: GSE11501, a study of genetic variation about primary leucocytes [17]; GSE87629, a genome-wide analysis of lymphocytes gene expression during a 6 week gluten challenge, looking for changes correlated to villi atrophy [18]; GSE72625, a combined study on intraepithelial lymphocytes between common variable immunodeficiency and CD [19]; GSE106260, an immunopathology study in children, particularly on intestinal epithelial cells [20]; GSE61849, an expression profiling of selected genes from children intestinal mucosa, among sick, healthy and GFD-treated children [21]; GSE76168, a compared study between adult and children about differential gene expression signatures [22]; and GSE84729, an ‘immunoancestry-based’ study about novel candidate genes related to CD [23]. In our study, the 4th data set is divided into intestinal epithelial cells and intraepithelial lymphocytes; the 5th and the 7th data sets are divided into two different analyses: CD versus control and CD versus GFD-treated; furthermore, the 5th data set is reduced to the children case because of the unbalanced number of the adult individuals.

The selected studies on autoimmune disease, for the comparison with CD, are GSE45512, a transcriptomic description of affected alopecic scalp skin [24]; GSE68801, a gene expression profiling from a whole genome analysis on alopecia areata samples [25]; GSE63425, a study on expression data of vascular smooth muscle cell in patients with suspected giant cell arteritis [26]; GSE71956, an analysis of epigenetic profile in T cells from Graves’ disease patients [27]; GSE11971, an evaluation of the chronic inflammation on gene expression in skeletal muscle from children with juvenile dermatomyositis [28]; GSE1551, a study of the innate response of some cytokines, from B and T helper cell muscle infiltration, in dermatomyositis [29]; GSE5370, an overall analysis of idiopathic inflammatory myopathies (polymyositis and dermatomyositis) [30]; GSE93170, an analysis of microRNA and mRNA expressions in CD4+ T cells to investigate primary biliary cirrhosis pathogenesis [31]; GSE31014, a study on the transcriptional profile of leukocytes in the patients with Guillain–Barré syndrome [32]; GSE95849, a study on the identification of mechanisms underlying the pathogenesis of diabetes mellitus and diabetic peripheral neuropathy [33]; GSE17755, an evaluation of gene expression profiling of peripheral blood cells from patients with different autoimmune diseases, with a focus on systemic lupus erythematosus [34]; GSE4588, gene expression studies performed on peripheral blood mononuclear cell from systemic lupus erythematosus patients and other autoimmune diseases [35]; GSE90880, a study of the autoimmune response in vitiligo regarding the systemic changes underlying skin-specific manifestation [36]; and GSE53148, the analysis of some cytokines and chemokines in patients with vitiligo [37]. In our study, the 3rd data set takes into account histological-proven and not histological-proven patients; the 4th study is divided into CD4 and CD8 T cells; the 11th study is divided into juvenile idiopathic arthritis and canonical rheumatoid arthritis. For completeness and to our best knowledge, the following pathologies have no data set on GEO: Addison’s disease, dermatitis herpetiformis, gluten ataxia, autoimmune pericarditis and microscopic colitis. Type I diabetes has a strong correlation with CD, especially in children, as claimed in [38], but we preferred to focus our work on less commonly pathologies studied together with CD.

Ontologies

The gene ontology (GO) project [39] provides comprehensive structured information of biological systems, from the molecular level to larger pathways, cellular- and organism-level systems. There are three GO domains: cellular component, molecular function and biological process (BP). Our approach is based on BP.

The disease ontology (DO) project is conceptually similar to the previous one, described in [40]; the mission is to provide an open-source ontology for integrated biomedical data, associated with human disease. The hierarchical organization takes into account clinical, etiological and location information and trying to provide an online comprehensive theory of disease. The DO terms for the selected diseases are alopecia areata DOID:986, cerebral arteritis DOID:11390, temporal arteritis DOID:13375, Takayasu arteritis DOID:2508, polyarteritis nodosa DOID:9810, autoimmune thyroid disease DOID:7188, dermatomyositis DOID:10223, autonomic peripheral neuropathy DOID:0060054, primary biliary cirrhosis DOID:12236, rheumatoid arthritis DOID:7148, juvenile rheumatoid arthritis DOID:676, vitiligo DOID:12306 and celiac disease DOID:10608.

Gene set enrichment analysis and semantic similarity

Gene set enrichment analysis (GSEA) is a set of statistical methods to classify genes in groups, taking into account common biological functions, chromosomal location or regulation [41]. Genes from microarrays or next generation sequencing (NGS) technologies can be analyzed by their differential expression among several conditions, selected as representative for a certain disease and correlated to GO terms, finding possible pathways linked to phenotype changes, e.g. disease development. The power of these methods lies in overcoming the possibility of masking single gene information, such as the pathways disease-related in which it is involved. Gene sets are more relevant of a single gene, especially when correlated to phenotype information.

Semantic similarity is a method to measure the closeness between two terms, according to a given ontology. Many methods are based on the annotation statistics of their common ancestor terms, i.e. on the information content of each term. In this work, the connections among the terms (genes, GO and DO) are more important than single comparisons. The Wang method fits for the purpose because it is a graph-based method constructed on the topology inherited by the selected ontology. In a nutshell, an ontology term can be represented with a directed acyclic graph |${DAG}_X=\left(X,{T}_X,{E}_X\right)$|⁠, where |$X$| is the term, |${T}_X$| is the set of ancestors of |$X$| and |${E}_X$|is the set of semantic relations (edges) among the terms. The semantic value of |$X$| is

$$\begin{equation} \left\{\begin{array}{ll}{S}_X(X)=1& t=X\\ {}{S}_X(t)=\mathit{\max}\left\{{w}_e\ast {S}_X\left({t}^{\prime}\right)|{t}^{\prime}\in children\ of\ (t)\right\}& t\ne X\end{array}\right. \end{equation}$$

(1)

where |$t$| is a generic term, |${t}^{\prime }$| is a child term and |${w}_e$| is the semantic contribution between |$t$| and |${t}^{\prime }$|⁠. The global semantic value for |$X$| is

$$\begin{equation} SV(X)=\sum_{ t\in {T}_X}{S}_X(t) \end{equation}$$

(2)

and the semantic similarity between two terms is

$$\begin{equation} sim\left(X,Y\right)=\frac{\sum_{t\in {T}_X\cap {T}_Y}\left[{S}_X(t)+{S}_Y(t)\right]}{SV(X)+ SV(Y)} \end{equation}$$

(3)

where |$X\ \mathrm{and}\ Y$| are two different terms. The most important advantage of this method is the calculation of a semantic similarity measure without taking into account any annotation statistics [42, 43]. The best-match average method calculates the semantic similarity between two different sets of terms [44]. Given two sets of terms |${T}_1=\left\{{t}_{11},{t}_{12},\dots {t}_{1l}\right\}$| and |${T}_2=\left\{{t}_{21},{t}_{22},\dots {t}_{2m}\right\}$| where |$l$| is the length of the 1st set and |$m$| is the length of the 2nd set, the semantic similarity between the two sets is

$$\begin{align} &{sim}_{BMA}\left({T}_1,{T}_2\right) = \frac{\sum_{i=1}^l\max_{1\le j\le m} sim\left({t}_{1i},{t}_{2j}\right)+\sum_{j=1}^m\max_{1\le i\le l} sim\left({t}_{1i},{t}_{2j}\right)}{l+m} \end{align}$$

(4)

with |$i\ \mathrm{and}\ j$| indices on |${T}_1\ \mathrm{and}\ {T}_2$| terms.

Designing of pipeline

Figure 2 shows the steps of the pipeline:

Figure 2

Block diagram of the pipeline. Different colors indicate different levels of analysis, from the output of the previous step.

Open in new tab Download slide

Table 1

Open in new tab

CD statistics summary. The features from left to right are number of raw genes, number of genes with significant P-value, number of genes with significant adjusted P-value, number of differential expressed genes, number of raw GO terms and number of significant GO terms. Statistics related to LogFC with threshold 2 are between brackets

Data set	Cell source	Raw genes	P-value	Adjusted P-value	LogFC	Raw GSEA	Fisher GSEA
GSE11501	Peripheral blood	18 981	2341	156	1 (0)	5700	12 (0)
GSE87629	Peripheral blood	20 624	785	0	14 (0)	4928	26 (0)
GSE72625	Intestinal biopsy	47 323	7632	3358	260 (23)	6128	296 (44)
GSE106260a	Intestinal biopsy	24 526	2980	0	267 (10)	5907	106 (18)
GSE106260b	Intestinal biopsy	24 526	2354	0	6818 (2120)	5783	109 (28)
GSE61849a	Intestinal biopsy	46	9	3	4 (2)	176	0 (0)
GSE61849b	Intestinal biopsy	46	3	0	8 (3)	176	0 (0)
GSE76168	Intestinal biopsy	40	10	4	2 (2)	232	0 (0)
GSE84729a	Intestinal biopsy	65	32	29	11 (4)	327	0 (0)
GSE84729b	Intestinal biopsy	65	20	14	11 (2)	327	0 (0)

Data set	Cell source	Raw genes	P-value	Adjusted P-value	LogFC	Raw GSEA	Fisher GSEA
GSE11501	Peripheral blood	18 981	2341	156	1 (0)	5700	12 (0)
GSE87629	Peripheral blood	20 624	785	0	14 (0)	4928	26 (0)
GSE72625	Intestinal biopsy	47 323	7632	3358	260 (23)	6128	296 (44)
GSE106260a	Intestinal biopsy	24 526	2980	0	267 (10)	5907	106 (18)
GSE106260b	Intestinal biopsy	24 526	2354	0	6818 (2120)	5783	109 (28)
GSE61849a	Intestinal biopsy	46	9	3	4 (2)	176	0 (0)
GSE61849b	Intestinal biopsy	46	3	0	8 (3)	176	0 (0)
GSE76168	Intestinal biopsy	40	10	4	2 (2)	232	0 (0)
GSE84729a	Intestinal biopsy	65	32	29	11 (4)	327	0 (0)
GSE84729b	Intestinal biopsy	65	20	14	11 (2)	327	0 (0)

Table 1

Open in new tab

CD statistics summary. The features from left to right are number of raw genes, number of genes with significant P-value, number of genes with significant adjusted P-value, number of differential expressed genes, number of raw GO terms and number of significant GO terms. Statistics related to LogFC with threshold 2 are between brackets

Data set	Cell source	Raw genes	P-value	Adjusted P-value	LogFC	Raw GSEA	Fisher GSEA
GSE11501	Peripheral blood	18 981	2341	156	1 (0)	5700	12 (0)
GSE87629	Peripheral blood	20 624	785	0	14 (0)	4928	26 (0)
GSE72625	Intestinal biopsy	47 323	7632	3358	260 (23)	6128	296 (44)
GSE106260a	Intestinal biopsy	24 526	2980	0	267 (10)	5907	106 (18)
GSE106260b	Intestinal biopsy	24 526	2354	0	6818 (2120)	5783	109 (28)
GSE61849a	Intestinal biopsy	46	9	3	4 (2)	176	0 (0)
GSE61849b	Intestinal biopsy	46	3	0	8 (3)	176	0 (0)
GSE76168	Intestinal biopsy	40	10	4	2 (2)	232	0 (0)
GSE84729a	Intestinal biopsy	65	32	29	11 (4)	327	0 (0)
GSE84729b	Intestinal biopsy	65	20	14	11 (2)	327	0 (0)

Data set	Cell source	Raw genes	P-value	Adjusted P-value	LogFC	Raw GSEA	Fisher GSEA
GSE11501	Peripheral blood	18 981	2341	156	1 (0)	5700	12 (0)
GSE87629	Peripheral blood	20 624	785	0	14 (0)	4928	26 (0)
GSE72625	Intestinal biopsy	47 323	7632	3358	260 (23)	6128	296 (44)
GSE106260a	Intestinal biopsy	24 526	2980	0	267 (10)	5907	106 (18)
GSE106260b	Intestinal biopsy	24 526	2354	0	6818 (2120)	5783	109 (28)
GSE61849a	Intestinal biopsy	46	9	3	4 (2)	176	0 (0)
GSE61849b	Intestinal biopsy	46	3	0	8 (3)	176	0 (0)
GSE76168	Intestinal biopsy	40	10	4	2 (2)	232	0 (0)
GSE84729a	Intestinal biopsy	65	32	29	11 (4)	327	0 (0)
GSE84729b	Intestinal biopsy	65	20	14	11 (2)	327	0 (0)

Table 2

Open in new tab

Autoimmune CD-related pathologies statistics summary. The features from left to right are number of raw genes, number of genes with significant P-value, number of genes with significant adjusted P-value, number of differential expressed genes, number of raw GO terms and number of significant GO terms. Statistics related to LogFC with threshold 2 are between brackets. Data set legend: AA—GSE45512 and GSE68801; AR—GSE63425; AT—GSE71956; DM—GSE11971, GSE1551 and GSE5370; PB—GSE93170; PN—GSE31014 and GSE95849; RA—GSE17755 and GSE4588; VI—GSE90880 and GSE53148

Data set	Cell source	Raw genes	P-value	Adjusted P-value	LogFC	Raw GSEA	Fisher GSEA
GSE45512	Skin biopsy (from scalp)	54 675	290	0	76 (10)	390	16 (0)
GSE68801	Skin biopsy (from scalp)	54 675	21 288	15 508	1198 (163)	6118	406 (95)
GSE63425a	Temporal artery biopsy	22 741	1260	0	142 (13)	6088	222 (13)
GSE63425b	Temporal artery biopsy	22 741	1138	0	46 (2)	6088	107 (6)
GSE71956a	Peripheral blood	47 323	1634	0	470 (3)	6128	53 (32)
GSE71956b	Peripheral blood	47 323	996	0	122 (0)	6128	123 (0)
GSE11971	Skeletal muscle biopsy	22 283	10 478	8308	7314 (1687)	5762	47 (179)
GSE1551	Muscle biopsy	22 283	7096	1820	7876 (543)	5762	93 (322)
GSE5370	Skeletal muscle biopsy	22 283	8450	4737	6522 (965)	5762	6 (26)
GSE93170	miRNA and mRNA profile	40 848	10695	3313	780 (26)	6101	152 (175)
GSE31014	Peripheral blood	22 283	3383	0	455 (6)	5762	128 (42)
GSE95849	Peripheral blood	31 741	16 410	13 963	8641 (1663)	6152	105 (133)
GSE17755a	Peripheral blood	30 336	11 844	8820	66 (5)	5819	28 (31)
GSE17755b	Peripheral blood	30 336	17 127	15 605	393 (26)	5819	13 (0)
GSE4588	Peripheral blood	54 675	5819	44	5185 (234)	6118	42 (91)
GSE90880	Peripheral blood	12 625	517	0	43 (4)	5274	99 (22)
GSE53148	Skin biopsy	25 326	1052	0	6349 (816)	5899	18 (151)

Data set	Cell source	Raw genes	P-value	Adjusted P-value	LogFC	Raw GSEA	Fisher GSEA
GSE45512	Skin biopsy (from scalp)	54 675	290	0	76 (10)	390	16 (0)
GSE68801	Skin biopsy (from scalp)	54 675	21 288	15 508	1198 (163)	6118	406 (95)
GSE63425a	Temporal artery biopsy	22 741	1260	0	142 (13)	6088	222 (13)
GSE63425b	Temporal artery biopsy	22 741	1138	0	46 (2)	6088	107 (6)
GSE71956a	Peripheral blood	47 323	1634	0	470 (3)	6128	53 (32)
GSE71956b	Peripheral blood	47 323	996	0	122 (0)	6128	123 (0)
GSE11971	Skeletal muscle biopsy	22 283	10 478	8308	7314 (1687)	5762	47 (179)
GSE1551	Muscle biopsy	22 283	7096	1820	7876 (543)	5762	93 (322)
GSE5370	Skeletal muscle biopsy	22 283	8450	4737	6522 (965)	5762	6 (26)
GSE93170	miRNA and mRNA profile	40 848	10695	3313	780 (26)	6101	152 (175)
GSE31014	Peripheral blood	22 283	3383	0	455 (6)	5762	128 (42)
GSE95849	Peripheral blood	31 741	16 410	13 963	8641 (1663)	6152	105 (133)
GSE17755a	Peripheral blood	30 336	11 844	8820	66 (5)	5819	28 (31)
GSE17755b	Peripheral blood	30 336	17 127	15 605	393 (26)	5819	13 (0)
GSE4588	Peripheral blood	54 675	5819	44	5185 (234)	6118	42 (91)
GSE90880	Peripheral blood	12 625	517	0	43 (4)	5274	99 (22)
GSE53148	Skin biopsy	25 326	1052	0	6349 (816)	5899	18 (151)

Table 2

Open in new tab

Autoimmune CD-related pathologies statistics summary. The features from left to right are number of raw genes, number of genes with significant P-value, number of genes with significant adjusted P-value, number of differential expressed genes, number of raw GO terms and number of significant GO terms. Statistics related to LogFC with threshold 2 are between brackets. Data set legend: AA—GSE45512 and GSE68801; AR—GSE63425; AT—GSE71956; DM—GSE11971, GSE1551 and GSE5370; PB—GSE93170; PN—GSE31014 and GSE95849; RA—GSE17755 and GSE4588; VI—GSE90880 and GSE53148

Data set	Cell source	Raw genes	P-value	Adjusted P-value	LogFC	Raw GSEA	Fisher GSEA
GSE45512	Skin biopsy (from scalp)	54 675	290	0	76 (10)	390	16 (0)
GSE68801	Skin biopsy (from scalp)	54 675	21 288	15 508	1198 (163)	6118	406 (95)
GSE63425a	Temporal artery biopsy	22 741	1260	0	142 (13)	6088	222 (13)
GSE63425b	Temporal artery biopsy	22 741	1138	0	46 (2)	6088	107 (6)
GSE71956a	Peripheral blood	47 323	1634	0	470 (3)	6128	53 (32)
GSE71956b	Peripheral blood	47 323	996	0	122 (0)	6128	123 (0)
GSE11971	Skeletal muscle biopsy	22 283	10 478	8308	7314 (1687)	5762	47 (179)
GSE1551	Muscle biopsy	22 283	7096	1820	7876 (543)	5762	93 (322)
GSE5370	Skeletal muscle biopsy	22 283	8450	4737	6522 (965)	5762	6 (26)
GSE93170	miRNA and mRNA profile	40 848	10695	3313	780 (26)	6101	152 (175)
GSE31014	Peripheral blood	22 283	3383	0	455 (6)	5762	128 (42)
GSE95849	Peripheral blood	31 741	16 410	13 963	8641 (1663)	6152	105 (133)
GSE17755a	Peripheral blood	30 336	11 844	8820	66 (5)	5819	28 (31)
GSE17755b	Peripheral blood	30 336	17 127	15 605	393 (26)	5819	13 (0)
GSE4588	Peripheral blood	54 675	5819	44	5185 (234)	6118	42 (91)
GSE90880	Peripheral blood	12 625	517	0	43 (4)	5274	99 (22)
GSE53148	Skin biopsy	25 326	1052	0	6349 (816)	5899	18 (151)

Data set	Cell source	Raw genes	P-value	Adjusted P-value	LogFC	Raw GSEA	Fisher GSEA
GSE45512	Skin biopsy (from scalp)	54 675	290	0	76 (10)	390	16 (0)
GSE68801	Skin biopsy (from scalp)	54 675	21 288	15 508	1198 (163)	6118	406 (95)
GSE63425a	Temporal artery biopsy	22 741	1260	0	142 (13)	6088	222 (13)
GSE63425b	Temporal artery biopsy	22 741	1138	0	46 (2)	6088	107 (6)
GSE71956a	Peripheral blood	47 323	1634	0	470 (3)	6128	53 (32)
GSE71956b	Peripheral blood	47 323	996	0	122 (0)	6128	123 (0)
GSE11971	Skeletal muscle biopsy	22 283	10 478	8308	7314 (1687)	5762	47 (179)
GSE1551	Muscle biopsy	22 283	7096	1820	7876 (543)	5762	93 (322)
GSE5370	Skeletal muscle biopsy	22 283	8450	4737	6522 (965)	5762	6 (26)
GSE93170	miRNA and mRNA profile	40 848	10695	3313	780 (26)	6101	152 (175)
GSE31014	Peripheral blood	22 283	3383	0	455 (6)	5762	128 (42)
GSE95849	Peripheral blood	31 741	16 410	13 963	8641 (1663)	6152	105 (133)
GSE17755a	Peripheral blood	30 336	11 844	8820	66 (5)	5819	28 (31)
GSE17755b	Peripheral blood	30 336	17 127	15 605	393 (26)	5819	13 (0)
GSE4588	Peripheral blood	54 675	5819	44	5185 (234)	6118	42 (91)
GSE90880	Peripheral blood	12 625	517	0	43 (4)	5274	99 (22)
GSE53148	Skin biopsy	25 326	1052	0	6349 (816)	5899	18 (151)

(i)
‘Data retrieval’. The selection of GEO data is carried out by GEO sample (GSM) records, downloaded with the platform and matrices information and converted into expression set class for the analysis. In order to evaluate the genes’ differential expression, healthy controls or treated patients with gluten-free diet are necessary for CD case;
(ii)
‘Sample selection’. An automatic selection for the GEO samples is not straightforward because sick and healthy patients are often mixed together; therefore, we performed a prior check over GSM records. This step helps in creating the design model, factorizing the classes of the patients (sick, healthy, treated). The design model is coeliac versus control or treated in CD cases, autoimmune disease versus control in the other cases;
(iii)
‘Differential expression’. After a linear and a Bayesian filtering over the design model, the results are tabulated. Three parameters determine the choice about the selection of candidate genes: P-value, adjusted P-value (Benjamini–Hochberg correction) and base 2 logarithm of fold change (logFC), sorted by logFC values;
(iv)
‘GO terms test’. We created the ‘topGOdata’ class by selecting the GO domain, using a filtering function to establish selected genes (out of genes universe) and providing the annotation for the mapping. We performed the Fisher’s exact test in order to obtain a filtering on GO terms and their relationships with the selected genes;
(v)
‘Semantic similarity’. After extracting the correspondences gene/GO term, we applied semantic similarity to all the autoimmune pathologies, using genes, GO terms and DO terms for the comparisons, in order to quantify the closeness among all the selected datasets;
(vi)
‘Cluster comparison’. Furthermore, we performed an enrichment test based on hypergeometric distribution for differential expressed genes and KEGG pathways [45], to find the principal pathways or pathologies for the CD and its comorbidities data sets;
(vii)
‘Results’. Finally, the pipeline output consisted of stastical summary, gene/GO term correspondences, GO graph topologies, semantic similarity matrices and dendrograms (for genes, GO terms and DO terms), and KEGG enrichment graph with the list of pathways/pathologies. Moreover, we used the list of differential expressed genes in common between CD and comorbidities to generate a gene network, with information on involved pathways/pathologies.

The pipeline is a combination of two main scripts in R language, available at the following link: http://bioinformatica.isa.cnr.it/COELIAC_DISEASE/SCRIPTS/. They build up the pipeline using the following Bioconductor packages [46]: ‘GEOquery’ [47] for data transformation in expression set class; ‘limma’ [48] for data analysis, linear models and differential expression on microarray data; ‘genefilter’ [49] for basic functions on filtering genes; ‘topGO’ [50] for the test of GO terms and the topology arrangement of the GO graph; ‘GOSemSim’ [51] for the semantic similarity evaluation among the diseases; ‘DOSE’ [52] for the semantic similarity evaluation among DO terms; and ‘clusterProfiler’ [53] for the enrichment analysis with KEGG pathways. We would point out that ‘GEOquery’ package provides the possibility to download GEO data sets directly but last updates show some issues in interaction with the other packages. Thus, we use a call to GEO File Transfer Protocol (FTP). Finally, we report the software and packages version used for the work: ‘R’ 3.3.1, ‘R Studio’ 1.0.136, ‘Bioconductor’ 3.4, ‘GEOquery’ 2.40.0, ‘limma’ 3.30.8, ‘genefilter’ 1.56, ‘topGO 2.26’, ‘GOSemSim’ 2.0.4, ‘DOSE’ 3.0.10 and ‘clusterProfiler’ 3.2.14.

Results

Statistics and GO terms tree

Table 1 shows some statistics for the selected CD studies. The 2nd, 3rd, 4th and 6th columns are the numbers obtained from the comparison with the threshold of 0.01 for the P-values. The standard threshold for logFC is 1.00 (among the brackets the number of left differential expressed genes with a threshold of 2.00). In Table 1, it is possible to notice that the studies are not standardized. The top four have a raw filter on microarray genes, based on unused or defected probes, whereas the following three are already curated. The step for searching for the differential expression of the genes is not trivial. The scores (adjusted P-value, P-value and logFC) are strictly related to data set dimension; furthermore, logFC is dependent from the kind of data set. For example, the 3rd data set has 23 genes with |logFC| > 2 and this subset should be considered as selected genes for the analysis. Moreover, the application of the Fisher’s exact test is biased for the selection of the GO terms because of the dimension of the data set and, probably, the performed manipulations. In such circumstances, we chose the candidate genes’ subset with a trade off among all the scores; thus, the creation of a GO graph eludes the Fisher’s exact test, keeping its biological importance to define process clusters but losing in statistical power.

Table 2 shows a selection of statistics for the autoimmune pathologies.

Figure 3

Example of GO graph with GSEA on GSE72625 data set. The complete graph (on the left) and a zoom (on the right) are shown. The rectangles represent the top five GO terms after the test. The red and orange colors indicate the most significant GO terms. In this case, an inflammatory sub-graph is linked to the interferon gamma cytokine and response to drug.

Open in new tab Download slide

Figure 4

Network on common differential expressed genes between CD and its comorbidities. Color legend: gray, consolidated pathway; pink, physical interaction; light violet, co-expression; dark yellow, predicted interaction; indigo, co-localization; light blue, other pathway; light green, genetic interaction; beige, shared protein domain. Circle legend: striated circle, common genes; normal circle, genes added after enrichment. The cluster network has been performed with the online tool ‘GeneMania’ [57].

Open in new tab Download slide

Figure 5

Semantic similarity matrix for differential expressed genes (from the first five GO terms). The two-letter suffix before the GSE codes are referred to the following: AA, alopecia areata; AR, arteritis; AT, autoimmune thyroid disease; DM, dermatomyositis; PB, primary biliary cirrhosis; PN, peripheral neuropathy; RA, rheumatoid arthritis; VI, vitiligo. The number after the two letters indicates the logFC threshold.

Open in new tab Download slide

Figure 6

Semantic similarity matrix for GO terms (1st five GO terms). Codes are same as in Figure 5.

Open in new tab Download slide

Figure 7

Semantic similarity matrix for DO terms. Legend: AA, alopecia areata; ARa, cerebral arteritis; ARb, temporal arteritis; ARc, Takayasu arteritis; ARd, polyarteritis nodosa; AT, autoimmune thyroid disease; CD, coeliac disease; DM, dermatomyositis; PB, primary biliary cirrhosis; PN, autonomic peripheral neuropathy; RAa, rheumatoid arthritis; RAb, juvenile rheumatoid arthritis; VI, vitiligo.

Open in new tab Download slide

Figure 8

Enrichment analysis with KEGG pathways from differential expressed genes. KEGG pathways related to diseases are listed in the rows; data sets (with logFC threshold) are listed in the columns. The dimension of the circles shows gene prevalence in the pathway; the color of the circles shows the statistical validation, with an upper bound P-value of 0.05. CD disease data sets are inside the vertical lines. The circles indicate the KEGG pathways for CD (dark green) and for comorbidities (light green).

Open in new tab Download slide

Table 3 reports a summary of the data sets and differential expressed genes analyzed. The hierarchy of the GO terms reveals that some of them are encapsulated; thus, not all the listed GO terms are fundamental. An example of hierarchy with a zoom on the most important GO terms is shown in Figure 3.

Table 3

Open in new tab

Summary of results along the pipeline steps for the selected pathologies. The features from left to right are selected disease, main location in the body, number of data sets, number of selected data sets and number of upregulated differential expressed genes (DEG up) and downregulated differential expressed genes (DEG down)

Disease	Organ/tissue	Data set	Selected data set	DEG up	DEG down
CD	Small intestine	427	7	56^*	43^*
Alopecia areata	Skin (scalp)	7	2	21	32
Arteritis	Arteries	24	1	2	4
Autoimmune thyroid disease	Thyroid	7	1	38	40
Dermatomyositis	Skin, muscle	22	3	63	6
Primary biliary cirrhosis	Liver	38	1	1	2
Peripheral neuropathy	Nerves	75	2	18	4
Rheumatoid arthritis	Joints	159	2	21	18
Vitiligo	Skin	9	2	54	17

Disease	Organ/tissue	Data set	Selected data set	DEG up	DEG down
CD	Small intestine	427	7	56^*	43^*
Alopecia areata	Skin (scalp)	7	2	21	32
Arteritis	Arteries	24	1	2	4
Autoimmune thyroid disease	Thyroid	7	1	38	40
Dermatomyositis	Skin, muscle	22	3	63	6
Primary biliary cirrhosis	Liver	38	1	1	2
Peripheral neuropathy	Nerves	75	2	18	4
Rheumatoid arthritis	Joints	159	2	21	18
Vitiligo	Skin	9	2	54	17

^*Numbers referred to differential expressed genes with strong evidences in CD.

Table 3

Open in new tab

Summary of results along the pipeline steps for the selected pathologies. The features from left to right are selected disease, main location in the body, number of data sets, number of selected data sets and number of upregulated differential expressed genes (DEG up) and downregulated differential expressed genes (DEG down)

Disease	Organ/tissue	Data set	Selected data set	DEG up	DEG down
CD	Small intestine	427	7	56^*	43^*
Alopecia areata	Skin (scalp)	7	2	21	32
Arteritis	Arteries	24	1	2	4
Autoimmune thyroid disease	Thyroid	7	1	38	40
Dermatomyositis	Skin, muscle	22	3	63	6
Primary biliary cirrhosis	Liver	38	1	1	2
Peripheral neuropathy	Nerves	75	2	18	4
Rheumatoid arthritis	Joints	159	2	21	18
Vitiligo	Skin	9	2	54	17

Disease	Organ/tissue	Data set	Selected data set	DEG up	DEG down
CD	Small intestine	427	7	56^*	43^*
Alopecia areata	Skin (scalp)	7	2	21	32
Arteritis	Arteries	24	1	2	4
Autoimmune thyroid disease	Thyroid	7	1	38	40
Dermatomyositis	Skin, muscle	22	3	63	6
Primary biliary cirrhosis	Liver	38	1	1	2
Peripheral neuropathy	Nerves	75	2	18	4
Rheumatoid arthritis	Joints	159	2	21	18
Vitiligo	Skin	9	2	54	17

^*Numbers referred to differential expressed genes with strong evidences in CD.

Pathways

We suggest the following framework on the BP involved in each study on CD:

(i)
GSE11501: peptidyl-tyrosine phosphorylation, phosphatidylinositol 3-kinase signaling and response to endoplasmic reticulum stress;
(ii)
GSE87629: mitosis regulation, microtubule cytoskeleton organization and protein destabilization;
(iii)
GSE72625: signaling pathway and cellular response about IFNγ;
(iv)
GSE106260a: signaling pathway and cellular response about IFNγ (and immune response);
(v)
GSE106260b: leukocyte migration and cardiac conduction;
(vi)
GSE61849a: immune response and immune system development;
(vii)
GSE61849b: protein phosphorylation, apoptotic process and regulation of cell adhesion;
(viii)
GSE76168: cytokine-mediated signaling pathways;
(ix)
GSE84729a: cytokine-mediated signaling pathways and response to virus;
(x)
GSE84729b: cytokine-mediated signaling pathways and response to virus.

In summary, most of the previous pathways are directly connected to the pathology, with a particular interest in the cases with the cytokine signaling pathway and the immune response topics, i.e. inflammation and autoimmune response generated by CD, as confirmed from the candidate genes. In addition, we underline interesting functional annotations, such as response to endoplasmic reticulum stress [54], cardiac conduction [55] and response to virus [56].

The comparison of the differential expressed genes between the CD data sets and the autoimmune diseases shows that the following genes are in common: HLA-DRB1, CXCL10, IFI27, STAT1, GBP1, BST2, IFITM1, IFITM2, IFITM3, IRF1, ISG15, OAS2, CCL20, CCL5, HLA-DPA1, CTLA4, ICOS, CXCL11, GPR65, GZMA, GZMB, XCL1, MMP12, DUSP10, CXCR4, IFNG, KLRF1, RGS1, MASP1, CCR2, TLR4, CD3D, CD3G, CD8A and HAVCR2. A cluster network is generated on the list of genes taking into account co-expression, consolidated pathways, co-localization, shared protein domain, predicted and physical interactions and shown in Figure 4. The most important clusters are related to the following pathways and pathologies, with the percentage of coverage: immune system (23.53%), chemokine signaling pathway (8.70%), rheumatoid arthritis (7.54%), influenza A (6.71%), interferon alpha/beta signaling (5.58%), Biocarta CTLA4 pathway (5.29%), Cytokine Signaling in Immune System (4.52%), chemokine receptors bind chemokines (4.08%) and IL12-mediated signaling events (2.71%).

In the same way, the comparison between GO terms from CD list and autoimmune disorders list highlights the presence of terms in common:

(i)
GO:0002376: immune system process;
(ii)
GO:0009615: response to virus;
(iii)
GO:0006955: immune response;
(iv)
GO:0019221: cytokine-mediated signaling pathway;
(v)
GO:0051607: defense response to virus;
(vi)
GO:0034340: response to type I interferon;
(vii)
GO:0060337: type I interferon signaling pathway;
(viii)
GO:0071357: cellular response to type I interferon.

Semantic similarity and KEGG enrichment

Figure 5 shows the semantic similarity matrix for differential expressed genes. Over a semantic similarity value of 0.7, CD1_GSE106260a/b is connected to many selected comorbidities, in particular with dermatomyositis, arteritis and vitiligo. Considering the other strong evidences from CD1_GSE84729a and CD2_GSE106260b, peripheral neuropathy, primary biliary cirrhosis and alopecia areata should be taken into account (for the selected data sets). In addition, giant cell arteritis (AR2_GSE63425a/b) and rheumatoid arthritis (RA2_GSE17755b) are outliers, with 2 as logFC threshold.

Figure 6 shows the semantic similarity matrix for GO terms. Over a semantic similarity value of 0.7, CD1_GSE106260a, CD1_GSE84729a, CD2_GSE106260a and CD2_GSE72625 are well clustered with all dermatomyositis data sets, normal and juvenile. Moreover, it is possible to notice from the little square in the center of the matrix that CD1_GSE84729a and VI1_GSE90880 (vitiligo) data sets have a spurious similarity.

Figure 7 shows the semantic similarity matrix for DO terms. From the CD row, alopecia areata, autoimmune thyroid disease, primary biliary cirrhosis and vitiligo are the most connected pathologies (over a threshold of 0.3). In details, vitiligo has a similarity value of 0.55.

Figure 8 represents the relationships between KEGG pathways and all the selected data sets. Recurring pathways among CD data sets, with at least three evidences, are natural killer cell-mediated cytotoxicity, autoimmune thyroid disease, chemokine signaling pathway, T cell receptor signaling pathway, herpes simplex infection, influenza A, cell adhesion molecules, NOD-like receptor signaling pathway, hepatitis C, measles, cytokine–cytokine receptor interaction and prolactin signaling pathway. In addition to the previous list of pathways, recurring pathways in common between CD and the others pathologies, with at least one CD evidence, are allograft rejection, graft versus host disease, type I diabetes mellitus, malaria, Chagas’ disease, antigen processing and presentation, hematopoietic cell lineage, herpes simplex infection, human papilloma infection, hepatitis B, prion diseases, asthma and intestinal immune network for IgA production.

Discussion

The aim of this work is to evidence the effectiveness of a pipeline of bioinformatics analyses in extracting information from public databases and investigating the relationships of a complex disease and its comorbidities. We studied the state-of-the-art CD and some autoimmune pathologies, in order to select gene expression data from microarray experiments available online. GSEA is a method suitable for giving a direction in studying CD, especially in terms of involved processes, inflammation pathways, relationships with other disease and interconnection bridge among different kinds of omic data by means of different ontology, such as GO and DO. Furthermore, we remark the usefulness of the semantic similarity method that quantifies the closeness among different data sets on the basis of the selected ontology without taking into account any other statistic or parameter.

We applied the described pipeline both to CD and to some of its comorbidities. Therefore, this work is methodologically general and could be applied to the following: (1) other CD data sets, (2) other CD comorbidities and (3) other complex pathologies. The only constraint is the typology of data, downloaded by GEO repository (or by Array Express repository), that consist of gene expression sets from sick and healthy patients, even GFD-treated in CD case. The threshold of 10 samples is considered as at least five individuals with active state of the disease in comparison with at least five individuals as healthy control (or under a treatment). This cutoff is a bound on the difference of samples’ number among the pathologies, with no more than one order of magnitude for all the considered microarrays. In this way, we tried to control the overrepresentation of one data set over another. Furthermore, even if selecting an objective number of samples is a heuristic choice, very small numbers affects statistical validations, as happen often for biological samples.

Briefly, we started from the set of differential expressed genes and applied GSEA to them, in order to obtain the most important GO terms. Then, we used a semantic similarity method on genes, GO terms and DO terms, making a comparison among all the results. The pipeline in this work is implemented in R language with packages from Bioconductor repository but it can be simply rewritten in other programming languages. It is important to highlight the choice of using different logFC thresholds; in this way, we have an augmentation of the data, not only in terms of quantity (when data are few) but even in terms of quality because we can extract different sets of differential expressed genes and GO terms, improving the research of common genes and pathways, in the same disease or among several diseases. From a methodological point of view, one strong point of the pipeline is the possibility of reusing available data. There are many projects with published results, which can be grasped for further analyses, but it is not simple to work on them. In other cases, such as in many CD comorbidities, microarray data are not publicly available because of legal/ethical reasons or difficulties in making the experiments. For this reason, our pipeline could help in obtaining novel evidences from old available data. It is important to underline how the selection of the microarray data sets can influence the evidences. The proposed method is data-driven; thus, the choice of the data sets is up to the scientists. However, we started from CD data and found results strengthened by the scientific literature, showing how the pipeline highlights the most important aspects involved in the selected pathologies. Obviously, we expect that the use of more and more microarray data sets may enrich the set of evidences, without affecting the previous ones. Therefore, as shown in this work, we suggest taking into account the data sets from different sources and different cell types, in order to integrate data and get evidences that are more robust for the single disease and among several diseases. One possible study target could be the analysis of the differential expression for genes of HLA, according to the importance claimed in [58]. In detail, the role of HLA class II genes is not completely described, but their importance is well known in autoimmune diseases. The availability of suitable microarray data sets would be a good seed for our pipeline, taking into account not only the classic case of patients versus control but also different states coming from genetic variants, in order to provide a support in defining the risk of having a disease. Moreover, novel microarray data sets with the concurrency of autoimmune diseases, e.g. patients with CD and one comorbidity versus patients with CD and another comorbidity, could help in supporting our finding or suggesting a new direction of investigation.

An inner issue with the typology of data is the comparison. The standardization is not similar all over the data set, for the kind of the data (conditions, tissues, normalization) or for the lack of curated records. It is meaningful to underline that GEO series (GSE) needs more effort in preparing data for the analysis because they are not curated. A possible direction in improving the pipeline can be the inclusion of a quality control on microarray data sets. Usually, researchers upload microarray data sets after a personal quality control step, in order to provide data sets suitable for semi-automated analysis. Nevertheless, some tools are available online to perform locally the quality control. One example is ‘arrayQualityMetrics’ package from Bioconductor repository, which provides a semi-interactive report on microarray quality evaluation, for both raw and normalized data sets, and an R object, suitable for the upstream integration in a pipeline [59]. The abovementioned ‘limma’ package provides the MA-plot, based on the comparison of the signal intensities, for making decisions on normalizing data. Furthermore, CD microarray data have a high sensitivity to the used filters, and therefore, it is up to the investigator finding a good trade off among the scores, making pretty difficult the realization of a standard protocol. From this perspective, a good work is explained in [60]. Nevertheless, our approach provides a good way to analyze, compare and integrate the described data. As future work, a complete workflow could be useful to automate the process, once we have coped with the manual selection of the patients (GSM records).

The integration of data from different studies, also obtained with different technologies, represent the most promising approach for investigating in details the complex disease. Our work proposes a pipeline suitable for analyzing different data sets from pathologies and their comorbidities. Novel omics data can be a further source of information for our pipeline, providing the opportunity to create a more complete perspective of the multifactorial disorders themselves.

Bioinformatics statistical analysis of CD and its comorbidities is a thriving direction of medical research. It has two important benefits: the improved understanding of the CD that shape the network of comorbidity diseases and the introduction of new workflow of analyses and statistical estimators for complex diseases.

Finally, our work is important from the clinical bioinformatics point of view because we provide methodology for comorbidity analysis and improved visualization based on developing bits of code to harmonize existing software into an effective pipeline that could be used by trained physicians. Although many molecular biologists use bioinformatics R scripts (or similar) in the research work, the impact of bioinformatics in clinical investigation is still at the beginning. We believe that our bioinformatics approach could be useful to researchers not only in CD but also in other complex diseases.

Key Points

Comprehensive bioinformatics analysis of a complex disease and its comorbidities can help in looking for novel findings or in strengthening old ones.
CD is a complex disease that needs many other efforts to cope with, in order to find important biomarkers and help clinicians and physicians in their evaluations.
A bioinformatics approach by means of some methodologies, such as gene set enrichment analysis and semantic similarity, is suitable for these purposes.
A straightforward pipeline in R language is useful to analyze common genes, gene ontology and disease ontology terms and pathways, by reusing online available microarray data.

Acknowledgements

A.F. activity has been supported by the Flagship “InterOmics” project (PB.P05) funded by the Italian Ministry of Education, University and Research and Italian National Research Council organizations.

Eugenio Del Prete is a PhD student in Applied Biology at the University of Basilicata, Potenza, Italy. His work for this article has been performed also at the Laboratory of Bioinformatics and Computational Biology, Institute of Food Sciences, Italian National Research Council, Avellino, Italy, and at the Computer Laboratory, Department of Computer Science, University of Cambridge, UK. He has an MSc in Telecommunications Engineering. His research interests include bioinformatics, biostatistics and computational biology.

Angelo Facchiano is a senior researcher at the Bioinformatics and Computational Biology Laboratory, Institute of Food Sciences, Italian National Research Council, Avellino, Italy. He is as a degree in Chemistry with specialization in Biotechnology and PhD in Cellular Biochemistry. His research interests include biochemistry, genetics and molecular biology, bioinformatics and computational biology, pharmacology, toxicology and pharmaceutics.

Pietro Liò is a reader in Computational Biology at the Computer Laboratory, a department of Computer Science at the University of Cambridge. He has an MA from Cambridge, a PhD in complex systems and non-linear dynamics and a PhD in theoretical genetics. His research interests include biomolecular and clinical data integration, computational medicine, complex disease dynamics and comorbidities, bioengineering, ICT and social innovation technologies.

References

1.

Bergamo

P

,

Maurano

F

,

Mazzarella

G

, et al.

Immunological evaluation of the alcohol-soluble protein fraction from gluten-free grains in relation to celiac disease

.

Mol Nutr Food Res

2011

;

55

:

1266

–

70

.

2.

Nijhawan

S

,

Goyal

G

.

Celiac disease review

.

J Gastrointest Dig Syst

2015

;

5

:

350

.

Google Scholar

Crossref

WorldCat

3.

Fasano

A

.

Zonulin, regulation of tight junctions, and autoimmune diseases

.

Ann N Y Acad Sci

2012

;

1258

(

1

):

25

–

33

.

4.

Heyman

M

,

Menard

S

.

Pathways of gliadin transport in celiac disease

.

Ann N Y Acad Sci

2009

;

1165

:

274

–

8

.

5.

Sollid

LM

,

Lundin

KE

.

Diagnosis and treatment of celiac disease

.

Mucosal Immunol

2009

;

2

:

3

–

7

.

6.

Alaedini

A

,

Green

PH

.

Narrative review: coeliac disease: understanding a complex autoimmune disorder

.

Ann Intern Med

2005

;

142

(

4

):

289

–

98

.

7.

Louka

AS

,

Sollid

LM

.

HLA in coeliac disease: unravelling the complex genetics of a complex disorder

.

Tissue Antigens

2003

;

61

(

2

):

105

–

17

.

8.

Fasano

A

.

Surprises from celiac disease

.

Sci Am

2009

;

301

:

54

–

61

.

9.

Mitchell

KJ

.

What is complex about complex disorders?

Genome Biol

2012

;

13

(

1

):

237

.

10.

Lauret

E

,

Rodrigo

L

.

Celiac disease and autoimmune-associated conditions

.

Biomed Res Int

2013

;

2013

:

127589

.

11.

Capobianco

E

,

Liò

P

.

Comorbidity: a multidimensional approach

.

Trends Mol Med

2013

;

19

(

9

):

515

–

21

.

12.

Ryan

D

,

Newnham

ED

,

Prenzler

PD

, et al.

Metabolomics as a tool for diagnosis and monitoring in coeliac disease

.

Metabolomics

2015

;

11

:

980

–

90

.

Google Scholar

Crossref

WorldCat

13.

Östensson

M

,

Montén

C

,

Bacelis

J

, et al.

A possible mechanism behind autoimmune disorders discovered by genome-wide linkage and association analysis in celiac disease

.

PLoS One

2013

;

8

(

8

):

e70174

.

14.

Buddrick

O

,

Cornell

HJ

,

Small

DM

.

Reduction of toxic gliadin content of wholegrain bread by the enzyme caricain

.

Food Chem

2015

;

170

:

343

–

7

.

15.

Wilhite

SE

,

Barrett

T

.

Strategies to explore functional genomics data sets in NCBI’s GEO database

.

Methods Mol Biol

2012

;

802

:

729

–

43

.

Google Scholar

OpenURL Placeholder Text

WorldCat

16.

Kolesnikov

N

,

Hastings

E

,

Keays

M

, et al.

ArrayExpress update—simplifying data submissions

.

Nucleic Acids Res

2015

;

43

(

Database issue

):

D1113

–

6

.

17.

Heap

GA

,

Trynka

G

,

Jansen

RC

, et al.

Complex nature of SNP genotype effects on gene expression in primary human leucocytes

.

BMC Med Genomics

2009

;

2

:

1

.

18.

Garber

ME

,

Saldanha

A

,

Parker

JS

, et al.

A B-cell gene signature correlates with the extent of gluten-induced intestinal injury in celiac disease

.

Cell Mol Gastroenterol Hepatol

2017

;

4

(

1

):

1

–

17

.

19.

Jorgensen

SF

,

Reims

HM

,

Freydenlund

D

, et al.

A cross-sectional study of the prevalence of gastrointestinal symptoms and pathology in patients with common variable immunodeficiency

.

Am J Gastroenterol

2016

;

111

(

10

):

1467

–

75

.

20.

Pietz

G

,

De

R

,

Hedberg

M

, et al.

Immunopathology of childhood celiac disease—key role of intestinal epithelial cells

.

PLoS One

2017

;

12

(

9

):

e0185025

.

21.

Plaza-Izurieta

L

,

Fernandez-Jimenez

N

,

Irastorza

I

, et al.

Expression analysis in intestinal mucosa reveals complex relations among genes under the association peaks in celiac disease

.

Eur J Hum Genet

2015

;

23

(

8

):

1100

–

5

.

22.

Pascual

V

,

Medrano

LM

,

López-Palacios

N

, et al.

Different gene expression signatures in children and adults with celiac disease

.

PLoS One

2016

;

11

(

2

):

e0146276

.

23.

Garcia-Etxebarria

K

,

Jauregi-Miguel

A

,

Romero-Garmendia

I

, et al.

Ancestry-based stratified analysis of Immunochip data identifies novel associations with celiac disease

.

Eur J Hum Genet

2016

;

24

(

12

):

1831

–

4

.

24.

Xing

L

,

Dai

Z

,

Jabbari

A

, et al.

Alopecia areata is driven by cytotoxic T lymphocytes and is reversed by JAK inhibition

.

Nat Med

2014

;

20

(

9

):

1043

–

9

.

25.

Jabbari

A

,

Cerise

JE

,

Chen

JC

, et al.

Molecular signatures define alopecia areata subtypes and transcriptional biomarkers

.

EBioMedicine

2016

;

7

:

240

–

7

.

26.

Regent

A

,

Mouthon

I

. Expression Data of Vascular Smooth Muscle Cell in Patients With Suspected Giant Cell Arteritis (GCA).

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63425

.

27.

Limbach

M

,

Saare

M

,

Tserel

L

, et al.

Epigenetic profiling in CD4+ and CD8+ T cells from Graves’ disease patients reveals changes in genes associated with T cell receptor signaling

.

J Autoimmun

2016

;

67

:

46

–

56

.

28.

Chen

YW

,

Shi

R

,

Geraci

N

, et al.

Duration of chronic inflammation alters gene expression in muscle from untreated girls with juvenile dermatomyositis

.

BMC Immunol

2008

;

9

:

43

.

29.

Greenberg

SA

,

Pinkus

JL

,

Pinkus

GS

, et al.

Interferon-alpha/beta-mediated innate immune mechanisms in dermatomyositis

.

Ann Neurol

2005

;

57

(

5

):

664

–

78

.

30.

Kanneboyina

N

. Myositis.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5370

.

31.

Nakagawa

R

,

Muroyama

R

,

Saeki

C

, et al.

miR-425 regulates inflammatory cytokine production in CD4+ T cells via N-Ras upregulation in primary biliary cholangitis

.

J Hepatol

2017

;

66

(

6

):

1223

–

30

.

32.

Chang

KH

,

Chuang

TJ

,

Lyu

RK

, et al.

Identification of gene networks and pathways associated with Guillain–Barré syndrome

.

PLoS One

2012

;

7

(

1

):

e29506

.

33.

Luo

L

,

Zhou

WH

,

Cai

JJ

, et al.

Gene expression profiling identifies downregulation of the neurotrophin-MAPK signaling pathway in female diabetic peripheral neuropathy patients

.

J Diabetes Res

2017

;

2017

:

8103904

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

34.

Lee

HM

,

Sugino

H

,

Aoki

C

, et al.

Underexpression of mitochondrial-DNA encoded ATP synthesis-related genes and DNA repair genes in systemic lupus erythematosus

.

Arthritis Res Ther

2011

;

13

(

2

):

R63

.

35.

Lauwerys

BR

,

Louahed

J

,

Knoops

L

, et al. Identification of Genes Specifically Over-Expressed in Lupus CD4 T and B cells.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4588

.

36.

Dey-Rao

R

,

Sinha

AA

.

Vitiligo blood transcriptomics provides new insights into disease mechanisms and identifies potential novel therapeutic targets

.

BMC Genomics

2017

;

18

(

1

):

109

.

37.

Rashighi

M

,

Agarwal

P

,

Richmond

JM

, et al.

CXCL10 is critical for the progression and maintenance of depigmentation in a mouse model of vitiligo

.

Sci Transl Med

2014

;

6

(

223

):

223ra23

.

38.

Kurppa

K

,

Laitinen

A

,

Agardh

D

.

Coeliac disease in children with type 1 diabetes

.

Lancet Child Adolesc Health

2018

;

2

(

2

):

133

–

43

.

39.

Gene Ontology Consortium

.

Gene Ontology Consortium: going forward

.

Nucleic Acids Res

2015

;

43

(

Database issue

):

D1049

–

56

.

PubMed

OpenURL Placeholder Text

WorldCat

40.

Kibbe

WA

,

Arze

C

,

Felix

V

, et al.

Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data

.

Nucleic Acids Res

2015

;

43

(

Database issue

):

D1071

–

8

.

41.

Subramanian

A

,

Tamayo

P

,

Mootha

VK

, et al.

Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles

.

Proc Natl Acad Sci U S A

2005

;

102

(

43

):

15545

–

50

.

42.

Wang

JZ

,

Ali

F

,

Srimani

PK

. Efficient method to measure the semantic similarity of ontologies. In:

Wu

S

,

Yang

LT

,

Xu

TL

(eds).

Advances in Grid and Pervasive Computing, GPC 2008, Lecture Notes in Computer Science,

Vol. 5036.

Berlin, Heidelberg

:

Springer

,

2008

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

43.

Moni

MA

,

Liò

P

.

comoR: a software for disease comorbidity risk assessment

.

J Clin Bioinforma

2014

;

4

:

8

.

44.

Pesquita

C

,

Faria

D

,

Bastos

H

, et al.

Metrics for GO based protein semantic similarity: a systematic evaluation

.

BMC Bioinformatics

2008

;

9

(

Suppl 5

):

S4

.

45.

Kanehisa

M

,

Sato

Y

,

Kawashima

M

, et al.

KEGG as a reference resource for gene and protein annotation

.

Nucleic Acids Res

2016

;

44

:

D457

–

62

.

46.

Huber

W

,

Carey

VJ

,

Gentleman

R

, et al.

Orchestrating high-throughput genomic analysis with Bioconductor

.

Nat Methods

2015

;

12

(

2

):

115

–

21

.

47.

Davis

S

,

Meltzer

P

.

GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor

.

Bioinformatics

2007

;

23

(

14

):

1846

–

7

.

48.

Ritchie

ME

,

Phipson

B

,

Wu

D

, et al.

limma powers differential expression analyses for RNA-sequencing and microarray studies

.

Nucleic Acids Res

2015

;

43

(

7

):

e47

.

49.

Gentleman

R

,

Carey

V

,

Huber

W

,

Hahne

F

. genefilter: methods for filtering genes from high-throughput experiments.

R package version 1.60.0

,

2017

.

50.

Alexa

A

,

Rahnenfuhrer

J

topGO: enrichment analysis for gene ontology.

R package version 2.30.0

,

2016

.

51.

Yu

G

,

Li

F

,

Qin

Y

, et al.

GOSemSim: an R package for measuring semantic similarity among GO terms and gene products

.

Bioinformatics

2010

;

26

(

7

):

976

–

8

.

52.

Yu

G

,

Wang

LG

,

Yan

GR

, et al.

DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis

.

Bioinformatics

2015

;

31

(

4

):

608

–

9

.

53.

Yu

G

,

Wang

LG

,

Han

Y

, et al.

clusterProfiler: an R package for comparing biological themes among gene clusters

.

OMICS

2012

;

16

(

5

):

284

–

7

.

54.

Marrè

ML

,

James

EA

,

Piganelli

JD

.

β cell ER stress and the implications for immunogenicity in type 1 diabetes

.

Front Cell Dev Biol

2015

;

3

:

67

.

55.

Demirtas

K

,

Yayla

C

,

Yüksel

M

, et al.

Tp-e interval and Tp-e/QT ratio in patients with celiac disease

.

Rev Clin Esp

2017

;

217

(

8

):

439

–

45

.

56.

Bouziat

R

,

Hinterleitner

R

,

Brown

JJ

, et al.

Reovirus infection triggers inflammatory responses to dietary antigens and development of celiac disease

.

Science

2017

;

356

(

6333

):

44

–

50

.

57.

Warde-Farley

D

,

Donaldson

SL

,

Comes

O

, et al.

The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function

.

Nucleic Acids Res

2010

;

38

(

Web Server issue

):

W214

–

20

.

58.

Gianfrani

C

,

Pisapia

L

,

Picascia

S

, et al.

Expression level of risk genes of MHC class II is a susceptibility factor for autoimmunity: new insights

.

J Autoimmun

2018

;

89

:

1

–

10

.

59.

Kauffmann

A

,

Gentleman

R

,

Huber

W

.

arrayQualityMetrics—a bioconductor package for quality assessment of microarray data

.

Bioinformatics

2009

;

25

(

3

):

415

–

6

.

60.

Quinn

EM

,

Coleman

C

,

Molloy

B

, et al.

Transcriptome analysis of CD4+ T cells in coeliac disease reveals imprint of BACH2 and IFNγ regulation

.

PLoS One

2015

;

10

(

10

):

e0140049

.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
November 2018	49
December 2018	29
January 2019	22
February 2019	11
March 2019	11
April 2019	16
May 2019	9
June 2019	7
July 2019	3
August 2019	3
September 2019	12
October 2019	20
November 2019	9
December 2019	10
January 2020	5
February 2020	26
March 2020	20
April 2020	36
May 2020	3
June 2020	9
July 2020	6
August 2020	7
September 2020	3
October 2020	3
November 2020	1
December 2020	5
January 2021	5
February 2021	7
April 2021	13
May 2021	32
June 2021	13
July 2021	48
August 2021	20
September 2021	3
October 2021	6
November 2021	8
December 2021	10
January 2022	17
February 2022	28
March 2022	24
April 2022	34
May 2022	19
June 2022	27
July 2022	42
August 2022	37
September 2022	38
October 2022	77
November 2022	23
December 2022	6
January 2023	24
February 2023	20
March 2023	33
April 2023	31
May 2023	18
June 2023	7
July 2023	15
August 2023	22
September 2023	16
October 2023	21
November 2023	21
December 2023	43
January 2024	24
February 2024	21
March 2024	23
April 2024	13
May 2024	18
June 2024	18
July 2024	22
August 2024	8
September 2024	18
October 2024	22
November 2024	21
December 2024	20
January 2025	11
February 2025	7
March 2025	40
April 2025	25

Article Contents

Bioinformatics methodologies for coeliac disease and its comorbidities

Abstract

Introduction

Methods

Survey of available data

Ontologies

Gene set enrichment analysis and semantic similarity

Designing of pipeline

Results

Statistics and GO terms tree

Pathways

Semantic similarity and KEGG enrichment

Discussion

Acknowledgements

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Bioinformatics methodologies for coeliac disease and its comorbidities

Abstract

Introduction

Methods

Survey of available data

Ontologies

Gene set enrichment analysis and semantic similarity

Designing of pipeline

Results

Statistics and GO terms tree

Pathways

Semantic similarity and KEGG enrichment

Discussion

Acknowledgements

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only