-
PDF
- Split View
-
Views
-
Cite
Cite
Anup Kumar Halder, Abhishek Agarwal, Karolina Jodkowska, Dariusz Plewczynski, A systematic analyses of different bioinformatics pipelines for genomic data and its impact on deep learning models for chromatin loop prediction, Briefings in Functional Genomics, Volume 23, Issue 5, September 2024, Pages 538–548, https://doi.org/10.1093/bfgp/elae009
- Share Icon Share
Abstract
Genomic data analysis has witnessed a surge in complexity and volume, primarily driven by the advent of high-throughput technologies. In particular, studying chromatin loops and structures has become pivotal in understanding gene regulation and genome organization. This systematic investigation explores the realm of specialized bioinformatics pipelines designed specifically for the analysis of chromatin loops and structures. Our investigation incorporates two protein (CTCF and Cohesin) factor-specific loop interaction datasets from six distinct pipelines, amassing a comprehensive collection of 36 diverse datasets. Through a meticulous review of existing literature, we offer a holistic perspective on the methodologies, tools and algorithms underpinning the analysis of this multifaceted genomic feature. We illuminate the vast array of approaches deployed, encompassing pivotal aspects such as data preparation pipeline, preprocessing, statistical features and modelling techniques. Beyond this, we rigorously assess the strengths and limitations inherent in these bioinformatics pipelines, shedding light on the interplay between data quality and the performance of deep learning models, ultimately advancing our comprehension of genomic intricacies.
INTRODUCTION
The field of genomics has experienced a significant shift, which is marked by the growing intricacy and quantity of data. The rapid development in high-throughput technology has played a key role in demonstrating that chromatin exhibits a precisely arranged configuration within the eukaryotic nucleus. This organisation is tightly controlled and influenced by factors such as the cell cycle stage, environmental signals, and pathological states [1–5]. The 3D configuration of chromatin is crucial for various fundamental biological functions, such as controlling gene expression and DNA replication [6–9]. Hence, it is crucial to conduct a thorough examination of chromatin organisation in order to understand better the impact of chromatin organization and looping occurrences on cell-specific biological functions [10–12]. So far, Hi-C has been the primary method used to identify chromatin contacts across the whole genome [13, 14]. Recent methods, such as Chromatin Interaction Analysis with Paired-End Tag sequencing (ChIA-PET) and HiChIP, have been developed to capture genome-wide highly specific chromatin interactions that are facilitated by specific proteins [15–18].
Recognising the expanding importance of chromatin loops and structures, this systematic review deeply explores the specialised domain of bioinformatics pipelines specifically intended for their investigation. Our research distinguishes itself through the examination of chromatin loops facilitated by two critical protein factors, CTCF and cohesin, drawing from three different data sources. Consequently, we consider six datasets in our analysis, processed by six independent bioinformatics approaches: ChIA-PET biased pipeline (ChIA-PIPE) [19], ChIA-PET Tool V3 [20], FitHiChIP [21], MAPS [22], Juicer [14, 23] and HiC-Pro [24]. This methodology generates a rich and diverse dataset of 36 variations, highlighting an extensive collection of computational strategies and methods applied in the study of genome architecture.
Our exploration extends beyond the mere enumeration of these datasets; we engage in an exhaustive review of the existing literature, casting light on the methodologies, tools and overall pipeline underpinning chromatin loops and structures analysis. This review spans the entire spectrum of approaches in this domain, covering essential aspects such as data preparation pipelines, prepossessing techniques, statistical features and advanced modelling methodologies. By scrutinizing the intricacies of each approach, we aim to provide a holistic perspective that outlines their diversity and emphasizes their individual contributions to the field.
Moreover, our investigation takes a significant step forward by critically assessing the strengths and limitations inherent in these bioinformatics pipelines. We delve into the interplay between data quality and the performance of deep learning models, recognizing the pivotal role played by data integrity in pursuing accurate and biologically relevant results. In doing so, our systematic review contributes to the ongoing efforts to enhance our understanding of genomics intricacies. It paves the way for further advancements in chromatin loop and structure analysis.
CHROMATIN INTERACTION
The emergence of chromosome conformation capture (3C) technology marked a transformative leap in our ability to deduce higher-order chromosome folding [25]. By pinpointing spatial proximity between distant genomic sequences, 3C provided a holistic understanding of genome topology. As we delve into the intricate realm of 3D chromatin organization, architectural protein factors, such as CTCF and cohesin, emerge as key players orchestrating the formation of topologically associated domains (TADs) and chromatin contact domains (CCDs) [16]. These architectural proteins play a pivotal role in delineating the spatial organization of the genome, contributing to the establishment of compartments and the intricacies of chromatin loops. The interplay between these structural elements shapes the three-dimensional (3D) landscape of the genome, unveiling a dynamic network of interactions that govern gene regulation and cellular function.
With the escalation in sequencing throughput, a global assessment of chromatin interactions within a population has become feasible through diverse methods such as 3C [25], 4C [26], 5C [27] and Hi-C [28] representing one-to-one, one-to-all, many-to-many and all-to-all type interactions, respectively. This involves sequencing all 3C ligation products or a selectively chosen subset [13, 14, 29]. The heightened sequencing capabilities not only empower comprehensive insights into chromatin architecture but also offer a nuanced exploration of intricate genomic interactions. In the following section, several loop-calling pipelines are discussed with data processing strategies.
LOOP CALLING PIPELINES
To record genome-wide chromatin interactions at high throughput and gain important insights into 3D chromatin interactions, various experimental procedures have been developed. The rapid evolution of experimental technologies is a driving force behind advancing bioinformatics tools for loop-calling methods for 3C-based experimental data. These experimental approaches can be categorized into three groups, each associated with biases originating from distinct initial data derived from three different experiments: ChIA-PET biased pipeline, Hi-C Biased Pipeline and HiChIP biased pipeline. These tools offer diverse functionalities for the analysis and processing of Hi-C and chromatin interaction data. Table 1 provides a succinct description of these pipelines, including their respective repository details.
ChIA-PIPE [19] is a Python-based pipeline for joint analysis of ChIP-seq and Hi-C data. Mango [30] facilitates Hi-C data visualization and exploration with interactive 3D views implemented in JavaScript and Python. ChIA-PET2 [31] analyzes Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA-PET) data, implemented in Perl. ChIA-PET Tool V3 [20], implemented in Java, Perl and R, is a suite for analyzing ChIA-PET data. The cLoops [32] is a versatile tool employed for calling loops in various chromatin interaction datasets, including ChIA-PET, HiChIP and high-resolution Hi-C data. cLoops2 [33], is designed with essential features such as peak-calling and loop-calling, accompanied by supplementary modules for interaction resolution, data similarity, feature quantification, aggregation analysis and visualization in 3D chromatin interaction datasets. ChIAPoP [34] employing a zero-truncated Poisson distribution and corrects for genomic distances and sequence bias, therefore, capable of discerning both interchromosomal and intrachromosomal loops by incorporating distinct background distributions
HiC-Pro [24], implemented in Python, is a comprehensive pipeline covering mapping, normalization and quality control. Juicer is a versatile Hi-C suite implemented in Java and Python, offering preprocessing, normalization and interaction calling [14, 23]. The hicpipe [43] is a Python-based Hi-C data processing pipeline, while hiclib [36] is a Python library for mapping, filtering, and chromatin interaction analysis. HOMER [44], implemented in Perl and C++, offers motif discovery, functional annotation, and Hi-C data analysis. HiFive, implemented in Python and Cython, features functionalities for contact matrix normalization, TAD calling and differential analysis [41]. HiCdat [38], an R package, aids in visualizing and analyzing Hi-C data, exploring chromatin interactions and structure. HiTC, an R package, provides statistical tools and interactive visualization for Hi-C data analysis [37]. HiC-inspector is an interactive tool for exploring and visualizing Hi-C data, implemented in JavaScript [42]. HiC-bench [39], a benchmarking framework, is implemented in Perl and R. HiCUP [45] efficiently handles Hi-C data processing with a focus on mapping and quality control, implemented in Perl. TADbit [40], a Python library, is designed for analyzing TADs and chromatin interactions. Finally, MUSTACHE [35], a method for detecting loops based on local enrichment, relies on the surrounding pixels of the contact map, and the fundamental concept involves representing image data at multiple scales through the utilization of a Gaussian kernel.
FitHiChIP [21] accurately calls peaks and performs statistical analysis on significant chromatin interactions from HiChIP data, implemented in R. MAPS [22] is a computational method implemented in R and Perl for identifying TADs and chromatin interactions from Hi-C data. Lastly, hichipper [46] is a Python-based HiChIP data processing pipeline with tools for mapping, filtering and interaction calling.
In this comprehensive review, we utilized only six distinct pipelines for loop-calling methodologies on 3C-based experimental data, selecting two from each group. This approach aims to elucidate the biases mitigated by these pipelines and provides insights into their functionality within the performance evaluation of deep learning models. The pipelines employed for dataset processing include the ChIA-PIPE [19] and ChIA-PET Tool V3 [20]), Hi-C biased Pipeline (Juicer [14, 23] and HiC-Pro [24]) and HiChIP biased pipeline (FitHiChIP [21] and MAPS [22]). In the following section, we will delve into the details of these six pipelines and thoroughly explore the data processing strategies employed. Figure 1 demonstrates the hierarchical categorization of these pipelines considering CTCF and Cohesin protein factors.

Hierarchical categorization of bioinformatics pipeline with two protein factors CTCF and Cohesin considering different experiment types and sources.
ChIA-PET biased pipeline
ChIA-PIPE stands as a groundbreaking achievement in the realm of chromatin interaction data analysis and visualization, offering a fully automated pipeline designed for comprehensive ChIA-PET studies [19]. This pipeline streamlines the entire process, from data processing to analysis and visualization, providing a seamless and efficient solution. ChIA-PIPE sets itself apart by assimilating distinctive functionalities not found in earlier tools like ChIA-PET tool [47], Mango [30] and ChIA-PET2 [31]. It introduces a binding peak–calling module with heightened sensitivity and specificity, comprehensive support for 2D contact map visualization tools like HiGlass [48] and Juicebox [49], and a novel genome browser called BASIC Browser, optimized for high-resolution exploration of peaks, loops and strand-specific RNA sequencing. However, ChIA-PIPE facilitates downstream structural interpretation analysis, including the calling and visualization of CCDs, annotation of enhancer–promoter (E-P) interactions, and the identification of haplotype-specific loops and peaks. With its fully automated approach and advanced features, ChIA-PIPE emerges as an invaluable tool for researchers engaged in ChIA-PET data analysis, promising efficiency, accuracy, and accessibility in understanding complex chromatin interactions.
In this assessment, after merging all the sample replicates, the data was subjected to processing through the ChIA-PIPE pipeline, utilizing the default parameters, including the Linker Sequence (GTTGGATAAG) and Peak-calling Algorithm (MACS2) [50]. As a result of this pipeline, a high-resolution 2D contact matrix in .hic’ file format was generated, along with annotated chromatin loops and their corresponding binding peak overlaps.
ChIA-PET Tool V3
The ChIA-PET Tool V3 [20] is a sophisticated tool for processing both short-read and long-read ChIA-PET data from paired-end reads. It excels in the generation of enriched binding peaks and the precise identification of chromatin interactions linked to proteins of interest. The tool’s strength lies in its robust error handling and quality assessment mechanisms, which encompass the creation of multiple log files and comprehensive statistics. This ensures meticulous traceability and confidence in the integrity of the processed data.
In the ChIA-PET Tool pipeline (version 3), flanking sequences after linker trimming were aligned to the human reference genome (hg38) using BWA-MEM (version 0.7.7) and only paired-end tags (PETs) that uniquely mapped (with mapping qualities
Juicer
Juicer is a software tool commonly used for analyzing the 3D structure of chromatin in a cell’s nucleus [14, 23]. This pipeline plays a key role in studying the genome organization, particularly how different regions of DNA are folded and interact with each other. Juicer works by processing high-throughput sequencing data, often generated through techniques like Hi-C. One of the key analyses performed by Juicer is loop calling. It identifies loops or specific interactions between different genomic regions. Loops can be indicative of functional elements in the genome, such as E-P interactions.
Juicer implements a normalization method known as Knight-Ruiz (KR) balancing’ to correct biases in Hi-C data [51]. This includes correction for GC content, mappability, and other factors. The normalization step aims to remove systematic biases, enhancing the accuracy of interaction matrices and subsequent loop calling. In this analytical study, all sample data underwent processing through Juicer [14, 52] and alignment against the hg38 reference genome. For subsequent analyses, all contact matrices were subject to KR-normalization using Juicer. Loops were identified by HiCCUPS [14, 52] within the.hic files, and Loop calling was performed at resolutions of 5kb, with subsequent merging in accordance with the procedure outlined in [14]. Loops with robust signals were delineated as those meeting the criteria of FDR
HiC-Pro
HiC-Pro emerges as a versatile and highly efficient pipeline meticulously crafted for the processing of Hi-C data [24]. HiC-Pro distinguishes itself by its proficiency in handling high-resolution datasets and providing an optimized format conducive to seamless contact map sharing. With a thoughtfully designed user interface, HiC-Pro not only administers rigorous quality controls but also effortlessly navigates the processing of Hi-C data, from the raw intricacies of sequencing reads to the generation of normalized, ready-to-use genome-wide contact maps. Its adaptability spans across diverse data origins, accommodating both restriction enzyme and nuclease digestion protocols with ease. The contact maps, both intra- and inter-chromosomal, crafted by HiC-Pro exhibit a remarkable resemblance to those generated by the hiclib package. The pipeline incorporates an intricately optimized iteration correction algorithm, a feature that significantly accelerates and refines the normalization process of Hi-C data.
HiC-Pro [24] normalizes interaction matrices through iterative correction, mitigating biases from library preparation and sequencing to enhance loop calling accuracy. It incorporates loop calling via the Directionality Index method and clustering, identifying loops through interaction directionality. Reads are aligned to GRCh38, with duplicates removed and reads allocated to MboI fragments, filtering for valid interactions and generating binned matrices. Loops are called using HiCCUPS with default settings, refining loop detection.
FitHiChIP
FitHiChIP [21] efficiently analyzes HiChIP data to detect chromatin interactions at protein binding sites, modelling read dependencies for significant statistical insights. Its data-driven methodology adeptly manages variable genomic coverage, identifying differential interactions across conditions to reveal chromatin architecture dynamics. With its user-friendly interface and efficient algorithms, FitHiChIP contributes to advancing our understanding of the intricate relationship between chromatin structure and transcriptional regulation.
HiC-Pro [24] processes raw FASTQ files to generate contact maps and validates ligation products, mapping them to the hg38 reference genome. Loop calling is then conducted in FitHiChIP [21] at a 5-kb resolution, ensuring uniform loop sizes. To align with FitHiChIP’s 1D peak prerequisites, MACS2 (version 2.2.9.1) [50] is utilized for peak calling in matching tissue conditions.
MAPS
MAPS [22], a computational approach named Model-based Analysis of PLAC-seq and HiChIP, has been devised to process experimental data and discern long-range chromatin interactions. Employing a zero-truncated Poisson regression framework [53], MAPS effectively removes systematic biases inherent in PLAC-seq and HiChIP datasets. Following this correction, the method utilizes normalized chromatin contact frequencies to pinpoint significant chromatin interactions anchored at genomic regions bound by the protein of interest. Significantly, MAPS exhibits superior performance when compared to existing software tools in the analysis of chromatin interactions across diverse PLAC-seq and HiChIP datasets associated with various transcription factors and histone marks.
The model-based analysis of long-range chromatin interactions is applied to call significant chromatin interactions at a resolution of 5 kb. In this pipeline, bwa-mem’ [54] was used to map the two ends of each paired-end read to the hg38 reference genome. Subsequently, the reads that mapped validly were retained, and PCR duplicates were eliminated using samtools rmdup’. Following this, intra-chromosomal reads were segregated into two categories: short- (
In this study, we want to investigate the variety in the chromatin loops processed using various loop-calling algorithms in order to assess the dataset variability using these thoroughly above-listed bioinformatics pipelines. Finding variations in the set of loops processed by these pipelines have biases towards the specific experimental dataset—for instance, HiC biased (HiCCUPS, HiC-Pro), HiChIP biased (FitHiChIP, MAPS) or ChIA-PET biased (ChIA-PIPE, ChIA-PET V3) is the goal of variability. The goal is to understand and quantify this variability within the sets of significant interactions that are input for the deep learning model.
Pipelines within the same category differ significantly. For example, ChIA-Pipe utilizes a statistical model based on the non-central hypergeometric distribution [55], incorporating genomic distance, whereas ChIA-PET V3 employs a hypergeometric distribution for interaction
Dataset variability across different pipelines and sources
In this study, we performed a comprehensive analysis of the dataset, classifying chromatin interaction data according to protein factors, experiment sources, and the specific pipeline employed. The loop data were evaluated and collected using these pipelines, focusing on the two protein factors, CTCF and Cohesin. In this assessment, we considered two distinct cell lines that were publicly available, GM12878 and HG00731, representing B-lymphocyte cell types. The GM12878 cell line datasets were acquired through experimental techniques, specifically ChIA-PET [29] and HiChIP [59], whereas the HG00731 datasets were solely obtained via the HiChIP method. The experimental data are collected from three different sources in CTCF and Cohesin to exemplify the variability of the datasets. In each of these experiments, CTCF and Cohesin are the two proteins of interest in each experiment. CTCF data are sourced from Mumbach [59], LGFS (in-house), and 4DN [60], while Cohesin data are obtained from Mumbach, LGFS (in-house) and ENCODE [61]. Consequently, this assessment comprised a total of 18 (6 pipelines

Correlation (Pearson) between the datasets of Chromatin Interaction Loops among
In-depth statistical information and dataset variability presented with a hierarchical arrangement of categorization factors
Protein Factor . | Experiment . | DataSource . | Cell lines . | Applied pipeline . | # Chromatin region . | Interactions . |
---|---|---|---|---|---|---|
CTCF | HiChIP | Mumbach | GM12878 | P1_ChIA-PIPE | 459 341 | 236 996 |
P2_ChIA-PET-V3 | 519 079 | 5 015 273 | ||||
P3_FitHiChIP | 8662 | 6928 | ||||
P4_MAPS | 110 461 | 178 535 | ||||
P5_Juicer | 64 696 | 59 751 | ||||
P6_HiC-Pro | 26 050 | 92 948 | ||||
LGFS | HG00731 | P1_ChIA-PIPE | 90 178 | 45 881 | ||
P2_ChIA-PET-V3 | 547 229 | 11 860 210 | ||||
P3_FitHiChIP | 54 215 | 82 132 | ||||
P4_MAPS | 26 212 | 28 013 | ||||
P5_Juicer | 35 813 | 26 590 | ||||
P6_HiC-Pro | 32 690 | 210 336 | ||||
ChIA-PET | 4DN | GM12878 | P1_ChIA-PIPE | 203 475 | 102 680 | |
P2_ChIA-PET-V3 | 70 973 | 328 562 | ||||
P3_FitHiChIP | 29 348 | 30 528 | ||||
P4_MAPS | 96 670 | 135 921 | ||||
P5_Juicer | 37 421 | 23 639 | ||||
P6_HiC-Pro | 33 894 | 183 934 | ||||
Cohesin | HiChIP | Mumbach | GM12878 | P1_ChIA-PIPE | 184 111 | 92 248 |
P2_ChIA-PET-V3 | 249 566 | 5 339 103 | ||||
P3_FitHiChIP | 66 223 | 89 784 | ||||
P4_MAPS | 32 519 | 36 536 | ||||
P5_Juicer | 29 732 | 19 478 | ||||
P6_HiC-Pro | 40 524 | 1 048 575 | ||||
LGFS | HG00731 | P1_ChIA-PIPE | 299 871 | 151 404 | ||
P2_ChIA-PET-V3 | 528 015 | 10 825 471 | ||||
P3_FitHiChIP | 185 931 | 827 288 | ||||
P4_MAPS | 61 352 | 80 488 | ||||
P5_Juicer | 51 253 | 43 148 | ||||
P6_HiC-Pro | 236 854 | 11 053 952 | ||||
ChIA-PET | ENCODE | GM12878 | P1_ChIA-PIPE | 48 487 | 69 955 | |
P2_ChIA-PET-V3 | 80 995 | 272 053 | ||||
P3_FitHiChIP | 133 000 | 276 277 | ||||
P4_MAPS | 8803 | 7205 | ||||
P5_Juicer | 15 415 | 8767 | ||||
P6_HiC-Pro | 121 776 | 3 553 840 |
Protein Factor . | Experiment . | DataSource . | Cell lines . | Applied pipeline . | # Chromatin region . | Interactions . |
---|---|---|---|---|---|---|
CTCF | HiChIP | Mumbach | GM12878 | P1_ChIA-PIPE | 459 341 | 236 996 |
P2_ChIA-PET-V3 | 519 079 | 5 015 273 | ||||
P3_FitHiChIP | 8662 | 6928 | ||||
P4_MAPS | 110 461 | 178 535 | ||||
P5_Juicer | 64 696 | 59 751 | ||||
P6_HiC-Pro | 26 050 | 92 948 | ||||
LGFS | HG00731 | P1_ChIA-PIPE | 90 178 | 45 881 | ||
P2_ChIA-PET-V3 | 547 229 | 11 860 210 | ||||
P3_FitHiChIP | 54 215 | 82 132 | ||||
P4_MAPS | 26 212 | 28 013 | ||||
P5_Juicer | 35 813 | 26 590 | ||||
P6_HiC-Pro | 32 690 | 210 336 | ||||
ChIA-PET | 4DN | GM12878 | P1_ChIA-PIPE | 203 475 | 102 680 | |
P2_ChIA-PET-V3 | 70 973 | 328 562 | ||||
P3_FitHiChIP | 29 348 | 30 528 | ||||
P4_MAPS | 96 670 | 135 921 | ||||
P5_Juicer | 37 421 | 23 639 | ||||
P6_HiC-Pro | 33 894 | 183 934 | ||||
Cohesin | HiChIP | Mumbach | GM12878 | P1_ChIA-PIPE | 184 111 | 92 248 |
P2_ChIA-PET-V3 | 249 566 | 5 339 103 | ||||
P3_FitHiChIP | 66 223 | 89 784 | ||||
P4_MAPS | 32 519 | 36 536 | ||||
P5_Juicer | 29 732 | 19 478 | ||||
P6_HiC-Pro | 40 524 | 1 048 575 | ||||
LGFS | HG00731 | P1_ChIA-PIPE | 299 871 | 151 404 | ||
P2_ChIA-PET-V3 | 528 015 | 10 825 471 | ||||
P3_FitHiChIP | 185 931 | 827 288 | ||||
P4_MAPS | 61 352 | 80 488 | ||||
P5_Juicer | 51 253 | 43 148 | ||||
P6_HiC-Pro | 236 854 | 11 053 952 | ||||
ChIA-PET | ENCODE | GM12878 | P1_ChIA-PIPE | 48 487 | 69 955 | |
P2_ChIA-PET-V3 | 80 995 | 272 053 | ||||
P3_FitHiChIP | 133 000 | 276 277 | ||||
P4_MAPS | 8803 | 7205 | ||||
P5_Juicer | 15 415 | 8767 | ||||
P6_HiC-Pro | 121 776 | 3 553 840 |
In-depth statistical information and dataset variability presented with a hierarchical arrangement of categorization factors
Protein Factor . | Experiment . | DataSource . | Cell lines . | Applied pipeline . | # Chromatin region . | Interactions . |
---|---|---|---|---|---|---|
CTCF | HiChIP | Mumbach | GM12878 | P1_ChIA-PIPE | 459 341 | 236 996 |
P2_ChIA-PET-V3 | 519 079 | 5 015 273 | ||||
P3_FitHiChIP | 8662 | 6928 | ||||
P4_MAPS | 110 461 | 178 535 | ||||
P5_Juicer | 64 696 | 59 751 | ||||
P6_HiC-Pro | 26 050 | 92 948 | ||||
LGFS | HG00731 | P1_ChIA-PIPE | 90 178 | 45 881 | ||
P2_ChIA-PET-V3 | 547 229 | 11 860 210 | ||||
P3_FitHiChIP | 54 215 | 82 132 | ||||
P4_MAPS | 26 212 | 28 013 | ||||
P5_Juicer | 35 813 | 26 590 | ||||
P6_HiC-Pro | 32 690 | 210 336 | ||||
ChIA-PET | 4DN | GM12878 | P1_ChIA-PIPE | 203 475 | 102 680 | |
P2_ChIA-PET-V3 | 70 973 | 328 562 | ||||
P3_FitHiChIP | 29 348 | 30 528 | ||||
P4_MAPS | 96 670 | 135 921 | ||||
P5_Juicer | 37 421 | 23 639 | ||||
P6_HiC-Pro | 33 894 | 183 934 | ||||
Cohesin | HiChIP | Mumbach | GM12878 | P1_ChIA-PIPE | 184 111 | 92 248 |
P2_ChIA-PET-V3 | 249 566 | 5 339 103 | ||||
P3_FitHiChIP | 66 223 | 89 784 | ||||
P4_MAPS | 32 519 | 36 536 | ||||
P5_Juicer | 29 732 | 19 478 | ||||
P6_HiC-Pro | 40 524 | 1 048 575 | ||||
LGFS | HG00731 | P1_ChIA-PIPE | 299 871 | 151 404 | ||
P2_ChIA-PET-V3 | 528 015 | 10 825 471 | ||||
P3_FitHiChIP | 185 931 | 827 288 | ||||
P4_MAPS | 61 352 | 80 488 | ||||
P5_Juicer | 51 253 | 43 148 | ||||
P6_HiC-Pro | 236 854 | 11 053 952 | ||||
ChIA-PET | ENCODE | GM12878 | P1_ChIA-PIPE | 48 487 | 69 955 | |
P2_ChIA-PET-V3 | 80 995 | 272 053 | ||||
P3_FitHiChIP | 133 000 | 276 277 | ||||
P4_MAPS | 8803 | 7205 | ||||
P5_Juicer | 15 415 | 8767 | ||||
P6_HiC-Pro | 121 776 | 3 553 840 |
Protein Factor . | Experiment . | DataSource . | Cell lines . | Applied pipeline . | # Chromatin region . | Interactions . |
---|---|---|---|---|---|---|
CTCF | HiChIP | Mumbach | GM12878 | P1_ChIA-PIPE | 459 341 | 236 996 |
P2_ChIA-PET-V3 | 519 079 | 5 015 273 | ||||
P3_FitHiChIP | 8662 | 6928 | ||||
P4_MAPS | 110 461 | 178 535 | ||||
P5_Juicer | 64 696 | 59 751 | ||||
P6_HiC-Pro | 26 050 | 92 948 | ||||
LGFS | HG00731 | P1_ChIA-PIPE | 90 178 | 45 881 | ||
P2_ChIA-PET-V3 | 547 229 | 11 860 210 | ||||
P3_FitHiChIP | 54 215 | 82 132 | ||||
P4_MAPS | 26 212 | 28 013 | ||||
P5_Juicer | 35 813 | 26 590 | ||||
P6_HiC-Pro | 32 690 | 210 336 | ||||
ChIA-PET | 4DN | GM12878 | P1_ChIA-PIPE | 203 475 | 102 680 | |
P2_ChIA-PET-V3 | 70 973 | 328 562 | ||||
P3_FitHiChIP | 29 348 | 30 528 | ||||
P4_MAPS | 96 670 | 135 921 | ||||
P5_Juicer | 37 421 | 23 639 | ||||
P6_HiC-Pro | 33 894 | 183 934 | ||||
Cohesin | HiChIP | Mumbach | GM12878 | P1_ChIA-PIPE | 184 111 | 92 248 |
P2_ChIA-PET-V3 | 249 566 | 5 339 103 | ||||
P3_FitHiChIP | 66 223 | 89 784 | ||||
P4_MAPS | 32 519 | 36 536 | ||||
P5_Juicer | 29 732 | 19 478 | ||||
P6_HiC-Pro | 40 524 | 1 048 575 | ||||
LGFS | HG00731 | P1_ChIA-PIPE | 299 871 | 151 404 | ||
P2_ChIA-PET-V3 | 528 015 | 10 825 471 | ||||
P3_FitHiChIP | 185 931 | 827 288 | ||||
P4_MAPS | 61 352 | 80 488 | ||||
P5_Juicer | 51 253 | 43 148 | ||||
P6_HiC-Pro | 236 854 | 11 053 952 | ||||
ChIA-PET | ENCODE | GM12878 | P1_ChIA-PIPE | 48 487 | 69 955 | |
P2_ChIA-PET-V3 | 80 995 | 272 053 | ||||
P3_FitHiChIP | 133 000 | 276 277 | ||||
P4_MAPS | 8803 | 7205 | ||||
P5_Juicer | 15 415 | 8767 | ||||
P6_HiC-Pro | 121 776 | 3 553 840 |
With the identical experiment, data source and celllines, we observe some differences in the number of chromatin loops. For example, in GM12878 CTCF from Mumbach et al., the interaction coverage is significantly different across six pipelines. These differences for the same cell line from different pipelines are due to the fact that each pipeline uses a different statistical model for predicting chromatin loops, utilization of ChIP-seq data of other experimental samples to locate loop anchors, differences in the distance cutoff to recognize intra-ligation or inter-ligation and different clustering algorithm to identity loops anchor and PETs.
With respect to the number of loops for a pipeline generating differences in loop numbers for different datasets, contact map resolution is one of the important factors for loop calling methods that may have a significant impact on the loop numbers; where, in our study, we used FitHiChIP, MAPS, HiCCUPS and HiC-Pro use fixed-sized genomic bins (5 KB) and ChIA-pet based pipeline generated the contact matrix file for multiple resolutions. Also, Hi-C anchors typically range from 5 to 100 kb in length, with rare cases involving extremely deep sequencing resulting in anchors as small as 1 kb [63, 64]. On the other hand, HiChIP and ChIA-PET typically span several kilo-base pairs [47]. The difference in resolution may introduce considerable noise into training datasets; consequently, we opted to directly utilize the chromatin interaction anchors.
In silico chromatin loop prediction approaches
A thorough investigation into the DNA sequence and epigenomic features, encompassing elements such as transcription factor binding sites, chromatin accessibility and histone modification, greatly aids in pinpointing active cis-regulatory elements (CREs) [65, 66]. Numerous computational approaches have surfaced for forecasting chromatin interactions within CREs, taking into consideration factors such as evolutionary traits [67] and the unique arrangement of nucleosomes [68]. Recent investigations propose that valuable information for predicting 3D chromatin architecture can be derived from both DNA sequence and 1D epigenomic modifications. In response to these insights, researchers have applied machine learning (ML) and polymer physics simulation methods to anticipate 3D chromatin organizations [69]. We have presented a brief overview of these approaches in Table 3.
Method . | Applied models/architecture . | Description . |
---|---|---|
CLNN-loop [70] | Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) | Predict chromatin loops across various cell lines and CTCF-binding site (CBS) pair types by integrating multiple sequence-based features. |
DeepLUCIA [71] | CNN | Predict chromatin loop in tissues utilizing epigenomic data. |
Block Copolymer Model [72] | Polymer Physics Simulation | Explains TADs formation and dynamics using the epigenomic landscape. |
Peakachu [73] | Random Forest | Predicts chromatin loops from contact maps, identifying unique interactions across species, supported by 56 Hi-C datasets. |
Pierro et al. [74] | Histone Modification Analysis | Shows that histone modification data can predict chromatin arrangement. |
Akita [75] | CNN | Predicts genome folding from DNA sequence, emphasizing CTCF binding sites orientation for in silico analysis. |
DeepC [76] | Transfer Learning | Predicts genome folding from DNA sequences, identifying domain boundaries and impacts of genetic variations. |
Sefer and Kingsford [77] | Histone Marker Analysis | Reveals that incorporating sequence features significantly enhances chromatin organization prediction. |
CTCF-MP [78] | Word2vec and Boosted Trees | Leverages CTCF ChIP-seq and DNase-seq data to predict chromatin loops. |
Lollipop [79] | Random Forest | Distinguishes CTCF-mediated loops from noninteracting ones using genomic and epigenomic data. |
DeepMILO [80] | CNN and RNN | Models CTCF-mediated insulator loops and predicts effects of noncoding variants using DNA sequences. |
3DEpiLoop [81] | Random Forest | Predicts 3D chromatin interaction in TADs at 1 kb resolution using epigenomic profiles. |
DeepTACT [82] | Deep Learning | Integrates genome sequence and chromatin data to predict enhancer-promoter interactions. |
DNABERT [83] | Transformer-based NLP | Classifies, annotates and analyses DNA sequences, capturing complex patterns for genomics research. |
DHL algorithm [84] | Transformer and ML Algorithms | Combines DNABERT with RF, SVM and KNN for sophisticated chromatin loop prediction. |
ccLoopER [85] | DNA Language Model | Utilizes a DNA-based transformer model to predict chromatin loops mediated by CTCF and Cohesin. |
Method . | Applied models/architecture . | Description . |
---|---|---|
CLNN-loop [70] | Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) | Predict chromatin loops across various cell lines and CTCF-binding site (CBS) pair types by integrating multiple sequence-based features. |
DeepLUCIA [71] | CNN | Predict chromatin loop in tissues utilizing epigenomic data. |
Block Copolymer Model [72] | Polymer Physics Simulation | Explains TADs formation and dynamics using the epigenomic landscape. |
Peakachu [73] | Random Forest | Predicts chromatin loops from contact maps, identifying unique interactions across species, supported by 56 Hi-C datasets. |
Pierro et al. [74] | Histone Modification Analysis | Shows that histone modification data can predict chromatin arrangement. |
Akita [75] | CNN | Predicts genome folding from DNA sequence, emphasizing CTCF binding sites orientation for in silico analysis. |
DeepC [76] | Transfer Learning | Predicts genome folding from DNA sequences, identifying domain boundaries and impacts of genetic variations. |
Sefer and Kingsford [77] | Histone Marker Analysis | Reveals that incorporating sequence features significantly enhances chromatin organization prediction. |
CTCF-MP [78] | Word2vec and Boosted Trees | Leverages CTCF ChIP-seq and DNase-seq data to predict chromatin loops. |
Lollipop [79] | Random Forest | Distinguishes CTCF-mediated loops from noninteracting ones using genomic and epigenomic data. |
DeepMILO [80] | CNN and RNN | Models CTCF-mediated insulator loops and predicts effects of noncoding variants using DNA sequences. |
3DEpiLoop [81] | Random Forest | Predicts 3D chromatin interaction in TADs at 1 kb resolution using epigenomic profiles. |
DeepTACT [82] | Deep Learning | Integrates genome sequence and chromatin data to predict enhancer-promoter interactions. |
DNABERT [83] | Transformer-based NLP | Classifies, annotates and analyses DNA sequences, capturing complex patterns for genomics research. |
DHL algorithm [84] | Transformer and ML Algorithms | Combines DNABERT with RF, SVM and KNN for sophisticated chromatin loop prediction. |
ccLoopER [85] | DNA Language Model | Utilizes a DNA-based transformer model to predict chromatin loops mediated by CTCF and Cohesin. |
Method . | Applied models/architecture . | Description . |
---|---|---|
CLNN-loop [70] | Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) | Predict chromatin loops across various cell lines and CTCF-binding site (CBS) pair types by integrating multiple sequence-based features. |
DeepLUCIA [71] | CNN | Predict chromatin loop in tissues utilizing epigenomic data. |
Block Copolymer Model [72] | Polymer Physics Simulation | Explains TADs formation and dynamics using the epigenomic landscape. |
Peakachu [73] | Random Forest | Predicts chromatin loops from contact maps, identifying unique interactions across species, supported by 56 Hi-C datasets. |
Pierro et al. [74] | Histone Modification Analysis | Shows that histone modification data can predict chromatin arrangement. |
Akita [75] | CNN | Predicts genome folding from DNA sequence, emphasizing CTCF binding sites orientation for in silico analysis. |
DeepC [76] | Transfer Learning | Predicts genome folding from DNA sequences, identifying domain boundaries and impacts of genetic variations. |
Sefer and Kingsford [77] | Histone Marker Analysis | Reveals that incorporating sequence features significantly enhances chromatin organization prediction. |
CTCF-MP [78] | Word2vec and Boosted Trees | Leverages CTCF ChIP-seq and DNase-seq data to predict chromatin loops. |
Lollipop [79] | Random Forest | Distinguishes CTCF-mediated loops from noninteracting ones using genomic and epigenomic data. |
DeepMILO [80] | CNN and RNN | Models CTCF-mediated insulator loops and predicts effects of noncoding variants using DNA sequences. |
3DEpiLoop [81] | Random Forest | Predicts 3D chromatin interaction in TADs at 1 kb resolution using epigenomic profiles. |
DeepTACT [82] | Deep Learning | Integrates genome sequence and chromatin data to predict enhancer-promoter interactions. |
DNABERT [83] | Transformer-based NLP | Classifies, annotates and analyses DNA sequences, capturing complex patterns for genomics research. |
DHL algorithm [84] | Transformer and ML Algorithms | Combines DNABERT with RF, SVM and KNN for sophisticated chromatin loop prediction. |
ccLoopER [85] | DNA Language Model | Utilizes a DNA-based transformer model to predict chromatin loops mediated by CTCF and Cohesin. |
Method . | Applied models/architecture . | Description . |
---|---|---|
CLNN-loop [70] | Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) | Predict chromatin loops across various cell lines and CTCF-binding site (CBS) pair types by integrating multiple sequence-based features. |
DeepLUCIA [71] | CNN | Predict chromatin loop in tissues utilizing epigenomic data. |
Block Copolymer Model [72] | Polymer Physics Simulation | Explains TADs formation and dynamics using the epigenomic landscape. |
Peakachu [73] | Random Forest | Predicts chromatin loops from contact maps, identifying unique interactions across species, supported by 56 Hi-C datasets. |
Pierro et al. [74] | Histone Modification Analysis | Shows that histone modification data can predict chromatin arrangement. |
Akita [75] | CNN | Predicts genome folding from DNA sequence, emphasizing CTCF binding sites orientation for in silico analysis. |
DeepC [76] | Transfer Learning | Predicts genome folding from DNA sequences, identifying domain boundaries and impacts of genetic variations. |
Sefer and Kingsford [77] | Histone Marker Analysis | Reveals that incorporating sequence features significantly enhances chromatin organization prediction. |
CTCF-MP [78] | Word2vec and Boosted Trees | Leverages CTCF ChIP-seq and DNase-seq data to predict chromatin loops. |
Lollipop [79] | Random Forest | Distinguishes CTCF-mediated loops from noninteracting ones using genomic and epigenomic data. |
DeepMILO [80] | CNN and RNN | Models CTCF-mediated insulator loops and predicts effects of noncoding variants using DNA sequences. |
3DEpiLoop [81] | Random Forest | Predicts 3D chromatin interaction in TADs at 1 kb resolution using epigenomic profiles. |
DeepTACT [82] | Deep Learning | Integrates genome sequence and chromatin data to predict enhancer-promoter interactions. |
DNABERT [83] | Transformer-based NLP | Classifies, annotates and analyses DNA sequences, capturing complex patterns for genomics research. |
DHL algorithm [84] | Transformer and ML Algorithms | Combines DNABERT with RF, SVM and KNN for sophisticated chromatin loop prediction. |
ccLoopER [85] | DNA Language Model | Utilizes a DNA-based transformer model to predict chromatin loops mediated by CTCF and Cohesin. |
Deep learning model specific assessment
In the realm of forecasting chromatin interactions and loops, deep learning models have risen as a promising area of research. These models employ advanced neural network architectures and undergo training on substantial volumes of genomic data to effectively predict the spatial relationships between distant genomic elements. A major strength of these models lies in their ability to capture intricate, non-linear associations among genomic characteristics, a challenge that traditional statistical methods often struggle to overcome.
Deep transformer models BERT [86] have significantly improved genomic sequence analysis, particularly in chromatin interaction prediction. Their ability to learn from large datasets and identify complex DNA patterns results in highly accurate models. Moreover, BERT’s adaptability for specific research tasks, from identifying regulatory elements to predicting chromatin interactions, showcases its versatility in genomics. In this case study, we have incorporated DNABERT [83] as a pre-training model and fine-tuned it with the above-described dataset. This fine-tuning was performed for the final classification, utilizing adaptable sequence fragments from two distinct chromatin regions.
Genomic interactions are represented as connections between distant points within the genome, where these distant regions serve as anchor points. In each chromatin interaction, we identified the highest-scoring CTCF motifs and their respective orientations within these anchor regions. As for the Cohesin protein factor, we incorporated the cohesin peaks with the highest strength to select significant sub-regions from the anchors. In this experiment concerning the Cohesin protein, three sets of peak information were utilized for selecting significant regions from three sources: Mumbach, LGFS and ENCODE. Interactions lacking motif regions in both anchors are excluded from the learning process. The resulting interactions are presented in Table 2.
The process involves extracting 250 base pair long genomic subsequences from anchor regions around CTCF motifs and Cohesin peaks, forming chromatin loops as pairs of these subsequences separated by a [SEP] token for a total of 500 base pairs. For the negative dataset, genomic regions are randomly chosen, ensuring no overlap with positive or transitive interactions. Training uses interactions from 22 chromosomes, excluding chromosome 9 for validation, maintaining a 1:1 ratio of positive to negative examples to balance the deep learning model.
In this analytical assessment, models have been developed by training on all 36 individual datasets generated from 6 different pipelines, considering two protein factors and three sources. The performances have been plotted in Figure 3. In CTCF-based datasets, ChIA-PET V3 and FitHiChIP show higher accuracy and area under the receiver operating characteristic (AUC) score, followed by Juicer. However, concerning data sources, LGFS shows significantly better AUC compared to Mumbach and 4DN, with the exception of MAPS. In terms of Cohesin-based datasets, Mumbach performs better in almost all pipelines, with only two exceptions: Juicer and HiC-Pro.

Chromatin Interaction Prediction performance across six pipelines with two protein factors, CTCF and Cohesin, considering different experiment types and sources.
For the assessment of the overall performance of all datasets based on CTCF and Cohesin, a subset of 1030 chromatin interaction data common to all pipelines and sources is extracted for validation purposes. The performance of individual pipelines is presented in Figure 4. A three-level consensus strategy is used to enhance the coverage of the interaction.

Chromatin Interaction Prediction on the set of shared interactions and the predicted performance across all pipelines with three-level consensus-based performance.
In the first level, the consensus strategy focuses on harmonizing the data from three different sources for both CTCF and Cohesin—namely, Mumbach, LGFS and 4DN for CTCF; and Mumbach, LGFS and ENCODE for Cohesin—across all six pipelines. This approach allows for the integration of diverse datasets, mitigating source-specific biases and enhancing the reliability of the interaction data. The outcomes of this initial consensus are visually represented in Figure 4 through Venn diagrams, which delineate the coverage of interactions specific to each source as well as those identified through the consensus approach.
The analysis adds complexity in the next consensus level by addressing experimental biases from different technologies, such as ChIA-PET (via ChIA-PIPE and ChIA-PET Tool V3), HiChIP (via FitHiChIP and MAPS) and Hi-C (via Juicer and HiC-Pro). It applies a consensus process for each technology group, using pre-computed models to balance their unique biases and strengths. This is crucial for creating a consensus model that accurately represents a wide range of experimental interactions.
The final consensus level, a result of this detailed process, merges insights from previous steps, markedly improving chromatin interaction coverage
The deployment of a multi-level consensus strategy in assessing the performance of different datasets on CTCF and Cohesin interactions not only facilitates a more accurate and inclusive detection of chromatin interactions but also showcases the potential of collaborative approaches in overcoming the limitations inherent to individual experimental pipelines. This holistic assessment methodology provides critical insights into the chromatin interaction landscape, paving the way for further advancements in the field of genomics and epigenomics.
CONCLUSION
Our study conducted a thorough analysis of chromatin interaction datasets, systematically categorizing the data based on protein factors, experiment sources, and specific pipelines utilized. We focused on the key proteins CTCF and Cohesin and considered two publicly available cell lines, GM12878 and HG00731, to represent B-lymphocyte cell types. Experimental data were obtained from different sources, emphasizing the variability in datasets. The assessment included a total of 36 distinct datasets (18 each for CTCF and Cohesin) generated from six pipelines across three sources. ChIA-PET Tool V3 consistently extracted a greater number of interactions, while the in-house LGFS strategy demonstrated superior coverage of chromatin interactions, especially for CTCF. A BERT-based deep learning model is applied to the resulting 36 individual datasets, and their performances are assessed in terms of accuracy and AUC score to gain insights into the potential of these datasets for in-silico loop prediction. ChIA-PET V3 and FitHiChIP demonstrated higher accuracy in CTCF-based datasets, with LGFS exhibiting superior AUC among the data sources. Mumbach consistently performed well in Cohesin-based datasets across most pipelines. The overall assessment, based on a subset of 1030 chromatin interaction data common to all pipelines and sources, revealed robust interaction coverage through a three-level consensus strategy. This strategy, involving source-specific consensus and subsequent experimental bias-based consensus, resulted in a substantial improvement in interaction coverage for both CTCF and Cohesin, surpassing individual pipeline-based coverages. These findings underscore the significance of integrating multiple sources and employing consensus strategies to enhance the reliability and coverage of chromatin interaction predictions in genomics research. However, we are deeply interested in examining learning strategies for both homogeneous and heterogeneous datasets, along with performing validations to evaluate the robustness of our methodology. This exploration will be a crucial aspect of our future work, extending into both cross-learning and self-learning evaluations and validations of the model.
We have outlined bioinformatics pipelines tailored for chromatin interaction and loop calling, focusing on experimental bias sources related to CTCF and Cohesin proteins.
Chromatin interaction coverage was assessed for two proteins, CTCF and Cohesin, across 18 combinations of pipelines and sources.
In-depth exploration and discussion of predictive approaches for insilico chromatin loop identification.
Utilizing a deep transformer-based model, we evaluated prediction performance on the datasets and conducted performance analyses through a three-level consensus strategy across 18 combinations.
AUTHOR CONTRIBUTIONS
Anup Kumar Halder (Conceptualization-Equal, Data curation-Supporting, Formal analysis-Lead, Investigation-Lead, Methodology-Lead, Validation-Equal, Visualization-Lead, Writing—original draft-Lead, Writing—review & editing-Equal ), Abhishek Agarwal (Conceptualization-Supporting, Data curation-Lead, Formal analysis-Supporting, Investigation-Supporting, Writing—original draft-Equal), Karolina Jodkowska (Data production-Lead, Investigation-Supporting, Validation-Supporting, Writing—review & editing-Supporting), Dariusz Plewczynski (Conceptualization-Equal, Project administration-Lead, Supervision-Lead, Writing—review & editing-Equal).
FUNDING
The research was co-funded by Warsaw University of Technology within the Excellence Initiative: Research University (IDUB) programme. The work has been co-supported by EU-funded Marie Sklodowska-Curie Action (MSCA) Innovative Training Network named Enhpathy (www.enhpathy.eu) Molecular Basis of Human enhanceropathies, National Institute of Health USA 4DNucleome grant 1U54DK107967-01 and “Nucleome Positioning System for Spatiotemporal Genome Organization and Regulation. This work has also been co-supported by the Polish National Science Centre (2019/35/O/ST6/02484,2020/37/B/NZ2/03757). Computations were performed thanks to the Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, using the Artificial Intelligence HPC platform financed by the Polish Ministry of Science and Higher Education (decision no. 7054/IA/SP/2020 of 2020-08-28).
Author Biographies
Anup Kumar Halder is a postdoctoral fellow at the Faculty of Mathematics and Information Sciences, Warsaw University of Technology, Warsaw, Poland. He is also associated with the Laboratory of Functional and Structural Genomics at the Center of New Technologies (CeNT), University of Warsaw, Poland.
Abhishek Agarwal is a PhD student in the Laboratory of Functional and Structural Genomics at the Center of New Technologies (CeNT), University of Warsaw, Poland. some description.
Karolina Jodkowska is a postdoctoral fellow in the Laboratory of Functional and Structural Genomics at the Center of New Technologies (CeNT), University of Warsaw, Warsaw, Poland. some description.
Dariusz Plewczynski is a Professor of Exact and Natural Sciences at Warsaw University of Technology as Principal Investigator and Head. He is also associated with the University of Warsaw in the Center of New Technologies CeNT, Warsaw, Poland, as the head of the Laboratory of Functional and Structural Genomics.
References