Abstract

Genomic data analysis has witnessed a surge in complexity and volume, primarily driven by the advent of high-throughput technologies. In particular, studying chromatin loops and structures has become pivotal in understanding gene regulation and genome organization. This systematic investigation explores the realm of specialized bioinformatics pipelines designed specifically for the analysis of chromatin loops and structures. Our investigation incorporates two protein (CTCF and Cohesin) factor-specific loop interaction datasets from six distinct pipelines, amassing a comprehensive collection of 36 diverse datasets. Through a meticulous review of existing literature, we offer a holistic perspective on the methodologies, tools and algorithms underpinning the analysis of this multifaceted genomic feature. We illuminate the vast array of approaches deployed, encompassing pivotal aspects such as data preparation pipeline, preprocessing, statistical features and modelling techniques. Beyond this, we rigorously assess the strengths and limitations inherent in these bioinformatics pipelines, shedding light on the interplay between data quality and the performance of deep learning models, ultimately advancing our comprehension of genomic intricacies.

INTRODUCTION

The field of genomics has experienced a significant shift, which is marked by the growing intricacy and quantity of data. The rapid development in high-throughput technology has played a key role in demonstrating that chromatin exhibits a precisely arranged configuration within the eukaryotic nucleus. This organisation is tightly controlled and influenced by factors such as the cell cycle stage, environmental signals, and pathological states [1–5]. The 3D configuration of chromatin is crucial for various fundamental biological functions, such as controlling gene expression and DNA replication [6–9]. Hence, it is crucial to conduct a thorough examination of chromatin organisation in order to understand better the impact of chromatin organization and looping occurrences on cell-specific biological functions [10–12]. So far, Hi-C has been the primary method used to identify chromatin contacts across the whole genome [13, 14]. Recent methods, such as Chromatin Interaction Analysis with Paired-End Tag sequencing (ChIA-PET) and HiChIP, have been developed to capture genome-wide highly specific chromatin interactions that are facilitated by specific proteins [15–18].

Recognising the expanding importance of chromatin loops and structures, this systematic review deeply explores the specialised domain of bioinformatics pipelines specifically intended for their investigation. Our research distinguishes itself through the examination of chromatin loops facilitated by two critical protein factors, CTCF and cohesin, drawing from three different data sources. Consequently, we consider six datasets in our analysis, processed by six independent bioinformatics approaches: ChIA-PET biased pipeline (ChIA-PIPE) [19], ChIA-PET Tool V3 [20], FitHiChIP [21], MAPS [22], Juicer [14, 23] and HiC-Pro [24]. This methodology generates a rich and diverse dataset of 36 variations, highlighting an extensive collection of computational strategies and methods applied in the study of genome architecture.

Our exploration extends beyond the mere enumeration of these datasets; we engage in an exhaustive review of the existing literature, casting light on the methodologies, tools and overall pipeline underpinning chromatin loops and structures analysis. This review spans the entire spectrum of approaches in this domain, covering essential aspects such as data preparation pipelines, prepossessing techniques, statistical features and advanced modelling methodologies. By scrutinizing the intricacies of each approach, we aim to provide a holistic perspective that outlines their diversity and emphasizes their individual contributions to the field.

Moreover, our investigation takes a significant step forward by critically assessing the strengths and limitations inherent in these bioinformatics pipelines. We delve into the interplay between data quality and the performance of deep learning models, recognizing the pivotal role played by data integrity in pursuing accurate and biologically relevant results. In doing so, our systematic review contributes to the ongoing efforts to enhance our understanding of genomics intricacies. It paves the way for further advancements in chromatin loop and structure analysis.

CHROMATIN INTERACTION

The emergence of chromosome conformation capture (3C) technology marked a transformative leap in our ability to deduce higher-order chromosome folding [25]. By pinpointing spatial proximity between distant genomic sequences, 3C provided a holistic understanding of genome topology. As we delve into the intricate realm of 3D chromatin organization, architectural protein factors, such as CTCF and cohesin, emerge as key players orchestrating the formation of topologically associated domains (TADs) and chromatin contact domains (CCDs) [16]. These architectural proteins play a pivotal role in delineating the spatial organization of the genome, contributing to the establishment of compartments and the intricacies of chromatin loops. The interplay between these structural elements shapes the three-dimensional (3D) landscape of the genome, unveiling a dynamic network of interactions that govern gene regulation and cellular function.

With the escalation in sequencing throughput, a global assessment of chromatin interactions within a population has become feasible through diverse methods such as 3C [25], 4C [26], 5C [27] and Hi-C [28] representing one-to-one, one-to-all, many-to-many and all-to-all type interactions, respectively. This involves sequencing all 3C ligation products or a selectively chosen subset [13, 14, 29]. The heightened sequencing capabilities not only empower comprehensive insights into chromatin architecture but also offer a nuanced exploration of intricate genomic interactions. In the following section, several loop-calling pipelines are discussed with data processing strategies.

LOOP CALLING PIPELINES

To record genome-wide chromatin interactions at high throughput and gain important insights into 3D chromatin interactions, various experimental procedures have been developed. The rapid evolution of experimental technologies is a driving force behind advancing bioinformatics tools for loop-calling methods for 3C-based experimental data. These experimental approaches can be categorized into three groups, each associated with biases originating from distinct initial data derived from three different experiments: ChIA-PET biased pipeline, Hi-C Biased Pipeline and HiChIP biased pipeline. These tools offer diverse functionalities for the analysis and processing of Hi-C and chromatin interaction data. Table 1 provides a succinct description of these pipelines, including their respective repository details.

Table 1

Pipeline details and their corresponding protocol with available scripts

Table 1

Pipeline details and their corresponding protocol with available scripts

ChIA-PIPE [19] is a Python-based pipeline for joint analysis of ChIP-seq and Hi-C data. Mango [30] facilitates Hi-C data visualization and exploration with interactive 3D views implemented in JavaScript and Python. ChIA-PET2 [31] analyzes Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA-PET) data, implemented in Perl. ChIA-PET Tool V3 [20], implemented in Java, Perl and R, is a suite for analyzing ChIA-PET data. The cLoops [32] is a versatile tool employed for calling loops in various chromatin interaction datasets, including ChIA-PET, HiChIP and high-resolution Hi-C data. cLoops2 [33], is designed with essential features such as peak-calling and loop-calling, accompanied by supplementary modules for interaction resolution, data similarity, feature quantification, aggregation analysis and visualization in 3D chromatin interaction datasets. ChIAPoP [34] employing a zero-truncated Poisson distribution and corrects for genomic distances and sequence bias, therefore, capable of discerning both interchromosomal and intrachromosomal loops by incorporating distinct background distributions

HiC-Pro [24], implemented in Python, is a comprehensive pipeline covering mapping, normalization and quality control. Juicer is a versatile Hi-C suite implemented in Java and Python, offering preprocessing, normalization and interaction calling [14, 23]. The hicpipe [43] is a Python-based Hi-C data processing pipeline, while hiclib [36] is a Python library for mapping, filtering, and chromatin interaction analysis. HOMER [44], implemented in Perl and C++, offers motif discovery, functional annotation, and Hi-C data analysis. HiFive, implemented in Python and Cython, features functionalities for contact matrix normalization, TAD calling and differential analysis [41]. HiCdat [38], an R package, aids in visualizing and analyzing Hi-C data, exploring chromatin interactions and structure. HiTC, an R package, provides statistical tools and interactive visualization for Hi-C data analysis [37]. HiC-inspector is an interactive tool for exploring and visualizing Hi-C data, implemented in JavaScript [42]. HiC-bench [39], a benchmarking framework, is implemented in Perl and R. HiCUP [45] efficiently handles Hi-C data processing with a focus on mapping and quality control, implemented in Perl. TADbit [40], a Python library, is designed for analyzing TADs and chromatin interactions. Finally, MUSTACHE [35], a method for detecting loops based on local enrichment, relies on the surrounding pixels of the contact map, and the fundamental concept involves representing image data at multiple scales through the utilization of a Gaussian kernel.

FitHiChIP [21] accurately calls peaks and performs statistical analysis on significant chromatin interactions from HiChIP data, implemented in R. MAPS [22] is a computational method implemented in R and Perl for identifying TADs and chromatin interactions from Hi-C data. Lastly, hichipper [46] is a Python-based HiChIP data processing pipeline with tools for mapping, filtering and interaction calling.

In this comprehensive review, we utilized only six distinct pipelines for loop-calling methodologies on 3C-based experimental data, selecting two from each group. This approach aims to elucidate the biases mitigated by these pipelines and provides insights into their functionality within the performance evaluation of deep learning models. The pipelines employed for dataset processing include the ChIA-PIPE [19] and ChIA-PET Tool V3 [20]), Hi-C biased Pipeline (Juicer [14, 23] and HiC-Pro [24]) and HiChIP biased pipeline (FitHiChIP [21] and MAPS [22]). In the following section, we will delve into the details of these six pipelines and thoroughly explore the data processing strategies employed. Figure 1 demonstrates the hierarchical categorization of these pipelines considering CTCF and Cohesin protein factors.

Hierarchical categorization of bioinformatics pipeline with two protein factors CTCF and Cohesin considering different experiment types and sources.
Figure 1

Hierarchical categorization of bioinformatics pipeline with two protein factors CTCF and Cohesin considering different experiment types and sources.

ChIA-PET biased pipeline

ChIA-PIPE stands as a groundbreaking achievement in the realm of chromatin interaction data analysis and visualization, offering a fully automated pipeline designed for comprehensive ChIA-PET studies [19]. This pipeline streamlines the entire process, from data processing to analysis and visualization, providing a seamless and efficient solution. ChIA-PIPE sets itself apart by assimilating distinctive functionalities not found in earlier tools like ChIA-PET tool [47], Mango [30] and ChIA-PET2 [31]. It introduces a binding peak–calling module with heightened sensitivity and specificity, comprehensive support for 2D contact map visualization tools like HiGlass [48] and Juicebox [49], and a novel genome browser called BASIC Browser, optimized for high-resolution exploration of peaks, loops and strand-specific RNA sequencing. However, ChIA-PIPE facilitates downstream structural interpretation analysis, including the calling and visualization of CCDs, annotation of enhancer–promoter (E-P) interactions, and the identification of haplotype-specific loops and peaks. With its fully automated approach and advanced features, ChIA-PIPE emerges as an invaluable tool for researchers engaged in ChIA-PET data analysis, promising efficiency, accuracy, and accessibility in understanding complex chromatin interactions.

In this assessment, after merging all the sample replicates, the data was subjected to processing through the ChIA-PIPE pipeline, utilizing the default parameters, including the Linker Sequence (GTTGGATAAG) and Peak-calling Algorithm (MACS2) [50]. As a result of this pipeline, a high-resolution 2D contact matrix in .hic’ file format was generated, along with annotated chromatin loops and their corresponding binding peak overlaps.

ChIA-PET Tool V3

The ChIA-PET Tool V3 [20] is a sophisticated tool for processing both short-read and long-read ChIA-PET data from paired-end reads. It excels in the generation of enriched binding peaks and the precise identification of chromatin interactions linked to proteins of interest. The tool’s strength lies in its robust error handling and quality assessment mechanisms, which encompass the creation of multiple log files and comprehensive statistics. This ensures meticulous traceability and confidence in the integrity of the processed data.

In the ChIA-PET Tool pipeline (version 3), flanking sequences after linker trimming were aligned to the human reference genome (hg38) using BWA-MEM (version 0.7.7) and only paired-end tags (PETs) that uniquely mapped (with mapping qualities 30) were retained. Self-ligation PETs were employed for binding site identification, while inter-ligation PETs were utilized for detecting long-range interactions.

Juicer

Juicer is a software tool commonly used for analyzing the 3D structure of chromatin in a cell’s nucleus [14, 23]. This pipeline plays a key role in studying the genome organization, particularly how different regions of DNA are folded and interact with each other. Juicer works by processing high-throughput sequencing data, often generated through techniques like Hi-C. One of the key analyses performed by Juicer is loop calling. It identifies loops or specific interactions between different genomic regions. Loops can be indicative of functional elements in the genome, such as E-P interactions.

Juicer implements a normalization method known as Knight-Ruiz (KR) balancing’ to correct biases in Hi-C data [51]. This includes correction for GC content, mappability, and other factors. The normalization step aims to remove systematic biases, enhancing the accuracy of interaction matrices and subsequent loop calling. In this analytical study, all sample data underwent processing through Juicer [14, 52] and alignment against the hg38 reference genome. For subsequent analyses, all contact matrices were subject to KR-normalization using Juicer. Loops were identified by HiCCUPS [14, 52] within the.hic files, and Loop calling was performed at resolutions of 5kb, with subsequent merging in accordance with the procedure outlined in [14]. Loops with robust signals were delineated as those meeting the criteria of FDR 0.01 and observed counts 5.

HiC-Pro

HiC-Pro emerges as a versatile and highly efficient pipeline meticulously crafted for the processing of Hi-C data [24]. HiC-Pro distinguishes itself by its proficiency in handling high-resolution datasets and providing an optimized format conducive to seamless contact map sharing. With a thoughtfully designed user interface, HiC-Pro not only administers rigorous quality controls but also effortlessly navigates the processing of Hi-C data, from the raw intricacies of sequencing reads to the generation of normalized, ready-to-use genome-wide contact maps. Its adaptability spans across diverse data origins, accommodating both restriction enzyme and nuclease digestion protocols with ease. The contact maps, both intra- and inter-chromosomal, crafted by HiC-Pro exhibit a remarkable resemblance to those generated by the hiclib package. The pipeline incorporates an intricately optimized iteration correction algorithm, a feature that significantly accelerates and refines the normalization process of Hi-C data.

HiC-Pro [24] normalizes interaction matrices through iterative correction, mitigating biases from library preparation and sequencing to enhance loop calling accuracy. It incorporates loop calling via the Directionality Index method and clustering, identifying loops through interaction directionality. Reads are aligned to GRCh38, with duplicates removed and reads allocated to MboI fragments, filtering for valid interactions and generating binned matrices. Loops are called using HiCCUPS with default settings, refining loop detection.

FitHiChIP

FitHiChIP [21] efficiently analyzes HiChIP data to detect chromatin interactions at protein binding sites, modelling read dependencies for significant statistical insights. Its data-driven methodology adeptly manages variable genomic coverage, identifying differential interactions across conditions to reveal chromatin architecture dynamics. With its user-friendly interface and efficient algorithms, FitHiChIP contributes to advancing our understanding of the intricate relationship between chromatin structure and transcriptional regulation.

HiC-Pro [24] processes raw FASTQ files to generate contact maps and validates ligation products, mapping them to the hg38 reference genome. Loop calling is then conducted in FitHiChIP [21] at a 5-kb resolution, ensuring uniform loop sizes. To align with FitHiChIP’s 1D peak prerequisites, MACS2 (version 2.2.9.1) [50] is utilized for peak calling in matching tissue conditions.

MAPS

MAPS [22], a computational approach named Model-based Analysis of PLAC-seq and HiChIP, has been devised to process experimental data and discern long-range chromatin interactions. Employing a zero-truncated Poisson regression framework [53], MAPS effectively removes systematic biases inherent in PLAC-seq and HiChIP datasets. Following this correction, the method utilizes normalized chromatin contact frequencies to pinpoint significant chromatin interactions anchored at genomic regions bound by the protein of interest. Significantly, MAPS exhibits superior performance when compared to existing software tools in the analysis of chromatin interactions across diverse PLAC-seq and HiChIP datasets associated with various transcription factors and histone marks.

The model-based analysis of long-range chromatin interactions is applied to call significant chromatin interactions at a resolution of 5 kb. In this pipeline, bwa-mem’ [54] was used to map the two ends of each paired-end read to the hg38 reference genome. Subsequently, the reads that mapped validly were retained, and PCR duplicates were eliminated using samtools rmdup’. Following this, intra-chromosomal reads were segregated into two categories: short- ( 1 kb) and long-range reads ( kb), where the long-range reads were employed to study protein-mediated chromatin interactions.

In this study, we want to investigate the variety in the chromatin loops processed using various loop-calling algorithms in order to assess the dataset variability using these thoroughly above-listed bioinformatics pipelines. Finding variations in the set of loops processed by these pipelines have biases towards the specific experimental dataset—for instance, HiC biased (HiCCUPS, HiC-Pro), HiChIP biased (FitHiChIP, MAPS) or ChIA-PET biased (ChIA-PIPE, ChIA-PET V3) is the goal of variability. The goal is to understand and quantify this variability within the sets of significant interactions that are input for the deep learning model.

Pipelines within the same category differ significantly. For example, ChIA-Pipe utilizes a statistical model based on the non-central hypergeometric distribution [55], incorporating genomic distance, whereas ChIA-PET V3 employs a hypergeometric distribution for interaction P-value calculations. For HiChIP analyses, FitHiChIP applies a regression model to correct biases related to coverage and genomic distances, followed by a binomial distribution for loop identification, contrasting with MAPS, which uses a zero-truncated Poisson regression model for bias adjustment. In the realm of Hi-C analyses, different statistical models are employed: HiCCUPS leverages Poisson regression to eliminate known biases for loop calling, while HiC-Pro utilizes a fast, sparse iterative correction technique. These distinctions are further amplified by other factors like alignment tools (e.g. Bowtie2 [56], BWA [57], GEM [58]) and the programming languages used.

Dataset variability across different pipelines and sources

In this study, we performed a comprehensive analysis of the dataset, classifying chromatin interaction data according to protein factors, experiment sources, and the specific pipeline employed. The loop data were evaluated and collected using these pipelines, focusing on the two protein factors, CTCF and Cohesin. In this assessment, we considered two distinct cell lines that were publicly available, GM12878 and HG00731, representing B-lymphocyte cell types. The GM12878 cell line datasets were acquired through experimental techniques, specifically ChIA-PET [29] and HiChIP [59], whereas the HG00731 datasets were solely obtained via the HiChIP method. The experimental data are collected from three different sources in CTCF and Cohesin to exemplify the variability of the datasets. In each of these experiments, CTCF and Cohesin are the two proteins of interest in each experiment. CTCF data are sourced from Mumbach [59], LGFS (in-house), and 4DN [60], while Cohesin data are obtained from Mumbach, LGFS (in-house) and ENCODE [61]. Consequently, this assessment comprised a total of 18 (6 pipelines × 3 sources) datasets for CTCF and 18 (6 pipelines × 3 sources) datasets for Cohesin protein-specific chromatin interactions. The detailed statistics of the data, along with their corresponding categorizations, are outlined in Table 2, yielding a total of 36 distinct datasets. Out of the six pipelines, ChIA-PET Tool V3 consistently extracts a greater number of interactions across all three sources. However, MAPS exhibits a lower coverage of interactions. Noteworthy is the fact that the in-house LGFS strategy delivers superior coverage of chromatin interactions, followed by Mumbach and 4DN for CTCF and ENCODE for Cohesin, respectively. To illustrate the relationship in terms of interaction coverage for both protein factors, a correlation (Pearson) heatmap is shown in Figure 2. The heatmaps depict a higher correlation with other pipelines for both CTCF and Cohesin protein-specific interactions. Specifically, Juicer and MAPS represent a subset of ChIA-PET Tool V3 in terms of chromatin interaction coverages, with a correlation of 1. The intersection between each individual is computed using BEDTools [62] to obtain a unique count of overlapping interacting regions.

Correlation (Pearson) between the datasets of Chromatin Interaction Loops among $6(pipelines)\times 3 (sources)$ datasets for CTCF and Cohesin protein factors.
Figure 2

Correlation (Pearson) between the datasets of Chromatin Interaction Loops among 6(pipelines)×3(sources) datasets for CTCF and Cohesin protein factors.

Table 2

In-depth statistical information and dataset variability presented with a hierarchical arrangement of categorization factors

Protein FactorExperimentDataSourceCell linesApplied pipeline# Chromatin regionInteractions
CTCFHiChIPMumbachGM12878P1_ChIA-PIPE459 341236 996
P2_ChIA-PET-V3519 0795 015 273
P3_FitHiChIP86626928
P4_MAPS110 461178 535
P5_Juicer64 69659 751
P6_HiC-Pro26 05092 948
LGFSHG00731P1_ChIA-PIPE90 17845 881
P2_ChIA-PET-V3547 22911 860 210
P3_FitHiChIP54 21582 132
P4_MAPS26 21228 013
P5_Juicer35 81326 590
P6_HiC-Pro32 690210 336
ChIA-PET4DNGM12878P1_ChIA-PIPE203 475102 680
P2_ChIA-PET-V370 973328 562
P3_FitHiChIP29 34830 528
P4_MAPS96 670135 921
P5_Juicer37 42123 639
P6_HiC-Pro33 894183 934
CohesinHiChIPMumbachGM12878P1_ChIA-PIPE184 11192 248
P2_ChIA-PET-V3249 5665 339 103
P3_FitHiChIP66 22389 784
P4_MAPS32 51936 536
P5_Juicer29 73219 478
P6_HiC-Pro40 5241 048 575
LGFSHG00731P1_ChIA-PIPE299 871151 404
P2_ChIA-PET-V3528 01510 825 471
P3_FitHiChIP185 931827 288
P4_MAPS61 35280 488
P5_Juicer51 25343 148
P6_HiC-Pro236 85411 053 952
ChIA-PETENCODEGM12878P1_ChIA-PIPE48 48769 955
P2_ChIA-PET-V380 995272 053
P3_FitHiChIP133 000276 277
P4_MAPS88037205
P5_Juicer15 4158767
P6_HiC-Pro121 7763 553 840
Protein FactorExperimentDataSourceCell linesApplied pipeline# Chromatin regionInteractions
CTCFHiChIPMumbachGM12878P1_ChIA-PIPE459 341236 996
P2_ChIA-PET-V3519 0795 015 273
P3_FitHiChIP86626928
P4_MAPS110 461178 535
P5_Juicer64 69659 751
P6_HiC-Pro26 05092 948
LGFSHG00731P1_ChIA-PIPE90 17845 881
P2_ChIA-PET-V3547 22911 860 210
P3_FitHiChIP54 21582 132
P4_MAPS26 21228 013
P5_Juicer35 81326 590
P6_HiC-Pro32 690210 336
ChIA-PET4DNGM12878P1_ChIA-PIPE203 475102 680
P2_ChIA-PET-V370 973328 562
P3_FitHiChIP29 34830 528
P4_MAPS96 670135 921
P5_Juicer37 42123 639
P6_HiC-Pro33 894183 934
CohesinHiChIPMumbachGM12878P1_ChIA-PIPE184 11192 248
P2_ChIA-PET-V3249 5665 339 103
P3_FitHiChIP66 22389 784
P4_MAPS32 51936 536
P5_Juicer29 73219 478
P6_HiC-Pro40 5241 048 575
LGFSHG00731P1_ChIA-PIPE299 871151 404
P2_ChIA-PET-V3528 01510 825 471
P3_FitHiChIP185 931827 288
P4_MAPS61 35280 488
P5_Juicer51 25343 148
P6_HiC-Pro236 85411 053 952
ChIA-PETENCODEGM12878P1_ChIA-PIPE48 48769 955
P2_ChIA-PET-V380 995272 053
P3_FitHiChIP133 000276 277
P4_MAPS88037205
P5_Juicer15 4158767
P6_HiC-Pro121 7763 553 840
Table 2

In-depth statistical information and dataset variability presented with a hierarchical arrangement of categorization factors

Protein FactorExperimentDataSourceCell linesApplied pipeline# Chromatin regionInteractions
CTCFHiChIPMumbachGM12878P1_ChIA-PIPE459 341236 996
P2_ChIA-PET-V3519 0795 015 273
P3_FitHiChIP86626928
P4_MAPS110 461178 535
P5_Juicer64 69659 751
P6_HiC-Pro26 05092 948
LGFSHG00731P1_ChIA-PIPE90 17845 881
P2_ChIA-PET-V3547 22911 860 210
P3_FitHiChIP54 21582 132
P4_MAPS26 21228 013
P5_Juicer35 81326 590
P6_HiC-Pro32 690210 336
ChIA-PET4DNGM12878P1_ChIA-PIPE203 475102 680
P2_ChIA-PET-V370 973328 562
P3_FitHiChIP29 34830 528
P4_MAPS96 670135 921
P5_Juicer37 42123 639
P6_HiC-Pro33 894183 934
CohesinHiChIPMumbachGM12878P1_ChIA-PIPE184 11192 248
P2_ChIA-PET-V3249 5665 339 103
P3_FitHiChIP66 22389 784
P4_MAPS32 51936 536
P5_Juicer29 73219 478
P6_HiC-Pro40 5241 048 575
LGFSHG00731P1_ChIA-PIPE299 871151 404
P2_ChIA-PET-V3528 01510 825 471
P3_FitHiChIP185 931827 288
P4_MAPS61 35280 488
P5_Juicer51 25343 148
P6_HiC-Pro236 85411 053 952
ChIA-PETENCODEGM12878P1_ChIA-PIPE48 48769 955
P2_ChIA-PET-V380 995272 053
P3_FitHiChIP133 000276 277
P4_MAPS88037205
P5_Juicer15 4158767
P6_HiC-Pro121 7763 553 840
Protein FactorExperimentDataSourceCell linesApplied pipeline# Chromatin regionInteractions
CTCFHiChIPMumbachGM12878P1_ChIA-PIPE459 341236 996
P2_ChIA-PET-V3519 0795 015 273
P3_FitHiChIP86626928
P4_MAPS110 461178 535
P5_Juicer64 69659 751
P6_HiC-Pro26 05092 948
LGFSHG00731P1_ChIA-PIPE90 17845 881
P2_ChIA-PET-V3547 22911 860 210
P3_FitHiChIP54 21582 132
P4_MAPS26 21228 013
P5_Juicer35 81326 590
P6_HiC-Pro32 690210 336
ChIA-PET4DNGM12878P1_ChIA-PIPE203 475102 680
P2_ChIA-PET-V370 973328 562
P3_FitHiChIP29 34830 528
P4_MAPS96 670135 921
P5_Juicer37 42123 639
P6_HiC-Pro33 894183 934
CohesinHiChIPMumbachGM12878P1_ChIA-PIPE184 11192 248
P2_ChIA-PET-V3249 5665 339 103
P3_FitHiChIP66 22389 784
P4_MAPS32 51936 536
P5_Juicer29 73219 478
P6_HiC-Pro40 5241 048 575
LGFSHG00731P1_ChIA-PIPE299 871151 404
P2_ChIA-PET-V3528 01510 825 471
P3_FitHiChIP185 931827 288
P4_MAPS61 35280 488
P5_Juicer51 25343 148
P6_HiC-Pro236 85411 053 952
ChIA-PETENCODEGM12878P1_ChIA-PIPE48 48769 955
P2_ChIA-PET-V380 995272 053
P3_FitHiChIP133 000276 277
P4_MAPS88037205
P5_Juicer15 4158767
P6_HiC-Pro121 7763 553 840

With the identical experiment, data source and celllines, we observe some differences in the number of chromatin loops. For example, in GM12878 CTCF from Mumbach et al., the interaction coverage is significantly different across six pipelines. These differences for the same cell line from different pipelines are due to the fact that each pipeline uses a different statistical model for predicting chromatin loops, utilization of ChIP-seq data of other experimental samples to locate loop anchors, differences in the distance cutoff to recognize intra-ligation or inter-ligation and different clustering algorithm to identity loops anchor and PETs.

With respect to the number of loops for a pipeline generating differences in loop numbers for different datasets, contact map resolution is one of the important factors for loop calling methods that may have a significant impact on the loop numbers; where, in our study, we used FitHiChIP, MAPS, HiCCUPS and HiC-Pro use fixed-sized genomic bins (5 KB) and ChIA-pet based pipeline generated the contact matrix file for multiple resolutions. Also, Hi-C anchors typically range from 5 to 100 kb in length, with rare cases involving extremely deep sequencing resulting in anchors as small as 1 kb [63, 64]. On the other hand, HiChIP and ChIA-PET typically span several kilo-base pairs [47]. The difference in resolution may introduce considerable noise into training datasets; consequently, we opted to directly utilize the chromatin interaction anchors.

In silico chromatin loop prediction approaches

A thorough investigation into the DNA sequence and epigenomic features, encompassing elements such as transcription factor binding sites, chromatin accessibility and histone modification, greatly aids in pinpointing active cis-regulatory elements (CREs) [65, 66]. Numerous computational approaches have surfaced for forecasting chromatin interactions within CREs, taking into consideration factors such as evolutionary traits [67] and the unique arrangement of nucleosomes [68]. Recent investigations propose that valuable information for predicting 3D chromatin architecture can be derived from both DNA sequence and 1D epigenomic modifications. In response to these insights, researchers have applied machine learning (ML) and polymer physics simulation methods to anticipate 3D chromatin organizations [69]. We have presented a brief overview of these approaches in Table 3.

Table 3

Computational methods for predicting 3D chromatin organizations

MethodApplied models/architectureDescription
CLNN-loop [70]Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM)Predict chromatin loops across various cell lines and CTCF-binding site (CBS) pair types by integrating multiple sequence-based features.
DeepLUCIA [71]CNNPredict chromatin loop in tissues utilizing epigenomic data.
Block Copolymer Model [72]Polymer Physics SimulationExplains TADs formation and dynamics using the epigenomic landscape.
Peakachu [73]Random ForestPredicts chromatin loops from contact maps, identifying unique interactions across species, supported by 56 Hi-C datasets.
Pierro et al. [74]Histone Modification AnalysisShows that histone modification data can predict chromatin arrangement.
Akita [75]CNNPredicts genome folding from DNA sequence, emphasizing CTCF binding sites orientation for in silico analysis.
DeepC [76]Transfer LearningPredicts genome folding from DNA sequences, identifying domain boundaries and impacts of genetic variations.
Sefer and Kingsford [77]Histone Marker AnalysisReveals that incorporating sequence features significantly enhances chromatin organization prediction.
CTCF-MP [78]Word2vec and Boosted TreesLeverages CTCF ChIP-seq and DNase-seq data to predict chromatin loops.
Lollipop [79]Random ForestDistinguishes CTCF-mediated loops from noninteracting ones using genomic and epigenomic data.
DeepMILO [80]CNN and RNNModels CTCF-mediated insulator loops and predicts effects of noncoding variants using DNA sequences.
3DEpiLoop [81]Random ForestPredicts 3D chromatin interaction in TADs at 1 kb resolution using epigenomic profiles.
DeepTACT [82]Deep LearningIntegrates genome sequence and chromatin data to predict enhancer-promoter interactions.
DNABERT [83]Transformer-based NLPClassifies, annotates and analyses DNA sequences, capturing complex patterns for genomics research.
DHL algorithm [84]Transformer and ML AlgorithmsCombines DNABERT with RF, SVM and KNN for sophisticated chromatin loop prediction.
ccLoopER [85]DNA Language ModelUtilizes a DNA-based transformer model to predict chromatin loops mediated by CTCF and Cohesin.
MethodApplied models/architectureDescription
CLNN-loop [70]Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM)Predict chromatin loops across various cell lines and CTCF-binding site (CBS) pair types by integrating multiple sequence-based features.
DeepLUCIA [71]CNNPredict chromatin loop in tissues utilizing epigenomic data.
Block Copolymer Model [72]Polymer Physics SimulationExplains TADs formation and dynamics using the epigenomic landscape.
Peakachu [73]Random ForestPredicts chromatin loops from contact maps, identifying unique interactions across species, supported by 56 Hi-C datasets.
Pierro et al. [74]Histone Modification AnalysisShows that histone modification data can predict chromatin arrangement.
Akita [75]CNNPredicts genome folding from DNA sequence, emphasizing CTCF binding sites orientation for in silico analysis.
DeepC [76]Transfer LearningPredicts genome folding from DNA sequences, identifying domain boundaries and impacts of genetic variations.
Sefer and Kingsford [77]Histone Marker AnalysisReveals that incorporating sequence features significantly enhances chromatin organization prediction.
CTCF-MP [78]Word2vec and Boosted TreesLeverages CTCF ChIP-seq and DNase-seq data to predict chromatin loops.
Lollipop [79]Random ForestDistinguishes CTCF-mediated loops from noninteracting ones using genomic and epigenomic data.
DeepMILO [80]CNN and RNNModels CTCF-mediated insulator loops and predicts effects of noncoding variants using DNA sequences.
3DEpiLoop [81]Random ForestPredicts 3D chromatin interaction in TADs at 1 kb resolution using epigenomic profiles.
DeepTACT [82]Deep LearningIntegrates genome sequence and chromatin data to predict enhancer-promoter interactions.
DNABERT [83]Transformer-based NLPClassifies, annotates and analyses DNA sequences, capturing complex patterns for genomics research.
DHL algorithm [84]Transformer and ML AlgorithmsCombines DNABERT with RF, SVM and KNN for sophisticated chromatin loop prediction.
ccLoopER [85]DNA Language ModelUtilizes a DNA-based transformer model to predict chromatin loops mediated by CTCF and Cohesin.
Table 3

Computational methods for predicting 3D chromatin organizations

MethodApplied models/architectureDescription
CLNN-loop [70]Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM)Predict chromatin loops across various cell lines and CTCF-binding site (CBS) pair types by integrating multiple sequence-based features.
DeepLUCIA [71]CNNPredict chromatin loop in tissues utilizing epigenomic data.
Block Copolymer Model [72]Polymer Physics SimulationExplains TADs formation and dynamics using the epigenomic landscape.
Peakachu [73]Random ForestPredicts chromatin loops from contact maps, identifying unique interactions across species, supported by 56 Hi-C datasets.
Pierro et al. [74]Histone Modification AnalysisShows that histone modification data can predict chromatin arrangement.
Akita [75]CNNPredicts genome folding from DNA sequence, emphasizing CTCF binding sites orientation for in silico analysis.
DeepC [76]Transfer LearningPredicts genome folding from DNA sequences, identifying domain boundaries and impacts of genetic variations.
Sefer and Kingsford [77]Histone Marker AnalysisReveals that incorporating sequence features significantly enhances chromatin organization prediction.
CTCF-MP [78]Word2vec and Boosted TreesLeverages CTCF ChIP-seq and DNase-seq data to predict chromatin loops.
Lollipop [79]Random ForestDistinguishes CTCF-mediated loops from noninteracting ones using genomic and epigenomic data.
DeepMILO [80]CNN and RNNModels CTCF-mediated insulator loops and predicts effects of noncoding variants using DNA sequences.
3DEpiLoop [81]Random ForestPredicts 3D chromatin interaction in TADs at 1 kb resolution using epigenomic profiles.
DeepTACT [82]Deep LearningIntegrates genome sequence and chromatin data to predict enhancer-promoter interactions.
DNABERT [83]Transformer-based NLPClassifies, annotates and analyses DNA sequences, capturing complex patterns for genomics research.
DHL algorithm [84]Transformer and ML AlgorithmsCombines DNABERT with RF, SVM and KNN for sophisticated chromatin loop prediction.
ccLoopER [85]DNA Language ModelUtilizes a DNA-based transformer model to predict chromatin loops mediated by CTCF and Cohesin.
MethodApplied models/architectureDescription
CLNN-loop [70]Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM)Predict chromatin loops across various cell lines and CTCF-binding site (CBS) pair types by integrating multiple sequence-based features.
DeepLUCIA [71]CNNPredict chromatin loop in tissues utilizing epigenomic data.
Block Copolymer Model [72]Polymer Physics SimulationExplains TADs formation and dynamics using the epigenomic landscape.
Peakachu [73]Random ForestPredicts chromatin loops from contact maps, identifying unique interactions across species, supported by 56 Hi-C datasets.
Pierro et al. [74]Histone Modification AnalysisShows that histone modification data can predict chromatin arrangement.
Akita [75]CNNPredicts genome folding from DNA sequence, emphasizing CTCF binding sites orientation for in silico analysis.
DeepC [76]Transfer LearningPredicts genome folding from DNA sequences, identifying domain boundaries and impacts of genetic variations.
Sefer and Kingsford [77]Histone Marker AnalysisReveals that incorporating sequence features significantly enhances chromatin organization prediction.
CTCF-MP [78]Word2vec and Boosted TreesLeverages CTCF ChIP-seq and DNase-seq data to predict chromatin loops.
Lollipop [79]Random ForestDistinguishes CTCF-mediated loops from noninteracting ones using genomic and epigenomic data.
DeepMILO [80]CNN and RNNModels CTCF-mediated insulator loops and predicts effects of noncoding variants using DNA sequences.
3DEpiLoop [81]Random ForestPredicts 3D chromatin interaction in TADs at 1 kb resolution using epigenomic profiles.
DeepTACT [82]Deep LearningIntegrates genome sequence and chromatin data to predict enhancer-promoter interactions.
DNABERT [83]Transformer-based NLPClassifies, annotates and analyses DNA sequences, capturing complex patterns for genomics research.
DHL algorithm [84]Transformer and ML AlgorithmsCombines DNABERT with RF, SVM and KNN for sophisticated chromatin loop prediction.
ccLoopER [85]DNA Language ModelUtilizes a DNA-based transformer model to predict chromatin loops mediated by CTCF and Cohesin.

Deep learning model specific assessment

In the realm of forecasting chromatin interactions and loops, deep learning models have risen as a promising area of research. These models employ advanced neural network architectures and undergo training on substantial volumes of genomic data to effectively predict the spatial relationships between distant genomic elements. A major strength of these models lies in their ability to capture intricate, non-linear associations among genomic characteristics, a challenge that traditional statistical methods often struggle to overcome.

Deep transformer models BERT [86] have significantly improved genomic sequence analysis, particularly in chromatin interaction prediction. Their ability to learn from large datasets and identify complex DNA patterns results in highly accurate models. Moreover, BERT’s adaptability for specific research tasks, from identifying regulatory elements to predicting chromatin interactions, showcases its versatility in genomics. In this case study, we have incorporated DNABERT [83] as a pre-training model and fine-tuned it with the above-described dataset. This fine-tuning was performed for the final classification, utilizing adaptable sequence fragments from two distinct chromatin regions.

Genomic interactions are represented as connections between distant points within the genome, where these distant regions serve as anchor points. In each chromatin interaction, we identified the highest-scoring CTCF motifs and their respective orientations within these anchor regions. As for the Cohesin protein factor, we incorporated the cohesin peaks with the highest strength to select significant sub-regions from the anchors. In this experiment concerning the Cohesin protein, three sets of peak information were utilized for selecting significant regions from three sources: Mumbach, LGFS and ENCODE. Interactions lacking motif regions in both anchors are excluded from the learning process. The resulting interactions are presented in Table 2.

The process involves extracting 250 base pair long genomic subsequences from anchor regions around CTCF motifs and Cohesin peaks, forming chromatin loops as pairs of these subsequences separated by a [SEP] token for a total of 500 base pairs. For the negative dataset, genomic regions are randomly chosen, ensuring no overlap with positive or transitive interactions. Training uses interactions from 22 chromosomes, excluding chromosome 9 for validation, maintaining a 1:1 ratio of positive to negative examples to balance the deep learning model.

In this analytical assessment, models have been developed by training on all 36 individual datasets generated from 6 different pipelines, considering two protein factors and three sources. The performances have been plotted in Figure 3. In CTCF-based datasets, ChIA-PET V3 and FitHiChIP show higher accuracy and area under the receiver operating characteristic (AUC) score, followed by Juicer. However, concerning data sources, LGFS shows significantly better AUC compared to Mumbach and 4DN, with the exception of MAPS. In terms of Cohesin-based datasets, Mumbach performs better in almost all pipelines, with only two exceptions: Juicer and HiC-Pro.

Chromatin Interaction Prediction performance across six pipelines with two protein factors, CTCF and Cohesin, considering different experiment types and sources.
Figure 3

Chromatin Interaction Prediction performance across six pipelines with two protein factors, CTCF and Cohesin, considering different experiment types and sources.

For the assessment of the overall performance of all datasets based on CTCF and Cohesin, a subset of 1030 chromatin interaction data common to all pipelines and sources is extracted for validation purposes. The performance of individual pipelines is presented in Figure 4. A three-level consensus strategy is used to enhance the coverage of the interaction.

Chromatin Interaction Prediction on the set of shared interactions and the predicted performance across all pipelines with three-level consensus-based performance.
Figure 4

Chromatin Interaction Prediction on the set of shared interactions and the predicted performance across all pipelines with three-level consensus-based performance.

In the first level, the consensus strategy focuses on harmonizing the data from three different sources for both CTCF and Cohesin—namely, Mumbach, LGFS and 4DN for CTCF; and Mumbach, LGFS and ENCODE for Cohesin—across all six pipelines. This approach allows for the integration of diverse datasets, mitigating source-specific biases and enhancing the reliability of the interaction data. The outcomes of this initial consensus are visually represented in Figure 4 through Venn diagrams, which delineate the coverage of interactions specific to each source as well as those identified through the consensus approach.

The analysis adds complexity in the next consensus level by addressing experimental biases from different technologies, such as ChIA-PET (via ChIA-PIPE and ChIA-PET Tool V3), HiChIP (via FitHiChIP and MAPS) and Hi-C (via Juicer and HiC-Pro). It applies a consensus process for each technology group, using pre-computed models to balance their unique biases and strengths. This is crucial for creating a consensus model that accurately represents a wide range of experimental interactions.

The final consensus level, a result of this detailed process, merges insights from previous steps, markedly improving chromatin interaction coverage 93.11% for CTCF and 88.55% for Cohesin, as shown in Figure 4. This represents a substantial enhancement over individual pipelines, demonstrating the consensus approach’s effectiveness in providing a fuller view of chromatin interactions.

The deployment of a multi-level consensus strategy in assessing the performance of different datasets on CTCF and Cohesin interactions not only facilitates a more accurate and inclusive detection of chromatin interactions but also showcases the potential of collaborative approaches in overcoming the limitations inherent to individual experimental pipelines. This holistic assessment methodology provides critical insights into the chromatin interaction landscape, paving the way for further advancements in the field of genomics and epigenomics.

CONCLUSION

Our study conducted a thorough analysis of chromatin interaction datasets, systematically categorizing the data based on protein factors, experiment sources, and specific pipelines utilized. We focused on the key proteins CTCF and Cohesin and considered two publicly available cell lines, GM12878 and HG00731, to represent B-lymphocyte cell types. Experimental data were obtained from different sources, emphasizing the variability in datasets. The assessment included a total of 36 distinct datasets (18 each for CTCF and Cohesin) generated from six pipelines across three sources. ChIA-PET Tool V3 consistently extracted a greater number of interactions, while the in-house LGFS strategy demonstrated superior coverage of chromatin interactions, especially for CTCF. A BERT-based deep learning model is applied to the resulting 36 individual datasets, and their performances are assessed in terms of accuracy and AUC score to gain insights into the potential of these datasets for in-silico loop prediction. ChIA-PET V3 and FitHiChIP demonstrated higher accuracy in CTCF-based datasets, with LGFS exhibiting superior AUC among the data sources. Mumbach consistently performed well in Cohesin-based datasets across most pipelines. The overall assessment, based on a subset of 1030 chromatin interaction data common to all pipelines and sources, revealed robust interaction coverage through a three-level consensus strategy. This strategy, involving source-specific consensus and subsequent experimental bias-based consensus, resulted in a substantial improvement in interaction coverage for both CTCF and Cohesin, surpassing individual pipeline-based coverages. These findings underscore the significance of integrating multiple sources and employing consensus strategies to enhance the reliability and coverage of chromatin interaction predictions in genomics research. However, we are deeply interested in examining learning strategies for both homogeneous and heterogeneous datasets, along with performing validations to evaluate the robustness of our methodology. This exploration will be a crucial aspect of our future work, extending into both cross-learning and self-learning evaluations and validations of the model.

Key Points
  • We have outlined bioinformatics pipelines tailored for chromatin interaction and loop calling, focusing on experimental bias sources related to CTCF and Cohesin proteins.

  • Chromatin interaction coverage was assessed for two proteins, CTCF and Cohesin, across 18 combinations of pipelines and sources.

  • In-depth exploration and discussion of predictive approaches for insilico chromatin loop identification.

  • Utilizing a deep transformer-based model, we evaluated prediction performance on the datasets and conducted performance analyses through a three-level consensus strategy across 18 combinations.

AUTHOR CONTRIBUTIONS

Anup Kumar Halder (Conceptualization-Equal, Data curation-Supporting, Formal analysis-Lead, Investigation-Lead, Methodology-Lead, Validation-Equal, Visualization-Lead, Writing—original draft-Lead, Writing—review & editing-Equal ), Abhishek Agarwal (Conceptualization-Supporting, Data curation-Lead, Formal analysis-Supporting, Investigation-Supporting, Writing—original draft-Equal), Karolina Jodkowska (Data production-Lead, Investigation-Supporting, Validation-Supporting, Writing—review & editing-Supporting), Dariusz Plewczynski (Conceptualization-Equal, Project administration-Lead, Supervision-Lead, Writing—review & editing-Equal).

FUNDING

The research was co-funded by Warsaw University of Technology within the Excellence Initiative: Research University (IDUB) programme. The work has been co-supported by EU-funded Marie Sklodowska-Curie Action (MSCA) Innovative Training Network named Enhpathy (www.enhpathy.eu) Molecular Basis of Human enhanceropathies, National Institute of Health USA 4DNucleome grant 1U54DK107967-01 and “Nucleome Positioning System for Spatiotemporal Genome Organization and Regulation. This work has also been co-supported by the Polish National Science Centre (2019/35/O/ST6/02484,2020/37/B/NZ2/03757). Computations were performed thanks to the Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, using the Artificial Intelligence HPC platform financed by the Polish Ministry of Science and Higher Education (decision no. 7054/IA/SP/2020 of 2020-08-28).

Author Biographies

Anup Kumar Halder is a postdoctoral fellow at the Faculty of Mathematics and Information Sciences, Warsaw University of Technology, Warsaw, Poland. He is also associated with the Laboratory of Functional and Structural Genomics at the Center of New Technologies (CeNT), University of Warsaw, Poland.

Abhishek Agarwal is a PhD student in the Laboratory of Functional and Structural Genomics at the Center of New Technologies (CeNT), University of Warsaw, Poland. some description.

Karolina Jodkowska is a postdoctoral fellow in the Laboratory of Functional and Structural Genomics at the Center of New Technologies (CeNT), University of Warsaw, Warsaw, Poland. some description.

Dariusz Plewczynski is a Professor of Exact and Natural Sciences at Warsaw University of Technology as Principal Investigator and Head. He is also associated with the University of Warsaw in the Center of New Technologies CeNT, Warsaw, Poland, as the head of the Laboratory of Functional and Structural Genomics.

References

1.

Pederson
T
.
Chromatin structure and the cell cycle
.
Proc Natl Acad Sci
 
1972
;
69
(
8
):
2224
8
.

2.

Dixon
JR
,
Jie
X
,
Dileep
V
, et al. .  
Integrative detection and analysis of structural variation in cancer genomes
.
Nat Genet
 
2018
;
50
(
10
):
1388
98
.

3.

Dileep
V
,
Ay
F
,
Sima
J
, et al. .  
Topologically associating domains and their long-range contacts are established during early G1 coincident with the establishment of the replication-timing program
.
Genome Res
 
2015
;
25
(
8
):
1104
13
.

4.

Beagrie
RA
,
Pombo
A
.
Continuous chromatin changes
.
Nature
 
2017
;
547
(
7661
):
34
5
.

5.

Chiliński
M
,
Sengupta
K
,
Plewczynski
D
.
From DNA human sequence to the chromatin higher order organisation and its biological meaning: using biomolecular interaction networks to understand the influence of structural variation on spatial genome organisation and its functional effect
. In:
Seminars in Cell & Developmental Biology
, Vol.
121
.
Academic: Elsevier
,
2022
,
171
85
.

6.

Roy
SS
,
Mukherjee
AK
,
Chowdhury
S
.
Insights about genome function from spatial organization of the genome
.
Hum Genom
 
2018
;
12
(
1
):
8
.

7.

Kadauke
S
,
Blobel
GA
.
Chromatin loops in gene regulation
.
Biochim Biophys Acta
 
2009
;
1789
(
1
):
17
25
.

8.

Jerković
I
,
Szabo
Q
,
Bantignies
F
,
Cavalli
G
.
Higher-order chromosomal structures mediate genome function
.
J Mol Biol
 
2020
;
432
(
3
):
676
81
.

9.

Sengupta
K
,
Denkiewicz
M
,
Chiliński
M
, et al. .  
Multi-scale phase separation by explosive percolation with single-chromatin loop resolution
.
Comput Struct Biotechnol J
 
2022
;
20
:
3591
603
.

10.

Bonev
B
,
Cavalli
G
.
Organization and function of the 3D genome
.
Nat Rev Genet
 
2016
;
17
(
11
):
661
78
.

11.

Zheng
H
,
Xie
W
.
The role of 3D genome organization in development and cell differentiation
.
Nat Rev Mol Cell Biol
 
2019
;
20
(
9
):
535
50
.

12.

Marchal
C
,
Sima
J
,
Gilbert
DM
.
Control of DNA replication timing in the 3D genome
.
Nat Rev Mol Cell Biol
 
2019
;
20
(
12
):
721
37
.

13.

Lieberman-Aiden
E
,
Van Berkum
NL
,
Williams
L
, et al. .  
Comprehensive mapping of long-range interactions reveals folding principles of the human genome
.
Science
 
2009
;
326
(
5950
):
289
93
.

14.

Rao
SSP
,
Huntley
MH
,
Durand
NC
, et al. .  
A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping
.
Cell
 
2014
;
159
(
7
):
1665
80
.

15.

Li
G
,
Ruan
X
,
Auerbach
RK
, et al. .  
Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation
.
Cell
 
2012
;
148
(
1
):
84
98
.

16.

Tang
Z
,
Luo
OJ
,
Li
X
, et al. .  
CTCF-mediated human 3D genome architecture reveals chromatin topology for transcription
.
Cell
 
2015
;
163
(
7
):
1611
27
.

17.

Giambartolomei
C
,
Seo
J-H
,
Schwarz
T
, et al. .  
H3k27ac HiChIP in prostate cell lines identifies risk genes for prostate cancer susceptibility
.
Am J Hum Genet
 
2021
;
108
(
12
):
2284
300
.

18.

Okuyama
K
,
Strid
T
,
Kuruvilla
J
, et al. .  
PAX5 is part of a functional transcription factor network targeted in lymphoid leukemia
.
PLoS Genet
 
2019
;
15
(
8
):
e1008280
.

19.

Lee
B
,
Wang
J
,
Cai
L
, et al. .  
ChIA-PIPE: a fully automated pipeline for comprehensive ChIA-PET data analysis and visualization
.
Sci Adv
 
2020
;
6
(
28
):
eaay2078
.

20.

Li
G
,
Sun
T
,
Chang
H
, et al. .  
Chromatin interaction analysis with updated ChIA-PET tool (v3)
.
Genes
 
2019
;
10
(
7
):
554
.

21.

Bhattacharyya
S
,
Chandra
V
,
Vijayanand
P
,
Ay
F
.
Identification of significant chromatin contacts from HiChIP data by FitHiChIP
.
Nat Commun
 
2019
;
10
(
1
):
4221
.

22.

Juric
I
,
Miao
Y
,
Abnousi
A
, et al. .  
MAPS: model-based analysis of long-range chromatin interactions from PLAC-seq and HichIP experiments
.
PLoS Comput Biol
 
2019
;
15
(
4
):
e1006982
.

23.

Durand
NC
,
Shamim
MS
,
Machol
I
, et al. .  
Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments
.
Cell Syst
 
2016
;
3
(
1
):
95
8
.

24.

Servant
N
,
Varoquaux
N
,
Lajoie
BR
, et al. .  
HiC-Pro: an optimized and flexible pipeline for Hi-C data processing
.
Genome Biol
 
2015
;
16
(
1
):
1
11
.

25.

Dekker
J
,
Rippe
K
,
Dekker
M
,
Kleckner
N
.
Capturing chromosome conformation
.
Science
 
2002
;
295
(
5558
):
1306
11
.

26.

Van De Werken
HJG
,
Landan
G
,
Holwerda
SJB
, et al. .  
Robust 4C-seq data analysis to screen for regulatory dna interactions
.
Nat Methods
 
2012
;
9
(
10
):
969
72
.

27.

Dostie
J
,
Richmond
TA
,
Arnaout
RA
, et al. .  
Chromosome conformation capture carbon copy (5C): a massively parallel solution for mapping interactions between genomic elements
.
Genome Res
 
2006
;
16
(
10
):
1299
309
.

28.

Belton
J-M
,
McCord
RP
,
Gibcus
JH
, et al. .  
Hi-C: a comprehensive technique to capture the conformation of genomes
.
Methods
 
2012
;
58
(
3
):
268
76
.

29.

Fullwood
MJ
,
Liu
MH
,
Pan
YF
, et al. .  
An oestrogen-receptor-α-bound human chromatin interactome
.
Nature
 
2009
;
462
(
7269
):
58
64
.

30.

Phanstiel
DH
,
Boyle
AP
,
Heidari
N
,
Snyder
MP
.
Mango: a bias-correcting ChIA-PET analysis pipeline
.
Bioinformatics
 
2015
;
31
(
19
):
3092
8
.

31.

Li
G
,
Chen
Y
,
Snyder
MP
,
Zhang
MQ
.
ChIA-PET2: a versatile and flexible pipeline for ChIA-PET data analysis
.
Nucleic Acids Res
 
2017
;
45
(
1
):
e4
4
.

32.

Cao
Y
,
Chen
Z
,
Chen
X
, et al. .  
Accurate loop calling for 3D genomic data with cLoops
.
Bioinformatics
 
2020
;
36
(
3
):
666
75
.

33.

Cao
Y
,
Liu
S
,
Ren
G
, et al. .  
cLoops2: a full-stack comprehensive analytical tool for chromatin interactions
.
Nucleic Acids Res
 
2022
;
50
(
1
):
57
71
.

34.

Huang
W
,
Medvedovic
M
,
Zhang
J
,
Niu
L
.
ChIAPoP: a new tool for ChIA-PET data analysis
.
Nucleic Acids Res
 
2019
;
47
(
7
):
e37
7
.

35.

Ardakany
AR
,
Gezer
HT
,
Lonardi
S
,
Ay
F
.
Mustache: multi-scale detection of chromatin loops from Hi-C and Micro-C maps using scale-space representation
.
Genome Biol
 
2020
;
21
:
1
17
.

36.

Imakaev
M
,
Fudenberg
G
,
McCord
RP
, et al. .  
Iterative correction of Hi-C data reveals hallmarks of chromosome organization
.
Nat Methods
 
2012
;
9
(
10
):
999
1003
.

37.

Servant
N
,
Lajoie
BR
,
Nora
EP
, et al. .  
HiTC: exploration of high-throughput ‘c’ experiments
.
Bioinformatics
 
2012
;
28
(
21
):
2843
4
.

38.

Schmid
MW
,
Grob
S
,
Grossniklaus
U
.
HiCdat: a fast and easy-to-use Hi-C data analysis tool
.
BMC Bioinform
 
2015
;
16
:
1
6
.

39.

Lazaris
C
,
Kelly
S
,
Ntziachristos
P
, et al. .  
HiC-bench: comprehensive and reproducible Hi-C data analysis designed for parameter exploration and benchmarking
.
BMC Genom
 
2017
;
18
:
1
16
.

40.

Serra
F
,
Baù
D
,
Goodstadt
M
, et al. .  
Automatic analysis and 3D-modelling of hi-c data using TADbit reveals structural features of the fly chromatin colors
.
PLoS Comput Biol
 
2017
;
13
(
7
):
e1005665
.

41.

Sauria
MEG
,
Phillips-Cremins
JE
,
Corces
VG
,
Taylor
J
.
HiFive: a tool suite for easy and efficient HiC and 5C data analysis
.
Genome Biol
 
2015
;
16
:
1
10
.

42.

Castellano
G
,
Le Dily
F
,
Pulido
AH
,
Beato
M
,
Roma
G
.
HiC-inspector: a toolkit for high-throughput chromosome capture data
.
bioRxiv
.
2015
;
020636
.

43.

Yaffe
E
,
Tanay
A
.
Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture
.
Nat Genet
 
2011
;
43
(
11
):
1059
65
.

44.

Heinz
S
,
Benner
C
,
Spann
N
, et al. .  
Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities
.
Mol Cell
 
2010
;
38
(
4
):
576
89
.

45.

Wingett
S
,
Ewels
P
,
Furlan-Magaril
M
, et al. .  
HiCUP: pipeline for mapping and processing Hi-C data
.
F1000Research
 
2015
;
4
:
1310
.

46.

Lareau
CA
,
Aryee
MJ
.
Hichipper: a preprocessing pipeline for calling dna loops from HiChIP data
.
Nat Methods
 
2018
;
15
(
3
):
155
6
.

47.

Li
G
,
Fullwood
MJ
,
Han
X
, et al. .  
ChIA-PET tool for comprehensive Chromatin Interaction Analysis with Paired-End Tag sequencing
.
Genome Biol
 
2010
;
11
:
R22
13
.

48.

Kerpedjiev
P
,
Abdennur
N
,
Lekschas
F
, et al. .  
HiGlass: web-based visual exploration and analysis of genome interaction maps
.
Genome Biol
 
2018
;
19
(
1
):
1
12
.

49.

Robinson
JT
,
Turner
D
,
Durand
NC
, et al. .  
Juicebox.js provides a cloud-based visualization system for Hi-C data
.
Cell Syst
 
2018
;
6
(
2
):
256
258.e1
.

50.

Zhang
Y
,
Liu
T
,
Meyer
CA
, et al. .  
Model-based analysis of ChIP-seq (MACS)
.
Genome Biol
 
2008
;
9
(
9
):
1
9
.

51.

Knight
PA
,
Ruiz
D
.
A fast algorithm for matrix balancing
.
IMA J Numer Anal
 
2013
;
33
(
3
):
1029
47
.

52.

Durand
NC
,
Robinson
JT
,
Shamim
MS
, et al. .  
Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom
.
Cell Syst
 
2016
;
3
(
1
):
99
101
.

53.

Hwang
W-H
,
Stoklosa
J
,
Wang
C-Y
.
Population size estimation using zero-truncated poisson regression with measurement error
.
J Agric Biol Environ Stat
 
2022
;
27
:
303
20
.

54.

Li
H
.
Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM
.
arXiv preprint arXiv:1303.3997
.
2013
. https://doi.org/10.48550/arXiv.1303.3997.

55.

Paulsen
J
,
Rødland
EA
,
Holden
L
, et al. .  
A statistical model of ChIA-PET data for accurate detection of chromatin 3D interactions
.
Nucleic Acids Res
 
2014
;
42
(
18
):
e143
3
.

56.

Langmead
B
,
Salzberg
SL
.
Fast gapped-read alignment with Bowtie 2
.
Nat Methods
 
2012
;
9
(
4
):
357
9
.

57.

Li
H
,
Durbin
R
.
Fast and accurate short read alignment with burrows–wheeler transform
.
Bioinformatics
 
2009
;
25
(
14
):
1754
60
.

58.

Marco-Sola
S
,
Sammeth
M
,
Guigó
R
,
Ribeca
P
.
The gem mapper: fast, accurate and versatile alignment by filtration
.
Nat Methods
 
2012
;
9
(
12
):
1185
8
.

59.

Mumbach
MR
,
Rubin
AJ
,
Flynn
RA
, et al. .  
HiChIP: efficient and sensitive analysis of protein-directed genome architecture
.
Nat Methods
 
2016
;
13
(
11
):
919
22
.

60.

Dekker
J
,
Belmont
AS
,
Guttman
M
, et al. .  
The 4D nucleome project
.
Nature
 
2017
;
549
(
7671
):
219
26
.

61.

Snyder
MP
,
Gingeras
TR
,
Moore
JE
, et al. .  
Perspectives on encode
.
Nature
 
2020
;
583
(
7818
):
693
8
.

62.

Quinlan
AR
,
Hall
IM
.
BEDTools: a flexible suite of utilities for comparing genomic features
.
Bioinformatics
 
2010
;
26
(
6
):
841
2
.

63.

Pal
K
,
Forcato
M
,
Ferrari
F
.
Hi-C analysis: from data generation to integration
.
Biophys Rev
 
2019
;
11
:
67
78
.

64.

Eagen
KP
.
Principles of chromosome architecture revealed by Hi-C
.
Trends Biochem Sci
 
2018
;
43
(
6
):
469
78
.

65.

Ghandi
M
,
Lee
D
,
Mohammad-Noori
M
,
Beer
MA
.
Enhanced regulatory sequence prediction using gapped k-mer features
.
PLoS Comput Biol
 
2014
;
10
(
7
):
e1003711
.

66.

Li
Y
,
Shi
W
,
Wasserman
WW
.
Genome-wide prediction of cis-regulatory regions using supervised deep learning methods
.
BMC Bioinformatics
 
2018
;
19
(
1
):
1
14
.

67.

Naville
M
,
Ishibashi
M
,
Ferg
M
, et al. .  
Long-range evolutionary constraints reveal cis-regulatory interactions on the human x chromosome
.
Nat Commun
 
2015
;
6
(
1
):
6904
.

68.

Zhang
H
,
Li
F
,
Jia
Y
, et al. .  
Characteristic arrangement of nucleosomes is predictive of chromatin interactions at kilobase resolution
.
Nucleic Acids Res
 
2017
;
45
(
22
):
12739
51
.

69.

Cheng
RR
,
Contessoto
VG
,
Aiden
EL
, et al. .  
Exploring chromosomal structural heterogeneity across multiple cell lines
.
Elife
 
2020
;
9
:e60312, 1–21.

70.

Zhang
P
,
Yingfu
W
,
Zhou
H
, et al. .  
CLNN-loop: a deep learning model to predict CTCF-mediated chromatin loops in the different cell lines and CTCF-binding sites (CBS) pair types
.
Bioinformatics
 
2022
;
38
(
19
):
4497
504
.

71.

Yang
D
,
Chung
T
,
Kim
D
.
DeepLUCIA: predicting tissue-specific chromatin loops using deep learning-based universal chromatin interaction annotator
.
Bioinformatics
 
2022
;
38
(
14
):
3501
12
.

72.

Jost
D
,
Carrivain
P
,
Cavalli
G
,
Vaillant
C
.
Modeling epigenome folding: formation and dynamics of topologically associated chromatin domains
.
Nucleic Acids Res
 
2014
;
42
(
15
):
9553
61
.

73.

Salameh
TJ
,
Wang
X
,
Song
F
, et al. .  
A supervised learning framework for chromatin loop detection in genome-wide contact maps
.
Nat Commun
 
2020
;
11
(
1
):
3428
.

74.

Di Pierro
M
,
Cheng
RR
,
Aiden
EL
, et al. .  
De novo prediction of human chromosome structures: epigenetic marking patterns encode genome architecture
.
Proc Natl Acad Sci
 
2017
;
114
(
46
):
12126
31
.

75.

Fudenberg
G
,
Kelley
DR
,
Pollard
KS
.
Predicting 3D genome folding from dna sequence with Akita
.
Nat Methods
 
2020
;
17
(
11
):
1111
7
.

76.

Schwessinger
R
,
Gosden
M
,
Downes
D
, et al. .  
DeepC: predicting 3D genome folding using megabase-scale transfer learning
.
Nat Methods
 
2020
;
17
(
11
):
1118
24
.

77.

Sefer
E
,
Kingsford
C
.
Semi-nonparametric modeling of topological domain formation from epigenetic data
.
Algorithms Mol Biol
 
2019
;
14
:
1
11
.

78.

Zhang
R
,
Wang
Y
,
Yang
Y
, et al. .  
Predicting CTCF-mediated chromatin loops using CTCF-MP
.
Bioinformatics
 
2018
;
34
(
13
):
i133
41
.

79.

Kai
Y
,
Andricovich
J
,
Zeng
Z
, et al. .  
Predicting CTCF-mediated chromatin interactions by integrating genomic and epigenomic features
.
Nat Commun
 
2018
;
9
(
1
):
4221
.

80.

Trieu
T
,
Martinez-Fundichely
A
,
Khurana
E
.
DeepMILO: a deep learning approach to predict the impact of non-coding sequence variants on 3D chromatin structure
.
Genome Biol
 
2020
;
21
:
1
11
.

81.

Al Bkhetan
Z
,
Plewczynski
D
.
Three-dimensional epigenome statistical model: genome-wide chromatin looping prediction
.
Sci Rep
 
2018
;
8
(
1
):
5217
.

82.

Li
W
,
Wong
WH
,
Jiang
R
.
DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning
.
Nucleic Acids Res
 
2019
;
47
(
10
):
e60
0
.

83.

Ji
Y
,
Zhou
Z
,
Liu
H
,
Davuluri
RV
.
DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome
.
Bioinformatics
 
2021
;
37
(
15
):
2112
20
.

84.

Chiliński
M
,
Halder
AK
,
Plewczynski
D
.
Prediction of chromatin looping using deep hybrid learning (DHL)
.
Quant Biol
 
2023
;
11
(
2
):
155
62
.

85.

Halder
AK
,
Agarwal
A
,
Korsak
S
, et al. .  
ccLoopER: deep prediction of CTCF and Cohesin mediated chromatin looping using DNA transformer model
. In:
International Conference on Pattern Recognition and Machine Intelligence
.
Springer, Cham: Springer
,
2023
,
871
8
.

86.

Devlin
J
,
Chang
M-W
,
Lee
K
,
Toutanova
K
.
BERT: pre-training of deep bidirectional transformers for language understanding
. In:
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, Vol.
1
, p.
1471–86
. Minneapolis, Minnesota: Association for Computational Linguistics.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/pages/standard-publication-reuse-rights)