A systematic analyses of different bioinformatics pipelines for genomic data and its impact on deep learning models for chromatin loop prediction

Pipeline details and their corresponding protocol with available scripts

Experimental bias	Pipelines	Repository	References
ChIA-PET	ChIA-PIPE	https://github.com/TheJacksonLaboratory/ChIA-PIPE	[19]
	Mango	https://github.com/dphansti/mango	[30]
	ChIA-PET2	https://github.com/GuipengLi/ChIA-PET2	[31]
	ChIA-PET Tool V3	https://github.com/GuoliangLi-HZAU/ChIA-PET_Tool_V3	[20]
	cLoops	https://github.com/YaqiangCao/cLoops	[32]
	cLoops2	https://github.com/KejiZhaoLab/cLoops2	[33]
	ChIAPoP	https://github.com/wh90999/ChIAPoP	[34]
HiC	HiC-Pro	https://github.com/nservant/HiC-Pro	[24]
	Juicer	https://github.com/theaidenlab/juicer	[14, 23]
	MUSTACHE	https://github.com/ay-lab/mustache	[35]
	hiclib	https://github.com/hiclib/iced	[36]
	HiTC	http://bioconductor.org/packages/release/bioc/html/HiTC.html	[37]
	HiCdat	https://github.com/MWSchmid/HiCdat	[38]
	HiC-bench	https://github.com/NYU-BFX/hic-bench	[39]
	TADbit	https://github.com/3DGenomes/tadbit	[40]
	HiFive	https://github.com/bxlab/hifive	[41]
	HiC-inspector	https://github.com/HiC-inspector/HiC-inspector	[42]
	hicpipe	https://compgenomics.weizmann.ac.il/tanay/?page_id=283	[43]
	HOMER	http://homer.ucsd.edu/homer/interactions/index.html	[44]
	HiCUP	https://www.bioinformatics.babraham.ac.uk/projects/hicup/	[45]
HiChIP	FitHiChIP	https://github.com/ay-lab/FitHiChIP	[21]
	MAPS	https://github.com/ijuric/MAPS	[22]
	hichipper	https://github.com/aryeelab/hichipper	[46]

Experimental bias	Pipelines	Repository	References
ChIA-PET	ChIA-PIPE	https://github.com/TheJacksonLaboratory/ChIA-PIPE	[19]
	Mango	https://github.com/dphansti/mango	[30]
	ChIA-PET2	https://github.com/GuipengLi/ChIA-PET2	[31]
	ChIA-PET Tool V3	https://github.com/GuoliangLi-HZAU/ChIA-PET_Tool_V3	[20]
	cLoops	https://github.com/YaqiangCao/cLoops	[32]
	cLoops2	https://github.com/KejiZhaoLab/cLoops2	[33]
	ChIAPoP	https://github.com/wh90999/ChIAPoP	[34]
HiC	HiC-Pro	https://github.com/nservant/HiC-Pro	[24]
	Juicer	https://github.com/theaidenlab/juicer	[14, 23]
	MUSTACHE	https://github.com/ay-lab/mustache	[35]
	hiclib	https://github.com/hiclib/iced	[36]
	HiTC	http://bioconductor.org/packages/release/bioc/html/HiTC.html	[37]
	HiCdat	https://github.com/MWSchmid/HiCdat	[38]
	HiC-bench	https://github.com/NYU-BFX/hic-bench	[39]
	TADbit	https://github.com/3DGenomes/tadbit	[40]
	HiFive	https://github.com/bxlab/hifive	[41]
	HiC-inspector	https://github.com/HiC-inspector/HiC-inspector	[42]
	hicpipe	https://compgenomics.weizmann.ac.il/tanay/?page_id=283	[43]
	HOMER	http://homer.ucsd.edu/homer/interactions/index.html	[44]
	HiCUP	https://www.bioinformatics.babraham.ac.uk/projects/hicup/	[45]
HiChIP	FitHiChIP	https://github.com/ay-lab/FitHiChIP	[21]
	MAPS	https://github.com/ijuric/MAPS	[22]
	hichipper	https://github.com/aryeelab/hichipper	[46]

Table 1

Pipeline details and their corresponding protocol with available scripts

Experimental bias	Pipelines	Repository	References
ChIA-PET	ChIA-PIPE	https://github.com/TheJacksonLaboratory/ChIA-PIPE	[19]
	Mango	https://github.com/dphansti/mango	[30]
	ChIA-PET2	https://github.com/GuipengLi/ChIA-PET2	[31]
	ChIA-PET Tool V3	https://github.com/GuoliangLi-HZAU/ChIA-PET_Tool_V3	[20]
	cLoops	https://github.com/YaqiangCao/cLoops	[32]
	cLoops2	https://github.com/KejiZhaoLab/cLoops2	[33]
	ChIAPoP	https://github.com/wh90999/ChIAPoP	[34]
HiC	HiC-Pro	https://github.com/nservant/HiC-Pro	[24]
	Juicer	https://github.com/theaidenlab/juicer	[14, 23]
	MUSTACHE	https://github.com/ay-lab/mustache	[35]
	hiclib	https://github.com/hiclib/iced	[36]
	HiTC	http://bioconductor.org/packages/release/bioc/html/HiTC.html	[37]
	HiCdat	https://github.com/MWSchmid/HiCdat	[38]
	HiC-bench	https://github.com/NYU-BFX/hic-bench	[39]
	TADbit	https://github.com/3DGenomes/tadbit	[40]
	HiFive	https://github.com/bxlab/hifive	[41]
	HiC-inspector	https://github.com/HiC-inspector/HiC-inspector	[42]
	hicpipe	https://compgenomics.weizmann.ac.il/tanay/?page_id=283	[43]
	HOMER	http://homer.ucsd.edu/homer/interactions/index.html	[44]
	HiCUP	https://www.bioinformatics.babraham.ac.uk/projects/hicup/	[45]
HiChIP	FitHiChIP	https://github.com/ay-lab/FitHiChIP	[21]
	MAPS	https://github.com/ijuric/MAPS	[22]
	hichipper	https://github.com/aryeelab/hichipper	[46]

Experimental bias	Pipelines	Repository	References
ChIA-PET	ChIA-PIPE	https://github.com/TheJacksonLaboratory/ChIA-PIPE	[19]
	Mango	https://github.com/dphansti/mango	[30]
	ChIA-PET2	https://github.com/GuipengLi/ChIA-PET2	[31]
	ChIA-PET Tool V3	https://github.com/GuoliangLi-HZAU/ChIA-PET_Tool_V3	[20]
	cLoops	https://github.com/YaqiangCao/cLoops	[32]
	cLoops2	https://github.com/KejiZhaoLab/cLoops2	[33]
	ChIAPoP	https://github.com/wh90999/ChIAPoP	[34]
HiC	HiC-Pro	https://github.com/nservant/HiC-Pro	[24]
	Juicer	https://github.com/theaidenlab/juicer	[14, 23]
	MUSTACHE	https://github.com/ay-lab/mustache	[35]
	hiclib	https://github.com/hiclib/iced	[36]
	HiTC	http://bioconductor.org/packages/release/bioc/html/HiTC.html	[37]
	HiCdat	https://github.com/MWSchmid/HiCdat	[38]
	HiC-bench	https://github.com/NYU-BFX/hic-bench	[39]
	TADbit	https://github.com/3DGenomes/tadbit	[40]
	HiFive	https://github.com/bxlab/hifive	[41]
	HiC-inspector	https://github.com/HiC-inspector/HiC-inspector	[42]
	hicpipe	https://compgenomics.weizmann.ac.il/tanay/?page_id=283	[43]
	HOMER	http://homer.ucsd.edu/homer/interactions/index.html	[44]
	HiCUP	https://www.bioinformatics.babraham.ac.uk/projects/hicup/	[45]
HiChIP	FitHiChIP	https://github.com/ay-lab/FitHiChIP	[21]
	MAPS	https://github.com/ijuric/MAPS	[22]
	hichipper	https://github.com/aryeelab/hichipper	[46]

ChIA-PIPE [19] is a Python-based pipeline for joint analysis of ChIP-seq and Hi-C data. Mango [30] facilitates Hi-C data visualization and exploration with interactive 3D views implemented in JavaScript and Python. ChIA-PET2 [31] analyzes Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA-PET) data, implemented in Perl. ChIA-PET Tool V3 [20], implemented in Java, Perl and R, is a suite for analyzing ChIA-PET data. The cLoops [32] is a versatile tool employed for calling loops in various chromatin interaction datasets, including ChIA-PET, HiChIP and high-resolution Hi-C data. cLoops2 [33], is designed with essential features such as peak-calling and loop-calling, accompanied by supplementary modules for interaction resolution, data similarity, feature quantification, aggregation analysis and visualization in 3D chromatin interaction datasets. ChIAPoP [34] employing a zero-truncated Poisson distribution and corrects for genomic distances and sequence bias, therefore, capable of discerning both interchromosomal and intrachromosomal loops by incorporating distinct background distributions

HiC-Pro [24], implemented in Python, is a comprehensive pipeline covering mapping, normalization and quality control. Juicer is a versatile Hi-C suite implemented in Java and Python, offering preprocessing, normalization and interaction calling [14, 23]. The hicpipe [43] is a Python-based Hi-C data processing pipeline, while hiclib [36] is a Python library for mapping, filtering, and chromatin interaction analysis. HOMER [44], implemented in Perl and C++, offers motif discovery, functional annotation, and Hi-C data analysis. HiFive, implemented in Python and Cython, features functionalities for contact matrix normalization, TAD calling and differential analysis [41]. HiCdat [38], an R package, aids in visualizing and analyzing Hi-C data, exploring chromatin interactions and structure. HiTC, an R package, provides statistical tools and interactive visualization for Hi-C data analysis [37]. HiC-inspector is an interactive tool for exploring and visualizing Hi-C data, implemented in JavaScript [42]. HiC-bench [39], a benchmarking framework, is implemented in Perl and R. HiCUP [45] efficiently handles Hi-C data processing with a focus on mapping and quality control, implemented in Perl. TADbit [40], a Python library, is designed for analyzing TADs and chromatin interactions. Finally, MUSTACHE [35], a method for detecting loops based on local enrichment, relies on the surrounding pixels of the contact map, and the fundamental concept involves representing image data at multiple scales through the utilization of a Gaussian kernel.

FitHiChIP [21] accurately calls peaks and performs statistical analysis on significant chromatin interactions from HiChIP data, implemented in R. MAPS [22] is a computational method implemented in R and Perl for identifying TADs and chromatin interactions from Hi-C data. Lastly, hichipper [46] is a Python-based HiChIP data processing pipeline with tools for mapping, filtering and interaction calling.

In this comprehensive review, we utilized only six distinct pipelines for loop-calling methodologies on 3C-based experimental data, selecting two from each group. This approach aims to elucidate the biases mitigated by these pipelines and provides insights into their functionality within the performance evaluation of deep learning models. The pipelines employed for dataset processing include the ChIA-PIPE [19] and ChIA-PET Tool V3 [20]), Hi-C biased Pipeline (Juicer [14, 23] and HiC-Pro [24]) and HiChIP biased pipeline (FitHiChIP [21] and MAPS [22]). In the following section, we will delve into the details of these six pipelines and thoroughly explore the data processing strategies employed. Figure 1 demonstrates the hierarchical categorization of these pipelines considering CTCF and Cohesin protein factors.

Figure 1

Hierarchical categorization of bioinformatics pipeline with two protein factors CTCF and Cohesin considering different experiment types and sources.

ChIA-PET biased pipeline

ChIA-PIPE stands as a groundbreaking achievement in the realm of chromatin interaction data analysis and visualization, offering a fully automated pipeline designed for comprehensive ChIA-PET studies [19]. This pipeline streamlines the entire process, from data processing to analysis and visualization, providing a seamless and efficient solution. ChIA-PIPE sets itself apart by assimilating distinctive functionalities not found in earlier tools like ChIA-PET tool [47], Mango [30] and ChIA-PET2 [31]. It introduces a binding peak–calling module with heightened sensitivity and specificity, comprehensive support for 2D contact map visualization tools like HiGlass [48] and Juicebox [49], and a novel genome browser called BASIC Browser, optimized for high-resolution exploration of peaks, loops and strand-specific RNA sequencing. However, ChIA-PIPE facilitates downstream structural interpretation analysis, including the calling and visualization of CCDs, annotation of enhancer–promoter (E-P) interactions, and the identification of haplotype-specific loops and peaks. With its fully automated approach and advanced features, ChIA-PIPE emerges as an invaluable tool for researchers engaged in ChIA-PET data analysis, promising efficiency, accuracy, and accessibility in understanding complex chromatin interactions.

In this assessment, after merging all the sample replicates, the data was subjected to processing through the ChIA-PIPE pipeline, utilizing the default parameters, including the Linker Sequence (GTTGGATAAG) and Peak-calling Algorithm (MACS2) [50]. As a result of this pipeline, a high-resolution 2D contact matrix in .hic’ file format was generated, along with annotated chromatin loops and their corresponding binding peak overlaps.

ChIA-PET Tool V3

The ChIA-PET Tool V3 [20] is a sophisticated tool for processing both short-read and long-read ChIA-PET data from paired-end reads. It excels in the generation of enriched binding peaks and the precise identification of chromatin interactions linked to proteins of interest. The tool’s strength lies in its robust error handling and quality assessment mechanisms, which encompass the creation of multiple log files and comprehensive statistics. This ensures meticulous traceability and confidence in the integrity of the processed data.

In the ChIA-PET Tool pipeline (version 3), flanking sequences after linker trimming were aligned to the human reference genome (hg38) using BWA-MEM (version 0.7.7) and only paired-end tags (PETs) that uniquely mapped (with mapping qualities $\geq$ 30) were retained. Self-ligation PETs were employed for binding site identification, while inter-ligation PETs were utilized for detecting long-range interactions.

Juicer

Juicer is a software tool commonly used for analyzing the 3D structure of chromatin in a cell’s nucleus [14, 23]. This pipeline plays a key role in studying the genome organization, particularly how different regions of DNA are folded and interact with each other. Juicer works by processing high-throughput sequencing data, often generated through techniques like Hi-C. One of the key analyses performed by Juicer is loop calling. It identifies loops or specific interactions between different genomic regions. Loops can be indicative of functional elements in the genome, such as E-P interactions.

Juicer implements a normalization method known as Knight-Ruiz (KR) balancing’ to correct biases in Hi-C data [51]. This includes correction for GC content, mappability, and other factors. The normalization step aims to remove systematic biases, enhancing the accuracy of interaction matrices and subsequent loop calling. In this analytical study, all sample data underwent processing through Juicer [14, 52] and alignment against the hg38 reference genome. For subsequent analyses, all contact matrices were subject to KR-normalization using Juicer. Loops were identified by HiCCUPS [14, 52] within the.hic files, and Loop calling was performed at resolutions of 5kb, with subsequent merging in accordance with the procedure outlined in [14]. Loops with robust signals were delineated as those meeting the criteria of FDR $\leq$ 0.01 and observed counts $\geq$ 5.

HiC-Pro

HiC-Pro emerges as a versatile and highly efficient pipeline meticulously crafted for the processing of Hi-C data [24]. HiC-Pro distinguishes itself by its proficiency in handling high-resolution datasets and providing an optimized format conducive to seamless contact map sharing. With a thoughtfully designed user interface, HiC-Pro not only administers rigorous quality controls but also effortlessly navigates the processing of Hi-C data, from the raw intricacies of sequencing reads to the generation of normalized, ready-to-use genome-wide contact maps. Its adaptability spans across diverse data origins, accommodating both restriction enzyme and nuclease digestion protocols with ease. The contact maps, both intra- and inter-chromosomal, crafted by HiC-Pro exhibit a remarkable resemblance to those generated by the hiclib package. The pipeline incorporates an intricately optimized iteration correction algorithm, a feature that significantly accelerates and refines the normalization process of Hi-C data.

HiC-Pro [24] normalizes interaction matrices through iterative correction, mitigating biases from library preparation and sequencing to enhance loop calling accuracy. It incorporates loop calling via the Directionality Index method and clustering, identifying loops through interaction directionality. Reads are aligned to GRCh38, with duplicates removed and reads allocated to MboI fragments, filtering for valid interactions and generating binned matrices. Loops are called using HiCCUPS with default settings, refining loop detection.

FitHiChIP

FitHiChIP [21] efficiently analyzes HiChIP data to detect chromatin interactions at protein binding sites, modelling read dependencies for significant statistical insights. Its data-driven methodology adeptly manages variable genomic coverage, identifying differential interactions across conditions to reveal chromatin architecture dynamics. With its user-friendly interface and efficient algorithms, FitHiChIP contributes to advancing our understanding of the intricate relationship between chromatin structure and transcriptional regulation.

HiC-Pro [24] processes raw FASTQ files to generate contact maps and validates ligation products, mapping them to the hg38 reference genome. Loop calling is then conducted in FitHiChIP [21] at a 5-kb resolution, ensuring uniform loop sizes. To align with FitHiChIP’s 1D peak prerequisites, MACS2 (version 2.2.9.1) [50] is utilized for peak calling in matching tissue conditions.

MAPS

MAPS [22], a computational approach named Model-based Analysis of PLAC-seq and HiChIP, has been devised to process experimental data and discern long-range chromatin interactions. Employing a zero-truncated Poisson regression framework [53], MAPS effectively removes systematic biases inherent in PLAC-seq and HiChIP datasets. Following this correction, the method utilizes normalized chromatin contact frequencies to pinpoint significant chromatin interactions anchored at genomic regions bound by the protein of interest. Significantly, MAPS exhibits superior performance when compared to existing software tools in the analysis of chromatin interactions across diverse PLAC-seq and HiChIP datasets associated with various transcription factors and histone marks.

The model-based analysis of long-range chromatin interactions is applied to call significant chromatin interactions at a resolution of 5 kb. In this pipeline, bwa-mem’ [54] was used to map the two ends of each paired-end read to the hg38 reference genome. Subsequently, the reads that mapped validly were retained, and PCR duplicates were eliminated using samtools rmdup’. Following this, intra-chromosomal reads were segregated into two categories: short- (⁠ $\leq$ 1 kb) and long-range reads (⁠ $\geq$ kb), where the long-range reads were employed to study protein-mediated chromatin interactions.

In this study, we want to investigate the variety in the chromatin loops processed using various loop-calling algorithms in order to assess the dataset variability using these thoroughly above-listed bioinformatics pipelines. Finding variations in the set of loops processed by these pipelines have biases towards the specific experimental dataset—for instance, HiC biased (HiCCUPS, HiC-Pro), HiChIP biased (FitHiChIP, MAPS) or ChIA-PET biased (ChIA-PIPE, ChIA-PET V3) is the goal of variability. The goal is to understand and quantify this variability within the sets of significant interactions that are input for the deep learning model.

Pipelines within the same category differ significantly. For example, ChIA-Pipe utilizes a statistical model based on the non-central hypergeometric distribution [55], incorporating genomic distance, whereas ChIA-PET V3 employs a hypergeometric distribution for interaction $P$ -value calculations. For HiChIP analyses, FitHiChIP applies a regression model to correct biases related to coverage and genomic distances, followed by a binomial distribution for loop identification, contrasting with MAPS, which uses a zero-truncated Poisson regression model for bias adjustment. In the realm of Hi-C analyses, different statistical models are employed: HiCCUPS leverages Poisson regression to eliminate known biases for loop calling, while HiC-Pro utilizes a fast, sparse iterative correction technique. These distinctions are further amplified by other factors like alignment tools (e.g. Bowtie2 [56], BWA [57], GEM [58]) and the programming languages used.

Dataset variability across different pipelines and sources

In this study, we performed a comprehensive analysis of the dataset, classifying chromatin interaction data according to protein factors, experiment sources, and the specific pipeline employed. The loop data were evaluated and collected using these pipelines, focusing on the two protein factors, CTCF and Cohesin. In this assessment, we considered two distinct cell lines that were publicly available, GM12878 and HG00731, representing B-lymphocyte cell types. The GM12878 cell line datasets were acquired through experimental techniques, specifically ChIA-PET [29] and HiChIP [59], whereas the HG00731 datasets were solely obtained via the HiChIP method. The experimental data are collected from three different sources in CTCF and Cohesin to exemplify the variability of the datasets. In each of these experiments, CTCF and Cohesin are the two proteins of interest in each experiment. CTCF data are sourced from Mumbach [59], LGFS (in-house), and 4DN [60], while Cohesin data are obtained from Mumbach, LGFS (in-house) and ENCODE [61]. Consequently, this assessment comprised a total of 18 (6 pipelines $\times$ 3 sources) datasets for CTCF and 18 (6 pipelines $\times$ 3 sources) datasets for Cohesin protein-specific chromatin interactions. The detailed statistics of the data, along with their corresponding categorizations, are outlined in Table 2, yielding a total of 36 distinct datasets. Out of the six pipelines, ChIA-PET Tool V3 consistently extracts a greater number of interactions across all three sources. However, MAPS exhibits a lower coverage of interactions. Noteworthy is the fact that the in-house LGFS strategy delivers superior coverage of chromatin interactions, followed by Mumbach and 4DN for CTCF and ENCODE for Cohesin, respectively. To illustrate the relationship in terms of interaction coverage for both protein factors, a correlation (Pearson) heatmap is shown in Figure 2. The heatmaps depict a higher correlation with other pipelines for both CTCF and Cohesin protein-specific interactions. Specifically, Juicer and MAPS represent a subset of ChIA-PET Tool V3 in terms of chromatin interaction coverages, with a correlation of 1. The intersection between each individual is computed using BEDTools [62] to obtain a unique count of overlapping interacting regions.

$Correlation (Pearson) between the datasets of Chromatin Interaction Loops among $6(pipelines)\times 3 (sources)$ datasets for CTCF and Cohesin protein factors.$

Figure 2

Correlation (Pearson) between the datasets of Chromatin Interaction Loops among $6 (p i p e l i n e s) \times 3 (s o u r c e s)$ datasets for CTCF and Cohesin protein factors.

Table 2

In-depth statistical information and dataset variability presented with a hierarchical arrangement of categorization factors

Protein Factor	Experiment	DataSource	Cell lines	Applied pipeline	# Chromatin region	Interactions
CTCF	HiChIP	Mumbach	GM12878	P1_ChIA-PIPE	459 341	236 996
				P2_ChIA-PET-V3	519 079	5 015 273
				P3_FitHiChIP	8662	6928
				P4_MAPS	110 461	178 535
				P5_Juicer	64 696	59 751
				P6_HiC-Pro	26 050	92 948
		LGFS	HG00731	P1_ChIA-PIPE	90 178	45 881
				P2_ChIA-PET-V3	547 229	11 860 210
				P3_FitHiChIP	54 215	82 132
				P4_MAPS	26 212	28 013
				P5_Juicer	35 813	26 590
				P6_HiC-Pro	32 690	210 336
	ChIA-PET	4DN	GM12878	P1_ChIA-PIPE	203 475	102 680
				P2_ChIA-PET-V3	70 973	328 562
				P3_FitHiChIP	29 348	30 528
				P4_MAPS	96 670	135 921
				P5_Juicer	37 421	23 639
				P6_HiC-Pro	33 894	183 934
Cohesin	HiChIP	Mumbach	GM12878	P1_ChIA-PIPE	184 111	92 248
				P2_ChIA-PET-V3	249 566	5 339 103
				P3_FitHiChIP	66 223	89 784
				P4_MAPS	32 519	36 536
				P5_Juicer	29 732	19 478
				P6_HiC-Pro	40 524	1 048 575
		LGFS	HG00731	P1_ChIA-PIPE	299 871	151 404
				P2_ChIA-PET-V3	528 015	10 825 471
				P3_FitHiChIP	185 931	827 288
				P4_MAPS	61 352	80 488
				P5_Juicer	51 253	43 148
				P6_HiC-Pro	236 854	11 053 952
	ChIA-PET	ENCODE	GM12878	P1_ChIA-PIPE	48 487	69 955
				P2_ChIA-PET-V3	80 995	272 053
				P3_FitHiChIP	133 000	276 277
				P4_MAPS	8803	7205
				P5_Juicer	15 415	8767
				P6_HiC-Pro	121 776	3 553 840

Protein Factor	Experiment	DataSource	Cell lines	Applied pipeline	# Chromatin region	Interactions
CTCF	HiChIP	Mumbach	GM12878	P1_ChIA-PIPE	459 341	236 996
				P2_ChIA-PET-V3	519 079	5 015 273
				P3_FitHiChIP	8662	6928
				P4_MAPS	110 461	178 535
				P5_Juicer	64 696	59 751
				P6_HiC-Pro	26 050	92 948
		LGFS	HG00731	P1_ChIA-PIPE	90 178	45 881
				P2_ChIA-PET-V3	547 229	11 860 210
				P3_FitHiChIP	54 215	82 132
				P4_MAPS	26 212	28 013
				P5_Juicer	35 813	26 590
				P6_HiC-Pro	32 690	210 336
	ChIA-PET	4DN	GM12878	P1_ChIA-PIPE	203 475	102 680
				P2_ChIA-PET-V3	70 973	328 562
				P3_FitHiChIP	29 348	30 528
				P4_MAPS	96 670	135 921
				P5_Juicer	37 421	23 639
				P6_HiC-Pro	33 894	183 934
Cohesin	HiChIP	Mumbach	GM12878	P1_ChIA-PIPE	184 111	92 248
				P2_ChIA-PET-V3	249 566	5 339 103
				P3_FitHiChIP	66 223	89 784
				P4_MAPS	32 519	36 536
				P5_Juicer	29 732	19 478
				P6_HiC-Pro	40 524	1 048 575
		LGFS	HG00731	P1_ChIA-PIPE	299 871	151 404
				P2_ChIA-PET-V3	528 015	10 825 471
				P3_FitHiChIP	185 931	827 288
				P4_MAPS	61 352	80 488
				P5_Juicer	51 253	43 148
				P6_HiC-Pro	236 854	11 053 952
	ChIA-PET	ENCODE	GM12878	P1_ChIA-PIPE	48 487	69 955
				P2_ChIA-PET-V3	80 995	272 053
				P3_FitHiChIP	133 000	276 277
				P4_MAPS	8803	7205
				P5_Juicer	15 415	8767
				P6_HiC-Pro	121 776	3 553 840

Table 2

In-depth statistical information and dataset variability presented with a hierarchical arrangement of categorization factors

Protein Factor	Experiment	DataSource	Cell lines	Applied pipeline	# Chromatin region	Interactions
CTCF	HiChIP	Mumbach	GM12878	P1_ChIA-PIPE	459 341	236 996
				P2_ChIA-PET-V3	519 079	5 015 273
				P3_FitHiChIP	8662	6928
				P4_MAPS	110 461	178 535
				P5_Juicer	64 696	59 751
				P6_HiC-Pro	26 050	92 948
		LGFS	HG00731	P1_ChIA-PIPE	90 178	45 881
				P2_ChIA-PET-V3	547 229	11 860 210
				P3_FitHiChIP	54 215	82 132
				P4_MAPS	26 212	28 013
				P5_Juicer	35 813	26 590
				P6_HiC-Pro	32 690	210 336
	ChIA-PET	4DN	GM12878	P1_ChIA-PIPE	203 475	102 680
				P2_ChIA-PET-V3	70 973	328 562
				P3_FitHiChIP	29 348	30 528
				P4_MAPS	96 670	135 921
				P5_Juicer	37 421	23 639
				P6_HiC-Pro	33 894	183 934
Cohesin	HiChIP	Mumbach	GM12878	P1_ChIA-PIPE	184 111	92 248
				P2_ChIA-PET-V3	249 566	5 339 103
				P3_FitHiChIP	66 223	89 784
				P4_MAPS	32 519	36 536
				P5_Juicer	29 732	19 478
				P6_HiC-Pro	40 524	1 048 575
		LGFS	HG00731	P1_ChIA-PIPE	299 871	151 404
				P2_ChIA-PET-V3	528 015	10 825 471
				P3_FitHiChIP	185 931	827 288
				P4_MAPS	61 352	80 488
				P5_Juicer	51 253	43 148
				P6_HiC-Pro	236 854	11 053 952
	ChIA-PET	ENCODE	GM12878	P1_ChIA-PIPE	48 487	69 955
				P2_ChIA-PET-V3	80 995	272 053
				P3_FitHiChIP	133 000	276 277
				P4_MAPS	8803	7205
				P5_Juicer	15 415	8767
				P6_HiC-Pro	121 776	3 553 840

Protein Factor	Experiment	DataSource	Cell lines	Applied pipeline	# Chromatin region	Interactions
CTCF	HiChIP	Mumbach	GM12878	P1_ChIA-PIPE	459 341	236 996
				P2_ChIA-PET-V3	519 079	5 015 273
				P3_FitHiChIP	8662	6928
				P4_MAPS	110 461	178 535
				P5_Juicer	64 696	59 751
				P6_HiC-Pro	26 050	92 948
		LGFS	HG00731	P1_ChIA-PIPE	90 178	45 881
				P2_ChIA-PET-V3	547 229	11 860 210
				P3_FitHiChIP	54 215	82 132
				P4_MAPS	26 212	28 013
				P5_Juicer	35 813	26 590
				P6_HiC-Pro	32 690	210 336
	ChIA-PET	4DN	GM12878	P1_ChIA-PIPE	203 475	102 680
				P2_ChIA-PET-V3	70 973	328 562
				P3_FitHiChIP	29 348	30 528
				P4_MAPS	96 670	135 921
				P5_Juicer	37 421	23 639
				P6_HiC-Pro	33 894	183 934
Cohesin	HiChIP	Mumbach	GM12878	P1_ChIA-PIPE	184 111	92 248
				P2_ChIA-PET-V3	249 566	5 339 103
				P3_FitHiChIP	66 223	89 784
				P4_MAPS	32 519	36 536
				P5_Juicer	29 732	19 478
				P6_HiC-Pro	40 524	1 048 575
		LGFS	HG00731	P1_ChIA-PIPE	299 871	151 404
				P2_ChIA-PET-V3	528 015	10 825 471
				P3_FitHiChIP	185 931	827 288
				P4_MAPS	61 352	80 488
				P5_Juicer	51 253	43 148
				P6_HiC-Pro	236 854	11 053 952
	ChIA-PET	ENCODE	GM12878	P1_ChIA-PIPE	48 487	69 955
				P2_ChIA-PET-V3	80 995	272 053
				P3_FitHiChIP	133 000	276 277
				P4_MAPS	8803	7205
				P5_Juicer	15 415	8767
				P6_HiC-Pro	121 776	3 553 840

With the identical experiment, data source and celllines, we observe some differences in the number of chromatin loops. For example, in GM12878 CTCF from Mumbach et al., the interaction coverage is significantly different across six pipelines. These differences for the same cell line from different pipelines are due to the fact that each pipeline uses a different statistical model for predicting chromatin loops, utilization of ChIP-seq data of other experimental samples to locate loop anchors, differences in the distance cutoff to recognize intra-ligation or inter-ligation and different clustering algorithm to identity loops anchor and PETs.

With respect to the number of loops for a pipeline generating differences in loop numbers for different datasets, contact map resolution is one of the important factors for loop calling methods that may have a significant impact on the loop numbers; where, in our study, we used FitHiChIP, MAPS, HiCCUPS and HiC-Pro use fixed-sized genomic bins (5 KB) and ChIA-pet based pipeline generated the contact matrix file for multiple resolutions. Also, Hi-C anchors typically range from 5 to 100 kb in length, with rare cases involving extremely deep sequencing resulting in anchors as small as 1 kb [63, 64]. On the other hand, HiChIP and ChIA-PET typically span several kilo-base pairs [47]. The difference in resolution may introduce considerable noise into training datasets; consequently, we opted to directly utilize the chromatin interaction anchors.

In silico chromatin loop prediction approaches

A thorough investigation into the DNA sequence and epigenomic features, encompassing elements such as transcription factor binding sites, chromatin accessibility and histone modification, greatly aids in pinpointing active cis-regulatory elements (CREs) [65, 66]. Numerous computational approaches have surfaced for forecasting chromatin interactions within CREs, taking into consideration factors such as evolutionary traits [67] and the unique arrangement of nucleosomes [68]. Recent investigations propose that valuable information for predicting 3D chromatin architecture can be derived from both DNA sequence and 1D epigenomic modifications. In response to these insights, researchers have applied machine learning (ML) and polymer physics simulation methods to anticipate 3D chromatin organizations [69]. We have presented a brief overview of these approaches in Table 3.

Table 3

Computational methods for predicting 3D chromatin organizations

Method	Applied models/architecture	Description
CLNN-loop [70]	Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM)	Predict chromatin loops across various cell lines and CTCF-binding site (CBS) pair types by integrating multiple sequence-based features.
DeepLUCIA [71]	CNN	Predict chromatin loop in tissues utilizing epigenomic data.
Block Copolymer Model [72]	Polymer Physics Simulation	Explains TADs formation and dynamics using the epigenomic landscape.
Peakachu [73]	Random Forest	Predicts chromatin loops from contact maps, identifying unique interactions across species, supported by 56 Hi-C datasets.
Pierro et al. [74]	Histone Modification Analysis	Shows that histone modification data can predict chromatin arrangement.
Akita [75]	CNN	Predicts genome folding from DNA sequence, emphasizing CTCF binding sites orientation for in silico analysis.
DeepC [76]	Transfer Learning	Predicts genome folding from DNA sequences, identifying domain boundaries and impacts of genetic variations.
Sefer and Kingsford [77]	Histone Marker Analysis	Reveals that incorporating sequence features significantly enhances chromatin organization prediction.
CTCF-MP [78]	Word2vec and Boosted Trees	Leverages CTCF ChIP-seq and DNase-seq data to predict chromatin loops.
Lollipop [79]	Random Forest	Distinguishes CTCF-mediated loops from noninteracting ones using genomic and epigenomic data.
DeepMILO [80]	CNN and RNN	Models CTCF-mediated insulator loops and predicts effects of noncoding variants using DNA sequences.
3DEpiLoop [81]	Random Forest	Predicts 3D chromatin interaction in TADs at 1 kb resolution using epigenomic profiles.
DeepTACT [82]	Deep Learning	Integrates genome sequence and chromatin data to predict enhancer-promoter interactions.
DNABERT [83]	Transformer-based NLP	Classifies, annotates and analyses DNA sequences, capturing complex patterns for genomics research.
DHL algorithm [84]	Transformer and ML Algorithms	Combines DNABERT with RF, SVM and KNN for sophisticated chromatin loop prediction.
ccLoopER [85]	DNA Language Model	Utilizes a DNA-based transformer model to predict chromatin loops mediated by CTCF and Cohesin.

Method	Applied models/architecture	Description
CLNN-loop [70]	Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM)	Predict chromatin loops across various cell lines and CTCF-binding site (CBS) pair types by integrating multiple sequence-based features.
DeepLUCIA [71]	CNN	Predict chromatin loop in tissues utilizing epigenomic data.
Block Copolymer Model [72]	Polymer Physics Simulation	Explains TADs formation and dynamics using the epigenomic landscape.
Peakachu [73]	Random Forest	Predicts chromatin loops from contact maps, identifying unique interactions across species, supported by 56 Hi-C datasets.
Pierro et al. [74]	Histone Modification Analysis	Shows that histone modification data can predict chromatin arrangement.
Akita [75]	CNN	Predicts genome folding from DNA sequence, emphasizing CTCF binding sites orientation for in silico analysis.
DeepC [76]	Transfer Learning	Predicts genome folding from DNA sequences, identifying domain boundaries and impacts of genetic variations.
Sefer and Kingsford [77]	Histone Marker Analysis	Reveals that incorporating sequence features significantly enhances chromatin organization prediction.
CTCF-MP [78]	Word2vec and Boosted Trees	Leverages CTCF ChIP-seq and DNase-seq data to predict chromatin loops.
Lollipop [79]	Random Forest	Distinguishes CTCF-mediated loops from noninteracting ones using genomic and epigenomic data.
DeepMILO [80]	CNN and RNN	Models CTCF-mediated insulator loops and predicts effects of noncoding variants using DNA sequences.
3DEpiLoop [81]	Random Forest	Predicts 3D chromatin interaction in TADs at 1 kb resolution using epigenomic profiles.
DeepTACT [82]	Deep Learning	Integrates genome sequence and chromatin data to predict enhancer-promoter interactions.
DNABERT [83]	Transformer-based NLP	Classifies, annotates and analyses DNA sequences, capturing complex patterns for genomics research.
DHL algorithm [84]	Transformer and ML Algorithms	Combines DNABERT with RF, SVM and KNN for sophisticated chromatin loop prediction.
ccLoopER [85]	DNA Language Model	Utilizes a DNA-based transformer model to predict chromatin loops mediated by CTCF and Cohesin.

Table 3

Computational methods for predicting 3D chromatin organizations

Method	Applied models/architecture	Description
CLNN-loop [70]	Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM)	Predict chromatin loops across various cell lines and CTCF-binding site (CBS) pair types by integrating multiple sequence-based features.
DeepLUCIA [71]	CNN	Predict chromatin loop in tissues utilizing epigenomic data.
Block Copolymer Model [72]	Polymer Physics Simulation	Explains TADs formation and dynamics using the epigenomic landscape.
Peakachu [73]	Random Forest	Predicts chromatin loops from contact maps, identifying unique interactions across species, supported by 56 Hi-C datasets.
Pierro et al. [74]	Histone Modification Analysis	Shows that histone modification data can predict chromatin arrangement.
Akita [75]	CNN	Predicts genome folding from DNA sequence, emphasizing CTCF binding sites orientation for in silico analysis.
DeepC [76]	Transfer Learning	Predicts genome folding from DNA sequences, identifying domain boundaries and impacts of genetic variations.
Sefer and Kingsford [77]	Histone Marker Analysis	Reveals that incorporating sequence features significantly enhances chromatin organization prediction.
CTCF-MP [78]	Word2vec and Boosted Trees	Leverages CTCF ChIP-seq and DNase-seq data to predict chromatin loops.
Lollipop [79]	Random Forest	Distinguishes CTCF-mediated loops from noninteracting ones using genomic and epigenomic data.
DeepMILO [80]	CNN and RNN	Models CTCF-mediated insulator loops and predicts effects of noncoding variants using DNA sequences.
3DEpiLoop [81]	Random Forest	Predicts 3D chromatin interaction in TADs at 1 kb resolution using epigenomic profiles.
DeepTACT [82]	Deep Learning	Integrates genome sequence and chromatin data to predict enhancer-promoter interactions.
DNABERT [83]	Transformer-based NLP	Classifies, annotates and analyses DNA sequences, capturing complex patterns for genomics research.
DHL algorithm [84]	Transformer and ML Algorithms	Combines DNABERT with RF, SVM and KNN for sophisticated chromatin loop prediction.
ccLoopER [85]	DNA Language Model	Utilizes a DNA-based transformer model to predict chromatin loops mediated by CTCF and Cohesin.

Method	Applied models/architecture	Description
CLNN-loop [70]	Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM)	Predict chromatin loops across various cell lines and CTCF-binding site (CBS) pair types by integrating multiple sequence-based features.
DeepLUCIA [71]	CNN	Predict chromatin loop in tissues utilizing epigenomic data.
Block Copolymer Model [72]	Polymer Physics Simulation	Explains TADs formation and dynamics using the epigenomic landscape.
Peakachu [73]	Random Forest	Predicts chromatin loops from contact maps, identifying unique interactions across species, supported by 56 Hi-C datasets.
Pierro et al. [74]	Histone Modification Analysis	Shows that histone modification data can predict chromatin arrangement.
Akita [75]	CNN	Predicts genome folding from DNA sequence, emphasizing CTCF binding sites orientation for in silico analysis.
DeepC [76]	Transfer Learning	Predicts genome folding from DNA sequences, identifying domain boundaries and impacts of genetic variations.
Sefer and Kingsford [77]	Histone Marker Analysis	Reveals that incorporating sequence features significantly enhances chromatin organization prediction.
CTCF-MP [78]	Word2vec and Boosted Trees	Leverages CTCF ChIP-seq and DNase-seq data to predict chromatin loops.
Lollipop [79]	Random Forest	Distinguishes CTCF-mediated loops from noninteracting ones using genomic and epigenomic data.
DeepMILO [80]	CNN and RNN	Models CTCF-mediated insulator loops and predicts effects of noncoding variants using DNA sequences.
3DEpiLoop [81]	Random Forest	Predicts 3D chromatin interaction in TADs at 1 kb resolution using epigenomic profiles.
DeepTACT [82]	Deep Learning	Integrates genome sequence and chromatin data to predict enhancer-promoter interactions.
DNABERT [83]	Transformer-based NLP	Classifies, annotates and analyses DNA sequences, capturing complex patterns for genomics research.
DHL algorithm [84]	Transformer and ML Algorithms	Combines DNABERT with RF, SVM and KNN for sophisticated chromatin loop prediction.
ccLoopER [85]	DNA Language Model	Utilizes a DNA-based transformer model to predict chromatin loops mediated by CTCF and Cohesin.

Deep learning model specific assessment

In the realm of forecasting chromatin interactions and loops, deep learning models have risen as a promising area of research. These models employ advanced neural network architectures and undergo training on substantial volumes of genomic data to effectively predict the spatial relationships between distant genomic elements. A major strength of these models lies in their ability to capture intricate, non-linear associations among genomic characteristics, a challenge that traditional statistical methods often struggle to overcome.

Deep transformer models BERT [86] have significantly improved genomic sequence analysis, particularly in chromatin interaction prediction. Their ability to learn from large datasets and identify complex DNA patterns results in highly accurate models. Moreover, BERT’s adaptability for specific research tasks, from identifying regulatory elements to predicting chromatin interactions, showcases its versatility in genomics. In this case study, we have incorporated DNABERT [83] as a pre-training model and fine-tuned it with the above-described dataset. This fine-tuning was performed for the final classification, utilizing adaptable sequence fragments from two distinct chromatin regions.

Genomic interactions are represented as connections between distant points within the genome, where these distant regions serve as anchor points. In each chromatin interaction, we identified the highest-scoring CTCF motifs and their respective orientations within these anchor regions. As for the Cohesin protein factor, we incorporated the cohesin peaks with the highest strength to select significant sub-regions from the anchors. In this experiment concerning the Cohesin protein, three sets of peak information were utilized for selecting significant regions from three sources: Mumbach, LGFS and ENCODE. Interactions lacking motif regions in both anchors are excluded from the learning process. The resulting interactions are presented in Table 2.

The process involves extracting 250 base pair long genomic subsequences from anchor regions around CTCF motifs and Cohesin peaks, forming chromatin loops as pairs of these subsequences separated by a [SEP] token for a total of 500 base pairs. For the negative dataset, genomic regions are randomly chosen, ensuring no overlap with positive or transitive interactions. Training uses interactions from 22 chromosomes, excluding chromosome 9 for validation, maintaining a 1:1 ratio of positive to negative examples to balance the deep learning model.

In this analytical assessment, models have been developed by training on all 36 individual datasets generated from 6 different pipelines, considering two protein factors and three sources. The performances have been plotted in Figure 3. In CTCF-based datasets, ChIA-PET V3 and FitHiChIP show higher accuracy and area under the receiver operating characteristic (AUC) score, followed by Juicer. However, concerning data sources, LGFS shows significantly better AUC compared to Mumbach and 4DN, with the exception of MAPS. In terms of Cohesin-based datasets, Mumbach performs better in almost all pipelines, with only two exceptions: Juicer and HiC-Pro.

Figure 3

Chromatin Interaction Prediction performance across six pipelines with two protein factors, CTCF and Cohesin, considering different experiment types and sources.

For the assessment of the overall performance of all datasets based on CTCF and Cohesin, a subset of 1030 chromatin interaction data common to all pipelines and sources is extracted for validation purposes. The performance of individual pipelines is presented in Figure 4. A three-level consensus strategy is used to enhance the coverage of the interaction.

Figure 4

Chromatin Interaction Prediction on the set of shared interactions and the predicted performance across all pipelines with three-level consensus-based performance.

In the first level, the consensus strategy focuses on harmonizing the data from three different sources for both CTCF and Cohesin—namely, Mumbach, LGFS and 4DN for CTCF; and Mumbach, LGFS and ENCODE for Cohesin—across all six pipelines. This approach allows for the integration of diverse datasets, mitigating source-specific biases and enhancing the reliability of the interaction data. The outcomes of this initial consensus are visually represented in Figure 4 through Venn diagrams, which delineate the coverage of interactions specific to each source as well as those identified through the consensus approach.

The analysis adds complexity in the next consensus level by addressing experimental biases from different technologies, such as ChIA-PET (via ChIA-PIPE and ChIA-PET Tool V3), HiChIP (via FitHiChIP and MAPS) and Hi-C (via Juicer and HiC-Pro). It applies a consensus process for each technology group, using pre-computed models to balance their unique biases and strengths. This is crucial for creating a consensus model that accurately represents a wide range of experimental interactions.

The final consensus level, a result of this detailed process, merges insights from previous steps, markedly improving chromatin interaction coverage $93.11 %$ for CTCF and $88.55 %$ for Cohesin, as shown in Figure 4. This represents a substantial enhancement over individual pipelines, demonstrating the consensus approach’s effectiveness in providing a fuller view of chromatin interactions.

The deployment of a multi-level consensus strategy in assessing the performance of different datasets on CTCF and Cohesin interactions not only facilitates a more accurate and inclusive detection of chromatin interactions but also showcases the potential of collaborative approaches in overcoming the limitations inherent to individual experimental pipelines. This holistic assessment methodology provides critical insights into the chromatin interaction landscape, paving the way for further advancements in the field of genomics and epigenomics.

CONCLUSION

Our study conducted a thorough analysis of chromatin interaction datasets, systematically categorizing the data based on protein factors, experiment sources, and specific pipelines utilized. We focused on the key proteins CTCF and Cohesin and considered two publicly available cell lines, GM12878 and HG00731, to represent B-lymphocyte cell types. Experimental data were obtained from different sources, emphasizing the variability in datasets. The assessment included a total of 36 distinct datasets (18 each for CTCF and Cohesin) generated from six pipelines across three sources. ChIA-PET Tool V3 consistently extracted a greater number of interactions, while the in-house LGFS strategy demonstrated superior coverage of chromatin interactions, especially for CTCF. A BERT-based deep learning model is applied to the resulting 36 individual datasets, and their performances are assessed in terms of accuracy and AUC score to gain insights into the potential of these datasets for in-silico loop prediction. ChIA-PET V3 and FitHiChIP demonstrated higher accuracy in CTCF-based datasets, with LGFS exhibiting superior AUC among the data sources. Mumbach consistently performed well in Cohesin-based datasets across most pipelines. The overall assessment, based on a subset of 1030 chromatin interaction data common to all pipelines and sources, revealed robust interaction coverage through a three-level consensus strategy. This strategy, involving source-specific consensus and subsequent experimental bias-based consensus, resulted in a substantial improvement in interaction coverage for both CTCF and Cohesin, surpassing individual pipeline-based coverages. These findings underscore the significance of integrating multiple sources and employing consensus strategies to enhance the reliability and coverage of chromatin interaction predictions in genomics research. However, we are deeply interested in examining learning strategies for both homogeneous and heterogeneous datasets, along with performing validations to evaluate the robustness of our methodology. This exploration will be a crucial aspect of our future work, extending into both cross-learning and self-learning evaluations and validations of the model.

Key Points

We have outlined bioinformatics pipelines tailored for chromatin interaction and loop calling, focusing on experimental bias sources related to CTCF and Cohesin proteins.
Chromatin interaction coverage was assessed for two proteins, CTCF and Cohesin, across 18 combinations of pipelines and sources.
In-depth exploration and discussion of predictive approaches for insilico chromatin loop identification.
Utilizing a deep transformer-based model, we evaluated prediction performance on the datasets and conducted performance analyses through a three-level consensus strategy across 18 combinations.

AUTHOR CONTRIBUTIONS

Anup Kumar Halder (Conceptualization-Equal, Data curation-Supporting, Formal analysis-Lead, Investigation-Lead, Methodology-Lead, Validation-Equal, Visualization-Lead, Writing—original draft-Lead, Writing—review & editing-Equal ), Abhishek Agarwal (Conceptualization-Supporting, Data curation-Lead, Formal analysis-Supporting, Investigation-Supporting, Writing—original draft-Equal), Karolina Jodkowska (Data production-Lead, Investigation-Supporting, Validation-Supporting, Writing—review & editing-Supporting), Dariusz Plewczynski (Conceptualization-Equal, Project administration-Lead, Supervision-Lead, Writing—review & editing-Equal).

FUNDING

The research was co-funded by Warsaw University of Technology within the Excellence Initiative: Research University (IDUB) programme. The work has been co-supported by EU-funded Marie Sklodowska-Curie Action (MSCA) Innovative Training Network named Enhpathy (www.enhpathy.eu) Molecular Basis of Human enhanceropathies, National Institute of Health USA 4DNucleome grant 1U54DK107967-01 and “Nucleome Positioning System for Spatiotemporal Genome Organization and Regulation. This work has also been co-supported by the Polish National Science Centre (2019/35/O/ST6/02484,2020/37/B/NZ2/03757). Computations were performed thanks to the Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, using the Artificial Intelligence HPC platform financed by the Polish Ministry of Science and Higher Education (decision no. 7054/IA/SP/2020 of 2020-08-28).

Author Biographies

Anup Kumar Halder is a postdoctoral fellow at the Faculty of Mathematics and Information Sciences, Warsaw University of Technology, Warsaw, Poland. He is also associated with the Laboratory of Functional and Structural Genomics at the Center of New Technologies (CeNT), University of Warsaw, Poland.

Abhishek Agarwal is a PhD student in the Laboratory of Functional and Structural Genomics at the Center of New Technologies (CeNT), University of Warsaw, Poland. some description.

Karolina Jodkowska is a postdoctoral fellow in the Laboratory of Functional and Structural Genomics at the Center of New Technologies (CeNT), University of Warsaw, Warsaw, Poland. some description.

Dariusz Plewczynski is a Professor of Exact and Natural Sciences at Warsaw University of Technology as Principal Investigator and Head. He is also associated with the University of Warsaw in the Center of New Technologies CeNT, Warsaw, Poland, as the head of the Laboratory of Functional and Structural Genomics.

References

1.

Pederson

T

.

Chromatin structure and the cell cycle

.

Proc Natl Acad Sci

1972

;

69

(

8

):

2224

–

8

.

2.

Dixon

JR

,

Jie

X

,

Dileep

V

, et al. .

Integrative detection and analysis of structural variation in cancer genomes

.

Nat Genet

2018

;

50

(

10

):

1388

–

98

.

3.

Dileep

V

,

Ay

F

,

Sima

J

, et al. .

Topologically associating domains and their long-range contacts are established during early G1 coincident with the establishment of the replication-timing program

.

Genome Res

2015

;

25

(

8

):

1104

–

13

.

4.

Beagrie

RA

,

Pombo

A

.

Continuous chromatin changes

.

Nature

2017

;

547

(

7661

):

34

–

5

.

5.

Chiliński

M

,

Sengupta

K

,

Plewczynski

D

.

From DNA human sequence to the chromatin higher order organisation and its biological meaning: using biomolecular interaction networks to understand the influence of structural variation on spatial genome organisation and its functional effect

. In:

Seminars in Cell & Developmental Biology

, Vol.

121

.

Academic: Elsevier

,

2022

,

171

–

85

.

6.

Roy

SS

,

Mukherjee

AK

,

Chowdhury

S

.

Insights about genome function from spatial organization of the genome

.

Hum Genom

2018

;

12

(

1

):

8

.

7.

Kadauke

S

,

Blobel

GA

.

Chromatin loops in gene regulation

.

Biochim Biophys Acta

2009

;

1789

(

1

):

17

–

25

.

8.

Jerković

I

,

Szabo

Q

,

Bantignies

F

,

Cavalli

G

.

Higher-order chromosomal structures mediate genome function

.

J Mol Biol

2020

;

432

(

3

):

676

–

81

.

9.

Sengupta

K

,

Denkiewicz

M

,

Chiliński

M

, et al. .

Multi-scale phase separation by explosive percolation with single-chromatin loop resolution

.

Comput Struct Biotechnol J

2022

;

20

:

3591

–

603

.

10.

Bonev

B

,

Cavalli

G

.

Organization and function of the 3D genome

.

Nat Rev Genet

2016

;

17

(

11

):

661

–

78

.

11.

Zheng

H

,

Xie

W

.

The role of 3D genome organization in development and cell differentiation

.

Nat Rev Mol Cell Biol

2019

;

20

(

9

):

535

–

50

.

12.

Marchal

C

,

Sima

J

,

Gilbert

DM

.

Control of DNA replication timing in the 3D genome

.

Nat Rev Mol Cell Biol

2019

;

20

(

12

):

721

–

37

.

13.

Lieberman-Aiden

E

,

Van Berkum

NL

,

Williams

L

, et al. .

Comprehensive mapping of long-range interactions reveals folding principles of the human genome

.

Science

2009

;

326

(

5950

):

289

–

93

.

14.

Rao

SSP

,

Huntley

MH

,

Durand

NC

, et al. .

A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping

.

Cell

2014

;

159

(

7

):

1665

–

80

.

15.

Li

G

,

Ruan

X

,

Auerbach

RK

, et al. .

Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation

.

Cell

2012

;

148

(

1

):

84

–

98

.

PubMed

16.

Tang

Z

,

Luo

OJ

,

Li

X

, et al. .

CTCF-mediated human 3D genome architecture reveals chromatin topology for transcription

.

Cell

2015

;

163

(

7

):

1611

–

27

.

17.

Giambartolomei

C

,

Seo

J-H

,

Schwarz

T

, et al. .

H3k27ac HiChIP in prostate cell lines identifies risk genes for prostate cancer susceptibility

.

Am J Hum Genet

2021

;

108

(

12

):

2284

–

300

.

18.

Okuyama

K

,

Strid

T

,

Kuruvilla

J

, et al. .

PAX5 is part of a functional transcription factor network targeted in lymphoid leukemia

.

PLoS Genet

2019

;

15

(

8

):

e1008280

.

19.

Lee

B

,

Wang

J

,

Cai

L

, et al. .

ChIA-PIPE: a fully automated pipeline for comprehensive ChIA-PET data analysis and visualization

.

Sci Adv

2020

;

6

(

28

):

eaay2078

.

20.

Li

G

,

Sun

T

,

Chang

H

, et al. .

Chromatin interaction analysis with updated ChIA-PET tool (v3)

.

Genes

2019

;

10

(

7

):

554

.

21.

Bhattacharyya

S

,

Chandra

V

,

Vijayanand

P

,

Ay

F

.

Identification of significant chromatin contacts from HiChIP data by FitHiChIP

.

Nat Commun

2019

;

10

(

1

):

4221

.

22.

Juric

I

,

Miao

Y

,

Abnousi

A

, et al. .

MAPS: model-based analysis of long-range chromatin interactions from PLAC-seq and HichIP experiments

.

PLoS Comput Biol

2019

;

15

(

4

):

e1006982

.

23.

Durand

NC

,

Shamim

MS

,

Machol

I

, et al. .

Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments

.

Cell Syst

2016

;

3

(

1

):

95

–

8

.

24.

Servant

N

,

Varoquaux

N

,

Lajoie

BR

, et al. .

HiC-Pro: an optimized and flexible pipeline for Hi-C data processing

.

Genome Biol

2015

;

16

(

1

):

1

–

11

.

25.

Dekker

J

,

Rippe

K

,

Dekker

M

,

Kleckner

N

.

Capturing chromosome conformation

.

Science

2002

;

295

(

5558

):

1306

–

11

.

26.

Van De Werken

HJG

,

Landan

G

,

Holwerda

SJB

, et al. .

Robust 4C-seq data analysis to screen for regulatory dna interactions

.

Nat Methods

2012

;

9

(

10

):

969

–

72

.

27.

Dostie

J

,

Richmond

TA

,

Arnaout

RA

, et al. .

Chromosome conformation capture carbon copy (5C): a massively parallel solution for mapping interactions between genomic elements

.

Genome Res

2006

;

16

(

10

):

1299

–

309

.

28.

Belton

J-M

,

McCord

RP

,

Gibcus

JH

, et al. .

Hi-C: a comprehensive technique to capture the conformation of genomes

.

Methods

2012

;

58

(

3

):

268

–

76

.

29.

Fullwood

MJ

,

Liu

MH

,

Pan

YF

, et al. .

An oestrogen-receptor-

α

-bound human chromatin interactome

.

Nature

2009

;

462

(

7269

):

58

–

64

.

30.

Phanstiel

DH

,

Boyle

AP

,

Heidari

N

,

Snyder

MP

.

Mango: a bias-correcting ChIA-PET analysis pipeline

.

Bioinformatics

2015

;

31

(

19

):

3092

–

8

.

31.

Li

G

,

Chen

Y

,

Snyder

MP

,

Zhang

MQ

.

ChIA-PET2: a versatile and flexible pipeline for ChIA-PET data analysis

.

Nucleic Acids Res

2017

;

45

(

1

):

e4

–

4

.

32.

Cao

Y

,

Chen

Z

,

Chen

X

, et al. .

Accurate loop calling for 3D genomic data with cLoops

.

Bioinformatics

2020

;

36

(

3

):

666

–

75

.

33.

Cao

Y

,

Liu

S

,

Ren

G

, et al. .

cLoops2: a full-stack comprehensive analytical tool for chromatin interactions

.

Nucleic Acids Res

2022

;

50

(

1

):

57

–

71

.

34.

Huang

W

,

Medvedovic

M

,

Zhang

J

,

Niu

L

.

ChIAPoP: a new tool for ChIA-PET data analysis

.

Nucleic Acids Res

2019

;

47

(

7

):

e37

–

7

.

35.

Ardakany

AR

,

Gezer

HT

,

Lonardi

S

,

Ay

F

.

Mustache: multi-scale detection of chromatin loops from Hi-C and Micro-C maps using scale-space representation

.

Genome Biol

2020

;

21

:

1

–

17

.

36.

Imakaev

M

,

Fudenberg

G

,

McCord

RP

, et al. .

Iterative correction of Hi-C data reveals hallmarks of chromosome organization

.

Nat Methods

2012

;

9

(

10

):

999

–

1003

.

37.

Servant

N

,

Lajoie

BR

,

Nora

EP

, et al. .

HiTC: exploration of high-throughput ‘c’ experiments

.

Bioinformatics

2012

;

28

(

21

):

2843

–

4

.

38.

Schmid

MW

,

Grob

S

,

Grossniklaus

U

.

HiCdat: a fast and easy-to-use Hi-C data analysis tool

.

BMC Bioinform

2015

;

16

:

1

–

6

.

39.

Lazaris

C

,

Kelly

S

,

Ntziachristos

P

, et al. .

HiC-bench: comprehensive and reproducible Hi-C data analysis designed for parameter exploration and benchmarking

.

BMC Genom

2017

;

18

:

1

–

16

.

40.

Serra

F

,

Baù

D

,

Goodstadt

M

, et al. .

Automatic analysis and 3D-modelling of hi-c data using TADbit reveals structural features of the fly chromatin colors

.

PLoS Comput Biol

2017

;

13

(

7

):

e1005665

.

41.

Sauria

MEG

,

Phillips-Cremins

JE

,

Corces

VG

,

Taylor

J

.

HiFive: a tool suite for easy and efficient HiC and 5C data analysis

.

Genome Biol

2015

;

16

:

1

–

10

.

42.

Castellano

G

,

Le Dily

F

,

Pulido

AH

,

Beato

M

,

Roma

G

.

HiC-inspector: a toolkit for high-throughput chromosome capture data

.

bioRxiv

.

2015

;

020636

.

43.

Yaffe

E

,

Tanay

A

.

Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture

.

Nat Genet

2011

;

43

(

11

):

1059

–

65

.

44.

Heinz

S

,

Benner

C

,

Spann

N

, et al. .

Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities

.

Mol Cell

2010

;

38

(

4

):

576

–

89

.

45.

Wingett

S

,

Ewels

P

,

Furlan-Magaril

M

, et al. .

HiCUP: pipeline for mapping and processing Hi-C data

.

F1000Research

2015

;

4

:

1310

.

46.

Lareau

CA

,

Aryee

MJ

.

Hichipper: a preprocessing pipeline for calling dna loops from HiChIP data

.

Nat Methods

2018

;

15

(

3

):

155

–

6

.

47.

Li

G

,

Fullwood

MJ

,

Han

X

, et al. .

ChIA-PET tool for comprehensive Chromatin Interaction Analysis with Paired-End Tag sequencing

.

Genome Biol

2010

;

11

:

R22

–

13

.

48.

Kerpedjiev

P

,

Abdennur

N

,

Lekschas

F

, et al. .

HiGlass: web-based visual exploration and analysis of genome interaction maps

.

Genome Biol

2018

;

19

(

1

):

1

–

12

.

49.

Robinson

JT

,

Turner

D

,

Durand

NC

, et al. .

Juicebox.js provides a cloud-based visualization system for Hi-C data

.

Cell Syst

2018

;

6

(

2

):

256

–

258.e1

.

50.

Zhang

Y

,

Liu

T

,

Meyer

CA

, et al. .

Model-based analysis of ChIP-seq (MACS)

.

Genome Biol

2008

;

9

(

9

):

1

–

9

.

51.

Knight

PA

,

Ruiz

D

.

A fast algorithm for matrix balancing

.

IMA J Numer Anal

2013

;

33

(

3

):

1029

–

47

.

. https://doi.org/10.48550/arXiv.1303.3997.

52.

Durand

NC

,

Robinson

JT

,

Shamim

MS

, et al. .

Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom

.

Cell Syst

2016

;

3

(

1

):

99

–

101

.

53.

Hwang

W-H

,

Stoklosa

J

,

Wang

C-Y

.

Population size estimation using zero-truncated poisson regression with measurement error

.

J Agric Biol Environ Stat

2022

;

27

:

303

–

20

.

54.

Li

H

.

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

.

arXiv preprint arXiv:1303.3997

.

2013

55.

Paulsen

J

,

Rødland

EA

,

Holden

L

, et al. .

A statistical model of ChIA-PET data for accurate detection of chromatin 3D interactions

.

Nucleic Acids Res

2014

;

42

(

18

):

e143

–

3

.

56.

Langmead

B

,

Salzberg

SL

.

Fast gapped-read alignment with Bowtie 2

.

Nat Methods

2012

;

9

(

4

):

357

–

9

.

57.

Li

H

,

Durbin

R

.

Fast and accurate short read alignment with burrows–wheeler transform

.

Bioinformatics

2009

;

25

(

14

):

1754

–

60

.

58.

Marco-Sola

S

,

Sammeth

M

,

Guigó

R

,

Ribeca

P

.

The gem mapper: fast, accurate and versatile alignment by filtration

.

Nat Methods

2012

;

9

(

12

):

1185

–

8

.

59.

Mumbach

MR

,

Rubin

AJ

,

Flynn

RA

, et al. .

HiChIP: efficient and sensitive analysis of protein-directed genome architecture

.

Nat Methods

2016

;

13

(

11

):

919

–

22

.

60.

Dekker

J

,

Belmont

AS

,

Guttman

M

, et al. .

The 4D nucleome project

.

Nature

2017

;

549

(

7671

):

219

–

26

.

61.

Snyder

MP

,

Gingeras

TR

,

Moore

JE

, et al. .

Perspectives on encode

.

Nature

2020

;

583

(

7818

):

693

–

8

.

PubMed

62.

Quinlan

AR

,

Hall

IM

.

BEDTools: a flexible suite of utilities for comparing genomic features

.

Bioinformatics

2010

;

26

(

6

):

841

–

2

.

63.

Pal

K

,

Forcato

M

,

Ferrari

F

.

Hi-C analysis: from data generation to integration

.

Biophys Rev

2019

;

11

:

67

–

78

.

64.

Eagen

KP

.

Principles of chromosome architecture revealed by Hi-C

.

Trends Biochem Sci

2018

;

43

(

6

):

469

–

78

.

65.

Ghandi

M

,

Lee

D

,

Mohammad-Noori

M

,

Beer

MA

.

Enhanced regulatory sequence prediction using gapped k-mer features

.

PLoS Comput Biol

2014

;

10

(

7

):

e1003711

.

66.

Li

Y

,

Shi

W

,

Wasserman

WW

.

Genome-wide prediction of cis-regulatory regions using supervised deep learning methods

.

BMC Bioinformatics

2018

;

19

(

1

):

1

–

14

.

67.

Naville

M

,

Ishibashi

M

,

Ferg

M

, et al. .

Long-range evolutionary constraints reveal cis-regulatory interactions on the human x chromosome

.

Nat Commun

2015

;

6

(

1

):

6904

.

68.

Zhang

H

,

Li

F

,

Jia

Y

, et al. .

Characteristic arrangement of nucleosomes is predictive of chromatin interactions at kilobase resolution

.

Nucleic Acids Res

2017

;

45

(

22

):

12739

–

51

.

69.

Cheng

RR

,

Contessoto

VG

,

Aiden

EL

, et al. .

Exploring chromosomal structural heterogeneity across multiple cell lines

.

Elife

2020

;

9

:e60312, 1–21.

70.

Zhang

P

,

Yingfu

W

,

Zhou

H

, et al. .

CLNN-loop: a deep learning model to predict CTCF-mediated chromatin loops in the different cell lines and CTCF-binding sites (CBS) pair types

.

Bioinformatics

2022

;

38

(

19

):

4497

–

504

.

71.

Yang

D

,

Chung

T

,

Kim

D

.

DeepLUCIA: predicting tissue-specific chromatin loops using deep learning-based universal chromatin interaction annotator

.

Bioinformatics

2022

;

38

(

14

):

3501

–

12

.

72.

Jost

D

,

Carrivain

P

,

Cavalli

G

,

Vaillant

C

.

Modeling epigenome folding: formation and dynamics of topologically associated chromatin domains

.

Nucleic Acids Res

2014

;

42

(

15

):

9553

–

61

.

73.

Salameh

TJ

,

Wang

X

,

Song

F

, et al. .

A supervised learning framework for chromatin loop detection in genome-wide contact maps

.

Nat Commun

2020

;

11

(

1

):

3428

.

74.

Di Pierro

M

,

Cheng

RR

,

Aiden

EL

, et al. .

De novo prediction of human chromosome structures: epigenetic marking patterns encode genome architecture

.

Proc Natl Acad Sci

2017

;

114

(

46

):

12126

–

31

.

75.

Fudenberg

G

,

Kelley

DR

,

Pollard

KS

.

Predicting 3D genome folding from dna sequence with Akita

.

Nat Methods

2020

;

17

(

11

):

1111

–

7

.

76.

Schwessinger

R

,

Gosden

M

,

Downes

D

, et al. .

DeepC: predicting 3D genome folding using megabase-scale transfer learning

.

Nat Methods

2020

;

17

(

11

):

1118

–

24

.

77.

Sefer

E

,

Kingsford

C

.

Semi-nonparametric modeling of topological domain formation from epigenetic data

.

Algorithms Mol Biol

2019

;

14

:

1

–

11

.

78.

Zhang

R

,

Wang

Y

,

Yang

Y

, et al. .

Predicting CTCF-mediated chromatin loops using CTCF-MP

.

Bioinformatics

2018

;

34

(

13

):

i133

–

41

.

79.

Kai

Y

,

Andricovich

J

,

Zeng

Z

, et al. .

Predicting CTCF-mediated chromatin interactions by integrating genomic and epigenomic features

.

Nat Commun

2018

;

9

(

1

):

4221

.

80.

Trieu

T

,

Martinez-Fundichely

A

,

Khurana

E

.

DeepMILO: a deep learning approach to predict the impact of non-coding sequence variants on 3D chromatin structure

.

Genome Biol

2020

;

21

:

1

–

11

.

81.

Al Bkhetan

Z

,

Plewczynski

D

.

Three-dimensional epigenome statistical model: genome-wide chromatin looping prediction

.

Sci Rep

2018

;

8

(

1

):

5217

.

82.

Li

W

,

Wong

WH

,

Jiang

R

.

DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning

.

Nucleic Acids Res

2019

;

47

(

10

):

e60

–

0

.

83.

Ji

Y

,

Zhou

Z

,

Liu

H

,

Davuluri

RV

.

DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome

.

Bioinformatics

2021

;

37

(

15

):

2112

–

20

.

84.

Chiliński

M

,

Halder

AK

,

Plewczynski

D

.

Prediction of chromatin looping using deep hybrid learning (DHL)

.

Quant Biol

2023

;

11

(

2

):

155

–

62

.