Damsel: analysis and visualisation of DamID sequencing in R

Abstract

Summary

DamID sequencing is a technique to map the genome-wide interaction of a protein with DNA. Damsel is the first Bioconductor package to provide an end to end analysis for DamID sequencing data within R. Damsel performs quantification and testing of significant binding sites along with exploratory and visual analysis. Damsel produces results consistent with previous analysis approaches.

Availability and implementation

The R package Damsel is available for install through the Bioconductor project https://bioconductor.org/packages/release/bioc/html/Damsel.html and the code is available on GitHub https://github.com/Oshlack/Damsel/.

1 Introduction

Defining the binding landscape of chromatin interacting proteins is vital for understanding and manipulating gene expression. DamID sequencing (DamID-seq) was developed as an alternative to ChIP-seq that facilitates this without relying on antibodies (van Steensel and Henikoff 2000). In the DamID-seq protocol, the Escherichia coli DNA adenine methylase (Dam) protein is fused to a transcription regulatory protein of interest, and upon interaction with DNA, Dam methylates the adenine in adjacent GATC motifs. As Dam can also randomly methylate GATC motifs, DamID-seq analysis is always conducted with the expression of a Dam-only control. Dam-methylated genomic regions are enriched using a methylation sensitive endonuclease to cut DNA and amplified with PCR. The ends of the resultant DNA fragments are sequenced and mapped back to regions between known GATC motifs within the genome to provide the genome-wide binding profile of the protein of interest. Following the identification of enriched genomic regions relative to the Dam-only control, potential target genes of the protein of interest can be identified. These target genes are identified based on their proximity to an enriched binding peak. A limitation of DamID-seq is the reliance on GATC motifs for defining the potential enriched regions, which limits the utility of DamID-seq to genetically tractable species, and makes it ill suited for identifying histone modifications, as well as modifications in GATC motif sparse regions in the genome. Targeted DamID-seq is an optimized version of DamID-seq that was developed in Drosophila melanogaster which allows in vivo expression and precise spatiotemporal control (Southall et al. 2013, Marshall et al. 2016, Aughey et al. 2019).

Although conceptually DamID-seq is similar to ChIP-seq, the region based nature of DNA fragmentation means that the data produced is different, requiring the development of new analytical tools. Bespoke analyses are commonly conducted, as in Vissers et al. (2018) and the R package Daim (Ashmore 2019) (available on GitHub) which utilize edgeR’s (Robinson et al. 2010) robust statistical testing. A 2019 review of DamID-seq mentions only two published methods (Li et al. 2015, Marshall and Brand 2015), both focussed on algorithms to facilitate peak calling (identifying regions of binding compared with the control) (Aughey et al. 2019). Neither of these tools conduct the whole DamID-seq bioinformatics workflow, thereby forcing researchers to rely on combining multiple tools and platforms with bespoke analysis. The widely used method, damidseq_pipeline (Marshall and Brand 2015), is the most comprehensive tool and runs on the command line analyzing samples individually, but has no visualization capabilities. Here, we present Damsel, the first dedicated R Bioconductor package for DamID sequencing, providing an end-to-end analysis, including exploratory and visualization capabilities.

2 Methods

The Damsel workflow, as described in Fig. 1A, starts after read alignment using standard tools (Supplementary Methods) and requires BAM files and GATC regions for a genome as input. Damsel conducts the following steps in the analysis workflow: region quantification, identification of bound genomic regions, peak calling, candidate gene detection, and gene ontology testing, described in more detail below.

Figure 1.

Overview of Damsel. (A) The main steps in the Damsel workflow alongside their Bioconductor functions, including region quantification, identification of bound regions, peak calling, candidate gene detection, and gene ontology testing. (B) Quality control plot illustrating the total counts (sequencing depth) of the samples, coloured by if the count from the region was part of the statistical testing, and if it was included in a peak. The replicate 1 samples (Dam_1 & Fusion_1) have a much lower sequencing library than replicate 2. (C) Venn diagram of peak overlap from the analysis of Vissers et al. (2018) data by Damsel, Vissers pipeline, and damidseq_pipeline (Marshall and Brand 2015). As expected, Damsel has a greater overlap with Vissers as they rely on the same testing procedure (edgeR) than it does with damidseq_pipeline. (D) Example visualization available in Damsel of the most significant peak found within the Vissers data. The layers of the plot show the counts across the regions for the Fusion and Dam-only samples, the differential methylation test results, the presence of significant peaks, the positions of the GATC sites, and the overlap to gene annotations.

Open in new tab Download slide

2.1 Region quantification

The genome is intrinsically partitioned into genomic regions demarcated by GATC motifs (note that GATC is its own reverse complement and therefore the motif demarcates both strands of the DNA simultaneously). Damsel then utilizes the featureCount() function from the Rsubread package (Liao et al. 2019) to summarize the read counts between two adjacent GATC sites (GATC region) from the provided BAM files and motif input file. For paired-end BAM files, Damsel instead summarizes the fragments (read pairs). Due to the background methylation that occurs in the Dam-only sample, there is high correlation of the read counts of the regions between samples (Supplementary Fig. S1).

2.2 Identification of bound genomic regions

Previously, enrichment of a bound genomic region was identified from the log2 ratio of the region counts of the fusion sample compared to the control. However, analysis by Marshall and Brand (2015) established that this can result in a negative bias, highlighting the importance of normalization. Damsel first filters out excessively large regions (>10 000 bp), along with low count regions, before normalization and statistical testing is conducted with edgeR (McCarthy et al. 2012, Chen et al. 2016, Chen et al. 2024) allowing statistical testing incorporating the replicates and with arbitrary experimental designs (Robinson et al. 2010). edgeR’s TMM normalization assumes the majority of regions are not significant (Robinson and Oshlack 2010) and differential testing utilizes a quasi-likelihood negative binomial model (Robinson et al. 2010), utilising experimental replicates to model the biological variance in the data.

2.3 Peak calling

Peak calling allows for aggregation of adjacent significant regions that are likely to be the result of the same binding event. Damsel identifies peaks by aggregating the region level differential expression results, with P-value threshold set to 0.01 (by default), and log2 fold change threshold set to one. If the gap to the next significant region is less than 150 bp, it is included in the peak. Regions smaller than this size have consistently lower counts, and thus represent a greater proportion of regions that are excluded from differential testing. Peaks are ranked based on the statistics for ChIP-seq analysis presented in csaw (Lun and Smyth 2014, Lun and Smyth 2016) using the region with the lowest P-value to represent the peak and ordering peak significance based on this P-value. It should be noted that this approach has been tuned on data arising from transcription factor Dam-fusions rather than larger chromatin binding regions which might require further investigations.

2.4 Candidate gene identification

Methods for associating peaks and genes match the peak to its closest gene, with distance taken from the middle of the peak to the transcription start site. This is straightforward if the analysis is conducted on a species with little overlap between genes, like humans. However in species like Drosophila melanogaster, there is a large amount of overlap between genes and the closest gene may not be the most likely candidate. Therefore, Damsel outputs each peak with a list of candidate target genes, along with their genomic positions and relevant statistics. Placing the genes within the context of the peaks provides a more informative output, and sets Damsel apart from other tools.

At this point, Damsel outputs a file with all identified peaks and their associated statistics, matched to their associated candidate genes. Comparatively, damidseq_pipeline (Marshall and Brand 2015) outputs a list of genes and a score, which is the end of the analysis. However, Damsel also includes the following utilities.

2.5 Gene ontology testing

Gene ontology testing can be used to interpret the set of candidate target genes identified from the peaks. Damsel repurposes goseq, an R package developed to correct for detection biases in RNA-seq (Young et al. 2010) This is important because the GATC motifs are not spread evenly throughout the genome and different genes have different lengths and therefore a different number of GATC regions associated with them. This results in a bias where genes with more GATC regions are more likely to be associated with a peak (binding event) (Supplementary Fig. S2). Without bias correction, the gene ontology terms identified are not truly representative of the data. goseq takes as input a list of genes, if they are significant, and the number of GATC regions within 2 kb of the gene as the bias parameter, and outputs a list of overrepresented gene ontology terms.

2.6 Visualizations

A unique feature of Damsel is its visualization capability. Other DamID sequencing analysis methods, including damidseq_pipeline (Marshall and Brand 2015), output files that they recommend viewing in IGV, requiring researchers to switch software platforms. Damsel provides the opportunity to use its output files to create plots within R. Building on ggbio (Yin et al. 2012) and ggcoverage (Song and Wang 2023), Damsel contains ggplot2 (Wickham 2016) style plots that present results from different stages of the analysis. These plots can be layered, allowing the visualization of counts per region, log2 fold change and differential expression testing P-values, peak location, GATC motif, and gene positions for a provided region (Fig. 1D).

3 Results

To evaluate Damsel, we applied it to a subset of the data presented in Vissers et al. (2018). This dataset contained two Dam only samples (Dam_1, Dam_2) and two samples with Dam fused to Scalloped (Fusion_1, Fusion_2) in Drosophila wing tissue (Vissers et al. 2018). Initial QC plots from Damsel revealed differences in the sequencing depth between replicates (Fig. 1B). Damsel’s results were compared to the Marshall and Brand (2015) pipeline, and the Vissers et al. (2018) bespoke pipeline. Daim (Ashmore 2019) was excluded as it was not able to run. Due to not being actively maintained, the code for the Vissers’ analysis and peak calling had to be updated manually, and Marshall and Brand’s results had to be manually compiled to obtain one peak set. It should also be noted that the output of Vissers’ peak script did not contain individual P-values or FDRs for each reported peak, limiting some downstream analyses.

While all three approaches conduct the main steps of DamID-Seq analysis (identifying regions of enrichment, combining regions into peaks, and associating peaks with genes), Damsel provides increased functionality and seamless data flow through the pipeline (Supplementary Table S1). Like Vissers et al. (2018) analysis, Damsel incorporates replicates into the analysis to model variability and significantly reduce the burden of compiling results. However, Damsel can further incorporate complex designs within the edgeR framework, including pairing of replicates. Additional features unique to Damsel include region quantification directly from bam files, the ranking of significant results (peaks and genes), providing relevant statistics associating the peaks with annotated genes, and capacity for visualization of the results (Supplementary Table S1).

To first evaluate the false positive rate, the samples from Vissers et al. (2018) experiment were reassigned labels as Fusion and Dam-only so that instead of comparing the Dam-only to the fusion we compared one fusion and one Dam-only sample to the other fusion and Dam-only sample. There are two possible configurations of this. In the Dam_1 + Fusion_1 versus Dam_2 + Fusion_2 case, Damsel detected no significant peaks but Vissers falsely identified 1254 significant regions and 186 peaks (FDR < 0.01). When switching the samples to compare Dam_1 + Fusion_2 versus Dam_2 + Fusion_1 case, neither Damsel or Vissers identified significant regions or peaks, demonstrating good Type 1 error control (Supplementary Methods).

We next analysed the data in the way it was intended with two replicates of the fusion versus Dam_only. In Damsel, we included a factor for the replicate in the design. Damsel was found to be more sensitive than the original reported results finding 3120 peaks, which is 834 more than identified from running Vissers et al. (2018) pipeline. Damsel successfully identified 60% of Vissers’ and Marshall and Brand’s peaks (Fig. 1C). Both Marshall and Vissers’ pipelines do not include a factor for the replicates during the testing, which may explain why their results have a higher proportion of overlap. However, we argue that Damsel’s incorporation of the replicates is necessary to account for the clear difference in the coverage between the replicates (Fig. 1B). As not all experiments will have this relationship between the replicates, this is an optional element of Damsel’s testing procedure. Damsel’s gene ontology (GO) testing accounts for the bias in the number of regions per gene (Supplementary Fig. S2). We performed GO analysis on the peaks reported in the original paper and those found by Damsel and reassuringly found that Damsel reported 6 of the same top 10 GO terms as the regions from Vissers (Supplementary Fig. S3). Finally, as an example of data visualization we used Damsel to show the data and analysis of the most significant peak identified by all three methods (Fig. 1D).

Identifying which method is the most sensitive and accurate is beyond the scope of this study, and would be difficult without a large amount of experimental validation. However, Damsel performs consistently compared with established methods, and has produced results on multiple published datasets.

4 Conclusions

Damsel offers an end-to-end DamID-seq analysis pipeline including exploratory analysis and visualizations. On published data we find similar results to previous bespoke analysis. We find that Damsel can control false discovery rate and is more sensitive. Damsel includes new features such as the ranking of peak significance and performing gene ontology analysis. Future work will include the utility to perform motif enrichment analysis on sequences found in enriched peaks.

Acknowledgements

We thank Avni Avand for testing the software and providing feedback.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest: None declared.

Funding

This work was supported by an Investigator grant [APP1196256 to A.O.] from the National Health and Medical Research Council of Australia. This work was supported by an Investigator grant [APP1194467 to K.F.H.] from the National Health and Medical Research Council of Australia.

Data availability

The data used in this paper is available at the European Nucleotide Archive (PRJNA494322) https://www.ebi.ac.uk/ena/browser/view/PRJNA494322. Analysis presented in this paper can be found at https://caitlinpage.github.io/damsel_paper/.

References

Ashmore

Analysis, interpretation, and visualisation of DamID-seq experiments. Doctoral thesis, The University of Edinburgh,

2019

. The Edinburgh Research Archive. not peer reviewed.

Aughey

Cheetham

Southall

TD.

DamID as a versatile tool for understanding gene regulation

Development

2019

;

146

–

Google Scholar

Crossref

WorldCat

Chen

Lun

ATL

Smyth

GK.

From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline

F1000Res

2016

;

1438

10.12688/f1000research.8987.2

Chen

Lun

ATL

et al. edgeR 4.0: powerful differential analysis of sequencing data with expanded functionality and improved support for small counts and larger datasets. bioRxiv,

10.1101/2024.01.21.576131

2024

, preprint: not peer reviewed.

Hempel

Jiang

A non-parametric peak calling algorithm for DamID-Seq

PLoS One

2015

;

e0117415

10.1371/journal.pone.0117415

Liao

Smyth

Shi

The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads

Nucleic Acids Res

2019

;

e47

Lun

ATL

Smyth

GK.

De novo detection of differentially bound regions for ChIP-seq data using peaks and windows: controlling error rates correctly

Nucleic Acids Res

2014

;

e95

Lun

ATL

Smyth

GK.

csaw: a Bioconductor package for differential binding analysis of ChIP-seq data using sliding windows

Nucleic Acids Res

2016

;

e45

Marshall

Brand

AH.

damidseq_pipeline: an automated pipeline for processing DamID sequencing datasets

Bioinformatics

2015

;

3371

–

10.1093/bioinformatics/btv386

Marshall

Southall

Cheetham

et al.

Cell-type-specific profiling of protein–DNA interactions without cell isolation using targeted DamID with next-generation sequencing

Nat Protoc

2016

;

1586

–

10.1038/nprot.2016.084

McCarthy

Chen

Smyth

GK.

Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation

Nucleic Acids Res

2012

;

4288

–

Robinson

McCarthy

Smyth

GK.

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data

Bioinformatics

2010

;

139

–

10.1093/bioinformatics/btp616

Robinson

Oshlack

A scaling normalization method for differential expression analysis of RNA-seq data

Genome Biol

2010

;

R25

10.1186/gb-2010-11-3-r25

Song

Wang

ggcoverage: an R package to visualize and annotate genome coverage for various NGS data

BMC Bioinformatics

2023

;

309

10.1186/s12859-023-05438-2

Southall

Gold

Egger

et al.

Cell-type-specific profiling of gene expression and chromatin binding without cell isolation: assaying RNA pol II occupancy in neural stem cells

Dev Cell

2013

;

101

–

10.1016/j.devcel.2013.05.020

van Steensel

Henikoff

Identification of in vivo DNA targets of chromatin proteins using tethered Dam methyltransferase

Nat Biotechnol

2000

;

424

–

Vissers

JHA

Froldi

Schröder

et al.

The scalloped and nerfin-1 transcription factors cooperate to maintain neuronal cell fate

Cell Rep

2018

;

1561

–

76.e7

10.1016/j.celrep.2018.10.038

Wickham

ggplot2: Elegant Graphics for Data Analysis

Cham

Springer

2016

Yin

Cook

Lawrence

ggbio: an R package for extending the grammar of graphics for genomic data

Genome Biol

2012

;

R77

10.1186/gb-2012-13-8-r77

Young

Wakefield

Smyth

et al.

Gene ontology analysis for RNA-seq: accounting for selection bias

Genome Biol

2010

;

R14

10.1186/gb-2010-11-2-r14

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Associate Editor:

Download all slides

Month:	Total Views:
November 2024	160
December 2024	454
January 2025	151
February 2025	73
March 2025	130
April 2025	86
May 2025	20

Article Contents

Damsel: analysis and visualisation of DamID sequencing in R

Abstract

1 Introduction

2 Methods

2.1 Region quantification

2.2 Identification of bound genomic regions

2.3 Peak calling

2.4 Candidate gene identification

2.5 Gene ontology testing

2.6 Visualizations

3 Results

4 Conclusions

Acknowledgements

Supplementary data

Funding

Data availability

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

Damsel: analysis and visualisation of DamID sequencing in R

Abstract

1 Introduction

2 Methods

2.1 Region quantification

2.2 Identification of bound genomic regions

2.3 Peak calling

2.4 Candidate gene identification

2.5 Gene ontology testing

2.6 Visualizations

3 Results

4 Conclusions

Acknowledgements

Supplementary data

Funding

Data availability

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only