Abstract

Motivation

Optical genome mapping (OGM) is a technique that extracts partial genomic information from optically imaged and linearized DNA fragments containing fluorescently labeled short sequence patterns. This information can be used for various genomic analyses and applications, such as the detection of structural variations and copy-number variations, epigenomic profiling, and microbial species identification. Currently, the choice of labeled patterns is based on the available biochemical methods and is not necessarily optimized for the application.

Results

In this work, we develop a model of OGM based on information theory, which enables the design of optimal labeling patterns for specific applications and target organism genomes. We validated the model through experimental OGM on human DNA and simulations on bacterial DNA. Our model predicts up to 10-fold improved accuracy by optimal choice of labeling patterns, which may guide future development of OGM biochemical labeling methods and significantly improve its accuracy and yield for applications such as epigenomic profiling and cultivation-free pathogen identification in clinical samples.

Availability and implementation

https://github.com/yevgenin/PatternCode

1 Introduction

Optical genome mapping (OGM) (Neely et al. 2010, Levy-Sakin and Ebenstein 2013, Deen et al. 2017, Jeffet et al. 2021) is a technique for mapping optically imaged linearized DNA fragments to reference genome sequences. It has been demonstrated for the detection of structural variations and copy-number variations (Ebert et al. 2021), DNA damage (Torchinsky et al. 2019), epigenomic profiling (Gabrieli et al. 2018, 2022), and microbial species identification (Grunwald et al. 2015, Wand et al. 2019, Bouwens et al. 2020, Müller et al. 2020). The current common process by which OGM is implemented involves fluorescent labeling of a specific short genome sequence pattern on the DNA (Neely et al. 2010), linearizing (or combing, stretching) the labeled DNA fragments on some surface (Levy-Sakin and Ebenstein 2013, Deen et al. 2015, Wu et al. 2019), followed by optical measurement of the labeled pattern density along the DNA fragments, and the analysis of the acquired images. The mapping of the images to reference genome sequences is done using alignment algorithms (Valouev et al. 2006, Mendelowitz and Pop 2014, Bouwens et al. 2020, Dehkordi et al. 2021).

Although OGM extracts only partial information from the genome, it possesses several advantages compared to DNA sequencing. These include the ability to generate extremely long genome maps of up to megabase size, which is necessary for detecting genomic large-scale structural and copy-number variations, and the potential for extremely high sensitivity, i.e. detection of low quantities of target DNA (Margalit et al. 2021), which is necessary in applications such as cultivation-free pathogen identification (Müller et al. 2020, Nyblom et al. 2023).

Multiple labeling methods are used in OGM. Originally, restriction enzymes were used and the length of fragments between cut sites was measured and used for the mapping (Schwartz et al. 1993). More recently, fluorescent labeling is done to label short sequence patterns (Neely et al. 2010). One labeling approach is competitive binding based (Müller et al. 2019, 2020, Nyblom et al. 2023), where a fluorophore competes with a DNA intercolating molecule for binding to the DNA, allowing the visualization of G/C base-pair density along a DNA fragment. Another approach uses CRISPR-Cas9 DNA labeling with a multitude of sgRNAs (Abid et al. 2021). One more labeling approach is achieved by enzymes from bacterial restriction-modification (RM) systems that have short recognition sequences of 4–6 bp length, such as nicking endonucleases, with incorporation of fluorescently labeled nucleotides (Neely et al. 2011), and DNA methyltransferases that methylate a specific sequence pattern, and together with modified cofactors, their action is modified to fluorescent labeling (Pljevaljčić et al. 2004, Dalhoff et al. 2006, Grunwald et al. 2015, Deen et al. 2017). The REBASE database (Roberts et al. 2015) lists thousands of RM systems, with different recognition sequences, many of which are available commercially. This suggests a wide playground for optimization of the labeling patterns for specific applications and target genomes, given a way to quantify and predict OGM accuracy for each pattern.

Previous attempts to computationally estimate the accuracy of OGM were done, e.g. in Wand et al. (2019) and Bouwens et al. (2020), where simulated estimations of OGM accuracy were done, mainly tailored to the specific labeled patterns and algorithms used in those studies. Before the advent of fluorescent labeling based OGM, in Anantharaman and Mishra (2001), P value estimation of restriction map alignment was done using combinatorial methods. In this article, we propose an information theory (Cover and Thomas 2012) based method for the labeled pattern design. Information-theoretic analysis has been applied for various DNA processing problems. For example, Motahari et al. (2013) analyzed the minimal read coverage needed for genome reconstruction by DNA shotgun sequencing, mostly in a noiseless setting. Information-theoretic analysis of reference-based DNA shotgun sequencing has been derived in Mohajer et al. (2013), and recently refined in Weinberger and Shomorony (2023). Information-theoretic methods were also applied to the analysis of nanopore DNA sequencing (Mao et al. 2018), where the nanopore channel was modeled as an insertion-deletion channel and general bounds on the number of possible decodable sequences were developed. However, tight bounds on the error probability were not developed.

In this work, we develop a model of OGM based on information theory and compute its error probability, thus enabling the optimization of labeling patterns. To achieve this goal, we model OGM as a random code operating over a noisy communication channel (Shannon 1948) and use tight bounds on its error probability. We present experimental and simulated validations of the model, and demonstrate the optimization of labeling patterns for different target genome sets, such as the human genome and bacterial genomes. This article is organized as follows. Section 2 develops the information-theoretic based model of OGM and describes the simulation and experimental methods we used for its validation. Section 3 presents the results of the validation for error probability versus DNA fragment length, and the results of the application of the model to the design of optimal labeling patterns.

2 Materials and methods

In OGM, a specific short genome sequence pattern is labeled in observed DNA fragments (Fig. 1). Given a set of target genome sequences from which the observed DNA fragments originate, and a specific labeled sequence pattern, we are interested in the accuracy (or error probability) for mapping each fragment to its true position from one of the target genome sequences. The target genome sequences may consist, e.g. of the human genome, a set of bacterial genomes, etc.

PatternCode concept. Subplot (a) presents an illustration of a DNA molecule (not to scale), with the pattern CTTAAG labeled. Subplot (b) shows an example DNA fragment experimental image, which is used for estimation of theoretical model parameters. Bottom: the pattern occurrences in the human genome sequence, matching the DNA fragment, along with their bin counts (x). Top: localized labels, together with their bin counts (y).
Figure 1.

PatternCode concept. Subplot (a) presents an illustration of a DNA molecule (not to scale), with the pattern CTTAAG labeled. Subplot (b) shows an example DNA fragment experimental image, which is used for estimation of theoretical model parameters. Bottom: the pattern occurrences in the human genome sequence, matching the DNA fragment, along with their bin counts (x). Top: localized labels, together with their bin counts (y).

A full model of the OGM process would include biochemical labeling errors of DNA fragments, nonuniform stretching of DNA fragments, optical imaging artifacts, and localization errors of labels in the images. Here, the model is simplified to enable theoretical analysis; nevertheless, our experimental validations, detailed below, show that it constitutes an excellent approximation of the actual measurement process.

2.1 Information theory of OGM accuracy

We define a successful OGM mapping of a DNA fragment, as its correct mapping to its sequence of origin, and its accurate positioning (up to some margin of error) in terms of start and stop coordinates in this sequence. The mapping error probability will represent the chance that the mapping will be unsuccessful for a randomly sampled DNA fragment. To simplify the model and estimate the mapping error probability, both the genome sequence and DNA fragments (Fig. 1) are segmented into nonoverlapping bins of size B, typically on the order of 1 kb. Thus, for simplification, we will define the mapping allowed margin of error as one bin size. The chosen value of B is explained in Supplementary Fig. S1, and the sensitivity of the model to this parameter in Supplementary Fig. S6. When the bin size is too small, label localization errors cause statistical dependence between bins; and if it is too large, significant information regarding the label position is lost. Here, statistical independence of the information contained in neighboring bins is assumed (see Supplementary Section S5 and Fig. S3 for elaboration on that), allowing for the modelling of the OGM process as a discrete memoryless channel (DMC) decoding problem in the random coding regime (Shannon 1948, Gallager 1968).

We model the OGM process as a noisy communication channel, transmitting a message (DNA fragment position in the genome) which is encoded by a codeword (the bin counts corresponding to pattern positions in the sequence), resulting in an information-theoretic code we term the pattern code. At the output of the communication channel, a noisy codeword (image of the DNA fragment) is received, and is decoded to produce the decoded message (estimated genome position), out of the possible messages, namely the codebook (all possible genome fragments). The process is summarized in Fig. 2. The codebook of size M is composed of codewords which are vectors of bin counts of length n. Each codeword is associated to a message of position offset, and corresponds to a segment of a genome sequence of length L=nB, where B is the bin size. Each value x in the codeword is the number of pattern occurrences in a bin of the genome sequence segment. Defining G as the total genomes length (sum of lengths of all target genome sequences), the size of the codebook is M=2G/B (approximately, considering valid position offsets), where the factor of 2 is due to the unknown orientation (35 or 53) of the DNA fragment, requiring consideration of both the forward and reverse direction of the fragment. The measured noisy codewords are the imaged DNA fragments, and are vectors of bin counts of length n. Each value y in the vector is the number of detected labels in a bin of the DNA fragment. The channel model is described by the likelihood py|x, which captures the various noise factors in the OGM process. The mapping algorithm of OGM, in this case, is a maximum-likelihood (ML) decoder which chooses the codeword (and message) that maximizes the likelihood of the observed noisy codeword.

Information-theoretic model of OGM as a random codeword transmitted over a noisy communication channel. (a) The OGM communication channel flowchart. (b) Example model parameters, explained in Section 2.1. B is the bin size, L is the DNA fragment length, n is the number of bins in the DNA fragment, and G is the genome length (here it’s very short, for illustration). The number of possible position offsets of a codeword in the binned genome is the number of messages (or codebook size) M. Also shown are the bin counts of pattern occurrences in the genome sequence (x) and of the labels in the DNA fragment image (y).
Figure 2.

Information-theoretic model of OGM as a random codeword transmitted over a noisy communication channel. (a) The OGM communication channel flowchart. (b) Example model parameters, explained in Section 2.1. B is the bin size, L is the DNA fragment length, n is the number of bins in the DNA fragment, and G is the genome length (here it’s very short, for illustration). The number of possible position offsets of a codeword in the binned genome is the number of messages (or codebook size) M. Also shown are the bin counts of pattern occurrences in the genome sequence (x) and of the labels in the DNA fragment image (y).

Modeling the problem in this way allows to use sharp and accurate bounds from the information-theoretic literature to bound the error probability in OGM. The celebrated noisy channel coding theorem (Shannon 1948) assures that optimal ML decoder achieve a vanishing probability when the coding rate is less than the “capacity” of the channel. Importantly, this is proved using a randomly chosen codebook, for which the codeword symbols are randomly chosen. This vanishing error probability is obtained for asymptotically large codeword size n. However, in the case of OGM, we are interested in the error probability for the shortest possible DNA fragments, and relatively high error probabilities can often be tolerated, as final clinically relevant decisions can be based on multiple mapped DNA fragments, allowing for the correction of errors. In recent years, tight bounds and approximations on the coding rate for nonvanishing error probabilities in DMCs has been obtained (Hayashi 2009, Polyanskiy 2010, Polyanskiy et al. 2010, Tan 2014), and these are the bounds we will henceforth utilize.

2.1.1 Error probability computation

Having formulated the OGM problem as the transmission of a codeword from a random codebook over a DMC, we can compute the error probability of the ML decoder following the achievability and converse probability bounds for a DMC in Tan (2014, Equations 4.33, 4.57). The error probability, denoted by ε, is approximated as follows:

(1)

where Φ is the standard normal cumulative distribution function, M=2G/B with G the total genomes length (sum of lengths of all target genome sequences), n=L/B with L the DNA fragment length, and B the bin size as above. The mutual information I (of x and y) and the variance V are computed from the joint and marginal distributions of the number of pattern occurrences in a bin of a genome sequence (x) and the number of detected labels in a bin of a DNA fragment (y), as follows. We define r as the following log ratio of joint and marginal distributions:

(2)

where pxy=py|xpx is the joint distribution of x and y, and py=Ex[py|x] is the marginal distribution of y. Then, I=Exy[r] and V=Varxy[r]. The only free parameters in this model that need to be estimated from the data are the pattern density distribution px and the label detection likelihood py|x. The label detection likelihood py|x implicitly includes effects such as imperfect labeling efficiency (incomplete digestion), off-target sequence labeling (e.g. due to nonspecific reaction of the labeling enzyme), and staining artifacts. This is due to the fact that it represents the probability of detecting a label in a bin of a DNA fragment, given the number of pattern occurrences in a bin of a genome sequence. The effect of nonuniform stretching of DNA fragments is not modelled explicitly here, and is left for future work. However, in the current work, we primarily focus on shorter DNA fragments for which the effect of nonuniform stretching is less significant, as we show in Supplementary Section S6.

2.1.2 Estimation of parameters

The pattern density distribution px was estimated from the target genome sequences by computing the histogram of x, the number of pattern occurrences in all the bins of size B. The label detection likelihood py|x was estimated from experimental OGM data by computing the 2D histogram of (x, y) pairs of aligned bins over all DNA fragments (see Table 1). Here, x is the number of pattern occurrences in a bin of the target genome sequence, and y is the number of detected labels in a bin of a DNA fragment.

Table 1.

Estimated parameters of information-theoretic noisy DMC model used in this work.a

py|xy=0y=1y=2px
x=00.959610.038470.001920.81555
x=10.187160.674920.137920.16278
x=20.131090.359780.509130.02168
py|xy=0y=1y=2px
x=00.959610.038470.001920.81555
x=10.187160.674920.137920.16278
x=20.131090.359780.509130.02168
a

The pattern density distribution px for the pattern CTTAAG is shown on the human genome sequences. The label detection likelihood py|x was estimated from experimental human DNA fragments, and the human genome hg38 with the same pattern labeled, as described in Section 2.1.2. The number of pattern occurrences in a genome bin, x, was capped at 2, and the number of labels detected in a DNA fragment bin, y, was capped at 2. The bin size used was 1 kb.

Table 1.

Estimated parameters of information-theoretic noisy DMC model used in this work.a

py|xy=0y=1y=2px
x=00.959610.038470.001920.81555
x=10.187160.674920.137920.16278
x=20.131090.359780.509130.02168
py|xy=0y=1y=2px
x=00.959610.038470.001920.81555
x=10.187160.674920.137920.16278
x=20.131090.359780.509130.02168
a

The pattern density distribution px for the pattern CTTAAG is shown on the human genome sequences. The label detection likelihood py|x was estimated from experimental human DNA fragments, and the human genome hg38 with the same pattern labeled, as described in Section 2.1.2. The number of pattern occurrences in a genome bin, x, was capped at 2, and the number of labels detected in a DNA fragment bin, y, was capped at 2. The bin size used was 1 kb.

The likelihood used in this work was estimated for experimental human DNA fragments, and the human genome hg38 (Fig. 1), where the pattern CTTAAG was labeled. This data consisted of images acquired as described in Section 2.4, where the DNA fragments were labeled by the DLE-1 enzyme and imaged using the Saphyr system (Bionano Genomics). The DNA fragments, all longer than 400 kb, were aligned to the human genome, using the DeepOM algorithm (Nogin et al. 2023). Such long DNA fragments were taken in order to have high confidence alignments (as error probability is very low for such fragments, as will be shown in Fig. 3), to ensure the accurate estimation of the parameters. Overall, 445 DNA fragments were used consisting of a total of 180 Mb.

Validation of the theoretical model versus DNA fragment length. Validation results are shown against three sets of genome sequences: a random genome sequence (i.i.d. uniformly distributed nucleotides) of length 108 bp (subplot a), the human genome (subplot b), and selected bacterial genomes (subplot c; see Supplementary Table S1). The plots in the top row show the error probability versus fragment length, while the bottom row plots show the same data but with the theoretical error probability on the x-axis and the simulated or experimental error probabilities on the y-axis (which reflect the actual error rate in simulation or experiment). The theoretical error probability was computed using Equation (1), and the simulated error rate was computed as described in Section 2.2, based on 512 randomly generated DNA fragments for each length shown. For the human genome, in addition to the simulated error rates, we show the experimental error rates for comparison. These experimental error rates (for the DeepOM algorithm) were adapted from Figure 4b in Nogin et al. (2023) and described in Section 2.3. Overlay of experimental results for the Bionano software, is in Supplementary Fig. S5. For both, the simulated and experimental error rates, 95% confidence bounds (Clopper and Pearson, 1934) for the rate are shown, based on the error counts and the number of trials.
Figure 3.

Validation of the theoretical model versus DNA fragment length. Validation results are shown against three sets of genome sequences: a random genome sequence (i.i.d. uniformly distributed nucleotides) of length 108 bp (subplot a), the human genome (subplot b), and selected bacterial genomes (subplot c; see Supplementary Table S1). The plots in the top row show the error probability versus fragment length, while the bottom row plots show the same data but with the theoretical error probability on the x-axis and the simulated or experimental error probabilities on the y-axis (which reflect the actual error rate in simulation or experiment). The theoretical error probability was computed using Equation (1), and the simulated error rate was computed as described in Section 2.2, based on 512 randomly generated DNA fragments for each length shown. For the human genome, in addition to the simulated error rates, we show the experimental error rates for comparison. These experimental error rates (for the DeepOM algorithm) were adapted from Figure 4b in Nogin et al. (2023) and described in Section 2.3. Overlay of experimental results for the Bionano software, is in Supplementary Fig. S5. For both, the simulated and experimental error rates, 95% confidence bounds (Clopper and Pearson, 1934) for the rate are shown, based on the error counts and the number of trials.

In both estimated probabilities, and throughout this work, the x and y counts were capped at the number 2 (i.e. if more than two pattern occurrences or more than two labels were detected in a bin, the count was set to 2). This can be justified by the low density of length-6 patterns in the genome [see px(x=2) in the table] and the inability of practical localization methods to separate closely spaced labels.

2.2 Simulated validation of the theory

The simulation-based validation of the theory in Equation (1) was done by simulating DNA fragments of length L=nB bp, each at a random start position in target genome sequences. The counts of labels in each bin (y) were simulated from the py|x distribution, where x is the number of pattern occurrences in a bin of a genome sequence and y is the number of detected labels in a bin of a DNA fragment (see Section 2.1). The ML decoder estimates the position offset i^ of a DNA fragment in a genome sequence as follows:

(3)

where xi+j(k) is the bin count of pattern occurrences in the jth bin of the genome sequence k starting at the bin from position i. For each genome sequence, its reverse sequence was also considered, since the orientation of the fragment is unknown. The error rate was computed by running the decoder on all generated DNA fragments (512 per parameter set: over a multitude of fragment lengths, labeled patterns, genome sets, see Figs 3 and 4), with a decoder error defined when the estimated position is different from the true position.

Design of optimal labeling patterns. This figure shows the evaluation of error probabilities for all length-6 labeling patterns, and some special patterns, against the following sets of genome sequences for a fixed DNA fragment length: subplot (a) is for the human genome, subplot (b) is for selected bacterial genomes (see Supplementary Table S1). We varied the labeling pattern, with all 46=4096 possible patterns evaluated and shown on the subplots. Additionally, highlighted are nicking enzyme recognition patterns from the REBASE database and some commonly used enzyme recognition patterns in OGM are shown with arrows (DLE-1, Nt.BspQI, Nb.BsmI with recognition patterns CTTAAG, GCTCTTC, and GCATTC, respectively). For each pattern, its reverse complement is also labeled at the same time. The data plotted for these special patterns for the human genome is shown in Supplementary Table S2. For each genome set, we computed the theoretical error probability using the estimated px and py|x (see Section 2.1), and validated the theory through simulations (see Section 2.2). The top row of subplots shows the error probability versus the density of the labeling pattern, while the bottom row plots compare the theory (x-axis) with simulated evaluation (y-axis) of the error probability for a random subset of patterns, with 95% confidence intervals computed as in Fig. 3.
Figure 4.

Design of optimal labeling patterns. This figure shows the evaluation of error probabilities for all length-6 labeling patterns, and some special patterns, against the following sets of genome sequences for a fixed DNA fragment length: subplot (a) is for the human genome, subplot (b) is for selected bacterial genomes (see Supplementary Table S1). We varied the labeling pattern, with all 46=4096 possible patterns evaluated and shown on the subplots. Additionally, highlighted are nicking enzyme recognition patterns from the REBASE database and some commonly used enzyme recognition patterns in OGM are shown with arrows (DLE-1, Nt.BspQI, Nb.BsmI with recognition patterns CTTAAG, GCTCTTC, and GCATTC, respectively). For each pattern, its reverse complement is also labeled at the same time. The data plotted for these special patterns for the human genome is shown in Supplementary Table S2. For each genome set, we computed the theoretical error probability using the estimated px and py|x (see Section 2.1), and validated the theory through simulations (see Section 2.2). The top row of subplots shows the error probability versus the density of the labeling pattern, while the bottom row plots compare the theory (x-axis) with simulated evaluation (y-axis) of the error probability for a random subset of patterns, with 95% confidence intervals computed as in Fig. 3.

2.3 Experimental validation of the theory

The experimental validation of the theory in Equation (1) was done by comparing it to results from the DeepOM work (Nogin et al. 2023), which evaluated the error rates of the DeepOM OGM algorithm in mapping DNA fragments to the genome, using experimental human DNA fragments labeled with the pattern CTTAAG (Fig. 1). The ground truth for the mapping in was based on longer than 400 kb high-confidence aligned DNA molecules, which were cropped to produce short fragments with known positions in the genome. The error rates were computed by counting the rate of successful mapping of the cropped fragments to their known ground truth position as described in the DeepOM paper. The algorithm is based on a convolutional neural network for localizing labels in a DNA fragment image, and a dynamic programming algorithm for aligning the DNA fragment to the genome. For each fragment length tested, over the range of 10–400 kb, 512 DNA fragments were used. The experimental images are supplementary to this article, as described in the Data Availability section.

2.4 Sample preparation and data collection

Images of human DNA fragments (a total of 445 images, see one example in Fig. 1), were used to estimate the label detection likelihood parameters (Section 2.1.2). These images were captured using the Saphyr instrument and Saphyr chips (G1.2) (Bionano Genomics). The protocol for extraction and labeling of DNA is described in Nogin et al. (2023). U2OS human cell line was cultured in Dulbecco’s modified Eagle medium with 10% fetal bovine serum (Gibco, Amarillo, TX), 2 mM l-glutamine, and 1% penicillin-streptomycin (10 000 U/ml; Gibco). The cells were incubated at 37°C with 5% CO2. DNA was extracted from 106 cells using the Bionano Prep Cell Culture DNA Isolation Protocol (Bionano Genomics). One microgram of DNA was directly labeled using the Bionano Genomics DLS labeling kit, composed of a single enzymatic labeling reaction with DLE-1 enzyme. The reaction mixture contained 6 µl of 5× DLE-buffer, 2 µl of 20× DL-Green, 2 µl of DLE-1 enzyme (Bionano Genomics), and a total reaction volume of 30 µl. The reaction was incubated for 2 h at 37°C.

2.5 Genome sequence data

Human and bacterial genome sequence data was downloaded from the NCBI genome datasets https://www.ncbi.nlm.nih.gov/datasets/. Human genome release GRCh38.p14 was used. The bacterial genomes used are listed in Supplementary Table S1.

3 Results

3.1 Accuracy versus fragment length

We validate our theoretical model by comparing it to experimental and simulated data for varying DNA fragment lengths. We first estimate the label detection likelihood py|x from the experimental data, as described in Section 2.1.2, resulting in the values shown in Table 1. We then assume that the label detection noise is the same for all genome sets, as it is a function of the imaging instrument and the labeling protocol. For each genome set, we estimate the pattern density distribution px as described in Section 2.1.2. We then compute the theoretical error probability as described in Section 2.1, using the estimated px and py|x. We validate the theory through simulations as described in Section 2.2. For the human genome, we also compare to the experimental data from DeepOM (Nogin et al. 2023).

The results are shown in Fig. 3, which shows the error probability versus DNA fragment length. For the random genome sequence, there is an almost perfect match between the theoretical and simulated error probabilities, validating the information-theoretic model in Section 2.1. For both the human and bacterial genomes, there is remarkable match between the theoretical and simulated error probabilities for shorter DNA fragment lengths, meaning that, to a good approximation, real genomes behave as random sequences, with respect to the distribution of labeled patterns. For longer fragments, the simulated probability is higher than the theoretical, which can be explained by the fact that the theoretical model assumes that the labeling pattern density has the same distribution along the genome, while in reality, genome sequences contain a multitude of anomalously distributed regions in terms of pattern density, such as repeated sequences, unmapped regions in the centromeres and telomeres in the human genome, highly similar regions in closely related bacterial species, etc. For the human genome, the experimental error rate (see Section 2.3) is slightly lower than the theoretical prediction. This can be explained by the fact that the DeepOM alignment algorithm uses the exact localized positions of the labels for the alignment to the genome sequence, which is added information that is not accounted for in the theoretical model, and aids in the accuracy of the alignment. Also, the experimental error rate vanishes for long enough fragments, which doesn’t happen in the simulation, due to the fact that the experimental error rate in the respective study was computed only for fragments from relatively conventional and nonanomalous regions of the genome. With all that said, the theoretical model still provides a good upper bound on the achievable error rate.

3.2 Design of optimal labeling patterns

Now that we have validated the theory for the accuracy of OGM for a specific labeled pattern (CTTAAG) on the human genome using experimental and simulated data, we can go beyond analysis to synthesis: the design of optimal labeling patterns, which is a direct practical implication of this work.

The results are shown in Fig. 4, where we show the error probability for different labeling patterns. The DNA fragment length was fixed, and the labeling pattern was varied. For a subset of patterns, we validated the theory by simulation as described in Section 2.2.

The examined patterns were generated as follows: all 46=4096 possible sequences over A,C,G,T of length 6 were generated, with the pattern positions chosen as the union of its positions in the genome and its reverse complement (Section 2). Note that for palindromic patterns (there are 43=64 such patterns), the reverse complement is identical to the pattern itself. Additionally, some special patterns of interest were examined (see Supplementary Table S2): nicking enzyme recognition patterns (with IUPAC codes expanded to matching sequences) from the REBASE database (Roberts et al. 2015), and some commonly used enzymes in OGM. For each pattern, we computed its average density in the genome by dividing the number of pattern occurrences by the genome length. We assumed for all patterns the same label detection likelihood py|x as used in Section 3.1. This is valid under the assumption that other biochemical labeling methods that have six-letter recognition sequences would have similar labeling errors.

It can be seen in the figure that the error probability has strong dependence on the pattern density, achieving a minimum for specific patterns with a density on the order of 1 kb. While for a random genome sequence, one would expect the length-6 pattern density to be on the order of 46 (Supplementary Fig. S2), real genome sequences behave differently, because the labeled sequences are not uniformly distributed (Fig. 4). For example, some six-letter sequences are much more frequent in the genome than others. The figure shows that for both the human genome and the bacterial genomes, the best patterns have a predicted error probability which is more than 10-fold lower than the commercially available pattern we examined experimentally, suggesting there is significant room for improvement. For example, as a direct prediction from our model, the best six-letter sequence for OGM of the human genome, would be GGAGGC (see Supplementary Data table file human_genome_p_err_vs_pattern.csv for the results for all patterns), which would lead to 10× improved accuracy, directly translatable to experimental parameters such as acquisition speed, low sample DNA quantity, simplified sample preparation, etc. In general, patterns with low density are less informative because there are too few labels on the DNA fragment to map it accurately to the reference genome, whereas patterns with high density are less informative because the label detection error per bin is too high. This leads to the conclusion that for a given target genome set with pattern density distribution px and label detection likelihood py|x of the labeling and imaging system, the theory can predict specific optimal labeling patterns which can be more than an order of magnitude better in terms of accuracy.

4 Discussion

We showed that our information-theoretic model provides a good approximation for the error probability of OGM. It depends on only four parameters: the target genome length, the DNA fragment length, and two easily estimated parameters: the label detection likelihood, estimated from experimental genome-aligned DNA fragment images, and the labeling pattern distribution, estimated from the target genome sequences. This enables the design of better OGM experiments without the assumption of a specific OGM algorithm, and allows for the intuitive understanding of the importance of different parameters on the accuracy, such as the logarithmic dependence on the target genome length versus the polynomial dependence on the fragment length (Equation 1). Additionally, the model enables fast computation due to its simple analytical form, allowing for the design of protocols where multiple patterns are labeled with multiple labeling reagents through combinatorial optimization of pattern combination selection.

In the model, we neglected several factors that might affect the actual error probability in real OGM experiments. The model assumes that an OGM algorithm has access only to bin counts of labels (Section 2.1), while there is more information that can be utilized from localized positions of labels within each bin. Modeling the process as a DMC, statistical independence of the information in neighboring bins was assumed, which was mitigated by choosing a bin size that is larger than the label localization error margin (Supplementary Fig. S1). In terms of the codebook model, due to overlaps and unknown orientation of the codewords, this implies that the codewords are not exactly independent and identically distributed (as required for the validity of the noisy channel coding theorem). The random codebook model is still rather accurate since a codeword is independent from almost all the other codewords in the codebook. We also assumed that the bin start positions are aligned with the reference genome, while in practice, the bin start positions are not known, so the reference genome sequence can be shifted by a random amount from the true position, up to the bin size. We also neglected the effect of nonuniform stretching of the DNA fragments, before they are imaged, thus the assumption of one-to-one mapping between consecutive bins in the reference genome and the DNA fragment image is not exact and will increase the size of the effective codebook. However, the dependence on the codebook size is logarithmic in Equation (1), so this effect is secondary in comparison to pattern density and fragment length.

Additionally, we applied our model to the design of optimal OGM labeling patterns for the human genome and selected bacterial genomes. However, we made some underlying assumptions that might impact the validity of the results. The label detection likelihood py|x (Table 1) was estimated (Section 2.1.2) for a specific labeling protocol and reagent, and might be different for other labeling reagents used to label other patterns. We limited our analysis to length-6 patterns, as this is the length of the pattern we examined experimentally. However, the theory can be applied to other pattern lengths, given the label detection likelihood py|x is estimated from relevant experimental data. It is important to note that some reagents, which can be used for the labeling processes, have recognition sequence patterns with characters matching more than one base type. This means that the labeling pattern is a set of recognition sequences, which we did not consider in this analysis, but it is straightforward to do so. It should also be noted here that fluorescence signal strength might vary across labeling schemes. However, as previous works have shown (Gabrieli et al. 2018), even individual fluorophores are easily detected by OGM systems.

To conclude, we developed an information-theoretic model of OGM, which enables the prediction of its accuracy and the design of optimal labeling patterns for specific applications and target organism genomes. The accuracy is predicted using parameters easily estimated from target genome sequences and experimental data. The model was validated experimentally on human DNA with a specific pattern labeled, and through simulations with varying DNA fragment lengths, various labeling patterns, and various genomes, including the human genome, bacterial genomes, and randomly generated genome sequences. The model predicts up to 10-fold better accuracy by optimal choice of labeling pattern for the human genome and bacterial genomes. Future development of biochemical reagents and protocols for labeling the patterns suggested by the model may significantly improve the accuracy and yield of OGM for applications such as epigenomic profiling and cultivation-free pathogen identification in clinical samples.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

None declared.

Funding

This work was supported by the Gellman-Lasser Fund [11846]; the H2020 European Research Council Horizon 2020 [802567]; Israel Science Foundation grant [1782/22]; The European Research Council consolidator [817811 to Y.E.]; and Israel Science Foundation [771/21 to Y.E.].

Data availability

All code and data used to generate the results of this work is deposited on Zenodo and is publicly available at DOI: https://doi.org/10.5281/zenodo.8285430. The numerical results of error probabilities for each pattern in Fig. 4 and Supplementary Fig. S2 are attached as supplementary csv files:

  • Human Genome:

     human_genome_p_err_vs_pattern.csv

  • Bacterial Genomes:

     bacterial_genomes_p_err_vs_pattern.csv

  • Random Genome:

     random_genomes_p_err_vs_pattern.csv

References

Abid
HZ
,
Young
E
,
McCaffrey
J
 et al.  
Customized optical mapping by CRISPR–Cas9 mediated DNA labeling with multiple sgRNAs
.
Nucleic Acids Res
 
2021
;
49
:
e8
.

Anantharaman
T
,
Mishra
B.
 False positives in genomic map assembly and sequence validation. In:
Algorithms in Bioinformatics: First International Workshop, WABI 2001, Århus, Denmark, August 28–31, 2001 Proceedings
, p.
27
40
.
Berlin, Heidelberg: Springer
,
2001
.

Bouwens
A
,
Deen
J
,
Vitale
R
 et al.  
Identifying microbial species by single-molecule DNA optical mapping and resampling statistics
.
NAR Genom Bioinform
 
2020
;
2
:
lqz007
. https://doi.org/10.1093/nargab/lqz007

Clopper
CJ
,
Pearson
ES.
 
The use of confidence or fiducial limits illustrated in the case of the binomial
.
Biometrika
 
1934
;
26
:
404
13
.

Cover
TM
,
Thomas
JA.
 
Elements of Information Theory
.
New York, NY: John Wiley & Sons, Inc., 2006

Dalhoff
C
,
Lukinavicius
G
,
Klimasăuskas
S
 et al.  
Direct transfer of extended groups from synthetic cofactors by DNA methyltransferases
.
Nat Chem Biol
 
2006
;
2
:
31
2
.

Deen
J
,
Sempels
W
,
De Dier
R
 et al.  
Combing of genomic DNA from droplets containing picograms of material
.
ACS Nano
 
2015
;
9
:
809
16
.

Deen
J
,
Vranken
C
,
Leen
V
 et al.  
Methyltransferase-directed labeling of biomolecules and its applications
.
Angew Chem Int Ed Engl
 
2017
;
56
:
5182
200
.

Dehkordi
SR
,
Luebeck
J
,
Bafna
V.
 
Fandom: fast nested distance-based seeding of optical maps
.
Patterns (N Y)
 
2021
;
2
:
100248
.

Ebert
P
,
Audano
PA
,
Zhu
Q
 et al.  
Haplotype-resolved diverse human genomes and integrated analysis of structural variation
.
Science
 
2021
;
372
:
eabf7117
.

Gabrieli
T
,
Michaeli
Y
,
Avraham
S
 et al.  
Chemoenzymatic labeling of DNA methylation patterns for single-molecule epigenetic mapping
.
Nucleic Acids Res
 
2022
;
50
:
e92
.

Gabrieli
T
,
Sharim
H
,
Nifker
G
 et al.  
Epigenetic optical mapping of 5-hydroxymethylcytosine in nanochannel arrays
.
ACS Nano
 
2018
;
12
:
7148
58
. https://doi.org/10.1021/acsnano.8b03023

Gallager
RG.
 
Information Theory and Reliable Communication
.
New York, N
: John Wiley & Sons, Inc.,
1968
.

Grunwald
A
,
Dahan
M
,
Giesbertz
A
 et al.  
Bacteriophage strain typing by rapid single molecule analysis
.
Nucleic Acids Res
 
2015
;
43
:
e117
. https://doi.org/10.1093/nar/gkv563

Hayashi
M.
 
Information spectrum approach to second-order coding rate in channel coding
.
IEEE Trans Inform Theory
 
2009
;
55
:
4947
66
.

Jeffet
J
,
Margalit
S
,
Michaeli
Y
 et al.  
Single-molecule optical genome mapping in nanochannels: multidisciplinarity at the nanoscale
.
Essays Biochem
 
2021
;
65
:
51
66
.

Levy-Sakin
M
,
Ebenstein
Y.
 
Beyond sequencing: optical mapping of DNA in the age of nanotechnology and nanoscopy
.
Curr Opin Biotechnol
 
2013
;
24
:
690
8
.

Mao
W
,
Diggavi
SN
,
Kannan
S.
 
Models and information-theoretic bounds for nanopore sequencing
.
IEEE Trans Inform Theory
 
2018
;
64
:
3216
36
.

Margalit
S
,
Abramson
Y
,
Sharim
H
 et al.  
Long reads capture simultaneous enhancer–promoter methylation status for cell-type deconvolution
.
Bioinformatics
 
2021
;
37
:
i327
3
.

Mendelowitz
L
,
Pop
M.
 
Computational methods for optical mapping
.
Gigascience
 
2014
;
3
:
33
217X
.

Mohajer
S
,
Motahari
A
,
Tse
D.
Reference-based DNA shotgun sequencing: information theoretic limits. In: 2013 IEEE International Symposium on Information Theory, Istanbul, Turkey, p.
1635
1639
, New York, NY: IEEE,
2013
.

Motahari
AS
,
Bresler
G
,
Tse
D.
 
Information theory of DNA shotgun sequencing
.
IEEE Trans Inform Theory
 
2013
;
59
:
6273
89
.

Müller
V
,
Dvirnas
A
,
Andersson
J
 et al.  
Enzyme-free optical DNA mapping of the human genome using competitive binding
.
Nucleic Acids Res
 
2019
;
47
:
e89
.

Müller
V
,
Nyblom
M
,
Johnning
A
 et al.  
Cultivation-free typing of bacteria using optical DNA mapping
.
ACS Infect Dis
 
2020
;
6
:
1076
84
. https://doi.org/10.1021/acsinfecdis.9b00464

Neely
RK
,
Dedecker
P
,
Hotta
J-I
 et al.  
DNA fluorocode: a single molecule, optical map of DNA with nanometre resolution
.
Chem Sci
 
2010
;
1
:
453
60
.

Neely
RK
,
Deen
J
,
Hofkens
J.
 
Optical mapping of DNA: single-molecule-based methods for mapping genomes
.
Biopolymers
 
2011
;
95
:
298
311
.

Nogin
Y
,
Detinis Zur
T
,
Margalit
S
 et al.  
DeepOM: single-molecule optical genome mapping via deep learning
.
Bioinformatics
 
2023
;
39
:
btad137
. https://doi.org/10.1093/bioinformatics/btad137

Nyblom
M
,
Johnning
A
,
Frykholm
K
 et al.  
Strain-level bacterial typing directly from patient samples using optical DNA mapping
.
Commun Med (Lond)
 
2023
;
3
:
31
.

Pljevaljčić
G
,
Schmidt
F
,
Weinhold
E.
 
Sequence-specific methyltransferase-induced labeling of DNA (smiling Dna)
.
Chembiochem
 
2004
;
5
:
265
9
.

Polyanskiy
Y.
Channel coding: non-asymptotic fundamental limits. Ph.D. Thesis, Princeton University,
2010
.

Polyanskiy
Y
,
Poor
HV
,
Verdú
S.
 
Channel coding rate in the finite blocklength regime
.
IEEE Trans Inform Theory
 
2010
;
56
:
2307
59
.

Roberts
RJ
,
Vincze
T
,
Posfai
J
 et al.  
Rebase—a database for DNA restriction and modification: enzymes, genes and genomes
.
Nucleic Acids Res
 
2015
;
43
:
D298
9
.

Schwartz
DC
,
Li
X
,
Hernandez
LI
 et al.  
Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping
.
Science
 
1993
;
262
:
110
4
.

Shannon
CE.
 
A mathematical theory of communication
.
Bell Syste Tech J
 
1948
;
27
:
379
423
.

Tan
VY.
 
Asymptotic estimates in information theory with non-vanishing error probabilities
.
FNT Commun Inform Theory
 
2014
;
10
:
1
184
.

Torchinsky
D
,
Michaeli
Y
,
Gassman
NR
 et al.  
Simultaneous detection of multiple DNA damage types by multi-colour fluorescent labelling
.
Chem Commun (Camb)
 
2019
;
55
:
11414
7
.

Valouev
A
,
Li
L
,
Liu
Y-C
 et al.  
Alignment of optical maps
.
J Comput Biol
 
2006
;
13
:
442
62
.

Wand
NO
,
Smith
DA
,
Wilkinson
AA
 et al.  
DNA barcodes for rapid, whole genome, single-molecule analyses
.
Nucleic Acids Res
 
2019
;
47
:
e68
. https://doi.org/10.1093/nar/gkz212

Weinberger
N
,
Shomorony
I.
Fundamental limits of reference-based sequence reordering. In: 2023 IEEE International Symposium on Information Theory, Taipei, Taiwan. New York, NY: IEEE,
2023
.

Wu
S
,
Jeffet
J
,
Grunwald
A
 et al.  
Microfluidic DNA combing for parallel single-molecule analysis
.
Nanotechnology
 
2019
;
30
:
045101
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Associate Editor: Tobias Marschall
Tobias Marschall
Associate Editor
Search for other works by this author on:

Supplementary data