-
PDF
- Split View
-
Views
-
Cite
Cite
Dongmei Ai, Lulu Chen, Jiemin Xie, Longwei Cheng, Fang Zhang, Yihui Luan, Yang Li, Shengwei Hou, Fengzhu Sun, Li Charlie Xia, Identifying local associations in biological time series: algorithms, statistical significance, and applications, Briefings in Bioinformatics, Volume 24, Issue 6, November 2023, bbad390, https://doi.org/10.1093/bib/bbad390
- Share Icon Share
Abstract
Local associations refer to spatial–temporal correlations that emerge from the biological realm, such as time-dependent gene co-expression or seasonal interactions between microbes. One can reveal the intricate dynamics and inherent interactions of biological systems by examining the biological time series data for these associations. To accomplish this goal, local similarity analysis algorithms and statistical methods that facilitate the local alignment of time series and assess the significance of the resulting alignments have been developed. Although these algorithms were initially devised for gene expression analysis from microarrays, they have been adapted and accelerated for multi-omics next generation sequencing datasets, achieving high scientific impact. In this review, we present an overview of the historical developments and recent advances for local similarity analysis algorithms, their statistical properties, and real applications in analyzing biological time series data. The benchmark data and analysis scripts used in this review are freely available at http://github.com/labxscut/lsareview.
INTRODUCTION
Identification of ecological and biochemical relationships in time-series data of a microbiome or a transcriptome is commonly achieved by correlation-based co-occurrence and co-expression analyses. However, global correlation measures such as Pearson's or Spearman's correlation are only effective when two time series are globally synchronized and associated. Yet, more complex relationships with dynamic changes are prevalent in real biological data, including localized and time-delayed associations, which have been observed in various fields such as microbial ecology [1–4], molecular biology [5, 6], and functional neuroscience [7, 8].
For instance, the regulatory mechanism in gene expression analysis between ARG2 (acetyl glutamate synthase) and CAR2 (ornithine aminotransferase) depends on the expression level of CPA2 (arginine specific carbamoyl-phosphate synthase). Nonetheless, the global Pearson's correlation coefficient (PCC) expression level between ARG2 and CAR2 is nearly zero [9], which emphasizes the insensitivity of global correlation measures to detect local associations. Although global correlations continue to be valuable tool, it is necessary to formulate more flexible alternative hypotheses to detect the intricate relationships that exist in biological time series.
To address the limitations of global correlations, researchers have introduced numerous methodologies, the prominent one of such is local similarity analysis (LSA) [10–13]. LSA is a local alignment-based method that identifies alignment configurations optimized for local similarity scores between two time series, which enables the detection of local and potentially time-delayed pairwise associations. A related method, known as local trend analysis (LTA) [14–16], a variant of LSA, identifies alignment configurations and correlations between pairs of trend-of-change series, representing up, down, and no-change statuses. Both LSA and LTA also evaluate the statistical significance of the identified associations. Notably, the term local similarity has not only been used in biology [17], but also in the fields of chemistry [18], public health [19], and power engineering [20], differing in meanings. We refer the readers to their respective papers and reviews.
This paper exclusively delves into the context of local similarity within biological time series. Within the realm of biological time series, local similarity analysis plays a pivotal role in uncovering potential relationships between microbial species richness and environmental factors. Local similarity becomes particularly relevant when actual correlations are limited to specific time subintervals or when correlations are subject to time delays [10]. The utilization of local similarity analysis has garnered extensive application in scrutinizing the spatiotemporal evolution of microbial communities across diverse environments. Importantly, the latent relationships between these microbial species cannot be effectively discerned through conventional correlation-based analysis methods.
Other methods exist for identifying diverse types of complex associations [9, 21–23]. For example, the liquid association (LA) technique [9, 22, 24] detects triplet rather than pairwise associations, while the maximal information coefficient (MIC) method identifies unspecified generic pairwise associations [25]. Granger causality analysis allows for identification of potential causal relationships in molecular ecological time-series data [26]. Furthermore, LA combined with LSA can be utilized to detect mediated covariant dynamics to discover the third-party mediator variable [27]. In this review, we focus on the advancement of LSA and LTA, including algorithms, statistical significance evaluation methods, software tools availability and their applications. Papers [28, 29] provide further information regarding related correlation measures and their applications.
LSA is founded on the concept of local sequence alignment, utilizing a dynamic programming algorithm, similar to the Smith-Waterman algorithm [17] to detect the optimal time series alignment configuration that maximizes the local similarity score. Qian et al. first introduced the method for gene expression profile analysis [10]. Based on this method, LSA has been extended to analyze the microbiome data including operational taxonomic unit (OTU) analysis by Ruan et al. [13] and metagenomic data analysis by Xia et al. [12]. Methodological advances have also been made to LSA, including support for replicate series data by Xia et al. [12], theoretical approximations of statistical significance (i.e., p-values) reported by Xia et al. [11] and Durno et al. [30], and more advanced methods for calculating p-values, such as moving block bootstrap LSA (MBBLSA) [31] and data-driven LSA (DDLSA) by Zhang et al. [32].
Accompanied by many other recently developed correlation measurement techniques for extracting meaningful association patterns from time series data, LSA by Ruan et al. [13] and the extended local similarity analysis tool, eLSA, a significantly accelerated implementation by Xia et al. [11, 12] remain advantageous. For instance, Wang et al. [33] developed a count statistic and evaluated its performance in comparison with LSA, the outcome revealed that as noise increased for pairs of correlated subsequences, LSA remained consistently robust and become more effective. Similarly, Tackmann et al. [34] introduced a Flashweave algorithm designed to infer high-resolution interaction networks while it was outperformed by eLSA when benchmarked with realistic data simulated by Weiss et al. [28].
Since 2011, the eLSA tool has been widely used in microbial data analyses, leading to numerous discoveries. For instance, Liu et al. [35] analyzed with eLSA the correlations between bacterial OTUs and phytoplankton in a reservoir and obtained a dynamical interaction network with significant local ecological associations. Similarly, Thiriet-Rupert et al. [36] constructed a gene co-expression network for a strain of microalgae to determine the relationships between differentially expressed transcription factors and mutant phenotypes by eLSA. Lee et al. [37] utilized eLSA to evaluate the time-dependent relationships among rhizome repair, environmental conditions, and bacterial genera in diesel-contaminated soil and obtained the correlation network.
Moreover, eLSA has been employed in studies of diverse subjects, including to investigate the correlation between marine paleontology and microbial communities by Parada et al. [38], to infer the diverse ecological relationships of microorganisms by Jones et al [39], to analyze the association network of microbial taxa of digested sludge by Liang et al. [40], and to identify the complex relationships among marine phytoplankton, bacteria and viruses [41, 42], between viral bacteriophages and their hosts [43] and between the abundance of antibiotic resistance genes and bacterial communities by many other researchers [44].
Additionally, eLSA's capability of identifying correlations with time shifts has also been widely used. For instance, Posch et al. [45] constructed an interaction network between ciliates and phytoplankton with eLSA to study the correlations with a time lag. They detected such associations among ciliates as well as between ciliates and algae. Džunková et al. [46] investigated the correlation among bacterial OTUs in the oral cavity of different individuals, considering the temporal delay. In another example, eLSA was utilized to evaluate the time-delayed correlation between microbial members of an activated sludge and environmental variables [47].
LSA related studies encouraged the development of methods sharing its rationale. One such method is local trend analysis (LTA), as introduced by He et al. [14] and Ji et al. [15], which analyzes the trend-of-change in gene expression profile series. In LTA, a local alignment algorithm is applied to the trend-transformed series of original data. Trend transformation involves taking the value differences of adjacent time points and discretizing them into {1, 0, −1} which represent {up, no-change, down} trending statuses, respectively. Xia et al. [16] developed a theory to approximate the statistical significance of LTA, which was further improved by Shan et al. known as the stationary theoretical local trend analysis (STLTA) [48].
Numerous studies have demonstrated the efficacy of LSA and LTA in inferring time-dependent associations in contrast to traditional global correlation measures. The added time dimension facilitates mining causality behind the delay. Additionally, the highly efficient theoretical approximation of statistical significance, i.e., p-value, has played a critical role in the widespread use of LSA and LTA analyses. These methods have been successfully applied to large-scale next-generation sequencing (NGS)-based datasets that would otherwise be impractical to analyze using much slower permutation-based p-value. In this paper, we provide an in-depth of the relevant algorithms and theories for LSA and LTA, as well as their recent improvements, including eLSA, fastLSA, MBBLSA, DDLSA, and STLTA.
LOCAL SIMILARITY ANALYSIS OF BIOLOGICAL SERIES DATA
Biological time series data
Time-series data are critical resources for studying the dynamics of biological systems. These data may be the gene expression levels in gene regulation studies or environmental and organismal factor levels in ecological studies. The nature of time-series data can be highly variable with values representing different measurements depending on the dataset.
Before conducting LSA on the original dataset, one needs to normalize the data. This step transforms raw values into standardized values with a mean of zero and a variance of one. However, one obstacle in standardizing count-based time series data is the high frequency of zeros. Excluding zero values from normalization procedures may be preferential since such values may be due to limited detection sensitivity or random dropouts other than true absence. Various normalization schemes are available, such as Z-score normalization and percentile normalization. Xia et al. used percentile normalization with sparsity adjustment, which generally provides the best and most robust result across datasets, even for datasets with non-linear associations (see paper [12] for the detailed procedure).
Local alignment and similarity score of biological series data
To explain the LSA process, we consider two factors with normalized time-series data at levels
|
|
|
|
Note: sgn(.) is the sign function.
|
|
|
|
Note: sgn(.) is the sign function.
This local alignment algorithm has unique features compared to Smith-Waterman's algorithm. First, the maximum time delay,
Local similarity score was initially scaled by
Local similarity score of replicated series data.
Assessing variability through replication is important for statistical inference [49–51]. The original LSA method only considered series data without replications. Xia et al. first proposed the extended local similarity analysis (eLSA) method to incorporate series data with replications [12]. The eLSA assumes that each sample has
|
|
|
|
Note: sgn(.) is the sign function.
|
|
|
|
Note: sgn(.) is the sign function.
In this modified algorithm, the scalars
Transformation functions to summarize replicates
To model the repeated measure and account for variability in eLSA, Xia et al. introduced the summarizing function
Summarizing Method . | |
---|---|
Mean | |
SD | |
Med | |
MAD |
Summarizing Method . | |
---|---|
Mean | |
SD | |
Med | |
MAD |
Summarizing Method . | |
---|---|
Mean | |
SD | |
Med | |
MAD |
Summarizing Method . | |
---|---|
Mean | |
SD | |
Med | |
MAD |
Bootstrap confidence interval for local similarity score
The bootstrap method provides an empirical confidence interval (CI) for local similarity score
A statistical theory for the significance of local similarity score
Permutation statistical significance
Non-replicated data
In the original non-replicated data LSA analysis, the statistical significance (p-value) of an LS score was computed by permutation. This process involved the following steps:
(i) For each pair of time series
(ii) Compute a permuted LS score for the permuted pair with the LSA algorithm (Algorithm 1).
(iii) Repeat steps (i) and (ii)
(iv) The p-value is determined as the proportion of permuted LS scores that were at least as great as
where
Replicated data
In the case of replicated data eLSA analysis, the permutation test is modified such that the series data
Moving block bootstrap
Zhang et al. [31] developed a moving block bootstrap method (MBBLSA) to replace point-wise permutation and compute the statistical p-value of LSA. The MBBLSA calculation process is as follows:
(i) Calculate the LS scores
(ii) Divide the time series into
(iii) Randomly select a block from all of the blocks and use its values as the first
(iv) Calculate the LS scores of
(v) Repeat steps (iii) and (iv) to obtain
(vi) Calculate the bootstrap p-value as in Eq. (2).
Regarding the block length, please refer to the block length selector proposed by Sherman et al. [57]:
where
Bottlenecks of permutation and bootstrap procedures
Although the p-value can be computed with any desired degree of accuracy using a large number of permutations, this approach can be computationally costly. The time complexity for calculating the LS score is
Theoretical statistical significance for local similarity scores
Feller’s theory for the range of random walk excursion
Due to computational complexity, permutation-based procedures are only practical for pairwise LSA analysis with a few hundred factors. Therefore, it is necessary to develop effective theoretical p-value approximations to alleviate this problem. One direction involves adapting Feller's theory for the range of random walk excursions to the settings of LSA. Feller [58] investigated the approximate distribution of the range of the sum of
A theory for local similarity score
Xia et al. [11] proposed the LS score,
Then, the tail probability
A generalization of the theory of local similarity scoring
The approach proposed by Xia et al. [11] is effective only for independent and identically distributed (i.i.d.) sequences. However, in practice, many sequences are not i.i.d.. To solve this issue, Zhang et al. [32] introduced the long-term variance
Accurately estimating
where
An upper bound approach for statistical significance
Durno et al. [30] proposed an alternative upper bound for the LS score p-value with the fastLSA software. They slightly modified the presentation of the LSA algorithm (Algorithm 1) to enable derivation of their bound for the tail distribution of the LS score statistics. They defined
For aligned sequences starting from
where
Durno et al. acknowledged the limitations of their approach. Note that Eq. (8) is asymptotic, and
A comparison of statistical significance approximation approaches
We compared five available open-source LSA software tools that implement either the permutation or theoretical approaches for estimating the statistical significance of LS scores, as presented in Table 2. The tools include eLSA [11, 12], fastLSA [30], LocSim [13], MBBLSA [31] and DDLSA [32]. Both eLSA and fastLSA are implemented in efficient C++ code, while LocSim, MBBLSA, and DDLSA are written in R. eLSA offer both theoretical (Xia et al. [11]) and permutation-based p-value calculations. FastLSA and DDLSA provide the theoretical p-value calculation. Meanwhile, LocSim and MBBLSA only implement the permutation approach. eLSA also includes additional features, such as trend series analysis, replicated data analysis, and bootstrap confidence intervals. Furthermore, we provided Figure 1 as a viable decision tree for selecting appropriate software tools for various tasks or datasets.
. | LocSim . | eLSA . | fastLSA . | MBBLSA . | DDLSA . |
---|---|---|---|---|---|
Reference | Ruan et al. 2006 [13] | Xia et al. 2011,2013 [11, 12] | Durno et al. 2013 [30] | Zhang et al 2018 [31] | Zhang et al 2019 [32] |
Homepage | https://www-rcf.usc.edu/~fsun/Programs/local_similarity/lsaMain.html | https://github.com/labxscut/elsa | http://www.cmde.science.ubc.ca/hallam/fastLSA | https://github.com/BlueStamford/MBBLSA | https://github.com/BlueStamford/DDLSA |
Language | R | Python and C++ | C++ | R | R |
Replicated Data | No | Yes | No | Yes | Yes |
Confidence Interval | No | Yes | No | No | No |
p-value | Permutation | Theory and Permutation | Theory | Permutation | Theory |
. | LocSim . | eLSA . | fastLSA . | MBBLSA . | DDLSA . |
---|---|---|---|---|---|
Reference | Ruan et al. 2006 [13] | Xia et al. 2011,2013 [11, 12] | Durno et al. 2013 [30] | Zhang et al 2018 [31] | Zhang et al 2019 [32] |
Homepage | https://www-rcf.usc.edu/~fsun/Programs/local_similarity/lsaMain.html | https://github.com/labxscut/elsa | http://www.cmde.science.ubc.ca/hallam/fastLSA | https://github.com/BlueStamford/MBBLSA | https://github.com/BlueStamford/DDLSA |
Language | R | Python and C++ | C++ | R | R |
Replicated Data | No | Yes | No | Yes | Yes |
Confidence Interval | No | Yes | No | No | No |
p-value | Permutation | Theory and Permutation | Theory | Permutation | Theory |
. | LocSim . | eLSA . | fastLSA . | MBBLSA . | DDLSA . |
---|---|---|---|---|---|
Reference | Ruan et al. 2006 [13] | Xia et al. 2011,2013 [11, 12] | Durno et al. 2013 [30] | Zhang et al 2018 [31] | Zhang et al 2019 [32] |
Homepage | https://www-rcf.usc.edu/~fsun/Programs/local_similarity/lsaMain.html | https://github.com/labxscut/elsa | http://www.cmde.science.ubc.ca/hallam/fastLSA | https://github.com/BlueStamford/MBBLSA | https://github.com/BlueStamford/DDLSA |
Language | R | Python and C++ | C++ | R | R |
Replicated Data | No | Yes | No | Yes | Yes |
Confidence Interval | No | Yes | No | No | No |
p-value | Permutation | Theory and Permutation | Theory | Permutation | Theory |
. | LocSim . | eLSA . | fastLSA . | MBBLSA . | DDLSA . |
---|---|---|---|---|---|
Reference | Ruan et al. 2006 [13] | Xia et al. 2011,2013 [11, 12] | Durno et al. 2013 [30] | Zhang et al 2018 [31] | Zhang et al 2019 [32] |
Homepage | https://www-rcf.usc.edu/~fsun/Programs/local_similarity/lsaMain.html | https://github.com/labxscut/elsa | http://www.cmde.science.ubc.ca/hallam/fastLSA | https://github.com/BlueStamford/MBBLSA | https://github.com/BlueStamford/DDLSA |
Language | R | Python and C++ | C++ | R | R |
Replicated Data | No | Yes | No | Yes | Yes |
Confidence Interval | No | Yes | No | No | No |
p-value | Permutation | Theory and Permutation | Theory | Permutation | Theory |

Furthermore, we compared the performance of p-value approximation approaches: theory or permutation. The theoretical methods were implemented in eLSA (Xia et al. [11]), fastLSA (Durno et al. [30]) and DDLSA (Zhang et al. [32]). The permutation approaches were implemented in LocSim (Ruan et al. [13]) and MBBLSA (Zhang et al. [31). The LocSim R tool was excluded from this comparison because its permutation approach is the same as that in eLSA but slower.
We obtained the benchmark data prepared by Weiss et al. from the GitHub (https://github.com/wdwvt1/correlations) [28]. For the benchmark purpose of this paper, we utilized a portion of their data which is shown in Table 3. Tables S1-4 (originally Tables 2.6–2.9 in Weiss et al.) simulated the Lotka-Volterra relationship of two species with different constraints. The Lotka-Volterra relationships describe
Data File . | Data Name . | Simulating Model . | Data Source . |
---|---|---|---|
Tables S1 | TLV-even-relative | Two-species Lotka-Volterra(TLV), even indices, relative abundance | Weiss 2016 [28] Table Set 2.6 |
Tables S2 | TLV-even-counts | Two-species Lotka-Volterra(TLV), even indices, counts | Weiss 2016 [28] Table Set 2.7 |
Tables S3 | TLV-random-relative | Two-species Lotka-Volterra(TLV), random indices, relative abundance | Weiss 2016 [28] Table Set 2.8 |
Tables S4 | TLV-random-counts | Two-species Lotka-Volterra(TLV), random indices, counts | Weiss 2016 [28] Table Set 2.9 |
Data File . | Data Name . | Simulating Model . | Data Source . |
---|---|---|---|
Tables S1 | TLV-even-relative | Two-species Lotka-Volterra(TLV), even indices, relative abundance | Weiss 2016 [28] Table Set 2.6 |
Tables S2 | TLV-even-counts | Two-species Lotka-Volterra(TLV), even indices, counts | Weiss 2016 [28] Table Set 2.7 |
Tables S3 | TLV-random-relative | Two-species Lotka-Volterra(TLV), random indices, relative abundance | Weiss 2016 [28] Table Set 2.8 |
Tables S4 | TLV-random-counts | Two-species Lotka-Volterra(TLV), random indices, counts | Weiss 2016 [28] Table Set 2.9 |
Note: Tables S1-S10 can be obtained at http://github.com/labxscut/lsareview.
Data File . | Data Name . | Simulating Model . | Data Source . |
---|---|---|---|
Tables S1 | TLV-even-relative | Two-species Lotka-Volterra(TLV), even indices, relative abundance | Weiss 2016 [28] Table Set 2.6 |
Tables S2 | TLV-even-counts | Two-species Lotka-Volterra(TLV), even indices, counts | Weiss 2016 [28] Table Set 2.7 |
Tables S3 | TLV-random-relative | Two-species Lotka-Volterra(TLV), random indices, relative abundance | Weiss 2016 [28] Table Set 2.8 |
Tables S4 | TLV-random-counts | Two-species Lotka-Volterra(TLV), random indices, counts | Weiss 2016 [28] Table Set 2.9 |
Data File . | Data Name . | Simulating Model . | Data Source . |
---|---|---|---|
Tables S1 | TLV-even-relative | Two-species Lotka-Volterra(TLV), even indices, relative abundance | Weiss 2016 [28] Table Set 2.6 |
Tables S2 | TLV-even-counts | Two-species Lotka-Volterra(TLV), even indices, counts | Weiss 2016 [28] Table Set 2.7 |
Tables S3 | TLV-random-relative | Two-species Lotka-Volterra(TLV), random indices, relative abundance | Weiss 2016 [28] Table Set 2.8 |
Tables S4 | TLV-random-counts | Two-species Lotka-Volterra(TLV), random indices, counts | Weiss 2016 [28] Table Set 2.9 |
Note: Tables S1-S10 can be obtained at http://github.com/labxscut/lsareview.
Total pairs correctly identified by the LSA software tools in benchmark (p-value <0.05)
. | TLV-even-relative (n = 4770) . | TLV-even-counts (n = 4366) . | TLV-random-relative (n = 4770) . | TLV-random-counts (n = 4285) . | ||||
---|---|---|---|---|---|---|---|---|
#accurate . | rank . | #accurate . | rank . | #accurate . | rank . | #accurate . | rank . | |
eLSA(perm) | 4486 | 5 | 4148 | 4 | 4466 | 4 | 4055 | 5 |
eLSA(theo) | 4745 | 1 | 4356 | 1 | 4731 | 1 | 4232 | 1 |
fastLSA | 4615 | 3 | 4201 | 3 | 4660 | 2 | 4145 | 3 |
DDLSA | 4696 | 2 | 4285 | 2 | 4654 | 3 | 4183 | 2 |
MBBLSA | 4507 | 4 | 4131 | 5 | 4465 | 5 | 4056 | 4 |
. | TLV-even-relative (n = 4770) . | TLV-even-counts (n = 4366) . | TLV-random-relative (n = 4770) . | TLV-random-counts (n = 4285) . | ||||
---|---|---|---|---|---|---|---|---|
#accurate . | rank . | #accurate . | rank . | #accurate . | rank . | #accurate . | rank . | |
eLSA(perm) | 4486 | 5 | 4148 | 4 | 4466 | 4 | 4055 | 5 |
eLSA(theo) | 4745 | 1 | 4356 | 1 | 4731 | 1 | 4232 | 1 |
fastLSA | 4615 | 3 | 4201 | 3 | 4660 | 2 | 4145 | 3 |
DDLSA | 4696 | 2 | 4285 | 2 | 4654 | 3 | 4183 | 2 |
MBBLSA | 4507 | 4 | 4131 | 5 | 4465 | 5 | 4056 | 4 |
Total pairs correctly identified by the LSA software tools in benchmark (p-value <0.05)
. | TLV-even-relative (n = 4770) . | TLV-even-counts (n = 4366) . | TLV-random-relative (n = 4770) . | TLV-random-counts (n = 4285) . | ||||
---|---|---|---|---|---|---|---|---|
#accurate . | rank . | #accurate . | rank . | #accurate . | rank . | #accurate . | rank . | |
eLSA(perm) | 4486 | 5 | 4148 | 4 | 4466 | 4 | 4055 | 5 |
eLSA(theo) | 4745 | 1 | 4356 | 1 | 4731 | 1 | 4232 | 1 |
fastLSA | 4615 | 3 | 4201 | 3 | 4660 | 2 | 4145 | 3 |
DDLSA | 4696 | 2 | 4285 | 2 | 4654 | 3 | 4183 | 2 |
MBBLSA | 4507 | 4 | 4131 | 5 | 4465 | 5 | 4056 | 4 |
. | TLV-even-relative (n = 4770) . | TLV-even-counts (n = 4366) . | TLV-random-relative (n = 4770) . | TLV-random-counts (n = 4285) . | ||||
---|---|---|---|---|---|---|---|---|
#accurate . | rank . | #accurate . | rank . | #accurate . | rank . | #accurate . | rank . | |
eLSA(perm) | 4486 | 5 | 4148 | 4 | 4466 | 4 | 4055 | 5 |
eLSA(theo) | 4745 | 1 | 4356 | 1 | 4731 | 1 | 4232 | 1 |
fastLSA | 4615 | 3 | 4201 | 3 | 4660 | 2 | 4145 | 3 |
DDLSA | 4696 | 2 | 4285 | 2 | 4654 | 3 | 4183 | 2 |
MBBLSA | 4507 | 4 | 4131 | 5 | 4465 | 5 | 4056 | 4 |

ROC curves comparing four LSA software tools and their implemented P-value approximation approaches. Panels A–D show the Area Under the Curve (AUC) scores of eLSA, fastLSA, DDLSA and MBBLSA under different data simulation models, such as eLSA with permutation (eLSA-perm), eLSA with theory approximation (eLSA-theo), fastLSA and DDLSA (both theoretical), and MBBLSA ( permutation-based) approaches. Further details about these approaches can be found in the main text.
After pre-processing the raw tables, we filtered out sequences with more than 80% time points being zero values. In the final analysis, Tables S1 and S3 contained 4770 pairs of sequences, including 10 pairs of time series with true correlation. Table S2 contained 4366 pairs of sequences, including 6 pairs of time series with true correlation, and Table S4 contained 4285 pairs of total sequences, including 5 pairs of time series with true correlation. We then input these series data into each LSA analytical software tool to calculate the pairwise LS scores between covariates.
We conducted benchmark tests on the two-species Lotka-Volterra (TLV) datasets with the eLSA, fastLSA, MBBLSA, and DDLSA tools. The performance of each tool was illustrated in Figure 2 by the Receiver Operating Characteristic (ROC) curve (the pROC R package). The curves showed that the eLSA theory (theo), eLSA permutation (perm), and DDLSA methods generally outperformed others in the tests. Specifically, the eLSA (theo) method demonstrated the highest overall accuracy compared to all other methods. Furthermore, we evaluated the tools with data simulating many-species Lotka-Volterra relationship and compared the resultant ROC curves, as shown in Supplementary Table S1 and Figure S1. We observed that the DDLSA methods yielded the best performance in those tests.
Furthermore, Table 4 shows a ranking of correctly identified sequence pairs by all the tested software tools. In the table, n represents the total number of OTU pairs calculated by the tool, and the accurate column recorded the number of correctly called pairs by the tool. Among these tools, we observed that the eLSA (theo) method exhibited the best performance and ranked the first in overall accuracy.
We also conducted a comparison of computational efficiency for the four LSA software tools in terms of the total running time, as measured in minutes, from the benchmark tests (see Table 5). The results indicated that all theoretical approaches - eLSA (theo), fastLSA and DDLSA, demonstrated good performance in terms of computational efficiency, showing more than hundreds of times acceleration than permutation. They finished the analysis within seconds when handling hundreds of covariates (OTUs). Our experience showed, the theory-based approaches can scale up to thousands of covariates using current single processor personal computer, while the permutation approaches become too slow to finish. Overall, the benchmark tests demonstrated that eLSA (theo) is the most recommended choice for LSA analysis, considering overall accuracy and efficiency.
. | TLV-even-relative (n = 4950) . | TLV-even-counts (n = 4465) . | TLV-random-relative (n = 4950) . | TLV-random-counts (n = 4371) . |
---|---|---|---|---|
eLSA(perm) | 513.8 | 468.7 | 613.7 | 466 |
eLSA(theo) | 1.8 | 1.3 | 1.7 | 1.3 |
fastLSA | <0.1 | <0.1 | <0.1 | <0.1 |
DDLSA | 0.2 | 0.2 | 0.2 | 0.2 |
MBBLSA | 439.7 | 302 | 410 | 296.6 |
. | TLV-even-relative (n = 4950) . | TLV-even-counts (n = 4465) . | TLV-random-relative (n = 4950) . | TLV-random-counts (n = 4371) . |
---|---|---|---|---|
eLSA(perm) | 513.8 | 468.7 | 613.7 | 466 |
eLSA(theo) | 1.8 | 1.3 | 1.7 | 1.3 |
fastLSA | <0.1 | <0.1 | <0.1 | <0.1 |
DDLSA | 0.2 | 0.2 | 0.2 | 0.2 |
MBBLSA | 439.7 | 302 | 410 | 296.6 |
. | TLV-even-relative (n = 4950) . | TLV-even-counts (n = 4465) . | TLV-random-relative (n = 4950) . | TLV-random-counts (n = 4371) . |
---|---|---|---|---|
eLSA(perm) | 513.8 | 468.7 | 613.7 | 466 |
eLSA(theo) | 1.8 | 1.3 | 1.7 | 1.3 |
fastLSA | <0.1 | <0.1 | <0.1 | <0.1 |
DDLSA | 0.2 | 0.2 | 0.2 | 0.2 |
MBBLSA | 439.7 | 302 | 410 | 296.6 |
. | TLV-even-relative (n = 4950) . | TLV-even-counts (n = 4465) . | TLV-random-relative (n = 4950) . | TLV-random-counts (n = 4371) . |
---|---|---|---|---|
eLSA(perm) | 513.8 | 468.7 | 613.7 | 466 |
eLSA(theo) | 1.8 | 1.3 | 1.7 | 1.3 |
fastLSA | <0.1 | <0.1 | <0.1 | <0.1 |
DDLSA | 0.2 | 0.2 | 0.2 | 0.2 |
MBBLSA | 439.7 | 302 | 410 | 296.6 |
Local trend analysis and statistical significance
Local trend series and local trend analysis
A related LSA analysis technique, local trend analysis (LTA) was also implemented in eLSA for time-dependent trend association mining. Recent studies have suggested that the similarity of increasing, stabilizing or decreasing trends along the timeline can be a strong indicator of associations. To address this, trend series analysis was thus developed to analyse the transformed series. Ji et al. [15] explored this approach by coding the trends of
In LTA, the initial step is to discretize the series profile using the changing trend alphabet
For the given original time series
where
Statistical significance of local trend score
Similar to LSA, a theory for the statistical significance of the LT score in LTA is necessary. After transformation, the resulting trend series is no longer independent and identically distributed (i.i.d.). As illustrated in Figure 3, it has a dependence structure wherein the dependence relationship is diminishing over time. Xia et al. [16] demonstrated that a first-order Markov chain provides a good approximation of the resulting dependent trend series in practice:
Therefore, the LT score
The theory by Xia et al. for LS score can be extended to Markov random variables to approximate the p-value of LT score. Daudin et al. [87] explored the distribution of the maximum cumulative sum of first-order Markov chain with values taken from a finite subset of R. Let
Therefore, following Eq. (6), the p-value approximation formula for LT score is:
Eq. (13) is equivalent to Eq. (6), where the only difference being that the standard deviation
The analysis conducted by Xia et al. [16] assumed that the original sequence is i.i.d.. Shan et al. [48] mitigated this assumption by proposing the spectral decomposition theory of matrices, which enables the solution of
where
Under the condition that
was proposed by Shan et al. [48] as plug-in estimates to Eq. (13) for computing the LTA p-value. When
where
If
was proposed by Shan et al. [48] as plug-in estimates to Eq. (13) for computing the LTA p-value.
Shan et al. [48] also studied the mixed-state Markov chain model. When
the proposed plug-in estimates to Eq. (13) for computing the LTA p-value.
Shan et al. [48] referred to this method as STLTA. Through simulation, they demonstrated that the Type I error rate of the STLTA method closely aligns with the specified significance level as the number of time points increases, which indicates that the STLTA method is effective asymptotically.
Finally, Table 6 presented a summary of LSA and LTA software tools and their real applications in various data domains and specialties, aiming to provide practitioners with a convenient overview of the wide adoption of these tools.
Method (Software) . | Domain . | Data Source . | Example Applications . |
---|---|---|---|
LSA (eLSA) | Health Science | 16S rRNA Seq | Refs. [46, 60–63] |
Metagenomics | Ref. [64] | ||
Food / Agriculture | 16S rRNA Seq | Refs. [4, 44, 65–67] | |
Ecology | Metagenomics | Refs. [43, 47, 68] | |
RNA-Seq | Ref. [36] | ||
16S rRNA Seq | Refs. [41, 69–79] | ||
Bioenergy | 16S rRNA Seq | Refs. [80, 81] | |
LSA (fastLSA) | Ecology | 16S rRNA Seq | Refs. [82–86] |
LTA | Molecular systems biology | RNA-Seq | Refs. [88, 89] |
Biomedicine | RNA-Seq | Ref. [90] |
Method (Software) . | Domain . | Data Source . | Example Applications . |
---|---|---|---|
LSA (eLSA) | Health Science | 16S rRNA Seq | Refs. [46, 60–63] |
Metagenomics | Ref. [64] | ||
Food / Agriculture | 16S rRNA Seq | Refs. [4, 44, 65–67] | |
Ecology | Metagenomics | Refs. [43, 47, 68] | |
RNA-Seq | Ref. [36] | ||
16S rRNA Seq | Refs. [41, 69–79] | ||
Bioenergy | 16S rRNA Seq | Refs. [80, 81] | |
LSA (fastLSA) | Ecology | 16S rRNA Seq | Refs. [82–86] |
LTA | Molecular systems biology | RNA-Seq | Refs. [88, 89] |
Biomedicine | RNA-Seq | Ref. [90] |
Method (Software) . | Domain . | Data Source . | Example Applications . |
---|---|---|---|
LSA (eLSA) | Health Science | 16S rRNA Seq | Refs. [46, 60–63] |
Metagenomics | Ref. [64] | ||
Food / Agriculture | 16S rRNA Seq | Refs. [4, 44, 65–67] | |
Ecology | Metagenomics | Refs. [43, 47, 68] | |
RNA-Seq | Ref. [36] | ||
16S rRNA Seq | Refs. [41, 69–79] | ||
Bioenergy | 16S rRNA Seq | Refs. [80, 81] | |
LSA (fastLSA) | Ecology | 16S rRNA Seq | Refs. [82–86] |
LTA | Molecular systems biology | RNA-Seq | Refs. [88, 89] |
Biomedicine | RNA-Seq | Ref. [90] |
Method (Software) . | Domain . | Data Source . | Example Applications . |
---|---|---|---|
LSA (eLSA) | Health Science | 16S rRNA Seq | Refs. [46, 60–63] |
Metagenomics | Ref. [64] | ||
Food / Agriculture | 16S rRNA Seq | Refs. [4, 44, 65–67] | |
Ecology | Metagenomics | Refs. [43, 47, 68] | |
RNA-Seq | Ref. [36] | ||
16S rRNA Seq | Refs. [41, 69–79] | ||
Bioenergy | 16S rRNA Seq | Refs. [80, 81] | |
LSA (fastLSA) | Ecology | 16S rRNA Seq | Refs. [82–86] |
LTA | Molecular systems biology | RNA-Seq | Refs. [88, 89] |
Biomedicine | RNA-Seq | Ref. [90] |

A generative illustration for local trend time series. The adjacent continuous values of the original time series are compared and thresholded to create a new trend series, which is a categorical series suitable for local trend analysis. This enables the identification of aligned directions of changes.
DISCUSSION AND CONCLUSIONS
Local association analysis methods, such as local similarity and trend analysis (LSA and LTA), are powerful tools for identifying time-dependent associations in biological time series data. These methods are widely applied in hypothesis generation to determine the most relevant interactions in biological systems. Standard dynamic programming algorithms are often employed to find the optimal time series alignment and compute local similarity and trend scores. Advances in theoretical statistical significance approximation have addressed the common computational bottleneck faced by researchers, notably when analysing large-scale NGS datasets, LSA and LTA methods were brought under a common conceptual framework of random walk excursions for theoretical study. Nevertheless, computational and theoretical challenges remain in cases where local association has gaps, all-to-all comparing tens of thousands or even more variables, or analysing local association involves three or more co-factors. Further development of novel algorithms and statistical theories will be necessary enable these analyses in the future.
Currently, multiple software tools are accessible for executing LSA. Table 2 provides a clear comparison of the features of five such tools: eLSA, fastLSA, LocSim, MBBLSA, and DDLSA. Although the older R tool LocSim is limited to the permutation procedure for p-values, the recent Python and C++ implementation, i.e., eLSA, supports both local similarity and trend analysis and provides both permutation and theoretical approximation for p-values. eLSA is particularly suitable for large NGS-based datasets. eLSA also supports replicated series with multiple summarizing functions and bootstrap confidence intervals. Recently developed MBBLSA and DDLSA methods further improved the permutation and theoretical approximation of p-values. Additionally, another C++ implementation, fastLSA, trades off p-value accuracy for faster computing speed.
This benchmark study has shown that eLSA is the preferred software tool for the analysis of time series data in both sparse and dense data scenarios. Weiss et al. [28] also recommended eLSA, along with MIC [25], as the best analysis methods for dense cross-sectional data when the delay
ABBREVIATIONS
DDLSA: Data-driven local similarity analysis
eLSA: Extended local similarity analysis.
i.i.d.: Independent and identically distributed.
LA: Liquid association.
LSA: Local similarity analysis.
LTA: Local trend analysis.
MBBLSA: Moving block bootstrap local similarity analysis.
MIC: Maximal information coefficient.
NGS: Next-generation sequencing.
OTU: Operational taxonomic unit.
PCC: Pearson's correlation coefficient.
STLTA: Stationary theoretical local trend analysis.
This paper provides a comprehensive review and analysis of local similarity analysis (LSA), including algorithms, statistical theories, and software tools.
Various software tools' accuracy and efficiency are compared to aid readers in choosing the most appropriate tool.
Alongside LSA, this paper also reviews local trend analysis (LTA) and its statistical significance theories.
ACKNOWLEDGEMENTS
We thank Drs. Jed Fuhrman, Jacob Cram and Joshua Steele for helpful discussions. We thank Drs. R. Knight, S. Weiss, and Mr. Van Treuren for developing and sharing the benchmark data tables.
COMPETING INTERESTS
The authors declare no competing interests.
FUNDINGS
This work was supported by grants from the National Natural Science Foundation of China (61873027 to D.A. and L.C.X.), Open Project of the National Engineering Laboratory for Agri-product Quality Traceability (AQT-2020-YB6 to D.A.), and Guangdong Basic and Applied Basic Research Foundation (2022A1515011426 to L.C.X.).
DATA AVAILABILITY
The datasets analyzed by the study were made publicly available by Weiss et al. on GitHub (https://github.com/wdwvt1/correlations).
Author Biographies
Dongmei Ai is a professor in the School of Mathematics and Physics at University of Science and Technology Beijing. Her research is focused on bioinformatics and computational biology.
Lulu Chen is a master’s student in the School of Mathematics and Physics at University of Science and Technology Beijing. Her research interests include bioinformatics and computational biology.
Jiemin Xie is a master’s student in the School of Mathematics, South China University of Technology. His research interest is bioinformatics.
Longwei Cheng is a master’s student in the School of Mathematics and Physics at University of Science and Technology Beijing. His research interests include computational biology and machine learning.
Fang Zhang works as a machine learning specialist at the Shenwan Hongyuan Securities Co. Ltd. He received his PhD in Mathematics at the School of Mathematics, Shandong University. His research interests are statistics and machine learning.
Yihui Luan is a professor at the School of Mathematics at Shandong University. He works in the fields of biostatistics and time series analysis.
Yang Li is a master’s student in the School of Mathematics, South China University of Technology. His research interest is deep learning and its application to biomedicine.
Shengwei Hou is an assistant professor in the Department of Ocean Science and Engineering, Southern University of Science and Technology. His research interests are bioinformatics, microbial ecology and evolution.
Fengzhu Sun is a professor of Quantitative and Computational Biology and Mathematics at the University of Southern California. He works in computational biology, bioinformatics, statistical genetics, and mathematical modeling.
Li Charlie Xia is a professor in the Department of Statistics and Financial Mathematics, South China University of Technology. He specialized in statistical modeling and algorithm development for biomedical data.