Comparative Analysis of Regulatory Motif Discovery Tools for Transcription Factor Binding Sites

Abstract

In the post-genomic era, identification of specific regulatory motifs or transcription factor binding sites (TFBSs) in non-coding DNA sequences, which is essential to elucidate transcriptional regulatory networks, has emerged as an obstacle that frustrates many researchers. Consequently, numerous motif discovery tools and correlated databases have been applied to solving this problem. However, these existing methods, based on different computational algorithms, show diverse motif prediction efficiency in non-coding DNA sequences. Therefore, understanding the similarities and differences of computational algorithms and enriching the motif discovery literatures are important for users to choose the most appropriate one among the online available tools. Moreover, there still lacks credible criterion to assess motif discovery tools and instructions for researchers to choose the best according to their own projects. Thus integration of the related resources might be a good approach to improve accuracy of the application. Recent studies integrate regulatory motif discovery tools with experimental methods to offer a complementary approach for researchers, and also provide a much-needed model for current researches on transcriptional regulatory networks. Here we present a comparative analysis of regulatory motif discovery tools for TFBSs.

motif, TFBS, non-coding DNA sequence, computational algorithm, motif discovery tool

Introduction

Biological processes in prokaryotic and eukaryotic organisms are guided by genomic information in coding and non-coding DNA sequences. Both kinds of sequences coordinate the construction of transcriptional regulatory networks to perform gene expression with temporal-spatial variations. Compared with the pregenomic era, which concentrated on deciphering coding DNA sequences and completed the blueprint of the human genome, the post-genomic era puts more emphases on digging the gold mine hidden in non-coding DNA sequences. Currently the identification of specific motifs or transcription factor binding sites (TFBSs) has become one of the key steps in this task.

As we all know, interaction between transcription factors (TFs) and non-coding DNA sequences is a prerequisite for transcription initiation of genes. The function of TFs is to recognize short conserved regions in non-coding DNA sequences, which are called motifs or TFBSs (1). However, it is not enough to find motifs or TFBSs in non-coding DNA sequences only depending on experimental methods. For example, systematic evolution of ligands by exponential enrichment (SELEX), serial analysis of gene expression (SAGE), and DNA microarray are only for transcript profiling in vitro (1, 2). Chromatin immunoprecipitation (ChIP) can be combined with DNA microarray, namely ChIP-on-chip, to identify protein-DNA interaction in vivo (3), but it is limited by antibody performance and availability (4). For this reason, a wide range of motif discovery tools and databases have been applied to motif or TFBS prediction in biological studies. Unfortunately, 99.9% of their predictions are shown to be futility theorems (5).

Motifs or TFBSs are always represented as consensus IUPAC strings, position frequency matrices (PFMs), position weight matrices (PWMs), or position specific scoring matrices (PSSMs) in databases. Commonly, motifs or TFBSs in non-coding DNA sequences are conserved but still tend to be degenerate, which can influence the interaction between TFs and motifs or TFBSs. Therefore, after motif or TFBS data are collected and aligned from experimental or computational results, relevant consensus IUPAC strings can be constructed by selecting a degeneracy base pair symbol for each position in the alignment (5). The motif or TFBS data can also be modeled as PFM by aligning identified sites and counting the frequency of each base pair at each position of the alignment (6). Usually, PFM should be converted into PWM or PSSM according to formulas (5, 7). Site scoring of non-coding DNA sequences can be calculated by computing the values for each position in PWM or PSSM model (5). Moreover, by using sequence logos, PWM can be displayed with color and height proportional to the base pair frequency and information content for each position by formulas (8).

In 1970s, scientists predicted that the pivotal difference between human and chimpanzee was located in non-coding DNA sequences rather than coding DNA sequences (9). Since then many essential elements of transcriptional regulatory networks have been identified in non-coding DNA sequences, including promoters, enhancers, insulators, silencers, and locus control regions (6). Nowadays, the discovery of motifs is mainly limited in canonical 5’ termini of known genes, where TFs are generally thought to bind in. Nevertheless, recently some researches have shown that only small proportion of motifs or TFBSs lie in immediate upstream sequences of well-characterized protein-coding genes, while the rest of them exist in either introns or 3’ regions (6, 10, 11).

A number of algorithms to discover motifs have been applied previously, for example, BE95 (12), KYD96 (13), DB97 (14), vHRCV00 (15), BJVU98 (16), EP20 (17), KFQW99 (18), and so on. However, many of these algorithms were designed for finding longer or more common motifs rather than for identifying TFBSs (19). The price paid for this generality is that many of the cited algorithms are not guaranteed to find globally optimal solutions, since they employ some forms of local search, such as Gibbs sampling, expectation maximization (EM), and phylogenetic algorithms. In this study, we give a brief introduction to the algorithm design and analysis for TFBSs with a focus on problems in comparative motif discovery.

Results and Discussion

Combinatorial approaches

Among the possible algorithmic approaches, combinatorial approaches try to exhaustively explore all the ways that a molecular process could happen. This leads to hard combinatorial problems for which efficient algorithms are required. Thus this kind of algorithms must make use of complex data representations and techniques.

Sequence-driven or Sample-driven (SD) algorithms

SD algorithms try to find comparative patterns by comparing the given length strings and looking for local similarities between them. They are based on constructing a local multiple alignment of the given non-coding DNA sequences and then extracting the comparative patterns from the alignment by combining the segments, which is common to most of the non-coding DNA sequences (20).

Pattern-driven (PD) algorithms

PD algorithms are based on enumerating candidate patterns in a given length string and inputting substrings with high fitness. The advantage of PD algorithms is that they can search the best comparative patterns in some limited sizes (20). Compared with SD algorithms, PD algorithms can be performed intelligently so that patterns are not present in the data that are not generated. For example, if a pattern α is not frequently present in the data, then there will be no frequent refinement that makes α more specific (hitting in even fewer places) in the data either (20).

Multiprofiler

This algorithm mainly utilizes multi-profiles that generalize a notion of a profile to detect subtle patterns that might escape detection by standard profiles (21). It is designed for finding particularly subtle motifs even in the case when real motifs may be blurred by random ones. The advantage of Multiprofiler is that it takes much less time (21). Kravchenko et al. used Multiprofiler to search and statistically assess putative motifs in promoter regions of co-regulated genes, where the discovered over-represented sites could be totally verified by cell transfection experiments (22).

Consensus

This approach determines all possible pairwise alignments of matrices and remains words to create two sequence alignments. It scores the two sequence alignments by using information content, and the highest scoring will be saved (23). Each of the two sequence matrices is paired with each word that is not already in the matrix, and then three sequence matrices are scored for information content, among which the highest will be kept again. This process will continue until each sequence has contributed exactly once to each saved alignment (24). In practice, Lenz et al. scanned the upstream regions of the known Vibrio cholerae σ⁵⁴-regulated genes and obtained a 16 bp motif, which perfectly matches the known σ⁵⁴ binding sites in V. cholerae with the consensus sequence “TGGCAC-N₅-TTGCA/T” (25). In another study, to prove the hypothesis that IL-2-regulated genes in T1 cells may be influenced by STAT5, Fung et al. searched for motifs in 5,000 bp upstream regions by using the Consensus approach, and the obtained classic motif “TTCNNNGAA” can be verified by ChIP experiments (26).

Teiresias

Teiresias is a two-phase combinatorial approach for general pattern discovery. This algorithm assumes an instance that every motif is present in every sequence, namely, it finds all the maximal patterns with minimum support. Its performance scales quasi-linear sequences with the size of output (27). One property that differentiates Teiresias from other algorithms is the type of structural restriction. In this algorithm users are allowed to impose on special patterns to search. For example, only the parameter W needs to be set. It thus becomes possible to discover patterns of arbitrary length as long as preserved positions are not more than W residues away (28). In 2005, Kiesler et al. scanned 23 Hrp59 target exons by using Teiresias and found the known “GGAGG” core motif. This result was confirmed by ChIP, IP, and RT-PCR experiments, respectively (29).

Winnower, SP-STAR, and cWinnower

Winnower first represents motif instances as vertices, then it tries to delete spurious edges and recover motifs with the remaining vertices (30). SP-STAR is a local sum of pairwise score improvement algorithm, which considers only the subsequences present in dataset and iteratively updates scores of the motifs (30). cWinnower improves its running time by a stronger constraint function (31).

MobyDick

In some cases, motifs can be defined as strings whose probability of occurrence greatly exceeds the expectation of background. One problem is to decide which part constitutes the background and natural limits in a motif since large pieces of a motif will show up in a list of improbable strings. MobyDick can resolve this issue perfectly. It is suitable for discovering motifs from a large collection of sequences, for example, all of the upstream regions in the yeast genome or all of the genes regulated during sporulation (32). In 2003, based on two clusters of genes gained from microarray experiments, Murphy et al. scanned 1,000 bp promoter regions of each gene in each cluster and found a motif “T(G/A)TTTAC”, which had been previously validated to be bound by a known TF. Moreover, they found a new motif “CTTATCA” that may control gene transcription (33).

Smile, Verbumculus, and Weeder

The Smile algorithm takes into consideration the fact that TFBSs may be multiple and present a constrained spatial structure in genomes. Such algorithm is therefore able to identify genomic sequences that are called “structured motifs”. A suffix tree is used for finding such motifs (34). The inner core of Verbumculus rests on subtly interwoven properties of statistics, pattern matching, and combinatorics on words. Thereby it is more feasible to both detect and visualize such words in a fast and practically useful way (35). Weeder permits to extend exhaustive enumeration to signals and does not need to input the exact length of the pattern to be found (36).

Mitra

Mitra can be extended to handle insertions and deletions in addition to mismatches in selected sequences. It takes advantage of a new insight, which prunes the patterns that allow for more efficient use of pairwise similarity than in Winnower. For example, unlike previous PD or SD algorithms, Mitra is able to discover composite motifs of a combined length over 30 bp (37).

Projection

This algorithm ameliorates the limitations of existing algorithms by using random projections of input. It extends previous projection-based searching techniques to solve a multiple alignment problem that is not effectively addressed by pairwise alignments. It is designed to efficiently solve the problems from the planted-(l, d) motif model, and can do more reliably and substantially difficult instances than previous algorithms (38). For t= 20 and n= 600, this algorithm achieves performance close to the best possible, being limited primarily by statistical considerations (38).

EC and MoDEL

The evolutionary computation (EC) approach allows variation of motifs by the measurement of a similarity score. Compared with SD algorithms, which are not always easy to define and rely on the accuracy of PSSM, the EC approach does not rely on any pre-defined or estimated weight matrices (39, 40). MoDEL uses a hybrid strategy consisting of an evolutionary algorithm (global search) and hill-climbing optimizations (local search) according to Brazma’s classification (41). It addresses a well-known problem: given a set of functionally related sequences, how to choose exactly one occurrence per sequence in a way that all chosen occurrences are maximally similar. Such a set of occurrences will be referred to as ungapped local multiple alignment (41).

Probabilistic approaches

Probabilistic or randomized approaches make certain decisions randomly. This concept extends the classical model of deterministic algorithms and has generated many useful and probably efficient algorithms over the last twenty years. Probabilistic approaches are often faster, simpler, or more elegant than their combinatorial counterparts. Probabilistic algorithms that identify gene modules based on motif discovery are highly appropriate for analyzing synthetic lethal genetic interaction datasets, and have great potential in the integrative analysis of heterogeneous datasets (42).

EM

The EM algorithm is used to estimate the probability density of a given dataset by employing the Gaussian mixture model. The probability density of a dataset is modeled as the weighted sum of a number of Gaussian distributions. The main advantage of EM is its fast speed, while the disadvantage is that it requires “appropriate” starting values and is difficult to deal with constrained parameters (43).

Gibbs Sampler

The Gibbs sampling algorithm is one of the simplest Markov chain Monte Carlo algorithms. By Gibbs sampling, the joint distribution of the parameters will converge to the joint probability of the parameters in the given dataset. Gibbs sampling strategies claim to be fast and sensitive. It generally finds an optimized local alignment model for N sequences in N-linear time, avoiding the problem that the EM algorithm falls into. For example, it requires a relatively large dataset (15 or more sequences) for weakly conserved patterns to reach statistical significance (44). In 2000, Petersen et al. tried to find motifs that are not necessarily 100% conserved in 17 putative promoter regions obtained from microarray experiments by using Gibbs Sampler (45). The search was performed in sequences ranging from 6 to 16 bp, where Gibbs Sampler repeatedly found motifs “TTGACT” and “GACTWWHC”, both of which had been identified by previous experiments.

MEME

The MEME algorithm extends the EM algorithm for identifying motifs in unaligned sequences. While a drawback of EM is that the maximum it finds is only local (46), MEME can either favor motifs that appear exactly once (one-per model) or appear zero or once (zero-or-one-per model) in each sequence in a training set, or give no preference to a number of occurrences (zero-or-more-per model). In 2005, Hall et al. acquired a set of correlated genes from genomic, transcriptomic, and proteomic analyses. They applied MEME to scan 1,000 bp of the 3’ end of stop codon, where a 47 bp motif was found in six of the analyzed sequences. Then it was used to search the entire genome and 20 additional genes were identified to have the same motif. This motif was known to be bound to Puf protein, implying that Puf protein may control the transcription of the analyzed genes (47).

LOGOS and MotifPrototyper

LOGOS consists of two interacting submodels: HMDM, a model for aligned selected sequences, and HMM, a model for the global distribution of motif instances. HMDM is a hidden Markov-Dirichlet multinomial model that captures rich biological prior knowledge and positional dependence in motif local structure in a principled way. HMM is a standard hidden Markov model, which allows formal and efficient inference of motif locations, and is potentially capable of capturing their dependencies. Model parameters can be fit on training motifs by using a variational EM algorithm within an empirical Bayesian framework (48). MotifPrototyper is later used to train the model’s parameters and to scan for known regulatory motifs and discover unknown ones (49).

Motif Sampler

Motif Sampler uses higher-order Markov models to represent the intergenic motifs in non-coding DNA sequences. It can incorporate higher-order background models to update probabilities of finding a motif at a certain position (50). To search for a known TF Yrrp1 consensus binding site in yeast, Le Crom et al. used Motif Sampler to search for motifs in the genes regulated by Yrrp1, and the result motif “(T/A)CCG(C/T)(G/T)(G/T)(A/T)(A/T)” was confirmed by EMSA experiments (51).

AlignACE

AlignACE is based on the Gibbs sampling algorithm, but it differs from Gibbs sampling in the following ways. Firstly, the motif model is changed so that base frequencies for non-site sequences are fixed according to the source genome. Secondly, both strands of input sequences are simultaneously considered at each step of this algorithm. Overlapping sites are not allowed even if these sites are on opposite strands. Thirdly, simultaneous multiple searching is replaced by an approach in which single motif is found and iteratively masked (52–54).

ANN-Spec

The objective function for ANN-Spec is designed to find patterns that distinguish the positive dataset from background. It succeeds in identifying the desired patterns specific for the positive dataset. For example, Gibbs sampling and ANN-Spec both work very well when the background is assumed to be random, while ANN-Spec finds patterns with higher specificity and higher correlation coefficients when it is provided with background sequences (55, 56).

BioProspector

BioProspector uses the Markov background to model base dependencies of non-motif bases, which greatly improves the specificity of reported motifs. The parameters of the Markov background model are either estimated from user-specified sequences or precomputed from the whole genome. A new motif scoring function is adopted to allow each input sequence contain zero to multiple copies of the motif. In addition, BioProspector can model gapped motifs with palindromic patterns, which are prevalent motif patterns in prokaryotes (57, 58).

MDscan and Motif Regressor

MDscan mainly examines ChIP-on-chip selected sequences. It combines the advantages of two widely adopted motif search strategies, word enumeration and PSSM, and incorporates ChIP enrichment information to accelerate the searching and enhance its success rate. Motif Regressor uses linear regression analysis to select motifs whose sequence matching scores are significantly correlated with ChIP-on-chip enrichment or downstream gene expression values. Ranking motifs by linear regression р-value, Motif Regressor automatically picks the best one with optimal width (59–61).

Improbizer

Improbizer searches for motifs that occur with improbable frequency by using a variation of the EM algorithm. It works by finding the patterns that occur more frequently than they should occur by chance. The simple way to estimate how frequently a particular nucleotide should occur by chance is to put one quarter to the power of the number of nucleotides in the sequence. Optionally, Improbizer constructs a Gaussian model of motif placement, so that motifs occurring in similar positions in the input sequences are more likely to be found (62).

SeSiMCMC

SeSiMCMC is a tool for multiple local alignment of a set of non-coding DNA sequences, which is based on a modification of the Gibbs sampling algorithm. Its primary objective is to create a computationally efficient tool that uses user-defined motif symmetry and evaluates motif length from dataset. Sequence fragments in a training set can have arbitrary orientation, and there is a probability for a sequence to contain no sites (63).

GMS-MP

GMS-MP performs significantly better than standard PWM-based Gibbs sampling methods. Compared with the Bayesian network approach, GMS-MP has a simpler model, easier prescribing prior, and much faster computation. The step of sampling pairwise correlations takes up only about 3% of the total computing time, which is much faster than the Bayesian network. This method also does not suffer any problems with over-fitting, which is likely to occur due to the employment of a rather conservative prior distribution on model pattern (64).

Phylogenetic footprinting approaches

Phylogenetic footprinting approaches discover regulatory elements in a set of orthologous regulatory regions from multiple species by identifying the best conserved motifs in those orthologous regions (65).

PhyloCon

Phylogenetic-Consensus (PhyloCon) takes into account both conserved orthologous genes and co-regulated genes within a species. The key idea of PhyloCon is to compare aligned sequence profiles from orthologous genes or co-regulated genes rather than unaligned sequences. PhyloCon integrates the knowledge of co-regulated genes in single species with sequence conservation across multiple species to improve the performance of motif discovery. An advantage of PhyloCon is that it reports motifs of varying lengths, instead of requiring the motif length to be input (66, 67).

EMnEM and OrthoMeme

Expectation-maximization on evolutionary mixtures (EMnEM) considers special motifs that are generated from ancestral sequences. The ancestral sequences are made of two component mixtures of motifs and background, each with their own evolutionary model. The value of varying evolutionary models has been realized in other contexts as well, and such models have been successfully trained by using EM. Normally, MEME often scores better than EMnEM with a substitution model, except for higher evolutionary distances, where EMnEM takes the head (68). OrthoMeme is the first algorithm to deal with heterogeous data sources in a truly integrated manner by using all the data from onset of analysis (69).

PhyME

PhyME integrates two different axes of information content in evaluating the significance of candidate motifs. One axis is the overrepresentation that depends on the number of occurrences of motifs in each species. The other axis is the level of conservation of each motif instance across species. An important feature of PhyME is that it allows motifs to occur in evolutionarily conserved as well as unconserved regions in orthologous sequences. PhyME treats the two kinds of occurrences differently when it scores a motif (70).

FootPrinter

The unique character of FootPrinter is that it takes input as a set of unaligned homologous sequences from various species and elicits a phylogenetic tree relating to these species. It then searches for short regions of the sequences that are highly conserved according to a parsimony criterion. The regions identified will be good candidates for regulatory elements (71).

CompareProspector

CompareProspector identifies regulatory elements by using information content from both intraspecies pattern enrichment and interspecies sequence conservation. This distinguishes it from other phylogenetic footprinting programs that use orthologous sequences of a single gene from multiple species to identify regulatory elements (44).

Conclusion

In the last decade, computational identification of motifs or TFBSs by analyzing non-coding DNA sequences has emerged as a major new technology for elucidating transcriptional regulatory networks. Combinatorial algorithms assume a discrete model and search for motifs with a high rate of occurrences in non-coding DNA sequences. One major drawback of combinatorial algorithms is that they are sometimes difficult to understand and many “hidden” details make them hard to implement. Probabilistic algorithms often run faster than their corresponding combinatorial algorithms. Moreover, many probabilistic algorithms are easier to implement and describe than combinatorial algorithms of comparable performance. However, these algorithms may miss lots of useful information when searching in non-coding DNA sequences. Phylogenetic footprinting assumes that functional sequences tend to be conserved through evolution. Motifs or TFBSs can thus be identified by looking for conservation of small regions within multiple alignments of non-coding DNA sequences.

Up to date, more than 120 motif discovery tools have been applied in biological researches. All the time the main challenge of motif discovery tools has been the application of effective algorithms that can treat all the intrinsic complexities associated with the nature of motifs or TFBSs. However, there still exist some considerations that we should bear in mind when thinking of computational approaches to tackle biological problems. One is the issue of futility theorem, which means we still do not have any good methods other than traditional molecular biology to find out whether our predictions of individual motif or TFBS have any relationships with a clear function in vivo. Another is that pattern discovery methods are severely restricted by the signal-to-noise problem, because the information content of motifs is strictly limited by its intrinsic nature. Additionally, some algorithms that work well for yeast might not work for human due to the complexity of DNA structure. Therefore, all observed patterns must be carefully considered.

Materials and Methods

Web-based resources for non-coding DNA sequence datasets

The non-coding DNA sequence dataset perspectives in web-based resources give the tools for biologists to work with relational experimental researches in their application development. The relational dataset tools include views, wizards, editors, and other features that make it easy for users to predict and test the experimental elements of their applications (partially in Table 1).

Table 1

Open in new tab

Selected web-based resources for promoter databases

Database	Explanation	URL
EPD	Eukaryotic promoter database	http://www.epd.isb-sib.ch/
DBTSS	Database of transcriptional start sites (human)	http://dbtss_old.hgc.jp/hg17/
SCPD	Saccharomyces cerevisiae promoter database	http://rulai.cshl.edu/SCPD/
DCPD	Drosophila core promoter database	http://www-biology.ucsd.edu/labs/Kadonaga/DCPD.html
PlantProm DB	Plant promoter database	http://mendel.cs.rhul.ac.uk/mendel.php?topic=plantprom
CSHLmpd	Cold Spring Harbor Laboratory mammalian promoter database	http://rulai.cshl.edu/CSHLmpd2/
TRED	Transcriptional regulatory element database	http://rulai.cshl.edu/cgi-bin/TRED/tred.cgi?process=home

Database	Explanation	URL
EPD	Eukaryotic promoter database	http://www.epd.isb-sib.ch/
DBTSS	Database of transcriptional start sites (human)	http://dbtss_old.hgc.jp/hg17/
SCPD	Saccharomyces cerevisiae promoter database	http://rulai.cshl.edu/SCPD/
DCPD	Drosophila core promoter database	http://www-biology.ucsd.edu/labs/Kadonaga/DCPD.html
PlantProm DB	Plant promoter database	http://mendel.cs.rhul.ac.uk/mendel.php?topic=plantprom
CSHLmpd	Cold Spring Harbor Laboratory mammalian promoter database	http://rulai.cshl.edu/CSHLmpd2/
TRED	Transcriptional regulatory element database	http://rulai.cshl.edu/cgi-bin/TRED/tred.cgi?process=home

Table 1

Open in new tab

Selected web-based resources for promoter databases

Database	Explanation	URL
EPD	Eukaryotic promoter database	http://www.epd.isb-sib.ch/
DBTSS	Database of transcriptional start sites (human)	http://dbtss_old.hgc.jp/hg17/
SCPD	Saccharomyces cerevisiae promoter database	http://rulai.cshl.edu/SCPD/
DCPD	Drosophila core promoter database	http://www-biology.ucsd.edu/labs/Kadonaga/DCPD.html
PlantProm DB	Plant promoter database	http://mendel.cs.rhul.ac.uk/mendel.php?topic=plantprom
CSHLmpd	Cold Spring Harbor Laboratory mammalian promoter database	http://rulai.cshl.edu/CSHLmpd2/
TRED	Transcriptional regulatory element database	http://rulai.cshl.edu/cgi-bin/TRED/tred.cgi?process=home

Database	Explanation	URL
EPD	Eukaryotic promoter database	http://www.epd.isb-sib.ch/
DBTSS	Database of transcriptional start sites (human)	http://dbtss_old.hgc.jp/hg17/
SCPD	Saccharomyces cerevisiae promoter database	http://rulai.cshl.edu/SCPD/
DCPD	Drosophila core promoter database	http://www-biology.ucsd.edu/labs/Kadonaga/DCPD.html
PlantProm DB	Plant promoter database	http://mendel.cs.rhul.ac.uk/mendel.php?topic=plantprom
CSHLmpd	Cold Spring Harbor Laboratory mammalian promoter database	http://rulai.cshl.edu/CSHLmpd2/
TRED	Transcriptional regulatory element database	http://rulai.cshl.edu/cgi-bin/TRED/tred.cgi?process=home

Web-based resources for regulatory motif or TFBS datasets

The relational motif or TFBS datasets help biologists create and manipulate the data definitions for their own projects, in terms of relational dataset schemas. Users can access relational motif or TFBS datasets under the analysis perspective, which allows users to browse or import dataset schemas in the servers view, create and work with dataset schemas in the data definition view, and change dataset schemas in the table editor. Users can also export data definitions to another dataset installed either locally or remotely (partially in Table 2).

Table 2

Open in new tab

Selected web-based resources for regulatory motifs or TFBSs

Database	Explanation	URL
JASPAR	A collection of transcription factor DNA-binding preferences	http://mordor.cgb.ki.se/cgi-bin/jaspar2005/jaspar_db.pl
TRANSFAC	Database on eukaryotic transcription factors, their genomic binding sites and DNA-binding profiles	http://www.gene-regulation.com/pub/databases.html#transfac
TRRD	Transcription regulatory regions database	http://wwwmgs.bionet.nsc.ru/mgs/gnw/
RegulonDB	A computational model of mechanisms of transcriptional regulation	http://regulondb.ccg.unam.mx/html/What_is_RegulonDB.jsp
TFD	Transcription factor databases	http://www.ifti.org/

Database	Explanation	URL
JASPAR	A collection of transcription factor DNA-binding preferences	http://mordor.cgb.ki.se/cgi-bin/jaspar2005/jaspar_db.pl
TRANSFAC	Database on eukaryotic transcription factors, their genomic binding sites and DNA-binding profiles	http://www.gene-regulation.com/pub/databases.html#transfac
TRRD	Transcription regulatory regions database	http://wwwmgs.bionet.nsc.ru/mgs/gnw/
RegulonDB	A computational model of mechanisms of transcriptional regulation	http://regulondb.ccg.unam.mx/html/What_is_RegulonDB.jsp
TFD	Transcription factor databases	http://www.ifti.org/

Table 2

Open in new tab

Selected web-based resources for regulatory motifs or TFBSs

Database	Explanation	URL
JASPAR	A collection of transcription factor DNA-binding preferences	http://mordor.cgb.ki.se/cgi-bin/jaspar2005/jaspar_db.pl
TRANSFAC	Database on eukaryotic transcription factors, their genomic binding sites and DNA-binding profiles	http://www.gene-regulation.com/pub/databases.html#transfac
TRRD	Transcription regulatory regions database	http://wwwmgs.bionet.nsc.ru/mgs/gnw/
RegulonDB	A computational model of mechanisms of transcriptional regulation	http://regulondb.ccg.unam.mx/html/What_is_RegulonDB.jsp
TFD	Transcription factor databases	http://www.ifti.org/

Database	Explanation	URL
JASPAR	A collection of transcription factor DNA-binding preferences	http://mordor.cgb.ki.se/cgi-bin/jaspar2005/jaspar_db.pl
TRANSFAC	Database on eukaryotic transcription factors, their genomic binding sites and DNA-binding profiles	http://www.gene-regulation.com/pub/databases.html#transfac
TRRD	Transcription regulatory regions database	http://wwwmgs.bionet.nsc.ru/mgs/gnw/
RegulonDB	A computational model of mechanisms of transcriptional regulation	http://regulondb.ccg.unam.mx/html/What_is_RegulonDB.jsp
TFD	Transcription factor databases	http://www.ifti.org/

Web-based resources for motif or TFBS discovery algorithms

Emphases are placed on the development of general design algorithms and data structures that are particularly suited for biological problems. Applications in a variety of areas such as genetic information systems, computer graphics, alignments, and computer aided designs are performed (partially in Table 3).

Table 3

Open in new tab

Selected web-based resources for motif discovery tools

Algorithm	Motif model	Match model	Ref.
AlignACE	matrix	PWM	52
ANN-Spec	matrix	PWM	55
BioOptimizer	–	PWM	72
BioProspector	matrix, dyad	PWM	57
CAGER	–	–	73
Cis-analyst	–	PWM	74
CisModule	–	PWM	75
Cister	–	PWM	76
Clover	–	PWM	77
ClusterScan	–	PWM	78
CoBind	matrix, dyad	PWM	79
COMET	–	–	80
CompareProspector	–	–	57
ConsecID	–	PWM	81
Consensus	matrix	PWM	24
ConSite	–	PWM	82
COOP	–	reg.exp	83
cWinnower	string	mismatch	31
DMotifs	string	reg.exp	84
DMS	–	PWM	85
Dyad analysis	string, dyad	oligos	15
EC	string	fitness	39
EMnEM	–	–	68
FastM	–	PWM	18
FootPrinter	–	mismatch	71
FrameWorker	–	PWM	86
GANN	–	flexible	87
Gibbs sampler	matrix	PWM	44
Gibbs recursive	matrix	PWM	88
GLAM	string	–	89
GMS-MP	GWM	HMM	64
HMDM	–	DM	90
Improbizer	–	PWM	62
LOGOS	HMDM	DM	48
MAPPER	–	HMM	91
MCAST	–	PWM	92
MDScan	matrix	PWM	59
MEME	matrix	PWM	46
MERMAID	string	PWM	93
MISAE	–	mismatch	94
Mitra	string, dyad	mismatch	48
Mitra-dyad	–	mismatch	17
MITRA-PSSM	matrix	PWM	95
MM	–	PWM	96
MobyDick	string	mismatch	32
MoDEL	string	PWM	41
ModelGenerator	–	PWM	97
ModelInspector	–	PWM	97
Modulescanner	–	PWM	98
ModuleSearcher	–	PWM	98
MotifLocator	–	PWM	98
MotifPrototyper	–	DM	49
Motif regressor	–	PWM	41
Motif sampler	–	PWM	50
MSCAN	–	PWM	99
MultiProfiler	string	mismatch	21
NestedMICA	–	PWM	100
NONPAR	–	mixture	101
Oligo–analysis	string	oligos	102
OrthoMEME	–	PWM	69
Pattern–assembly	–	–	103
PhyloCon	–	PWM	66
PhyME	–	–	70
Pratt2	–	reg.exp	104
Projection	string	PWM	38
ProMapper	–	DM	105
PromoterInsp	–	oligos	106
QuickScore	string	IUPAC	107
REDUCE	–	PWM	108
SCORE	–	–	109
SeSiMCMC	–	PWM	63
SMILE	string, mult	mismatch	34
SOMBERO	–	PWM	110
Splash	–	reg.exp	111
Stubb	–	PWM	42
Teiresias	string	reg.exp	27
TFBScluster	–	PWM	112
Verbumculus	string	mismatch	35
Weeder	string	mismatch	36
Winnower	string	mismatch	30
YMF	string	reg.exp	113

Algorithm	Motif model	Match model	Ref.
AlignACE	matrix	PWM	52
ANN-Spec	matrix	PWM	55
BioOptimizer	–	PWM	72
BioProspector	matrix, dyad	PWM	57
CAGER	–	–	73
Cis-analyst	–	PWM	74
CisModule	–	PWM	75
Cister	–	PWM	76
Clover	–	PWM	77
ClusterScan	–	PWM	78
CoBind	matrix, dyad	PWM	79
COMET	–	–	80
CompareProspector	–	–	57
ConsecID	–	PWM	81
Consensus	matrix	PWM	24
ConSite	–	PWM	82
COOP	–	reg.exp	83
cWinnower	string	mismatch	31
DMotifs	string	reg.exp	84
DMS	–	PWM	85
Dyad analysis	string, dyad	oligos	15
EC	string	fitness	39
EMnEM	–	–	68
FastM	–	PWM	18
FootPrinter	–	mismatch	71
FrameWorker	–	PWM	86
GANN	–	flexible	87
Gibbs sampler	matrix	PWM	44
Gibbs recursive	matrix	PWM	88
GLAM	string	–	89
GMS-MP	GWM	HMM	64
HMDM	–	DM	90
Improbizer	–	PWM	62
LOGOS	HMDM	DM	48
MAPPER	–	HMM	91
MCAST	–	PWM	92
MDScan	matrix	PWM	59
MEME	matrix	PWM	46
MERMAID	string	PWM	93
MISAE	–	mismatch	94
Mitra	string, dyad	mismatch	48
Mitra-dyad	–	mismatch	17
MITRA-PSSM	matrix	PWM	95
MM	–	PWM	96
MobyDick	string	mismatch	32
MoDEL	string	PWM	41
ModelGenerator	–	PWM	97
ModelInspector	–	PWM	97
Modulescanner	–	PWM	98
ModuleSearcher	–	PWM	98
MotifLocator	–	PWM	98
MotifPrototyper	–	DM	49
Motif regressor	–	PWM	41
Motif sampler	–	PWM	50
MSCAN	–	PWM	99
MultiProfiler	string	mismatch	21
NestedMICA	–	PWM	100
NONPAR	–	mixture	101
Oligo–analysis	string	oligos	102
OrthoMEME	–	PWM	69
Pattern–assembly	–	–	103
PhyloCon	–	PWM	66
PhyME	–	–	70
Pratt2	–	reg.exp	104
Projection	string	PWM	38
ProMapper	–	DM	105
PromoterInsp	–	oligos	106
QuickScore	string	IUPAC	107
REDUCE	–	PWM	108
SCORE	–	–	109
SeSiMCMC	–	PWM	63
SMILE	string, mult	mismatch	34
SOMBERO	–	PWM	110
Splash	–	reg.exp	111
Stubb	–	PWM	42
Teiresias	string	reg.exp	27
TFBScluster	–	PWM	112
Verbumculus	string	mismatch	35
Weeder	string	mismatch	36
Winnower	string	mismatch	30
YMF	string	reg.exp	113

Table 3

Open in new tab

Selected web-based resources for motif discovery tools

Algorithm	Motif model	Match model	Ref.
AlignACE	matrix	PWM	52
ANN-Spec	matrix	PWM	55
BioOptimizer	–	PWM	72
BioProspector	matrix, dyad	PWM	57
CAGER	–	–	73
Cis-analyst	–	PWM	74
CisModule	–	PWM	75
Cister	–	PWM	76
Clover	–	PWM	77
ClusterScan	–	PWM	78
CoBind	matrix, dyad	PWM	79
COMET	–	–	80
CompareProspector	–	–	57
ConsecID	–	PWM	81
Consensus	matrix	PWM	24
ConSite	–	PWM	82
COOP	–	reg.exp	83
cWinnower	string	mismatch	31
DMotifs	string	reg.exp	84
DMS	–	PWM	85
Dyad analysis	string, dyad	oligos	15
EC	string	fitness	39
EMnEM	–	–	68
FastM	–	PWM	18
FootPrinter	–	mismatch	71
FrameWorker	–	PWM	86
GANN	–	flexible	87
Gibbs sampler	matrix	PWM	44
Gibbs recursive	matrix	PWM	88
GLAM	string	–	89
GMS-MP	GWM	HMM	64
HMDM	–	DM	90
Improbizer	–	PWM	62
LOGOS	HMDM	DM	48
MAPPER	–	HMM	91
MCAST	–	PWM	92
MDScan	matrix	PWM	59
MEME	matrix	PWM	46
MERMAID	string	PWM	93
MISAE	–	mismatch	94
Mitra	string, dyad	mismatch	48
Mitra-dyad	–	mismatch	17
MITRA-PSSM	matrix	PWM	95
MM	–	PWM	96
MobyDick	string	mismatch	32
MoDEL	string	PWM	41
ModelGenerator	–	PWM	97
ModelInspector	–	PWM	97
Modulescanner	–	PWM	98
ModuleSearcher	–	PWM	98
MotifLocator	–	PWM	98
MotifPrototyper	–	DM	49
Motif regressor	–	PWM	41
Motif sampler	–	PWM	50
MSCAN	–	PWM	99
MultiProfiler	string	mismatch	21
NestedMICA	–	PWM	100
NONPAR	–	mixture	101
Oligo–analysis	string	oligos	102
OrthoMEME	–	PWM	69
Pattern–assembly	–	–	103
PhyloCon	–	PWM	66
PhyME	–	–	70
Pratt2	–	reg.exp	104
Projection	string	PWM	38
ProMapper	–	DM	105
PromoterInsp	–	oligos	106
QuickScore	string	IUPAC	107
REDUCE	–	PWM	108
SCORE	–	–	109
SeSiMCMC	–	PWM	63
SMILE	string, mult	mismatch	34
SOMBERO	–	PWM	110
Splash	–	reg.exp	111
Stubb	–	PWM	42
Teiresias	string	reg.exp	27
TFBScluster	–	PWM	112
Verbumculus	string	mismatch	35
Weeder	string	mismatch	36
Winnower	string	mismatch	30
YMF	string	reg.exp	113

Algorithm	Motif model	Match model	Ref.
AlignACE	matrix	PWM	52
ANN-Spec	matrix	PWM	55
BioOptimizer	–	PWM	72
BioProspector	matrix, dyad	PWM	57
CAGER	–	–	73
Cis-analyst	–	PWM	74
CisModule	–	PWM	75
Cister	–	PWM	76
Clover	–	PWM	77
ClusterScan	–	PWM	78
CoBind	matrix, dyad	PWM	79
COMET	–	–	80
CompareProspector	–	–	57
ConsecID	–	PWM	81
Consensus	matrix	PWM	24
ConSite	–	PWM	82
COOP	–	reg.exp	83
cWinnower	string	mismatch	31
DMotifs	string	reg.exp	84
DMS	–	PWM	85
Dyad analysis	string, dyad	oligos	15
EC	string	fitness	39
EMnEM	–	–	68
FastM	–	PWM	18
FootPrinter	–	mismatch	71
FrameWorker	–	PWM	86
GANN	–	flexible	87
Gibbs sampler	matrix	PWM	44
Gibbs recursive	matrix	PWM	88
GLAM	string	–	89
GMS-MP	GWM	HMM	64
HMDM	–	DM	90
Improbizer	–	PWM	62
LOGOS	HMDM	DM	48
MAPPER	–	HMM	91
MCAST	–	PWM	92
MDScan	matrix	PWM	59
MEME	matrix	PWM	46
MERMAID	string	PWM	93
MISAE	–	mismatch	94
Mitra	string, dyad	mismatch	48
Mitra-dyad	–	mismatch	17
MITRA-PSSM	matrix	PWM	95
MM	–	PWM	96
MobyDick	string	mismatch	32
MoDEL	string	PWM	41
ModelGenerator	–	PWM	97
ModelInspector	–	PWM	97
Modulescanner	–	PWM	98
ModuleSearcher	–	PWM	98
MotifLocator	–	PWM	98
MotifPrototyper	–	DM	49
Motif regressor	–	PWM	41
Motif sampler	–	PWM	50
MSCAN	–	PWM	99
MultiProfiler	string	mismatch	21
NestedMICA	–	PWM	100
NONPAR	–	mixture	101
Oligo–analysis	string	oligos	102
OrthoMEME	–	PWM	69
Pattern–assembly	–	–	103
PhyloCon	–	PWM	66
PhyME	–	–	70
Pratt2	–	reg.exp	104
Projection	string	PWM	38
ProMapper	–	DM	105
PromoterInsp	–	oligos	106
QuickScore	string	IUPAC	107
REDUCE	–	PWM	108
SCORE	–	–	109
SeSiMCMC	–	PWM	63
SMILE	string, mult	mismatch	34
SOMBERO	–	PWM	110
Splash	–	reg.exp	111
Stubb	–	PWM	42
Teiresias	string	reg.exp	27
TFBScluster	–	PWM	112
Verbumculus	string	mismatch	35
Weeder	string	mismatch	36
Winnower	string	mismatch	30
YMF	string	reg.exp	113

Authors’ contributions

WW carried out the study, and YX supervised the research. Both authors read and approved the final manuscript.

Competing interests

The authors have declared that no competing interests exist.

Acknowledgements

We thank Mr. Maximilian Häußler and Dr. Saurabh Sinha for providing their theses, Prof. Finn Drabløs for his precious documents, Dr. Zhiping Weng and Prof. Michael Q. Zhang for their helpful websites, and Dr. Jinkuk Choi for his useful critical review. Especially we thank Dr. Aiguo Li of the National Cancer Institute in NIH for her helpful advice on the manuscript.

References

Roulet

et al.

High-throughput SELEX-SAGE method for quantitative modeling of transcription-factor binding sites

Nat. Biotechnol.

2002

;

831

835

Month:	Total Views:
July 2023	3
August 2023	4
September 2023	6
November 2023	5
December 2023	5
January 2024	18
February 2024	3
March 2024	10
April 2024	12
May 2024	13
June 2024	17
July 2024	18
August 2024	5
September 2024	12
October 2024	19
November 2024	7
December 2024	15
January 2025	8
February 2025	16
March 2025	15
April 2025	5
May 2025	4

Article Contents

Comparative Analysis of Regulatory Motif Discovery Tools for Transcription Factor Binding Sites Open Access

Abstract

Introduction

Results and Discussion

Combinatorial approaches

Sequence-driven or Sample-driven (SD) algorithms

Pattern-driven (PD) algorithms

Multiprofiler

Consensus

Teiresias

Winnower, SP-STAR, and cWinnower

MobyDick

Smile, Verbumculus, and Weeder

Mitra

Projection

EC and MoDEL

Probabilistic approaches

EM

Gibbs Sampler

MEME

LOGOS and MotifPrototyper

Motif Sampler

AlignACE

ANN-Spec

BioProspector

MDscan and Motif Regressor

Improbizer

SeSiMCMC

GMS-MP

Phylogenetic footprinting approaches

PhyloCon

EMnEM and OrthoMeme

PhyME

FootPrinter

CompareProspector

Conclusion

Materials and Methods

Web-based resources for non-coding DNA sequence datasets

Web-based resources for regulatory motif or TFBS datasets

Web-based resources for motif or TFBS discovery algorithms

Authors’ contributions

Competing interests

Acknowledgements

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

Comparative Analysis of Regulatory Motif Discovery Tools for Transcription Factor Binding Sites