Abstract

Protein–DNA bindings between transcription factors (TFs) and transcription factor binding sites (TFBSs) play an essential role in transcriptional regulation. Over the past decades, significant efforts have been made to study the principles for protein–DNA bindings. However, it is considered that there are no simple one-to-one rules between amino acids and nucleotides. Many methods impose complicated features beyond sequence patterns. Protein-DNA bindings are formed from associated amino acid and nucleotide sequence pairs, which determine many functional characteristics. Therefore, it is desirable to investigate associated sequence patterns between TFs and TFBSs. With increasing computational power, availability of massive experimental databases on DNA and proteins, and mature data mining techniques, we propose a framework to discover associated TF–TFBS binding sequence patterns in the most explicit and interpretable form from TRANSFAC. The framework is based on association rule mining with Apriori algorithm. The patterns found are evaluated by quantitative measurements at several levels on TRANSFAC. With further independent verifications from literatures, Protein Data Bank and homology modeling, there are strong evidences that the patterns discovered reveal real TF–TFBS bindings across different TFs and TFBSs, which can drive for further knowledge to better understand TF–TFBS bindings.

INTRODUCTION

We first introduce protein–DNA bindings in this section. Existing bioinformatics methods are briefly described, followed by the layout of this article.

Protein–DNA binding

Protein–DNA binding plays a central role in genetic activities such as transcription, packaging, rearrangement, and replication ( 1 , 2 ). Therefor, it is very important to identify and understand the protein–DNA bindings as the basis for further deciphering biological systems. We focus on protein–DNA bindings between transcription factors (TFs) and transcription factor binding sites (TFBSs), which are the primary regulatory activities with abundant data support. TFs bind in a sequence-specific manner to TFBSs to regulate gene transcription (gene expression). The DNA binding domain(s) of a TF can recognize and bind to a collection of similar TFBSs, from which a conserved pattern called motif can be obtained. TFBSs, the nucleotide fragments bound by TFs, are usually short (usually about 5–20 bp) in the cis -regulatory/intergenic regions and can assume very different locations from the transcription start site.

It is expensive and laborious to experimentally identify TF–TFBS binding sequence pairs, for example, using DNA footprinting ( 3 ) or gel electrophoresis ( 4 ). The technology of chromatin immunoprecipitation (ChIP) ( 5 , 6 ) measures the binding of a particular TF to DNA of co-regulated genes on a genome-wide scale in vivo , but at low resolution. Further processing are needed to extract precise TFBSs ( 7 ). TRANSFAC ( 8 ) is one of the largest and most representative databases for regulatory elements including TFs, TFBSs, nucleotide distribution matrices of the TFBSs and regulated genes. The data are expertly annotated and manually curated from peer-reviewed and experimentally proved publications. Other annotation databases of TF families and binding domains are also available [e.g. PROSITE ( 9 ), Pfam ( 10 )]. It is even more difficult and time-consuming to extract high-resolution 3D TF–TFBS complex structures with X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopic analysis. Nevertheless, the high-quality TF–TFBS binding structures provide valuable insights into verifications of putative principles of binding. The Protein Data Bank (PDB) ( 11 ) serves as a representative repository of such experimentally extracted protein–DNA (in particular TF–TFBS) complexes with high resolution at atomic levels. However, the available 3D structures are far from complete. As a result, there is strong motivation to have automatic methods, particularly, computational approaches based on existing abundant data, to provide testable candidates of novel TF domains and/or TFBS motifs with high confidence to guide and accelerate the wet-lab experiments.

Existing methods

The first attempt of computational methods related to TF–TFBS bindings was to discover the motifs of TF domains and TFBSs separately. Moreover, researchers have been trying hard to generalize the one-to-one binding codes from existing 3D structures. Data mining methods have also been proposed with feature transformations and machine learning to decipher complicated binding rules. They are briefly described as follows:

Motif discovery

TF domains and TFBSs sequences are somewhat conserved due to their functional similarity and importance. By exploiting conservation in the sequences, Bioinformatics methods called motif discovery save some of the expensive and laborious laboratory experiments. Motif discovery ( 6 ) can be categorized into two types: (i) motif matching and (ii) de novo motif discovery. (i) Motif matching is to identify putative TF domains ( 9 , 10 ) or TFBSs ( 12 ) based on motif knowledge obtained from annotated data. (ii) de novo motif discovery predicts conserved patterns without knowledge on their appearances, based on certain motif models and scoring functions ( 13 , 14 ) from a set of protein/DNA promoter sequences with similar regulatory functions. While de novo motif discovery is successful for well-conserved TF functional domain motifs, the counterpart for TFBSs remains very challenging with poor performances on real benchmarks ( 6 , 15 , 16 ). A significant limitation of motif discovery is the lack of linkage between the binding counterparts for revealing TF–TFBS relationships.

One-to-one binding codes

Numerous studies have been carried out to analyze existing protein–DNA binding 3D structures comprehensively ( 2 , 17 , 18 ) or with focus on specific families ( 1 ) [e.g. zinc fingers ( 19 )]. Various properties have been discovered concerning, e.g. bonding and force types, TF conservation and mutation ( 1 ), and bending of the DNA ( 17 ). Some are already applicable to predict binding amino acids on the TF side ( 20 ). However, annotated data are far from complete. Alternatively, researchers have sought hard for general binding ‘codes' between proteins and DNA, in particular the one-to-one mapping between amino acids from TFs and nucleotides from TFBSs. Despite many proposed one-one binding propensity mappings ( 1 , 21 , 22 ), it has come to a consensus that there are no simple binding ‘codes' ( 23 ).

Data mining

In the hope of better understanding for protein–DNA bindings, many data mining approaches have also been proposed ( 24 ). Researchers employ and transfer additional detailed information such as base compositions, structures, thermodynamic properties ( 25 , 26 ) as well as expressions ( 27 ), into sophisticated features to fit into certain data mining techniques. Although some approaches may provide interpretable rules, most of them have stringent data requirements which cannot be obtained trivially. Existing data beyond sequences are also insufficient and limited for practitioners. These methods usually extract complicated features rather than working on interpretable data directly. Many data mining techniques, such as neural networks, support vector machines (SVM) ( 28 ) and regressions ( 24 ), may generate rules which are not trivial to interpret. Furthermore, many data mining approaches are based on specific families or particular data sets, where the generality of the results are limited. On the other hand, sequences serve as the most handy primary data that carry important information for protein–DNA bindings ( 23 ). It is desirable to make use of the large-scale and comprehensive sequence data to mine explicit and interpretable TF–TFBS binding rules.

Article layout

In this article, we propose a framework based on association rule mining to discover protein–DNA binding sequence patterns from TRANSFAC. The article layout is as follows: the proposed methods are presented in the next section: ‘Materials and Methods' section; experimental results and verifications are reported in sections ‘Results and Analysis' and ‘Verifications' section, respectively; and finally we have the ‘Discussion' section for the approach.

MATERIALS AND METHODS

In this section, we propose a framework for mining, discovering and verifying TF–TFBS bindings on large-scale databases. The framework starts from data cleansing and transformation on TRANSFAC, and then applies association rule mining to discover TF-TFBS binding sequence patterns. Comprehensive 3D verifications and evaluations are carried out on PDB. Detailed bonding analysis is performed to provide strong support to the discovered rules.

In the following subsections, Apriori algorithm for association rule mining is first introduced. We then elaborate how the algorithm is applied to protein–DNA binding pattern discovery. Finally, we present how the data are preprocessed for the task with a running example.

Association rule mining and Apriori Algorithm

Association rule mining ( 29 ) aims at discovering frequently co-occurring items, called frequent itemsets, from a large number of data samples above a certain count threshold (minimum support) ( 30 ). The support of an itemset is defined as the number of data samples where all the items in the itemset co-occur. In the case of protein–DNA binding, the binding domains of TFs can recognize and form strong bondings with certain sequence-specific patterns of the TFBSs. Therefore, they are likely to co-occur frequently among the combinations between all possible TF and TFBS subsequences, and can be thus identified by association rule mining. In this study, we use the notation of k -mer (a subsequence with k amino acid or nucleotide residues) to represent a candidate item. A frequent TF–TFBS itemset is a TF k -mer and TFBS k -mer (the two k 's can be different) pair, or simply a pair, co-occurring with a frequency no less than the minimum support in the TF–TFBS sequence records (TRANSFAC database).

Apriori algorithm proposed by Agrawal et al. ( 29 ) is a classical approach to find out frequent itemsets. It is outlined in Algorithm 1 in the  Appendix 1 . It is a branch and bound algorithm for discovering association rules in a database. With its downward closure property, an optimal performance is guaranteed. The algorithm first obtains frequent 1-itemsets. Iteratively, it uses the frequent n -itemsets (itemsets with n items) to generate all possible candidate ( n +1)-itemsets. They are then evaluated for their supports ( 30 ). If the support of an ( n +1)-itemset is lower than a threshold, the ( n +1)-itemset is removed. After the removal, the resultant ( n +1)-itemsets are the frequent ( n +1)-itemsets. The above procedure is repeated until an empty set is found.

Discovering associated TF–TFBS sequence patterns

To formulate the TF–TFBS sequence pattern discovery problem into association rule mining, we have to transform the protein–DNA binding records into the formats of itemsets ( k -mers). An illustrative example for the TF–TFBS binding records from TRANSFAC 2008.3 is shown in Figure 1 . The TF (e.g. T01333 RXR-γ) can bind to several TFBS DNA sequences. The DNA sequences may be different in lengths due to experimental methods and noises. Both the TF and TFBS sequences are chopped into overlapping short k -mers, as illustrated in Figure 2 (first part). They together with the corresponding reverse complements (e.g. GACCT and reverse complement: AGGTC) form one data sample. To generate the itemsets, all the k -mers are recorded in a binary array where appearing k -mers are marked 1; and 0 otherwise. Thus, the length of the array depends on the number of all possible TF k -mers and TFBS k -mers ( Figure 2 , second part). Since k is usually short ( 4–6 ), all the possible 4 k combinations of TFBS DNA k -mers can be adopted. However, it is computationally infeasible to obtain all the possible 20 k combinations of TF k -mers. Thus a data-driven approach is employed by scanning the whole TRANSFAC to obtain frequent TF amino acid k -mers.

TFBS sequences of a TF (TRANSFAC 2008.3 ID: T01333).
Figure 1.

TFBS sequences of a TF (TRANSFAC 2008.3 ID: T01333).

Flowchart of the proposed framework to discover association rules from TRANSFAC.
Figure 2.

Flowchart of the proposed framework to discover association rules from TRANSFAC.

Since there are multiple TFBSs for each TF (e.g. Figure 1 ), a question arises: how to define the ‘commonly found' TFBS k -mers of a TF? Without loss of generality, the majority rule ( 31 ) is applied. If the majority of a TF's TFBS sequences contains a certain DNA residue k -mer, then the k -mer is considered ‘commonly found'. We set the majority to be 50% for TFBS k -mers. We only count the number of TFBS sequences in which a certain k-mer appears, in order not to be biased by multiple occurrences of the k -mer appearing in only a few TFBS sequences. Figure 1 illustrates an example where there are five TFBS sequences. The TFBS DNA k -mer AGGTC (or its reverse complement: GACCT) can be found in three of the TFBS sequences. The k -mer appears in 60% (3/5) of the TFBS sequences of the TF, and thus is considered ‘commonly found'. On the other hand, GTTCA is not considered ‘commonly found' because it only appears in 2 (40%) out of the 5 TFBS sequences of the TF.

After all valid TF data samples are transformed into itemsets, Apriori algorithm is applied to generate frequent TF–TFBS k -mer sequence patterns (the links in Figure 2 , second part). The special feature in this study is that the co-occurring pairs should contain both TF and TFBS k -mer items, as illustrated in the third part of Figure 2 . In the current study, we only consider one TF k -mer with one TFBS k -mer in the frequent itemsets, but it is straightforward to generalize it to be multiple TF and TFBS k -mers in principle. The huge computational intensity for the generalization, when applied on the large TRANSFAC database, prevents us from doing so at this time. Finally, the association rules are computed based on the confidence measurements for the frequent itemsets, which are defined as follows:
where conf( k -mer DNAk -mer AA ) is called forward confidence, conf( k -mer DNAk -mer AA ) is called backward confidence and support( X ) is the support of itemset X . For each association rule, its forward confidence measures the posterior probability that the corresponding amino acid k -mer can be found in a TF's sequence if the DNA k -mer is commonly found in the TF's TFBS sequences. Its backward confidence measures the posterior probability that the corresponding DNA k -mer can be commonly found in a TF's TFBS sequences if the amino acid k -mer is found in the TF's sequence. The minimum of them is taken as confidence in this article. The higher the confidence, the better the association rule is ( Figure 2 , fourth part). The whole proposed approach is summarized in Figure 2 .

Data preparation

To apply the methodology on TRANSFAC, TF and TFBS data were downloaded and extracted from the flat files of TRANSFAC 2008.3 [a free public (older) version is also available ( http://www.gene-regulation.com/pub/database.html )]. The entries without sequence data were discarded. Since a TF can bind to one or more TFBSs, TFBS data were grouped by TF. TFBS sequences were extracted for each TF to form a TF data set—a TF sequence and the corresponding TFBS sequences—and finally to be transformed into itemsets. To avoid sampling error, TF data sets with less than five TFBS sequences were discarded. Furthermore, the redundancy of TF sequences was removed by BLASTClust using 90% TF sequence identity ( 32 ). Only one TF data set was selected for each cluster. Note that we only used sequence data in TRANSFAC. None of the prior information (e.g. the binding domains of TFs) other than sequences was used. Importantly, it turns out that the results of the proposed approach can be verified by annotations, 3D structures from PDB and even homology modeling as described in the subsequent sections.

After data preparation, the 631 TF data sets (listed in Table 5 in the  Appendix 1 ) were selected. The minimum support ( 30 ) was set to seven TF data sets to avoid sampling error. For the values of k , we try 4–6 for both TF k -mers and TFBS k -mers, resulting in 9 (3 × 3) different combinations. In particular, 256 DNA 4-mers, 1024 DNA 5-mers and 4096 DNA 6-mers were adopted for TFBS, whereas 99 621 amino acid 4-mers, 82 561 amino acid 5-mers, and 39 320 amino acid 6-mers were adopted for TF, as the frequent 1-itemsets.

Apriori algorithm was then applied to discover frequently co-occurring TF–TFBS k -mer pairs (2-itemsets). Finally, the resultant pairs were rescanned in TRANSFAC to measure their forward and backward confidences ( 33 ).

RESULTS AND ANALYSIS

In this section, the discovered rules are reported, followed by analysis with different measurements.

Rules discovered

Varying k from 4 to 6 for both TF k -mers and TFBS k -mers, we have obtained nine sets of associated pairs. For each set of pairs, the forward and backward confidences of each pair were calculated. Then, the pairs in the same set were sorted by the minima of their forward and backward confidences in descending order. The nine sets of rules (pairs) exhibit a similar trend that the number of rules decreases as the association criterion becomes more stringent (with higher confidence levels). The TFBS 5-mers settings in general show the most available rules when the confidence level is high (≥0.5), indicating more conserved and significant results. Therefore, we focus on them and use TFBS 5-mer–TF 5-mer as the representative example throughout the article. The results for all other settings are available in the Supplementary Data .

The number of rules (pairs) discovered is summarized in Table 1 . For instance, there are 70 TF 5-mer–TFBS 5-mer pairs without any further removal (in the N column) with both forward and backward confidences ≥0.5. Considering direct and reverse complement TFBS DNA k -mers as equivalent, we further removed the duplicated pairs (e.g. leaving AGGTC–CEGCK and removing GACCT–CEGCK because AGGTC and GACCT are reverse complements). The results are shown in the N ′ column in Table 1 . For instance, the 70 TF 5-mer–TFBS 5-mer pairs were reduced to 35 at a confidence level of 0.5. Furthermore, we found that most pairs could be merged together to form a longer pair. For instance, GGTCA–SGYHY and GGTCA–GYHYG could be merged to form a pair GGTCA–SGYHYG. Thus the pairs have been merged and the rule numbers are shown in the Nm column in Table 1 . For instance, 35 TF 5-mer–TFBS 5-mer pairs are merged to form 11 merged pairs when the confidence level is equal to 0.5.

Table 1.

Number of the TFBS 5-mer–TF 5-mer pairs across different confidence levels

Confidenceformulaformulaformulaformula
0.0262131299.88 ± 3.68
0.1262131299.88 ± 3.68
0.22401202410.14 ± 3.73
0.3180902310.63 ± 4.11
0.4126632111.40 ± 4.59
0.570351113.63 ± 5.05
0.62412815.08 ± 5.28
0.763210.33 ± 2.36
0.8000N/A
0.9000N/A
1.0000N/A
Confidenceformulaformulaformulaformula
0.0262131299.88 ± 3.68
0.1262131299.88 ± 3.68
0.22401202410.14 ± 3.73
0.3180902310.63 ± 4.11
0.4126632111.40 ± 4.59
0.570351113.63 ± 5.05
0.62412815.08 ± 5.28
0.763210.33 ± 2.36
0.8000N/A
0.9000N/A
1.0000N/A

N , number of pairs, N , number of pairs (duplicated pairs removed); formula , number of merged pairs; S , mean and SD of the support of the pairs in formula .)

Table 1.

Number of the TFBS 5-mer–TF 5-mer pairs across different confidence levels

Confidenceformulaformulaformulaformula
0.0262131299.88 ± 3.68
0.1262131299.88 ± 3.68
0.22401202410.14 ± 3.73
0.3180902310.63 ± 4.11
0.4126632111.40 ± 4.59
0.570351113.63 ± 5.05
0.62412815.08 ± 5.28
0.763210.33 ± 2.36
0.8000N/A
0.9000N/A
1.0000N/A
Confidenceformulaformulaformulaformula
0.0262131299.88 ± 3.68
0.1262131299.88 ± 3.68
0.22401202410.14 ± 3.73
0.3180902310.63 ± 4.11
0.4126632111.40 ± 4.59
0.570351113.63 ± 5.05
0.62412815.08 ± 5.28
0.763210.33 ± 2.36
0.8000N/A
0.9000N/A
1.0000N/A

N , number of pairs, N , number of pairs (duplicated pairs removed); formula , number of merged pairs; S , mean and SD of the support of the pairs in formula .)

Quantitative analysis

To evaluate the number of TF data sets supporting each pair (support), the support for each pair was counted. In general, more supports are found when the confidence level is increased. For instance, the average support of the TFBS 5-mer–TF 5-mer pairs is generally increased when the confidence level is increased in the S column of Table 1 . The overall results are summarized in Supplementary Table S4 .

Support is considered the degree of co-occurrence between a TF amino acid k -mer and a TFBS DNA k -mer. Forward and backward confidences consider the cases when either one of them is absent. Some may have questions about the remaining case. How about the case when both of them are absent? To take the case into account, ϕ-coefficients ( 35 ) were measured for each pair, as shown in the ϕ column in Table 2 . The overall results are summarized in Supplementary Table S5 . Most values are >0.4, indicating that positive correlations exist among pairs.

Table 2.

Quantitative measurements for the TFBS 5-mer–TF 5-mer pairs across different confidence levels

ConfidenceformulaLFCBC
0.00.49 ± 0.1117.92 ± 7.341.89 ± 0.673.50 ± 2.29
0.10.49 ± 0.1117.92 ± 7.341.89 ± 0.673.50 ± 2.29
0.20.51 ± 0.1118.32 ± 7.461.94 ± 0.683.51 ± 2.30
0.30.54 ± 0.1019.81 ± 7.792.02 ± 0.643.46 ± 2.31
0.40.58 ± 0.0921.41 ± 8.532.23 ± 0.663.61 ± 2.40
0.50.64 ± 0.0722.57 ± 10.462.49 ± 0.704.35 ± 2.65
0.60.71 ± 0.0625.80 ± 13.763.33 ± 0.574.21 ± 2.55
0.70.79 ± 0.0342.07 ± 14.873.70 ± 0.294.87 ± 0.00
0.8N/AN/AN/AN/A
0.9N/AN/AN/AN/A
1.0N/AN/AN/AN/A
ConfidenceformulaLFCBC
0.00.49 ± 0.1117.92 ± 7.341.89 ± 0.673.50 ± 2.29
0.10.49 ± 0.1117.92 ± 7.341.89 ± 0.673.50 ± 2.29
0.20.51 ± 0.1118.32 ± 7.461.94 ± 0.683.51 ± 2.30
0.30.54 ± 0.1019.81 ± 7.792.02 ± 0.643.46 ± 2.31
0.40.58 ± 0.0921.41 ± 8.532.23 ± 0.663.61 ± 2.40
0.50.64 ± 0.0722.57 ± 10.462.49 ± 0.704.35 ± 2.65
0.60.71 ± 0.0625.80 ± 13.763.33 ± 0.574.21 ± 2.55
0.70.79 ± 0.0342.07 ± 14.873.70 ± 0.294.87 ± 0.00
0.8N/AN/AN/AN/A
0.9N/AN/AN/AN/A
1.0N/AN/AN/AN/A

ϕ, mean and SD of ϕ-coefficient; L, mean and SD of lift; FC, mean and SD of forward conviction; BC, mean and SD of backward conviction.

Table 2.

Quantitative measurements for the TFBS 5-mer–TF 5-mer pairs across different confidence levels

ConfidenceformulaLFCBC
0.00.49 ± 0.1117.92 ± 7.341.89 ± 0.673.50 ± 2.29
0.10.49 ± 0.1117.92 ± 7.341.89 ± 0.673.50 ± 2.29
0.20.51 ± 0.1118.32 ± 7.461.94 ± 0.683.51 ± 2.30
0.30.54 ± 0.1019.81 ± 7.792.02 ± 0.643.46 ± 2.31
0.40.58 ± 0.0921.41 ± 8.532.23 ± 0.663.61 ± 2.40
0.50.64 ± 0.0722.57 ± 10.462.49 ± 0.704.35 ± 2.65
0.60.71 ± 0.0625.80 ± 13.763.33 ± 0.574.21 ± 2.55
0.70.79 ± 0.0342.07 ± 14.873.70 ± 0.294.87 ± 0.00
0.8N/AN/AN/AN/A
0.9N/AN/AN/AN/A
1.0N/AN/AN/AN/A
ConfidenceformulaLFCBC
0.00.49 ± 0.1117.92 ± 7.341.89 ± 0.673.50 ± 2.29
0.10.49 ± 0.1117.92 ± 7.341.89 ± 0.673.50 ± 2.29
0.20.51 ± 0.1118.32 ± 7.461.94 ± 0.683.51 ± 2.30
0.30.54 ± 0.1019.81 ± 7.792.02 ± 0.643.46 ± 2.31
0.40.58 ± 0.0921.41 ± 8.532.23 ± 0.663.61 ± 2.40
0.50.64 ± 0.0722.57 ± 10.462.49 ± 0.704.35 ± 2.65
0.60.71 ± 0.0625.80 ± 13.763.33 ± 0.574.21 ± 2.55
0.70.79 ± 0.0342.07 ± 14.873.70 ± 0.294.87 ± 0.00
0.8N/AN/AN/AN/A
0.9N/AN/AN/AN/A
1.0N/AN/AN/AN/A

ϕ, mean and SD of ϕ-coefficient; L, mean and SD of lift; FC, mean and SD of forward conviction; BC, mean and SD of backward conviction.

Consider the following scenario: if a TFBS DNA k -mer and a TF amino acid k -mer are both frequently found in the data sets, it will be very likely that they co-occur frequently merely by chance. To tackle such scenario, forward and backward confidences do play their important roles in pruning them. But for clarity, lift ( 36 ) that estimates the ratio of the actual support to the expected support was measured for each pair, where the expected support was calculated from the random model that the TFBS DNA k -mer is independent of the TF amino acid k -mer for each pair. For instance, the average lift for the TFBS 5-mer–TF 5-mer pairs is shown in the L column in Table 2 . The overall results are summarized in Supplementary Table S6 . Most values of the lift are >5. Thus the DNA residue k -mer and the amino acid residue k -mer of most pairs co-occur at least five times more frequently than the prediction based on the independent assumption made by the lift measurement.

To estimate the validity of the pairs, both forward and backward convictions (the same directions as the forward and backward confidences, respectively) ( 36 ) were measured for each pair. The measurements were averaged for each set of pairs. For instance, the average forward and backward convictions for the TFBS 5-mer–TF 5-mer pairs is shown in the FC and BC columns in Table 2 . The overall results are summarized in Supplementary Tables S7 and S8 . Most values are >1. The pairs commit fewer errors than the prediction based on the statistically independent assumption made by the measurements: forward and backward convictions. In other words, the pairs would have committed more errors if the association between its TFBS k -mer and TF k -mer had happened purely by chance.

Annotation analysis

If the pairs in our results are the actual binding cores between TFs and TFBSs, most of their TF amino acid k -mers should be inside DNA binding domains. Thus, the TF amino acid k -mers were scanned in TRANSFAC to check whether they were within the annotated DNA binding domains. As stated in the previous section, the set of TFBS 4-mer–TF 4-mer pairs constitutes all the pairs in the other sets by the downward closure property. Thus only the TF amino acid 4-mers of the set of TFBS 4-mer–TF 4-mer pairs were needed for the checking: of the 792 TF amino acid 4-mers, formula of them were found within the DNA binding domains listed in the ‘PFAM 18’ list downloaded from DBD ( 37 ) on 25 January 2010.

Empirical analysis

Since the numbers of results are quite large, they are tabulated in a statistical perspective in the previous sections. This section provides readers with empirical insights into the results obtained. Comparing with the other sets, the set of TFBS 5-mer–TF 5-mer pairs shows its relative invariability to confidence level pruning. Thus, it motivates us to have an in-depth empirical analysis on them. They are listed in Table 3 .

Table 3.

The set of TFBS 5-mer–TF 5-mer pairs (duplicated pairs removed and sorted in alphabetical order)

ConfidenceForward confidenceBackward confidencePairsConfidenceForward confidenceBackward confidencePairsConfidenceForward confidenceBackward confidencePairs
0.70.70.8AAACA–HNLSL0.40.40.7AGGTC–CQYCR0.30.50.3GCCAC–ARRSR
0.50.50.7AAACA–IRHNL0.20.20.6AGGTC–CVVCG0.40.50.4GCCAC–ESARR
0.50.50.6AAACA–KPPYS0.60.60.7AGGTC–EGCKG0.40.40.6GCCAC–KQSNR
0.40.40.7AAACA–NLSLN0.20.20.7AGGTC–FFRRT0.40.60.4GCCAC–NRESA
0.60.60.6AAACA–NSIRH0.20.20.8AGGTC–FRRTI0.40.40.6GCCAC–QSNRE
0.50.50.6AAACA–PPYSY0.60.60.6AGGTC–GCKGF0.40.50.4GCCAC–RESAR
0.40.40.6AAACA–PYSYI0.30.30.5AGGTC–GFFKR0.40.40.6GCCAC–RKQSN
0.40.40.6AAACA–QNSIR0.40.40.5AGGTC–GFFRR0.40.40.5GCCAC–RLRKQ
0.70.70.8AAACA–RHNLS0.30.30.6AGGTC–KGFFK0.40.40.5GCCAC–RRSRL
0.50.50.8AAACA–SIRHN0.40.40.5AGGTC–KGFFR0.40.40.6GCCAC–RSRLR
0.40.40.6AAACA–WQNSI0.40.40.9AGGTC–RNRCQ0.40.50.4GCCAC–SARRS
0.40.40.6AACAA–HNLSL0.30.30.5AGGTC–TCEGC0.40.50.4GCCAC–SNRES
0.30.30.6AACAA–IRHNL0.40.40.5AGGTC–VCGDK0.40.50.4GCCAC–SRLRK
0.30.30.5AACAA–NSIRH0.20.20.5AGGTC–VVCGD0.60.60.8GGTCA–CEGCK
0.30.30.7AACAA–PMNAF0.20.70.2ATTAA–FQNRR0.20.20.9GGTCA–CGDKA
0.40.40.6AACAA–RHNLS0.20.60.2ATTAA–IWFQN0.50.50.6GGTCA–CKGFF
0.30.30.7AACAA–RPMNA0.20.60.2ATTAA–KIWFQ0.30.30.9GGTCA–CQYCR
0.30.30.7AACAA–SIRHN0.30.50.3ATTAA–NRRMK0.20.20.8GGTCA–CVVCG
0.20.60.2AAGGT–CKGFF0.30.50.3ATTAA–QNRRM0.10.11GGTCA–DLVLD
0.20.50.2AATTA–FQNRR0.20.70.2ATTAA–WFQNR0.50.50.8GGTCA–EGCKG
0.30.30.3AATTA–NRRAK0.20.50.2CACCC–GEKPY0.20.20.8GGTCA–FFKRS
0.40.40.5AATTA–QNRRA0.10.50.1CACCC–HTGEK0.20.20.8GGTCA–FFRRT
0.30.30.7AATTA–QVWFQ0.10.50.1CACCC–TGEKP0.20.21GGTCA–FRRTI
0.50.50.5AATTA–VWFQN0.50.50.5CCACG–ARRSR0.50.50.7GGTCA–GCKGF
0.20.50.2AATTA–WFQNR0.50.50.6CCACG–ESARR0.20.20.5GGTCA–GFFKR
0.50.50.7ACGTG–ARRSR0.30.30.7CCACG–KQSNR0.30.30.6GGTCA–GFFRR
0.10.10.7ACGTG–ERELK0.20.20.6CCACG–LRKQA0.10.10.6GGTCA–GYHYG
0.50.50.9ACGTG–ESARR0.60.60.6CCACG–NRESA0.10.11GGTCA–ITCEG
0.20.20.8ACGTG–KQSNR0.30.30.6CCACG–QSNRE0.20.20.6GGTCA–KGFFK
0.20.20.7ACGTG–LRKQA0.50.50.6CCACG–RESAR0.30.30.6GGTCA–KGFFR
0.60.60.9ACGTG–NRESA0.20.20.7CCACG–RKQAE0.10.11GGTCA–NRCQY
0.20.20.7ACGTG–QSNRE0.30.30.7CCACG–RKQSN0.10.11GGTCA–RCQYC
0.50.50.9ACGTG–RESAR0.30.30.5CCACG–RLRKQ0.10.10.8GGTCA–RNQCQ
0.10.10.7ACGTG–RKQAE0.30.30.6CCACG–RRSRL0.30.31GGTCA–RNRCQ
0.20.20.8ACGTG–RKQSN0.30.30.6CCACG–RSRLR0.20.21GGTCA–SCEGC
0.20.20.6ACGTG–RLRKQ0.50.50.6CCACG–SARRS0.10.10.5GGTCA–SGYHY
0.20.20.7ACGTG–RRSRL0.50.50.6CCACG–SNRES0.30.30.6GGTCA–TCEGC
0.20.20.8ACGTG–RSRLR0.40.40.4CCACG–SRLRK0.30.30.6GGTCA–VCGDK
0.50.50.9ACGTG–SARRS0.50.50.5CGGAA–LRYYY0.20.20.7GGTCA–VVCGD
0.50.50.9ACGTG–SNRES0.50.50.8CTTCC–LRYYY0.50.50.7GTCAA–KYGQK
0.30.30.5ACGTG–SRLRK0.40.40.7CTTCC–LWQFL0.50.50.7GTCAA–RKYGQ
0.60.70.6AGGTC–CEGCK0.40.70.4GATAA–CNACG0.50.50.7GTCAA–WRKYG
0.30.30.8AGGTC–CGDKA0.40.70.4GATAA–LCNAC0.70.71TGACA–NWFIN
0.60.70.6AGGTC–CKGFF0.60.70.6GATAA–NACGL
ConfidenceForward confidenceBackward confidencePairsConfidenceForward confidenceBackward confidencePairsConfidenceForward confidenceBackward confidencePairs
0.70.70.8AAACA–HNLSL0.40.40.7AGGTC–CQYCR0.30.50.3GCCAC–ARRSR
0.50.50.7AAACA–IRHNL0.20.20.6AGGTC–CVVCG0.40.50.4GCCAC–ESARR
0.50.50.6AAACA–KPPYS0.60.60.7AGGTC–EGCKG0.40.40.6GCCAC–KQSNR
0.40.40.7AAACA–NLSLN0.20.20.7AGGTC–FFRRT0.40.60.4GCCAC–NRESA
0.60.60.6AAACA–NSIRH0.20.20.8AGGTC–FRRTI0.40.40.6GCCAC–QSNRE
0.50.50.6AAACA–PPYSY0.60.60.6AGGTC–GCKGF0.40.50.4GCCAC–RESAR
0.40.40.6AAACA–PYSYI0.30.30.5AGGTC–GFFKR0.40.40.6GCCAC–RKQSN
0.40.40.6AAACA–QNSIR0.40.40.5AGGTC–GFFRR0.40.40.5GCCAC–RLRKQ
0.70.70.8AAACA–RHNLS0.30.30.6AGGTC–KGFFK0.40.40.5GCCAC–RRSRL
0.50.50.8AAACA–SIRHN0.40.40.5AGGTC–KGFFR0.40.40.6GCCAC–RSRLR
0.40.40.6AAACA–WQNSI0.40.40.9AGGTC–RNRCQ0.40.50.4GCCAC–SARRS
0.40.40.6AACAA–HNLSL0.30.30.5AGGTC–TCEGC0.40.50.4GCCAC–SNRES
0.30.30.6AACAA–IRHNL0.40.40.5AGGTC–VCGDK0.40.50.4GCCAC–SRLRK
0.30.30.5AACAA–NSIRH0.20.20.5AGGTC–VVCGD0.60.60.8GGTCA–CEGCK
0.30.30.7AACAA–PMNAF0.20.70.2ATTAA–FQNRR0.20.20.9GGTCA–CGDKA
0.40.40.6AACAA–RHNLS0.20.60.2ATTAA–IWFQN0.50.50.6GGTCA–CKGFF
0.30.30.7AACAA–RPMNA0.20.60.2ATTAA–KIWFQ0.30.30.9GGTCA–CQYCR
0.30.30.7AACAA–SIRHN0.30.50.3ATTAA–NRRMK0.20.20.8GGTCA–CVVCG
0.20.60.2AAGGT–CKGFF0.30.50.3ATTAA–QNRRM0.10.11GGTCA–DLVLD
0.20.50.2AATTA–FQNRR0.20.70.2ATTAA–WFQNR0.50.50.8GGTCA–EGCKG
0.30.30.3AATTA–NRRAK0.20.50.2CACCC–GEKPY0.20.20.8GGTCA–FFKRS
0.40.40.5AATTA–QNRRA0.10.50.1CACCC–HTGEK0.20.20.8GGTCA–FFRRT
0.30.30.7AATTA–QVWFQ0.10.50.1CACCC–TGEKP0.20.21GGTCA–FRRTI
0.50.50.5AATTA–VWFQN0.50.50.5CCACG–ARRSR0.50.50.7GGTCA–GCKGF
0.20.50.2AATTA–WFQNR0.50.50.6CCACG–ESARR0.20.20.5GGTCA–GFFKR
0.50.50.7ACGTG–ARRSR0.30.30.7CCACG–KQSNR0.30.30.6GGTCA–GFFRR
0.10.10.7ACGTG–ERELK0.20.20.6CCACG–LRKQA0.10.10.6GGTCA–GYHYG
0.50.50.9ACGTG–ESARR0.60.60.6CCACG–NRESA0.10.11GGTCA–ITCEG
0.20.20.8ACGTG–KQSNR0.30.30.6CCACG–QSNRE0.20.20.6GGTCA–KGFFK
0.20.20.7ACGTG–LRKQA0.50.50.6CCACG–RESAR0.30.30.6GGTCA–KGFFR
0.60.60.9ACGTG–NRESA0.20.20.7CCACG–RKQAE0.10.11GGTCA–NRCQY
0.20.20.7ACGTG–QSNRE0.30.30.7CCACG–RKQSN0.10.11GGTCA–RCQYC
0.50.50.9ACGTG–RESAR0.30.30.5CCACG–RLRKQ0.10.10.8GGTCA–RNQCQ
0.10.10.7ACGTG–RKQAE0.30.30.6CCACG–RRSRL0.30.31GGTCA–RNRCQ
0.20.20.8ACGTG–RKQSN0.30.30.6CCACG–RSRLR0.20.21GGTCA–SCEGC
0.20.20.6ACGTG–RLRKQ0.50.50.6CCACG–SARRS0.10.10.5GGTCA–SGYHY
0.20.20.7ACGTG–RRSRL0.50.50.6CCACG–SNRES0.30.30.6GGTCA–TCEGC
0.20.20.8ACGTG–RSRLR0.40.40.4CCACG–SRLRK0.30.30.6GGTCA–VCGDK
0.50.50.9ACGTG–SARRS0.50.50.5CGGAA–LRYYY0.20.20.7GGTCA–VVCGD
0.50.50.9ACGTG–SNRES0.50.50.8CTTCC–LRYYY0.50.50.7GTCAA–KYGQK
0.30.30.5ACGTG–SRLRK0.40.40.7CTTCC–LWQFL0.50.50.7GTCAA–RKYGQ
0.60.70.6AGGTC–CEGCK0.40.70.4GATAA–CNACG0.50.50.7GTCAA–WRKYG
0.30.30.8AGGTC–CGDKA0.40.70.4GATAA–LCNAC0.70.71TGACA–NWFIN
0.60.70.6AGGTC–CKGFF0.60.70.6GATAA–NACGL
Table 3.

The set of TFBS 5-mer–TF 5-mer pairs (duplicated pairs removed and sorted in alphabetical order)

ConfidenceForward confidenceBackward confidencePairsConfidenceForward confidenceBackward confidencePairsConfidenceForward confidenceBackward confidencePairs
0.70.70.8AAACA–HNLSL0.40.40.7AGGTC–CQYCR0.30.50.3GCCAC–ARRSR
0.50.50.7AAACA–IRHNL0.20.20.6AGGTC–CVVCG0.40.50.4GCCAC–ESARR
0.50.50.6AAACA–KPPYS0.60.60.7AGGTC–EGCKG0.40.40.6GCCAC–KQSNR
0.40.40.7AAACA–NLSLN0.20.20.7AGGTC–FFRRT0.40.60.4GCCAC–NRESA
0.60.60.6AAACA–NSIRH0.20.20.8AGGTC–FRRTI0.40.40.6GCCAC–QSNRE
0.50.50.6AAACA–PPYSY0.60.60.6AGGTC–GCKGF0.40.50.4GCCAC–RESAR
0.40.40.6AAACA–PYSYI0.30.30.5AGGTC–GFFKR0.40.40.6GCCAC–RKQSN
0.40.40.6AAACA–QNSIR0.40.40.5AGGTC–GFFRR0.40.40.5GCCAC–RLRKQ
0.70.70.8AAACA–RHNLS0.30.30.6AGGTC–KGFFK0.40.40.5GCCAC–RRSRL
0.50.50.8AAACA–SIRHN0.40.40.5AGGTC–KGFFR0.40.40.6GCCAC–RSRLR
0.40.40.6AAACA–WQNSI0.40.40.9AGGTC–RNRCQ0.40.50.4GCCAC–SARRS
0.40.40.6AACAA–HNLSL0.30.30.5AGGTC–TCEGC0.40.50.4GCCAC–SNRES
0.30.30.6AACAA–IRHNL0.40.40.5AGGTC–VCGDK0.40.50.4GCCAC–SRLRK
0.30.30.5AACAA–NSIRH0.20.20.5AGGTC–VVCGD0.60.60.8GGTCA–CEGCK
0.30.30.7AACAA–PMNAF0.20.70.2ATTAA–FQNRR0.20.20.9GGTCA–CGDKA
0.40.40.6AACAA–RHNLS0.20.60.2ATTAA–IWFQN0.50.50.6GGTCA–CKGFF
0.30.30.7AACAA–RPMNA0.20.60.2ATTAA–KIWFQ0.30.30.9GGTCA–CQYCR
0.30.30.7AACAA–SIRHN0.30.50.3ATTAA–NRRMK0.20.20.8GGTCA–CVVCG
0.20.60.2AAGGT–CKGFF0.30.50.3ATTAA–QNRRM0.10.11GGTCA–DLVLD
0.20.50.2AATTA–FQNRR0.20.70.2ATTAA–WFQNR0.50.50.8GGTCA–EGCKG
0.30.30.3AATTA–NRRAK0.20.50.2CACCC–GEKPY0.20.20.8GGTCA–FFKRS
0.40.40.5AATTA–QNRRA0.10.50.1CACCC–HTGEK0.20.20.8GGTCA–FFRRT
0.30.30.7AATTA–QVWFQ0.10.50.1CACCC–TGEKP0.20.21GGTCA–FRRTI
0.50.50.5AATTA–VWFQN0.50.50.5CCACG–ARRSR0.50.50.7GGTCA–GCKGF
0.20.50.2AATTA–WFQNR0.50.50.6CCACG–ESARR0.20.20.5GGTCA–GFFKR
0.50.50.7ACGTG–ARRSR0.30.30.7CCACG–KQSNR0.30.30.6GGTCA–GFFRR
0.10.10.7ACGTG–ERELK0.20.20.6CCACG–LRKQA0.10.10.6GGTCA–GYHYG
0.50.50.9ACGTG–ESARR0.60.60.6CCACG–NRESA0.10.11GGTCA–ITCEG
0.20.20.8ACGTG–KQSNR0.30.30.6CCACG–QSNRE0.20.20.6GGTCA–KGFFK
0.20.20.7ACGTG–LRKQA0.50.50.6CCACG–RESAR0.30.30.6GGTCA–KGFFR
0.60.60.9ACGTG–NRESA0.20.20.7CCACG–RKQAE0.10.11GGTCA–NRCQY
0.20.20.7ACGTG–QSNRE0.30.30.7CCACG–RKQSN0.10.11GGTCA–RCQYC
0.50.50.9ACGTG–RESAR0.30.30.5CCACG–RLRKQ0.10.10.8GGTCA–RNQCQ
0.10.10.7ACGTG–RKQAE0.30.30.6CCACG–RRSRL0.30.31GGTCA–RNRCQ
0.20.20.8ACGTG–RKQSN0.30.30.6CCACG–RSRLR0.20.21GGTCA–SCEGC
0.20.20.6ACGTG–RLRKQ0.50.50.6CCACG–SARRS0.10.10.5GGTCA–SGYHY
0.20.20.7ACGTG–RRSRL0.50.50.6CCACG–SNRES0.30.30.6GGTCA–TCEGC
0.20.20.8ACGTG–RSRLR0.40.40.4CCACG–SRLRK0.30.30.6GGTCA–VCGDK
0.50.50.9ACGTG–SARRS0.50.50.5CGGAA–LRYYY0.20.20.7GGTCA–VVCGD
0.50.50.9ACGTG–SNRES0.50.50.8CTTCC–LRYYY0.50.50.7GTCAA–KYGQK
0.30.30.5ACGTG–SRLRK0.40.40.7CTTCC–LWQFL0.50.50.7GTCAA–RKYGQ
0.60.70.6AGGTC–CEGCK0.40.70.4GATAA–CNACG0.50.50.7GTCAA–WRKYG
0.30.30.8AGGTC–CGDKA0.40.70.4GATAA–LCNAC0.70.71TGACA–NWFIN
0.60.70.6AGGTC–CKGFF0.60.70.6GATAA–NACGL
ConfidenceForward confidenceBackward confidencePairsConfidenceForward confidenceBackward confidencePairsConfidenceForward confidenceBackward confidencePairs
0.70.70.8AAACA–HNLSL0.40.40.7AGGTC–CQYCR0.30.50.3GCCAC–ARRSR
0.50.50.7AAACA–IRHNL0.20.20.6AGGTC–CVVCG0.40.50.4GCCAC–ESARR
0.50.50.6AAACA–KPPYS0.60.60.7AGGTC–EGCKG0.40.40.6GCCAC–KQSNR
0.40.40.7AAACA–NLSLN0.20.20.7AGGTC–FFRRT0.40.60.4GCCAC–NRESA
0.60.60.6AAACA–NSIRH0.20.20.8AGGTC–FRRTI0.40.40.6GCCAC–QSNRE
0.50.50.6AAACA–PPYSY0.60.60.6AGGTC–GCKGF0.40.50.4GCCAC–RESAR
0.40.40.6AAACA–PYSYI0.30.30.5AGGTC–GFFKR0.40.40.6GCCAC–RKQSN
0.40.40.6AAACA–QNSIR0.40.40.5AGGTC–GFFRR0.40.40.5GCCAC–RLRKQ
0.70.70.8AAACA–RHNLS0.30.30.6AGGTC–KGFFK0.40.40.5GCCAC–RRSRL
0.50.50.8AAACA–SIRHN0.40.40.5AGGTC–KGFFR0.40.40.6GCCAC–RSRLR
0.40.40.6AAACA–WQNSI0.40.40.9AGGTC–RNRCQ0.40.50.4GCCAC–SARRS
0.40.40.6AACAA–HNLSL0.30.30.5AGGTC–TCEGC0.40.50.4GCCAC–SNRES
0.30.30.6AACAA–IRHNL0.40.40.5AGGTC–VCGDK0.40.50.4GCCAC–SRLRK
0.30.30.5AACAA–NSIRH0.20.20.5AGGTC–VVCGD0.60.60.8GGTCA–CEGCK
0.30.30.7AACAA–PMNAF0.20.70.2ATTAA–FQNRR0.20.20.9GGTCA–CGDKA
0.40.40.6AACAA–RHNLS0.20.60.2ATTAA–IWFQN0.50.50.6GGTCA–CKGFF
0.30.30.7AACAA–RPMNA0.20.60.2ATTAA–KIWFQ0.30.30.9GGTCA–CQYCR
0.30.30.7AACAA–SIRHN0.30.50.3ATTAA–NRRMK0.20.20.8GGTCA–CVVCG
0.20.60.2AAGGT–CKGFF0.30.50.3ATTAA–QNRRM0.10.11GGTCA–DLVLD
0.20.50.2AATTA–FQNRR0.20.70.2ATTAA–WFQNR0.50.50.8GGTCA–EGCKG
0.30.30.3AATTA–NRRAK0.20.50.2CACCC–GEKPY0.20.20.8GGTCA–FFKRS
0.40.40.5AATTA–QNRRA0.10.50.1CACCC–HTGEK0.20.20.8GGTCA–FFRRT
0.30.30.7AATTA–QVWFQ0.10.50.1CACCC–TGEKP0.20.21GGTCA–FRRTI
0.50.50.5AATTA–VWFQN0.50.50.5CCACG–ARRSR0.50.50.7GGTCA–GCKGF
0.20.50.2AATTA–WFQNR0.50.50.6CCACG–ESARR0.20.20.5GGTCA–GFFKR
0.50.50.7ACGTG–ARRSR0.30.30.7CCACG–KQSNR0.30.30.6GGTCA–GFFRR
0.10.10.7ACGTG–ERELK0.20.20.6CCACG–LRKQA0.10.10.6GGTCA–GYHYG
0.50.50.9ACGTG–ESARR0.60.60.6CCACG–NRESA0.10.11GGTCA–ITCEG
0.20.20.8ACGTG–KQSNR0.30.30.6CCACG–QSNRE0.20.20.6GGTCA–KGFFK
0.20.20.7ACGTG–LRKQA0.50.50.6CCACG–RESAR0.30.30.6GGTCA–KGFFR
0.60.60.9ACGTG–NRESA0.20.20.7CCACG–RKQAE0.10.11GGTCA–NRCQY
0.20.20.7ACGTG–QSNRE0.30.30.7CCACG–RKQSN0.10.11GGTCA–RCQYC
0.50.50.9ACGTG–RESAR0.30.30.5CCACG–RLRKQ0.10.10.8GGTCA–RNQCQ
0.10.10.7ACGTG–RKQAE0.30.30.6CCACG–RRSRL0.30.31GGTCA–RNRCQ
0.20.20.8ACGTG–RKQSN0.30.30.6CCACG–RSRLR0.20.21GGTCA–SCEGC
0.20.20.6ACGTG–RLRKQ0.50.50.6CCACG–SARRS0.10.10.5GGTCA–SGYHY
0.20.20.7ACGTG–RRSRL0.50.50.6CCACG–SNRES0.30.30.6GGTCA–TCEGC
0.20.20.8ACGTG–RSRLR0.40.40.4CCACG–SRLRK0.30.30.6GGTCA–VCGDK
0.50.50.9ACGTG–SARRS0.50.50.5CGGAA–LRYYY0.20.20.7GGTCA–VVCGD
0.50.50.9ACGTG–SNRES0.50.50.8CTTCC–LRYYY0.50.50.7GTCAA–KYGQK
0.30.30.5ACGTG–SRLRK0.40.40.7CTTCC–LWQFL0.50.50.7GTCAA–RKYGQ
0.60.70.6AGGTC–CEGCK0.40.70.4GATAA–CNACG0.50.50.7GTCAA–WRKYG
0.30.30.8AGGTC–CGDKA0.40.70.4GATAA–LCNAC0.70.71TGACA–NWFIN
0.60.70.6AGGTC–CKGFF0.60.70.6GATAA–NACGL

Among the 131 pairs in Table 3 , the TFBS DNA k -mers are quite conserved. There are only 15 distinct TFBS DNA k -mers. Each TFBS DNA k -mer forms pairs with 8.73 TF amino acid k -mers on average. One of the reasons may be the specificity of DNA residue, is lower in view of its alphabet size ( 4 ) as compared to the amino acid alphabet size ( 20 ).

To act as a DNA binding protein, a TF needs to provide a basic interacting surface for the recognition of major/minor grooves as well as the phosphate backbone of DNA. Therefore, we searched through the set of pairs in Table 3 to count the occurring frequency for each residue. Interestingly, we found that the basic residues, lysine (50 times) and arginine (131 times), occur at the highest frequency among 131 pairs of TFBS–TF. On the other hand, the hydrophobic residues ( 38 ) such as isoleucine ( 15 ) and valine ( 13 ) occur at the lowest frequency. These results suggest the potential of the TF sequences for being the binding sequences between TFs and TFBSs. On the other hand, as the nucleotides of TFBSs are somehow negatively charged, it can be deduced that their binding amino acid residues of TFs should be positively charged. Thus the occurring frequencies were further examined. Among the 131 pairs, the positively charged residues: arginine (R) and lysine (K) occur 131 and 50 times, respectively. In contrast, the negatively charged residues aspartic acid (D) and glutamic acid (E) occur 8 and 30 times, respectively. Such discrepancy supports their potential for being the binding sequences between TFs and TFBSs.

Experimental analysis

This section follows the same approach in empirical analysis. The set of TFBS 5-mer–TF5mer pairs in Table 3 is selected for experimental analysis. Out of the 131 pairs, 5 of them were selected and analyzed. The first pair is GGTCA–CEGCK, which have been experimentally proved as binding sequences in Ref. ( 39 ). The TF amino acid k -mer (CEGCK) is considered part of P-box (CEGCKG) within the DNA binding domain of Bp-nhr-2, which is believed to bind the DNA k -mer (GGTCA). The second pair is AAACA–IRHNL mentioned in Ref. ( 40 ). Based on the corresponding PDB entry 3CO6, it is believed that the pair was the binding pair between a TF and a TFBS as shown in Figure 3 . Similarly, the remaining pairs are GATAA–NACGL, GGTCA–GFFRR and CTTCC–LRYYY. They are found as binding pairs in PDB entries 3DFV ( 41 ), 3DZY ( 42 ) and 2NNY ( 43 ) as shown in Figure 3 a, b and c, respectively. The above five pairs reveal that the pairs generated from the proposed approach have biological evidences in literatures. Among the previous figures, two of them (3CO6 and 2NNY) were further analyzed in terms of hydrogen bonding, which also means the specificity of the interaction between amino acids and the bases, as shown in Figure 4 a and b. We have also highlighted the hydrogen bonds as black lines as well as the residues that make contact with the base (only predicted residues), which are the evidence of the significance and accuracy of the prediction of the TF–TFBS pairs. Nevertheless, as the proposed approach is applied on a large-scale database, such extensive and detailed analysis of all the binding core pairs discovered are not practical. Therefore, a scalable verification approach will be presented in the next section to verify the massive results generated.

 Four representative TF–TFBS pairs are shown in ribbon diagram. ( a ) AAACA–IRHNL pair in 3C06, ( b ) GATAA–NACGL pair in 3DFV, ( c ) GGTCA–GFFRR pair in 3DZY and ( d ) CTTCC–LRYYY pair in 2NNY are shown. The TF amino acids and TFBS nucleotides are highlighted in ball and stick format. The sequences of the TF–TFBS pairs are also labeled in the figures. The figures are generated using Protein Workshop ( 34 ).
Figure 3.

Four representative TF–TFBS pairs are shown in ribbon diagram. ( a ) AAACA–IRHNL pair in 3C06, ( b ) GATAA–NACGL pair in 3DFV, ( c ) GGTCA–GFFRR pair in 3DZY and ( d ) CTTCC–LRYYY pair in 2NNY are shown. The TF amino acids and TFBS nucleotides are highlighted in ball and stick format. The sequences of the TF–TFBS pairs are also labeled in the figures. The figures are generated using Protein Workshop ( 34 ).

 The interactions between the TF and TFBS of two representative pairs ( a ) AAACA–IRHNL in 3CO6 and ( b ) CTTCC–LRYYY in 2NNY are shown. The proteins are shown in ribbon diagram with the highlighted TF amino acids in ball and stick format. The helices and strands are colored in red and cyan, respectively. The amino acids that interact with the nucleotides are labeled. The hydrogen bonds are shown in dark line. The figures are generated using DS visualizer, Accelrys.
Figure 4.

The interactions between the TF and TFBS of two representative pairs ( a ) AAACA–IRHNL in 3CO6 and ( b ) CTTCC–LRYYY in 2NNY are shown. The proteins are shown in ribbon diagram with the highlighted TF amino acids in ball and stick format. The helices and strands are colored in red and cyan, respectively. The amino acids that interact with the nucleotides are labeled. The hydrogen bonds are shown in dark line. The figures are generated using DS visualizer, Accelrys.

VERIFICATIONS

In this section, we try to verify the discovered pairs with external data sources, in particular the 3D protein-DNA complex structures experimentally determined from PDB. Homology modeling has also been done for further verifications.

Verification by PDB

In this article, PDB is selected for providing 3D protein–DNA complex data for 3D structural verification. The PDB data were downloaded from RCSB PDB ( http://www.pdb.org ) from 16 September 2009 to 22 September 2009, where the protein–DNA complexes were selected based on the entry-type list provided in ftp://ftp.wwpdb.org/ .

For each set of pairs in Supplementary Table S2 , each pair is independently evaluated as shown in Figure 5 . For each pair, its TF k -mer is used to query which PDB chain has the TF k -mer. Once the corresponding set of PDB chains has been identified and returned, its redundancy is removed by BLASTClust using 90% sequence identity ( 32 ). The removal is to ensure that redundant PDB chains are not double counted. After the removal, the pair is evaluated for binding in the 3D space:

  • A TFBS k -mer–TF k -mer pair is considered binding for a PDB chain if and only if an atom of the TFBS k -mer and an atom of the TF k -mer are close to each other. Two atoms are considered close if and only if their distance is <3.5 Å ( 25 , 28 ).

With the pair evaluated in its PDB chains, its PDB chains can be classified into the following three categories:

  • PDB chains only having the TF k -mer ( a )

  • PDB chains having both TF k -mer and TFBS k -mer

    • The pair binds together ( b )

    • The pair does not bind together ( c )

Thus the number of chains in each category is counted and converted into the following performance metrics:

  • TFBS prediction score=( b + c )/( a + b + c )

  • TFBS binding prediction score = b /( a + b + c )

  • Binding prediction score= b /( b + c )

Given the resultant PDB chains queried by a TF k -mer, TFBS prediction score measures the proportion of PDB chains that contain the corresponding TFBS k -mer. In other words, it measures the backward confidence of a pair in PDB. TFBS binding prediction score is a more stringent metric. It measures the proportion of PDB chains that have the corresponding TFBS k -mer binding with the queried TF k -mer. Lastly, binding prediction score is the most important metric. It measures the proportion of PDB chains in which the pair is really binding. To verify the cases when ( b + c ) = 0 (i.e. the pairs do not appear in PDB), homology modeling is also performed.

Flowchart of 3D verification for each set of pairs.
Figure 5.

Flowchart of 3D verification for each set of pairs.

For each setting, we have a set of pairs. For each pair, the above performance metrics are calculated. The overall results are averaged and summarized in Supplementary Tables S9–S11 . For each setting, we also have a set of merged pairs. For each merged pair, the above performance metrics are also calculated. The overall results are averaged and summarized in Supplementary Tables S12–S14 . Note that the most conservative calculation has been used for each performance metric for each pair. If a performance metric of a pair does not have enough PDB data for calculation, a value of zero will be given to the performance metric of the pair. For instance, the cases when ( b + c ) = 0 or ( a + b + c ) = 0. Despite the above setting, the performance metrics of the pairs still have reasonable performances. They are shown to be significantly better than the maximal performance of 50 random runs in a later section.

Nevertheless, although the above metrics can capture the performance of a pair quantitatively, the most important point is to know how many generated pairs could be verified [with at least one binding evidence in PDB data ( b > 0)]. To gain more insights, the number of pairs with at least one related PDB chain [( b + c ) > 0] are tabulated in Supplementary Tables S15 and S16 . Correspondingly, the percentage of verified pairs ((Number of pairs with b > 0/Number of pairs with ( b + c ) > 0)) are calculated and tabulated in Supplementary Tables S17 and S18 . In the tables, the percentage of verified pairs is high enough to justify that the proposed approach has produced pairs proven to be binding in PDB. For instance, the statistics for the TFBS 5-mer–TF 5-mer pairs is extracted in Table 4 and Figure 6 . Among the 80 TFBS 5-mer–TF 5-mer pairs with at least one related PDB chain [( b + c ) > 0] when the confidence level = 0.0, more than 81% of them have at least one binding evidence ( b > 0).

Table 4.

Number of the TFBS 5-mer–TF 5-mer pairs verified across different confidence levels

Confidenceformulaformulaformulaformula
0.080651916
0.180651916
0.271591513
0.350441513
0.432281211
0.5191776
0.69955
0.72211
0.80000
0.90000
1.00000
Confidenceformulaformulaformulaformula
0.080651916
0.180651916
0.271591513
0.350441513
0.432281211
0.5191776
0.69955
0.72211
0.80000
0.90000
1.00000

formula , number of the TFBS 5-mer–TF 5-mer pairs with at least one related PDB chain [ formula ]; formula , number of the TFBS 5-mer–TF 5-mer pairs with at least one PDB chain as a binding evidence [ formula ]; formula , number of the TFBS 5-mer–TF 5-mer merged pairs with at least one related PDB chain [ formula ]; formula , number of the TFBS 5-mer–TF 5-mer merged pairs with at least one PDB chain as a binding evidence [ formula )].

Table 4.

Number of the TFBS 5-mer–TF 5-mer pairs verified across different confidence levels

Confidenceformulaformulaformulaformula
0.080651916
0.180651916
0.271591513
0.350441513
0.432281211
0.5191776
0.69955
0.72211
0.80000
0.90000
1.00000
Confidenceformulaformulaformulaformula
0.080651916
0.180651916
0.271591513
0.350441513
0.432281211
0.5191776
0.69955
0.72211
0.80000
0.90000
1.00000

formula , number of the TFBS 5-mer–TF 5-mer pairs with at least one related PDB chain [ formula ]; formula , number of the TFBS 5-mer–TF 5-mer pairs with at least one PDB chain as a binding evidence [ formula ]; formula , number of the TFBS 5-mer–TF 5-mer merged pairs with at least one related PDB chain [ formula ]; formula , number of the TFBS 5-mer–TF 5-mer merged pairs with at least one PDB chain as a binding evidence [ formula )].

Percentage of the TFBS 5-mer–TF 5-mer pairs verified across different confidence levels.
Figure 6.

Percentage of the TFBS 5-mer–TF 5-mer pairs verified across different confidence levels.

Table 5.

631 TRANSFAC 2008.3 IDs and factor names used in this article

IDFactor nameIDFactor nameIDFactor nameIDFactor nameIDFactor nameIDFactor name
T00003AS-CT3T00842Tra-1(long form)T01950HNF-1α-BT04378MadT08676STAT6T09986NF-AT4
T00008Adf-1T00843Ttk69KT01951HNF-1α-CT04446Nkx5-1T08787ARF1isoform-1T09990CDP-isoform1
T00011ADR1T00851T3R-β1T01973REST-form2T04539RPN4T08797MCB1T10028SREBP-1c
T00019AhRT00863UbxT01992Abd-AT04610SXRT08805WRKY1T10030POU3F2
T00026AntpT00886v-ErbAT02003Cdx-3T04651ER-βT08823E2FT10059GCMa
T00028YAP1T00891HNF-1β-AT02008EmsT04665Xvent-1T08853myogeninT10068COUP-TF2
T00033AP-2αT00893v-JunT02030SdT04674IRF-7AT08858REVERB-αT10083HNF-3α
T00063BcdT00894Vmw65T02033HsfA1T04675MRF-2- isoform1T08863S8T10144Gfi1b
T00077CACCC- binding factorT00895v-MybT02039HAC1T04679driT08868CTCFT10187NF-E2p45
T00079CadT00899WT1T02050Nkx6-2T04728CDC5LT08878Opaque-2T10207GATA-6
T00080CBF1T00910YB-1T02054HOX11T04733Alfin1T08972EAR2T10209Nkx2-1
T00104C/EBPαT00915YY1T02063KNOX3T04734Topors- isoform1T08978Dl-AT10211Evi-1
T00106C/EBPT00917Zen-1T02068PU.1T04783mtTFAT08985Pti4T10265LRH-1
T00109C/EBPδT00918ZesteT02099Zen-2T04784PF1T08989Fra-1T10276Erm
T00112c-Ets-1T00923ZtaT02100ZesteT04811FOXP1aT08994HIF-1α- isoform1T10282Otx2
T00113c-Ets-2T00925AMT1T02128SAP-1bT04817LIM1T09001BPC1T10317IA-1
T00115c-Ets-168T00937HBP-1aT02142OCA-BT04819EmBP-1aT09018N-MycT10331NRF-1
T00117CF1T00938HBP-1bT02216TFIIA-α/β precursor (major)T04886Tel-2bT09033TEF-1T10392GATA-3
T00120CF2-IIT00969POU3F1T02217TFIIA-α/β precursor (minor)T04931p73αT09051AhRT10393GATA-2
T00128HOXA4T01005MEF2A- isoform1T02235PEBP2αB1T04957EKLFT09059SEF2-1BT10429PU.1
T00140c-MycT01017CRE-BP2T02248StuApT04961GLI2αT09071AGT10459Alx-3
T00151CP2aT01019Elf-1T02256AML1aT04996ZBP89T09089PIF3T10462Prop-1
T00163CREBT01027BAS1T02288HFH-1T04998Tel-2aT09093IPF1T10473TEF-5
T00167ATF-2-xbb4T01035Isl-1αT02290FOXD3T04999Tel-2cT09097SRYT10482AP-2γ
T00176CTF-1T01051FOXA4aT02291CrocT05021NERF-1aT09098SREBP-2T10484TEF-3
T00177CTF-2T01053HNF-3βT02294FOXI1aT05051BTEB3T09102FOXO4T10543Sox5
T00179CUP2T01059MNB1aT02302GCMT05137CIZ6-1T09106RelA-p65T10573DREB1A
T00183DBPT01072TEFT02313MIBP1T05181DSFT09117E2F-1T10588Snai3
T00193DfdT01074ApT02330G/HBF-1T05553MYBAS1T09129BCL-6T10638HY5
T00204E12T01078GBF1T02361CREBβT05587BZI-1T09156TGIF-isoform2T10644MTF-1
T00208E74AT01083NF-μNRT02378USF1T05682ERRα1T09158BZR1T10664Gfi1
T00217EcRT01085abaAT02419Sp3T05705GATA-1T09159PITX2AT10666SRY
T00253EnT01109TCF-1(P)T02420Sox13T05706GATA-2T09162Pax-3T10674MafK
T00262ER-αT01112EBF1-LT02422HNF-4α2T05707GATA-3T09177MyoDT10712DMRT1
T00264ER-αT01147SF-1isoform2T02429HNF-4α1T05708GATA-4T09178C/EBPαT10720GCR1
T00272EveT01152T3R-α1T02463GBF1T05737PCF3T09182Pax-5T10721DMRT2
T00295FtzT01154c-RelT02469AP-2βT05743ABI4T09183WRKY53T10723DMRT3
T00296FTZ-F1T01258MSN4T02529PPARγ1T05770DREB1AT09184Pax-8T10725DMRT7
T00301GAGA factorT01265MAC1T02636CBF1T05834CBF2T09190AGL15T10727DMRT4
T00302GAL4T01274ABF2T02639ANTT05835DRF1.1T09194NF-AT1CT10731DMRT5
T00303GAL80T01275mat1-McT02654ERF2T05837DRF1.3T09195SPL14T10739MRP1
T00315GBFT01286ROX1T02669EmBP-1aT05929SUSIBA2T09196HSF2AT10745HSFA2
T00329GlassT01313ATF3T02672GBF1T05943FOXP1dT09199STAT5AT10747MTF-1
T00330GLI1T01333RXR-γT02690Dof2T05975E2F1T09218Msx-1T10754ABF1
T00331GLI3T01346ArntT02691Dof3T05977PENDT09225En-1T10760HAP1
T00337GR-αT01350T3R-β2T02772GCNFT05982POTH1T09226Lhx2T10795C/EBPγ
T00349HAP2T01352PPARαT02786RITA-1T06004DeltaNp63αT09230Prep1T10849STB5
T00350HAP3T01388C/EBPT02789bZIP910T06029Sox17T09243MafGT10854GCN4
T00368HNF-1α-AT01400Ets-1deltaVIIT02790bZIP911T06043AGP1T09287MITF-A2T10881TRAB1
T00377HOXA5T01422ste11T02807OSBZ8T06137p73βT09304Smad4T10928TGA2
T00383HSFT01427p300T02809ROM1T06168p63αT09319IRF-1T10958ATHB-2
T00385HSF1T01431c-Maf (long form)T02810ROM2T06341BEL5T09323IRF-1T10959PCF1
T00386HSTFT01470Ik-2T02818GLN3T06356Rim101pT09343SRFT10960PCF2
T00395HbT01471Ik-3T02825gaf2T06404WRKY38T09355Alx-4T11115ZIC1
T00401ICP4T01476Abd-BT02841FACBT06429HIC-1- isoform2T09356HOXA3T11136DEC2
T00445KNIRPST01477BR-CZ1T02846UAYT06532NAC69-1T09383GABP-αT11158HELIOS-B
T00456KrT01478BR-CZ2T02878TCF-4ET06533MYB80T09424WRKY2T11164FOXJ1
T00458LAC9T01479BR-CZ3T02897Sox6-Isoform1T06537CiT09426Sp3-isoform1T11166FOXF1
T00459C/EBPβ(LAP)T01480BR-CZ4T02905LEF-1T08158ABZ1T09427RAP-1-xbb1T11180Gli1
T00480MAL63T01481Pbx1aT02907MYB305T08251FBI-1T09431Sp1T11200DEC1
T00487MATα2T01482ExdT02929MYB340T08252NF-AT3T09441RBP-JκT11217Gzf1
T00488MATa1T01484Cdx-1T02936FOXO1T08279USF1T09444CPRF-3T11246ZIC2
T00489Max-isoform2T01492STAT1αT02983Pax-4aT08291GATA-1T09449CPRF-2T11250Brachyury
T00490MAZT01517TwiT02999OCSBF-1T08292GATA- 1isoform1T09450CPRF-1T11256GCMb
T00497 MBP-1( 1 ) T01527RORα1T03031Pax-2.1T08293GATA-1T09462Egr-1T11258GCMa
T00500MCM1T01528RORα2T03178SQUAT08298KaisoT09478TGA1aT11310MafA
T00509MIG1T01556SREBP-1aT03227CAT8T08300ER-α-LT09507Sox-xbb1T11372HOXB8
T00529MZF1B-CT01590P (long form)T03256HNF-3βT08313USF2aT09514HTF4γT11383HOXD13
T00535NF-1T01592C1 (long form)T03258HNF-6βT08318Elf-1T09531ATF-4T11390Cart-1
T00594RelA-p65T01599LCR-F1T03388Meis-1aT08319ZecT09540c-KroxT11394PR-β
T00625ZEB(1124AA)T01615SuT03389Meis-1bT08321p53-isoform1T09548IRF-3T11402Crx
T00627NIT2T01649HES-1T03447LHX3bT08323p53T09561RoazT11425Chx10
T00642POU2F1T01660PR-αT03481SKN7T08340Egr-2T09569HlfT11440FAC1-xbb1
T00644POU2F1aT01661PRAT03491MED8T08348RXR-αT09571MYB1T11453TAF-1
T00651POU5F1T01664TR2-11T03500MOT3T08358GATA-4T09588E4BP4T13753HsfB1
T00653POU5F1(Oct-5)T01667RFX2T03524PDR1T08409GAMYBT09608Kid3T13760ABF1
T00669Ovo-BT01669RFX2T03525PDR3T08410PBFT09623ATF6T13794TGA1
T00677Pax-1T01670RFX3T03538RCS1T08411SEDT09629MYBJS1T13809AGL2
T00689PHO2T01671RFX3T03541RFX1T08415CBTT09635AP1T13810Dof4
T00690PHO4T01673RFX1T03556RGT1T08431PPARαT09649cel-let-7T13811AGL3
T00691Pit-1AT01675Nkx2-5T03593Pax-9aT08441Sox10T09701cel-miR-84T14002GKLF
T00696PRBT01679PacCT03594Pax-9bT08445Elk-1-isoform1T09706hsa-let-7aT14118ASR-1
T00697PRBT01692T3R-β1T03600SIP4T08466c-JunT09707hsa-let-7bT14187AIRE-isoform1
T00699PrdT01705HOXA7T03612NK-4T08475GR-αT09718hsa-miR-23aT14230WRKY40
T00709qa-1FT01710HoxA-9T03707XBP1T08482VDRT09727hsa-miR-103T14231RP58
T00710RT01735HOXB7T03717ZAP1T08487ART09729hsa-miR-107T14234WRKY18
T00715RAP1T01737HOXB8T03718WRKY1T08492LRH-1-xbb1T09731dme-miR-2aT14258Nkx3-2
T00719RAR-α1T01755HOXD9T03722ZAP1T08493c-FosT09732dme-miR-2bT14268MIZF
T00725REB1T01757HOXD10T03975SPF1T08505COUP-TF1T09737dme-miR-7T14302C1-Myb
T00731RME1T01784MEF-2AT03994ID1T08520TBPT09741dme-miR-13aT14317Myb-15
T00737SAP-1aT01786E12T04001ATHB-9T08528ART09742dme-miR-13bT14381ATHB-1
T00746SGF-3T01799Tal-1T04096Smad3T08544MOVO-BT09793dme-let-7T14382ATHB-5
T00751SnT01814Pax-6/Pd-5aT04146HLTFT08546Ovo1aT09806hsa-miR-1T14442STF1
T00761SRFT01823Pax-2T04166FOXD3T08571GATA-2T09810hsa-miR-124aT14444TGA1
T00763SRFT01838Sox4T04169FOXJ2 (long isoform)T08577ZBRK1T09812hsa-miR-130aT14447PBF
T00767Sry-δT01841WT1-del2T04176FOXO4T08580STAT3T09819hsa-miR-125aT14485XBP-1
T00769Sry-βT01851HMGIT04255Nkx3-1T08583CCA1T09824hsa-miR-206T14491CBNAC
T00776SWI5T01865Oct-2.3T04280FOXP3T08584LHYT09840hsa-miR-130bT14517Zic3
T00788T-AgT01866Oct-2.4T04297Nkx6-1T08613ZNF219T09880dre-miR-430aT14521ZF5
T00789TllT01867Oct-2.6T04312NURR1- isoform1T08615PLZFBT09892c-Myb-isoform1T14543CBF1
T00798TBPT01882unc-86T04323Nkx2-5T08619WEREWOLFT09914SF-1T14573FUS3
T00810TFE3-LT01888POU6F1(c2)T04324DREFT08621HAHB-4T09923RREB-1T14681Spz1
T00812TFEB-isoform1T01897Cf1aT04336Nkx2-8T08624Sox9T09942HNF-3βT14827DEAF-1
T00814TFE3-ST01900PDM-1T04337Nkx2-2T08630CART09949FOXC1T14951Ncx
T00830TGA1bT01944NF-AT1T04345TBX5-LT08667SZF1-1T09960TR4T14954OG-2
T14992Pitx3
IDFactor nameIDFactor nameIDFactor nameIDFactor nameIDFactor nameIDFactor name
T00003AS-CT3T00842Tra-1(long form)T01950HNF-1α-BT04378MadT08676STAT6T09986NF-AT4
T00008Adf-1T00843Ttk69KT01951HNF-1α-CT04446Nkx5-1T08787ARF1isoform-1T09990CDP-isoform1
T00011ADR1T00851T3R-β1T01973REST-form2T04539RPN4T08797MCB1T10028SREBP-1c
T00019AhRT00863UbxT01992Abd-AT04610SXRT08805WRKY1T10030POU3F2
T00026AntpT00886v-ErbAT02003Cdx-3T04651ER-βT08823E2FT10059GCMa
T00028YAP1T00891HNF-1β-AT02008EmsT04665Xvent-1T08853myogeninT10068COUP-TF2
T00033AP-2αT00893v-JunT02030SdT04674IRF-7AT08858REVERB-αT10083HNF-3α
T00063BcdT00894Vmw65T02033HsfA1T04675MRF-2- isoform1T08863S8T10144Gfi1b
T00077CACCC- binding factorT00895v-MybT02039HAC1T04679driT08868CTCFT10187NF-E2p45
T00079CadT00899WT1T02050Nkx6-2T04728CDC5LT08878Opaque-2T10207GATA-6
T00080CBF1T00910YB-1T02054HOX11T04733Alfin1T08972EAR2T10209Nkx2-1
T00104C/EBPαT00915YY1T02063KNOX3T04734Topors- isoform1T08978Dl-AT10211Evi-1
T00106C/EBPT00917Zen-1T02068PU.1T04783mtTFAT08985Pti4T10265LRH-1
T00109C/EBPδT00918ZesteT02099Zen-2T04784PF1T08989Fra-1T10276Erm
T00112c-Ets-1T00923ZtaT02100ZesteT04811FOXP1aT08994HIF-1α- isoform1T10282Otx2
T00113c-Ets-2T00925AMT1T02128SAP-1bT04817LIM1T09001BPC1T10317IA-1
T00115c-Ets-168T00937HBP-1aT02142OCA-BT04819EmBP-1aT09018N-MycT10331NRF-1
T00117CF1T00938HBP-1bT02216TFIIA-α/β precursor (major)T04886Tel-2bT09033TEF-1T10392GATA-3
T00120CF2-IIT00969POU3F1T02217TFIIA-α/β precursor (minor)T04931p73αT09051AhRT10393GATA-2
T00128HOXA4T01005MEF2A- isoform1T02235PEBP2αB1T04957EKLFT09059SEF2-1BT10429PU.1
T00140c-MycT01017CRE-BP2T02248StuApT04961GLI2αT09071AGT10459Alx-3
T00151CP2aT01019Elf-1T02256AML1aT04996ZBP89T09089PIF3T10462Prop-1
T00163CREBT01027BAS1T02288HFH-1T04998Tel-2aT09093IPF1T10473TEF-5
T00167ATF-2-xbb4T01035Isl-1αT02290FOXD3T04999Tel-2cT09097SRYT10482AP-2γ
T00176CTF-1T01051FOXA4aT02291CrocT05021NERF-1aT09098SREBP-2T10484TEF-3
T00177CTF-2T01053HNF-3βT02294FOXI1aT05051BTEB3T09102FOXO4T10543Sox5
T00179CUP2T01059MNB1aT02302GCMT05137CIZ6-1T09106RelA-p65T10573DREB1A
T00183DBPT01072TEFT02313MIBP1T05181DSFT09117E2F-1T10588Snai3
T00193DfdT01074ApT02330G/HBF-1T05553MYBAS1T09129BCL-6T10638HY5
T00204E12T01078GBF1T02361CREBβT05587BZI-1T09156TGIF-isoform2T10644MTF-1
T00208E74AT01083NF-μNRT02378USF1T05682ERRα1T09158BZR1T10664Gfi1
T00217EcRT01085abaAT02419Sp3T05705GATA-1T09159PITX2AT10666SRY
T00253EnT01109TCF-1(P)T02420Sox13T05706GATA-2T09162Pax-3T10674MafK
T00262ER-αT01112EBF1-LT02422HNF-4α2T05707GATA-3T09177MyoDT10712DMRT1
T00264ER-αT01147SF-1isoform2T02429HNF-4α1T05708GATA-4T09178C/EBPαT10720GCR1
T00272EveT01152T3R-α1T02463GBF1T05737PCF3T09182Pax-5T10721DMRT2
T00295FtzT01154c-RelT02469AP-2βT05743ABI4T09183WRKY53T10723DMRT3
T00296FTZ-F1T01258MSN4T02529PPARγ1T05770DREB1AT09184Pax-8T10725DMRT7
T00301GAGA factorT01265MAC1T02636CBF1T05834CBF2T09190AGL15T10727DMRT4
T00302GAL4T01274ABF2T02639ANTT05835DRF1.1T09194NF-AT1CT10731DMRT5
T00303GAL80T01275mat1-McT02654ERF2T05837DRF1.3T09195SPL14T10739MRP1
T00315GBFT01286ROX1T02669EmBP-1aT05929SUSIBA2T09196HSF2AT10745HSFA2
T00329GlassT01313ATF3T02672GBF1T05943FOXP1dT09199STAT5AT10747MTF-1
T00330GLI1T01333RXR-γT02690Dof2T05975E2F1T09218Msx-1T10754ABF1
T00331GLI3T01346ArntT02691Dof3T05977PENDT09225En-1T10760HAP1
T00337GR-αT01350T3R-β2T02772GCNFT05982POTH1T09226Lhx2T10795C/EBPγ
T00349HAP2T01352PPARαT02786RITA-1T06004DeltaNp63αT09230Prep1T10849STB5
T00350HAP3T01388C/EBPT02789bZIP910T06029Sox17T09243MafGT10854GCN4
T00368HNF-1α-AT01400Ets-1deltaVIIT02790bZIP911T06043AGP1T09287MITF-A2T10881TRAB1
T00377HOXA5T01422ste11T02807OSBZ8T06137p73βT09304Smad4T10928TGA2
T00383HSFT01427p300T02809ROM1T06168p63αT09319IRF-1T10958ATHB-2
T00385HSF1T01431c-Maf (long form)T02810ROM2T06341BEL5T09323IRF-1T10959PCF1
T00386HSTFT01470Ik-2T02818GLN3T06356Rim101pT09343SRFT10960PCF2
T00395HbT01471Ik-3T02825gaf2T06404WRKY38T09355Alx-4T11115ZIC1
T00401ICP4T01476Abd-BT02841FACBT06429HIC-1- isoform2T09356HOXA3T11136DEC2
T00445KNIRPST01477BR-CZ1T02846UAYT06532NAC69-1T09383GABP-αT11158HELIOS-B
T00456KrT01478BR-CZ2T02878TCF-4ET06533MYB80T09424WRKY2T11164FOXJ1
T00458LAC9T01479BR-CZ3T02897Sox6-Isoform1T06537CiT09426Sp3-isoform1T11166FOXF1
T00459C/EBPβ(LAP)T01480BR-CZ4T02905LEF-1T08158ABZ1T09427RAP-1-xbb1T11180Gli1
T00480MAL63T01481Pbx1aT02907MYB305T08251FBI-1T09431Sp1T11200DEC1
T00487MATα2T01482ExdT02929MYB340T08252NF-AT3T09441RBP-JκT11217Gzf1
T00488MATa1T01484Cdx-1T02936FOXO1T08279USF1T09444CPRF-3T11246ZIC2
T00489Max-isoform2T01492STAT1αT02983Pax-4aT08291GATA-1T09449CPRF-2T11250Brachyury
T00490MAZT01517TwiT02999OCSBF-1T08292GATA- 1isoform1T09450CPRF-1T11256GCMb
T00497 MBP-1( 1 ) T01527RORα1T03031Pax-2.1T08293GATA-1T09462Egr-1T11258GCMa
T00500MCM1T01528RORα2T03178SQUAT08298KaisoT09478TGA1aT11310MafA
T00509MIG1T01556SREBP-1aT03227CAT8T08300ER-α-LT09507Sox-xbb1T11372HOXB8
T00529MZF1B-CT01590P (long form)T03256HNF-3βT08313USF2aT09514HTF4γT11383HOXD13
T00535NF-1T01592C1 (long form)T03258HNF-6βT08318Elf-1T09531ATF-4T11390Cart-1
T00594RelA-p65T01599LCR-F1T03388Meis-1aT08319ZecT09540c-KroxT11394PR-β
T00625ZEB(1124AA)T01615SuT03389Meis-1bT08321p53-isoform1T09548IRF-3T11402Crx
T00627NIT2T01649HES-1T03447LHX3bT08323p53T09561RoazT11425Chx10
T00642POU2F1T01660PR-αT03481SKN7T08340Egr-2T09569HlfT11440FAC1-xbb1
T00644POU2F1aT01661PRAT03491MED8T08348RXR-αT09571MYB1T11453TAF-1
T00651POU5F1T01664TR2-11T03500MOT3T08358GATA-4T09588E4BP4T13753HsfB1
T00653POU5F1(Oct-5)T01667RFX2T03524PDR1T08409GAMYBT09608Kid3T13760ABF1
T00669Ovo-BT01669RFX2T03525PDR3T08410PBFT09623ATF6T13794TGA1
T00677Pax-1T01670RFX3T03538RCS1T08411SEDT09629MYBJS1T13809AGL2
T00689PHO2T01671RFX3T03541RFX1T08415CBTT09635AP1T13810Dof4
T00690PHO4T01673RFX1T03556RGT1T08431PPARαT09649cel-let-7T13811AGL3
T00691Pit-1AT01675Nkx2-5T03593Pax-9aT08441Sox10T09701cel-miR-84T14002GKLF
T00696PRBT01679PacCT03594Pax-9bT08445Elk-1-isoform1T09706hsa-let-7aT14118ASR-1
T00697PRBT01692T3R-β1T03600SIP4T08466c-JunT09707hsa-let-7bT14187AIRE-isoform1
T00699PrdT01705HOXA7T03612NK-4T08475GR-αT09718hsa-miR-23aT14230WRKY40
T00709qa-1FT01710HoxA-9T03707XBP1T08482VDRT09727hsa-miR-103T14231RP58
T00710RT01735HOXB7T03717ZAP1T08487ART09729hsa-miR-107T14234WRKY18
T00715RAP1T01737HOXB8T03718WRKY1T08492LRH-1-xbb1T09731dme-miR-2aT14258Nkx3-2
T00719RAR-α1T01755HOXD9T03722ZAP1T08493c-FosT09732dme-miR-2bT14268MIZF
T00725REB1T01757HOXD10T03975SPF1T08505COUP-TF1T09737dme-miR-7T14302C1-Myb
T00731RME1T01784MEF-2AT03994ID1T08520TBPT09741dme-miR-13aT14317Myb-15
T00737SAP-1aT01786E12T04001ATHB-9T08528ART09742dme-miR-13bT14381ATHB-1
T00746SGF-3T01799Tal-1T04096Smad3T08544MOVO-BT09793dme-let-7T14382ATHB-5
T00751SnT01814Pax-6/Pd-5aT04146HLTFT08546Ovo1aT09806hsa-miR-1T14442STF1
T00761SRFT01823Pax-2T04166FOXD3T08571GATA-2T09810hsa-miR-124aT14444TGA1
T00763SRFT01838Sox4T04169FOXJ2 (long isoform)T08577ZBRK1T09812hsa-miR-130aT14447PBF
T00767Sry-δT01841WT1-del2T04176FOXO4T08580STAT3T09819hsa-miR-125aT14485XBP-1
T00769Sry-βT01851HMGIT04255Nkx3-1T08583CCA1T09824hsa-miR-206T14491CBNAC
T00776SWI5T01865Oct-2.3T04280FOXP3T08584LHYT09840hsa-miR-130bT14517Zic3
T00788T-AgT01866Oct-2.4T04297Nkx6-1T08613ZNF219T09880dre-miR-430aT14521ZF5
T00789TllT01867Oct-2.6T04312NURR1- isoform1T08615PLZFBT09892c-Myb-isoform1T14543CBF1
T00798TBPT01882unc-86T04323Nkx2-5T08619WEREWOLFT09914SF-1T14573FUS3
T00810TFE3-LT01888POU6F1(c2)T04324DREFT08621HAHB-4T09923RREB-1T14681Spz1
T00812TFEB-isoform1T01897Cf1aT04336Nkx2-8T08624Sox9T09942HNF-3βT14827DEAF-1
T00814TFE3-ST01900PDM-1T04337Nkx2-2T08630CART09949FOXC1T14951Ncx
T00830TGA1bT01944NF-AT1T04345TBX5-LT08667SZF1-1T09960TR4T14954OG-2
T14992Pitx3
Table 5.

631 TRANSFAC 2008.3 IDs and factor names used in this article

IDFactor nameIDFactor nameIDFactor nameIDFactor nameIDFactor nameIDFactor name
T00003AS-CT3T00842Tra-1(long form)T01950HNF-1α-BT04378MadT08676STAT6T09986NF-AT4
T00008Adf-1T00843Ttk69KT01951HNF-1α-CT04446Nkx5-1T08787ARF1isoform-1T09990CDP-isoform1
T00011ADR1T00851T3R-β1T01973REST-form2T04539RPN4T08797MCB1T10028SREBP-1c
T00019AhRT00863UbxT01992Abd-AT04610SXRT08805WRKY1T10030POU3F2
T00026AntpT00886v-ErbAT02003Cdx-3T04651ER-βT08823E2FT10059GCMa
T00028YAP1T00891HNF-1β-AT02008EmsT04665Xvent-1T08853myogeninT10068COUP-TF2
T00033AP-2αT00893v-JunT02030SdT04674IRF-7AT08858REVERB-αT10083HNF-3α
T00063BcdT00894Vmw65T02033HsfA1T04675MRF-2- isoform1T08863S8T10144Gfi1b
T00077CACCC- binding factorT00895v-MybT02039HAC1T04679driT08868CTCFT10187NF-E2p45
T00079CadT00899WT1T02050Nkx6-2T04728CDC5LT08878Opaque-2T10207GATA-6
T00080CBF1T00910YB-1T02054HOX11T04733Alfin1T08972EAR2T10209Nkx2-1
T00104C/EBPαT00915YY1T02063KNOX3T04734Topors- isoform1T08978Dl-AT10211Evi-1
T00106C/EBPT00917Zen-1T02068PU.1T04783mtTFAT08985Pti4T10265LRH-1
T00109C/EBPδT00918ZesteT02099Zen-2T04784PF1T08989Fra-1T10276Erm
T00112c-Ets-1T00923ZtaT02100ZesteT04811FOXP1aT08994HIF-1α- isoform1T10282Otx2
T00113c-Ets-2T00925AMT1T02128SAP-1bT04817LIM1T09001BPC1T10317IA-1
T00115c-Ets-168T00937HBP-1aT02142OCA-BT04819EmBP-1aT09018N-MycT10331NRF-1
T00117CF1T00938HBP-1bT02216TFIIA-α/β precursor (major)T04886Tel-2bT09033TEF-1T10392GATA-3
T00120CF2-IIT00969POU3F1T02217TFIIA-α/β precursor (minor)T04931p73αT09051AhRT10393GATA-2
T00128HOXA4T01005MEF2A- isoform1T02235PEBP2αB1T04957EKLFT09059SEF2-1BT10429PU.1
T00140c-MycT01017CRE-BP2T02248StuApT04961GLI2αT09071AGT10459Alx-3
T00151CP2aT01019Elf-1T02256AML1aT04996ZBP89T09089PIF3T10462Prop-1
T00163CREBT01027BAS1T02288HFH-1T04998Tel-2aT09093IPF1T10473TEF-5
T00167ATF-2-xbb4T01035Isl-1αT02290FOXD3T04999Tel-2cT09097SRYT10482AP-2γ
T00176CTF-1T01051FOXA4aT02291CrocT05021NERF-1aT09098SREBP-2T10484TEF-3
T00177CTF-2T01053HNF-3βT02294FOXI1aT05051BTEB3T09102FOXO4T10543Sox5
T00179CUP2T01059MNB1aT02302GCMT05137CIZ6-1T09106RelA-p65T10573DREB1A
T00183DBPT01072TEFT02313MIBP1T05181DSFT09117E2F-1T10588Snai3
T00193DfdT01074ApT02330G/HBF-1T05553MYBAS1T09129BCL-6T10638HY5
T00204E12T01078GBF1T02361CREBβT05587BZI-1T09156TGIF-isoform2T10644MTF-1
T00208E74AT01083NF-μNRT02378USF1T05682ERRα1T09158BZR1T10664Gfi1
T00217EcRT01085abaAT02419Sp3T05705GATA-1T09159PITX2AT10666SRY
T00253EnT01109TCF-1(P)T02420Sox13T05706GATA-2T09162Pax-3T10674MafK
T00262ER-αT01112EBF1-LT02422HNF-4α2T05707GATA-3T09177MyoDT10712DMRT1
T00264ER-αT01147SF-1isoform2T02429HNF-4α1T05708GATA-4T09178C/EBPαT10720GCR1
T00272EveT01152T3R-α1T02463GBF1T05737PCF3T09182Pax-5T10721DMRT2
T00295FtzT01154c-RelT02469AP-2βT05743ABI4T09183WRKY53T10723DMRT3
T00296FTZ-F1T01258MSN4T02529PPARγ1T05770DREB1AT09184Pax-8T10725DMRT7
T00301GAGA factorT01265MAC1T02636CBF1T05834CBF2T09190AGL15T10727DMRT4
T00302GAL4T01274ABF2T02639ANTT05835DRF1.1T09194NF-AT1CT10731DMRT5
T00303GAL80T01275mat1-McT02654ERF2T05837DRF1.3T09195SPL14T10739MRP1
T00315GBFT01286ROX1T02669EmBP-1aT05929SUSIBA2T09196HSF2AT10745HSFA2
T00329GlassT01313ATF3T02672GBF1T05943FOXP1dT09199STAT5AT10747MTF-1
T00330GLI1T01333RXR-γT02690Dof2T05975E2F1T09218Msx-1T10754ABF1
T00331GLI3T01346ArntT02691Dof3T05977PENDT09225En-1T10760HAP1
T00337GR-αT01350T3R-β2T02772GCNFT05982POTH1T09226Lhx2T10795C/EBPγ
T00349HAP2T01352PPARαT02786RITA-1T06004DeltaNp63αT09230Prep1T10849STB5
T00350HAP3T01388C/EBPT02789bZIP910T06029Sox17T09243MafGT10854GCN4
T00368HNF-1α-AT01400Ets-1deltaVIIT02790bZIP911T06043AGP1T09287MITF-A2T10881TRAB1
T00377HOXA5T01422ste11T02807OSBZ8T06137p73βT09304Smad4T10928TGA2
T00383HSFT01427p300T02809ROM1T06168p63αT09319IRF-1T10958ATHB-2
T00385HSF1T01431c-Maf (long form)T02810ROM2T06341BEL5T09323IRF-1T10959PCF1
T00386HSTFT01470Ik-2T02818GLN3T06356Rim101pT09343SRFT10960PCF2
T00395HbT01471Ik-3T02825gaf2T06404WRKY38T09355Alx-4T11115ZIC1
T00401ICP4T01476Abd-BT02841FACBT06429HIC-1- isoform2T09356HOXA3T11136DEC2
T00445KNIRPST01477BR-CZ1T02846UAYT06532NAC69-1T09383GABP-αT11158HELIOS-B
T00456KrT01478BR-CZ2T02878TCF-4ET06533MYB80T09424WRKY2T11164FOXJ1
T00458LAC9T01479BR-CZ3T02897Sox6-Isoform1T06537CiT09426Sp3-isoform1T11166FOXF1
T00459C/EBPβ(LAP)T01480BR-CZ4T02905LEF-1T08158ABZ1T09427RAP-1-xbb1T11180Gli1
T00480MAL63T01481Pbx1aT02907MYB305T08251FBI-1T09431Sp1T11200DEC1
T00487MATα2T01482ExdT02929MYB340T08252NF-AT3T09441RBP-JκT11217Gzf1
T00488MATa1T01484Cdx-1T02936FOXO1T08279USF1T09444CPRF-3T11246ZIC2
T00489Max-isoform2T01492STAT1αT02983Pax-4aT08291GATA-1T09449CPRF-2T11250Brachyury
T00490MAZT01517TwiT02999OCSBF-1T08292GATA- 1isoform1T09450CPRF-1T11256GCMb
T00497 MBP-1( 1 ) T01527RORα1T03031Pax-2.1T08293GATA-1T09462Egr-1T11258GCMa
T00500MCM1T01528RORα2T03178SQUAT08298KaisoT09478TGA1aT11310MafA
T00509MIG1T01556SREBP-1aT03227CAT8T08300ER-α-LT09507Sox-xbb1T11372HOXB8
T00529MZF1B-CT01590P (long form)T03256HNF-3βT08313USF2aT09514HTF4γT11383HOXD13
T00535NF-1T01592C1 (long form)T03258HNF-6βT08318Elf-1T09531ATF-4T11390Cart-1
T00594RelA-p65T01599LCR-F1T03388Meis-1aT08319ZecT09540c-KroxT11394PR-β
T00625ZEB(1124AA)T01615SuT03389Meis-1bT08321p53-isoform1T09548IRF-3T11402Crx
T00627NIT2T01649HES-1T03447LHX3bT08323p53T09561RoazT11425Chx10
T00642POU2F1T01660PR-αT03481SKN7T08340Egr-2T09569HlfT11440FAC1-xbb1
T00644POU2F1aT01661PRAT03491MED8T08348RXR-αT09571MYB1T11453TAF-1
T00651POU5F1T01664TR2-11T03500MOT3T08358GATA-4T09588E4BP4T13753HsfB1
T00653POU5F1(Oct-5)T01667RFX2T03524PDR1T08409GAMYBT09608Kid3T13760ABF1
T00669Ovo-BT01669RFX2T03525PDR3T08410PBFT09623ATF6T13794TGA1
T00677Pax-1T01670RFX3T03538RCS1T08411SEDT09629MYBJS1T13809AGL2
T00689PHO2T01671RFX3T03541RFX1T08415CBTT09635AP1T13810Dof4
T00690PHO4T01673RFX1T03556RGT1T08431PPARαT09649cel-let-7T13811AGL3
T00691Pit-1AT01675Nkx2-5T03593Pax-9aT08441Sox10T09701cel-miR-84T14002GKLF
T00696PRBT01679PacCT03594Pax-9bT08445Elk-1-isoform1T09706hsa-let-7aT14118ASR-1
T00697PRBT01692T3R-β1T03600SIP4T08466c-JunT09707hsa-let-7bT14187AIRE-isoform1
T00699PrdT01705HOXA7T03612NK-4T08475GR-αT09718hsa-miR-23aT14230WRKY40
T00709qa-1FT01710HoxA-9T03707XBP1T08482VDRT09727hsa-miR-103T14231RP58
T00710RT01735HOXB7T03717ZAP1T08487ART09729hsa-miR-107T14234WRKY18
T00715RAP1T01737HOXB8T03718WRKY1T08492LRH-1-xbb1T09731dme-miR-2aT14258Nkx3-2
T00719RAR-α1T01755HOXD9T03722ZAP1T08493c-FosT09732dme-miR-2bT14268MIZF
T00725REB1T01757HOXD10T03975SPF1T08505COUP-TF1T09737dme-miR-7T14302C1-Myb
T00731RME1T01784MEF-2AT03994ID1T08520TBPT09741dme-miR-13aT14317Myb-15
T00737SAP-1aT01786E12T04001ATHB-9T08528ART09742dme-miR-13bT14381ATHB-1
T00746SGF-3T01799Tal-1T04096Smad3T08544MOVO-BT09793dme-let-7T14382ATHB-5
T00751SnT01814Pax-6/Pd-5aT04146HLTFT08546Ovo1aT09806hsa-miR-1T14442STF1
T00761SRFT01823Pax-2T04166FOXD3T08571GATA-2T09810hsa-miR-124aT14444TGA1
T00763SRFT01838Sox4T04169FOXJ2 (long isoform)T08577ZBRK1T09812hsa-miR-130aT14447PBF
T00767Sry-δT01841WT1-del2T04176FOXO4T08580STAT3T09819hsa-miR-125aT14485XBP-1
T00769Sry-βT01851HMGIT04255Nkx3-1T08583CCA1T09824hsa-miR-206T14491CBNAC
T00776SWI5T01865Oct-2.3T04280FOXP3T08584LHYT09840hsa-miR-130bT14517Zic3
T00788T-AgT01866Oct-2.4T04297Nkx6-1T08613ZNF219T09880dre-miR-430aT14521ZF5
T00789TllT01867Oct-2.6T04312NURR1- isoform1T08615PLZFBT09892c-Myb-isoform1T14543CBF1
T00798TBPT01882unc-86T04323Nkx2-5T08619WEREWOLFT09914SF-1T14573FUS3
T00810TFE3-LT01888POU6F1(c2)T04324DREFT08621HAHB-4T09923RREB-1T14681Spz1
T00812TFEB-isoform1T01897Cf1aT04336Nkx2-8T08624Sox9T09942HNF-3βT14827DEAF-1
T00814TFE3-ST01900PDM-1T04337Nkx2-2T08630CART09949FOXC1T14951Ncx
T00830TGA1bT01944NF-AT1T04345TBX5-LT08667SZF1-1T09960TR4T14954OG-2
T14992Pitx3
IDFactor nameIDFactor nameIDFactor nameIDFactor nameIDFactor nameIDFactor name
T00003AS-CT3T00842Tra-1(long form)T01950HNF-1α-BT04378MadT08676STAT6T09986NF-AT4
T00008Adf-1T00843Ttk69KT01951HNF-1α-CT04446Nkx5-1T08787ARF1isoform-1T09990CDP-isoform1
T00011ADR1T00851T3R-β1T01973REST-form2T04539RPN4T08797MCB1T10028SREBP-1c
T00019AhRT00863UbxT01992Abd-AT04610SXRT08805WRKY1T10030POU3F2
T00026AntpT00886v-ErbAT02003Cdx-3T04651ER-βT08823E2FT10059GCMa
T00028YAP1T00891HNF-1β-AT02008EmsT04665Xvent-1T08853myogeninT10068COUP-TF2
T00033AP-2αT00893v-JunT02030SdT04674IRF-7AT08858REVERB-αT10083HNF-3α
T00063BcdT00894Vmw65T02033HsfA1T04675MRF-2- isoform1T08863S8T10144Gfi1b
T00077CACCC- binding factorT00895v-MybT02039HAC1T04679driT08868CTCFT10187NF-E2p45
T00079CadT00899WT1T02050Nkx6-2T04728CDC5LT08878Opaque-2T10207GATA-6
T00080CBF1T00910YB-1T02054HOX11T04733Alfin1T08972EAR2T10209Nkx2-1
T00104C/EBPαT00915YY1T02063KNOX3T04734Topors- isoform1T08978Dl-AT10211Evi-1
T00106C/EBPT00917Zen-1T02068PU.1T04783mtTFAT08985Pti4T10265LRH-1
T00109C/EBPδT00918ZesteT02099Zen-2T04784PF1T08989Fra-1T10276Erm
T00112c-Ets-1T00923ZtaT02100ZesteT04811FOXP1aT08994HIF-1α- isoform1T10282Otx2
T00113c-Ets-2T00925AMT1T02128SAP-1bT04817LIM1T09001BPC1T10317IA-1
T00115c-Ets-168T00937HBP-1aT02142OCA-BT04819EmBP-1aT09018N-MycT10331NRF-1
T00117CF1T00938HBP-1bT02216TFIIA-α/β precursor (major)T04886Tel-2bT09033TEF-1T10392GATA-3
T00120CF2-IIT00969POU3F1T02217TFIIA-α/β precursor (minor)T04931p73αT09051AhRT10393GATA-2
T00128HOXA4T01005MEF2A- isoform1T02235PEBP2αB1T04957EKLFT09059SEF2-1BT10429PU.1
T00140c-MycT01017CRE-BP2T02248StuApT04961GLI2αT09071AGT10459Alx-3
T00151CP2aT01019Elf-1T02256AML1aT04996ZBP89T09089PIF3T10462Prop-1
T00163CREBT01027BAS1T02288HFH-1T04998Tel-2aT09093IPF1T10473TEF-5
T00167ATF-2-xbb4T01035Isl-1αT02290FOXD3T04999Tel-2cT09097SRYT10482AP-2γ
T00176CTF-1T01051FOXA4aT02291CrocT05021NERF-1aT09098SREBP-2T10484TEF-3
T00177CTF-2T01053HNF-3βT02294FOXI1aT05051BTEB3T09102FOXO4T10543Sox5
T00179CUP2T01059MNB1aT02302GCMT05137CIZ6-1T09106RelA-p65T10573DREB1A
T00183DBPT01072TEFT02313MIBP1T05181DSFT09117E2F-1T10588Snai3
T00193DfdT01074ApT02330G/HBF-1T05553MYBAS1T09129BCL-6T10638HY5
T00204E12T01078GBF1T02361CREBβT05587BZI-1T09156TGIF-isoform2T10644MTF-1
T00208E74AT01083NF-μNRT02378USF1T05682ERRα1T09158BZR1T10664Gfi1
T00217EcRT01085abaAT02419Sp3T05705GATA-1T09159PITX2AT10666SRY
T00253EnT01109TCF-1(P)T02420Sox13T05706GATA-2T09162Pax-3T10674MafK
T00262ER-αT01112EBF1-LT02422HNF-4α2T05707GATA-3T09177MyoDT10712DMRT1
T00264ER-αT01147SF-1isoform2T02429HNF-4α1T05708GATA-4T09178C/EBPαT10720GCR1
T00272EveT01152T3R-α1T02463GBF1T05737PCF3T09182Pax-5T10721DMRT2
T00295FtzT01154c-RelT02469AP-2βT05743ABI4T09183WRKY53T10723DMRT3
T00296FTZ-F1T01258MSN4T02529PPARγ1T05770DREB1AT09184Pax-8T10725DMRT7
T00301GAGA factorT01265MAC1T02636CBF1T05834CBF2T09190AGL15T10727DMRT4
T00302GAL4T01274ABF2T02639ANTT05835DRF1.1T09194NF-AT1CT10731DMRT5
T00303GAL80T01275mat1-McT02654ERF2T05837DRF1.3T09195SPL14T10739MRP1
T00315GBFT01286ROX1T02669EmBP-1aT05929SUSIBA2T09196HSF2AT10745HSFA2
T00329GlassT01313ATF3T02672GBF1T05943FOXP1dT09199STAT5AT10747MTF-1
T00330GLI1T01333RXR-γT02690Dof2T05975E2F1T09218Msx-1T10754ABF1
T00331GLI3T01346ArntT02691Dof3T05977PENDT09225En-1T10760HAP1
T00337GR-αT01350T3R-β2T02772GCNFT05982POTH1T09226Lhx2T10795C/EBPγ
T00349HAP2T01352PPARαT02786RITA-1T06004DeltaNp63αT09230Prep1T10849STB5
T00350HAP3T01388C/EBPT02789bZIP910T06029Sox17T09243MafGT10854GCN4
T00368HNF-1α-AT01400Ets-1deltaVIIT02790bZIP911T06043AGP1T09287MITF-A2T10881TRAB1
T00377HOXA5T01422ste11T02807OSBZ8T06137p73βT09304Smad4T10928TGA2
T00383HSFT01427p300T02809ROM1T06168p63αT09319IRF-1T10958ATHB-2
T00385HSF1T01431c-Maf (long form)T02810ROM2T06341BEL5T09323IRF-1T10959PCF1
T00386HSTFT01470Ik-2T02818GLN3T06356Rim101pT09343SRFT10960PCF2
T00395HbT01471Ik-3T02825gaf2T06404WRKY38T09355Alx-4T11115ZIC1
T00401ICP4T01476Abd-BT02841FACBT06429HIC-1- isoform2T09356HOXA3T11136DEC2
T00445KNIRPST01477BR-CZ1T02846UAYT06532NAC69-1T09383GABP-αT11158HELIOS-B
T00456KrT01478BR-CZ2T02878TCF-4ET06533MYB80T09424WRKY2T11164FOXJ1
T00458LAC9T01479BR-CZ3T02897Sox6-Isoform1T06537CiT09426Sp3-isoform1T11166FOXF1
T00459C/EBPβ(LAP)T01480BR-CZ4T02905LEF-1T08158ABZ1T09427RAP-1-xbb1T11180Gli1
T00480MAL63T01481Pbx1aT02907MYB305T08251FBI-1T09431Sp1T11200DEC1
T00487MATα2T01482ExdT02929MYB340T08252NF-AT3T09441RBP-JκT11217Gzf1
T00488MATa1T01484Cdx-1T02936FOXO1T08279USF1T09444CPRF-3T11246ZIC2
T00489Max-isoform2T01492STAT1αT02983Pax-4aT08291GATA-1T09449CPRF-2T11250Brachyury
T00490MAZT01517TwiT02999OCSBF-1T08292GATA- 1isoform1T09450CPRF-1T11256GCMb
T00497 MBP-1( 1 ) T01527RORα1T03031Pax-2.1T08293GATA-1T09462Egr-1T11258GCMa
T00500MCM1T01528RORα2T03178SQUAT08298KaisoT09478TGA1aT11310MafA
T00509MIG1T01556SREBP-1aT03227CAT8T08300ER-α-LT09507Sox-xbb1T11372HOXB8
T00529MZF1B-CT01590P (long form)T03256HNF-3βT08313USF2aT09514HTF4γT11383HOXD13
T00535NF-1T01592C1 (long form)T03258HNF-6βT08318Elf-1T09531ATF-4T11390Cart-1
T00594RelA-p65T01599LCR-F1T03388Meis-1aT08319ZecT09540c-KroxT11394PR-β
T00625ZEB(1124AA)T01615SuT03389Meis-1bT08321p53-isoform1T09548IRF-3T11402Crx
T00627NIT2T01649HES-1T03447LHX3bT08323p53T09561RoazT11425Chx10
T00642POU2F1T01660PR-αT03481SKN7T08340Egr-2T09569HlfT11440FAC1-xbb1
T00644POU2F1aT01661PRAT03491MED8T08348RXR-αT09571MYB1T11453TAF-1
T00651POU5F1T01664TR2-11T03500MOT3T08358GATA-4T09588E4BP4T13753HsfB1
T00653POU5F1(Oct-5)T01667RFX2T03524PDR1T08409GAMYBT09608Kid3T13760ABF1
T00669Ovo-BT01669RFX2T03525PDR3T08410PBFT09623ATF6T13794TGA1
T00677Pax-1T01670RFX3T03538RCS1T08411SEDT09629MYBJS1T13809AGL2
T00689PHO2T01671RFX3T03541RFX1T08415CBTT09635AP1T13810Dof4
T00690PHO4T01673RFX1T03556RGT1T08431PPARαT09649cel-let-7T13811AGL3
T00691Pit-1AT01675Nkx2-5T03593Pax-9aT08441Sox10T09701cel-miR-84T14002GKLF
T00696PRBT01679PacCT03594Pax-9bT08445Elk-1-isoform1T09706hsa-let-7aT14118ASR-1
T00697PRBT01692T3R-β1T03600SIP4T08466c-JunT09707hsa-let-7bT14187AIRE-isoform1
T00699PrdT01705HOXA7T03612NK-4T08475GR-αT09718hsa-miR-23aT14230WRKY40
T00709qa-1FT01710HoxA-9T03707XBP1T08482VDRT09727hsa-miR-103T14231RP58
T00710RT01735HOXB7T03717ZAP1T08487ART09729hsa-miR-107T14234WRKY18
T00715RAP1T01737HOXB8T03718WRKY1T08492LRH-1-xbb1T09731dme-miR-2aT14258Nkx3-2
T00719RAR-α1T01755HOXD9T03722ZAP1T08493c-FosT09732dme-miR-2bT14268MIZF
T00725REB1T01757HOXD10T03975SPF1T08505COUP-TF1T09737dme-miR-7T14302C1-Myb
T00731RME1T01784MEF-2AT03994ID1T08520TBPT09741dme-miR-13aT14317Myb-15
T00737SAP-1aT01786E12T04001ATHB-9T08528ART09742dme-miR-13bT14381ATHB-1
T00746SGF-3T01799Tal-1T04096Smad3T08544MOVO-BT09793dme-let-7T14382ATHB-5
T00751SnT01814Pax-6/Pd-5aT04146HLTFT08546Ovo1aT09806hsa-miR-1T14442STF1
T00761SRFT01823Pax-2T04166FOXD3T08571GATA-2T09810hsa-miR-124aT14444TGA1
T00763SRFT01838Sox4T04169FOXJ2 (long isoform)T08577ZBRK1T09812hsa-miR-130aT14447PBF
T00767Sry-δT01841WT1-del2T04176FOXO4T08580STAT3T09819hsa-miR-125aT14485XBP-1
T00769Sry-βT01851HMGIT04255Nkx3-1T08583CCA1T09824hsa-miR-206T14491CBNAC
T00776SWI5T01865Oct-2.3T04280FOXP3T08584LHYT09840hsa-miR-130bT14517Zic3
T00788T-AgT01866Oct-2.4T04297Nkx6-1T08613ZNF219T09880dre-miR-430aT14521ZF5
T00789TllT01867Oct-2.6T04312NURR1- isoform1T08615PLZFBT09892c-Myb-isoform1T14543CBF1
T00798TBPT01882unc-86T04323Nkx2-5T08619WEREWOLFT09914SF-1T14573FUS3
T00810TFE3-LT01888POU6F1(c2)T04324DREFT08621HAHB-4T09923RREB-1T14681Spz1
T00812TFEB-isoform1T01897Cf1aT04336Nkx2-8T08624Sox9T09942HNF-3βT14827DEAF-1
T00814TFE3-ST01900PDM-1T04337Nkx2-2T08630CART09949FOXC1T14951Ncx
T00830TGA1bT01944NF-AT1T04345TBX5-LT08667SZF1-1T09960TR4T14954OG-2
T14992Pitx3

The TFBS–TF pairs that we found to have binding evidences in the PDB show typical structural features of DNA–protein interactions. Such features include the 'recognition helix' of the DNA–binding protein making base contacts in the major groove and direct hydrogen bonds between the side chains and the bases. These interactions play the crucial role in the DNA recognition and site-specific binding, respectively ( 44 ). Interestingly, the nucleotides of TFBS are located in the major groove of the DNA, which are close to, and make contacts with the amino acids of the ‘recognition helix' of the TF (as for example shown in Figure 3 ).

The verification is considered satisfactory since those pairs not found in PDB [( b + c ) = 0] may be unannotated discovery as shown in the following verification by homology modeling.

Verification by homology modeling

Regarding the pairs without any related PDB chain [( b + c ) = 0], there is no PDB data for us to verify them. Thus, we have taken the most conservative approach to assign zero to their performance metrics in the aforementioned evaluations. Nevertheless, we believe that most of those pairs are true and our approach can be used as an effective protein–DNA binding discovery tool. Thus 6 TFBS 5-mer–TF 5-mer pairs were taken and merged. The resultant pair ACGTG-SNRESARRSR was analyzed by homology modeling as follows:

The model of DNA–protein complex was built by homology modeling (INSIGHT II, MSI) based on the structure of the GCN4–DNA complex (1YSA) ( 45 ). Briefly, three amino acids (R234S, T236R and A238S) and two nucleotides (T29C and A31T) were mutated in the original structure. The side chains of the mutated amino acids were chosen from the rotamer database and examined using the Ramachandran plots to prevent any steric effect. The interactions between the amino acids and the nucleotides were searched based on the distance of the hydrogen bond.

The pair ACGTG–SNRESARRSR using homology modeling.
Figure 7.

The pair ACGTG–SNRESARRSR using homology modeling.

As shown in Figure 3 , we found that the pair ACGTG-SNRESARRSR exists in plant as the basic leucine-zipper (bZIP) transcription factor which binds to G-box binding factors (GBF) of DNA ( 46 ). Moreover, the ACGTG sequence is the consensus sequence, which is defined as G-box core and locates at the major groove of the double-stranded DNA. It is believed that the G-box core is the DNA sequence of GBF that provides the specificity of the binding to bZIP proteins. In order to further understand the interactions between the TF–TFBS, we built a model by using homology modeling based on the structure of GCN4–DNA (1YSA) complex ( 45 ). As shown in the model, the protein helix fits into the major groove of the DNA very well and forms extensive interactions (black lines) between the amino acids and the nucleotides. Interestingly, the mutations of the protein (R234S, T236R and A238S) as well as nucleotides (T29C and A31T) increases the number of hydrogen bonds compared with the original structure (1YSA), suggesting the binding specificity between this pair of TF–TFBS. In conclusion, we believe that the protein–DNA binding sequence patterns found using association rule mining on the large-scale database reveal real TF–TFBS pairs in physiologically relevant situation and this method could guide us to discover new and undescribed TF–TFBS pairs in the future.

Verification by random analysis

For each set of pairs in Supplementary Table S1 , we use a random process to generate a random set with the same number of pairs. Within a random set, its pairs were randomly sampled from all the combinations of the k -mers used in the proposed approach. Fifty random runs were performed. The maximal performance metrics of the 50 random runs are summarized in Supplementary Tables S19–S21 . In a comparison to the proposed approach, their performance has been depicted in Figure 8 . It can be observed that the performance of the proposed approach is significantly better than the best one of the 50 random runs. For instance, the binding prediction score of the 131 TFBS 5-mer–TF 5-mer pairs generated is 0.36±0.39 on average, whereas the maximal binding prediction score over 50 random runs is only 0.00509±0.06492 on average. Similar observation can also be drawn for their merged pairs in Supplementary Tables S22–S24 . It can be concluded that the performance of the proposed approach is very unlikely to happen purely by chance in PDB.

 Performance Comparison for PDB verifications. ( a ) TFBS prediction score, ( b ) TFBS binding prediction score, ( c ) binding prediction score ( d ) TFBS prediction score (merged pairs), ( e ) TFBS binding prediction score (merged pairs) and ( f ) binding prediction score (merged pairs) are shown.
Figure 8.

Performance Comparison for PDB verifications. ( a ) TFBS prediction score, ( b ) TFBS binding prediction score, ( c ) binding prediction score ( d ) TFBS prediction score (merged pairs), ( e ) TFBS binding prediction score (merged pairs) and ( f ) binding prediction score (merged pairs) are shown.

DISCUSSION

In this article, we have proposed a framework based on association rule mining with Apriori algorithm to discover associated TF–TFBS binding sequence patterns in the most explicit and interpretable form from TRANSFAC. With downward closure property, the algorithm guarantees the exact and optimal performance to generate all frequent TFBS k -mer TF k -mer pairs from TRANSFAC. The approach relies merely on sequence information without any prior knowledge in TF binding domains or protein–DNA 3D structure data. From comprehensive evaluations, statistics of the discovered patterns are shown to reflect meaningful binding characteristics. According to external literatures, PDB data and homology modeling, a good number of TF–TFBS binding patterns discovered have been verified by experiments and annotations. They exhibit atomic-level bindings between the respective TF binding domains and specific nucleotides of the TFBS from experimentally determined protein–DNA 3D structures. In fact, most of the pairs discovered are actually the binding cores from the TF binding domains and TFBS, respectively.

The proposed approach has great potential for discovering intuitive and interpretable rules of TF–TFBS binding mechanisms. Such rules are able to reveal TF binding domains, detailed interactions between amino acids and nucleotides, accurate TFBS sequence motifs, and help better understanding and deciphering of protein–DNA interactions. It also offers strategic help to reduce the labor and costs involved in wet-lab experiments. With increasing computational power and more sophisticated mining approaches, the proposed methodology can be further improved for discovering more intriguing TF–TFBS binding patterns and rules.

In the future, approximate associations will be considered to handle the experimental and biological noises, although the inevitable computational burden needs to be carefully handled, and much more efforts are needed to distinguish real signals from the large number of false positives introduced by loosening the pattern matching and clustering. Combinatorial associations between multiple TF and TFBS k -mers will also be another challenging topic. We will also seek further real applications of the approach on experimentally verifiable TF–TFBS bindings.

FUNDING

Research Grants Council of the Hong Kong SAR, China (project numbers: CUHK41407 and CUHK414708. Funding for open access charge: Block Grant Project of the Chinese University of Hong Kong, ref #2150591; Focused Investment Scheme D on Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong (project number: 1904014).

Conflict of interest statement . None declared.

ACKNOWLEDGEMENTS

The authors are grateful to the anonymous reviewers for their valuable comments. They would also like to thank Shaoke Lou, Kin-Cheung Ling and Leung-Yau Lo for their valuable help on preparing the study and the manuscript.

REFERENCES

1
Luscombe
NM
Thornton
JM
,
Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity
J. Mol. Biol.
,
2002
, vol.
320
(pg.
991
-
1009
)
2
Luscombe
NM
Austin
SE
Berman
HM
Thornton
JM
,
An overview of the structures of protein-DNA complexes
Genome Biol.
,
2000
, vol.
1
pg.
REVIEWS001
3
Galas
DJ
Schmitz
A
,
DNAse footprinting: a simple method for the detection of protein-DNA binding specificity
Nucleic Acids Res.
,
1987
, vol.
5
(pg.
3157
-
3170
)
4
Garner
MM
Revzin
A
,
A gel electrophoresis method for quantifying the binding of proteins to specific DNA regions: application to components of the Escherichia coli lactose operon regulatory system
Nucleic Acids Res.
,
1981
, vol.
9
(pg.
3047
-
3060
)
5
Smith
AD
Sumazin
P
Das
D
Zhang
MQ
,
Mining ChIP-chip data for transcription factor and cofactor binding sites
Bioinformatics
,
2005
, vol.
21
Suppl 1
(pg.
i403
-
i412
)
6
MacIsaac
KD
Fraenkel
E
,
Practical strategies for discovering regulatory DNA sequence motifs
PLoS Comput. Biol.
,
2006
, vol.
2
pg.
e36
7
Liu
XS
Brutlag
DL
Liu
JS
,
An algorithm for finding protein–DNA binding sites with applications to chromatinimmunoprecipitation microarray experiments
Nat. Biotechnol.
,
2002
, vol.
20
(pg.
835
-
839
)
8
Matys
V
Kel-Margoulis
OV
Fricke
E
Liebich
I
Land
S
Barre-Dirrie
A
Reuter
I
Chekmenev
D
Krull
M
Hornischer
K
et al.
,
TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes
Nucleic Acids Res.
,
2006
, vol.
34
(pg.
108
-
110
)
9
Hulo
N
Bairoch
A
Bulliard
V
Cerutti
L
Cuche
BA
de Castro
E
Lachaize
C
Langendijk-Genevaux
PS
Sigrist
CJA
,
The 20 years of PROSITE
Nucleic Acids Res.
,
2008
, vol.
36
Suppl.1
(pg.
D245
-
D249
)
10
Bateman
A
Coin
L
Durbin
R
Finn
RD
Hollich
V
GrifRths-Jones
S
Khanna
A
Marshall
M
Moxon
S
Sonnhammer
ELL
et al.
,
The Pfam protein families database
Nucleic Acids Res.
,
2004
, vol.
32
(pg.
D138
-
D141
)
11
Berman
HM
Westbrook
J
Feng
Z
Gilliland
G
Bhat
TN
Weissig
H
Shindyalov
IN
Bourne
PE
,
The Protein Data Bank
Nucleic Acids Res.
,
2000
, vol.
28
(pg.
235
-
242
)
12
Kel
AE
Goessling
E
Reuter
I
Cheremushkin
E
Kel-Margoulis
OV
Wingender
E
,
MATCH: a tool for searching transcription factor binding sites in DNA sequences
Nucleic Acids Res.
,
2003
, vol.
31
(pg.
3576
-
3579
)
13
Stormo
GD
,
Computer methods for analyzing sequence recognition of nucleic acids
Annu. Rev. Biochem.
,
1988
, vol.
17
(pg.
241
-
263
)
14
Jensen
ST
Liu
XS
Zhou
Q
Liu
JS
,
Computational discovery of gene regulatory binding motifs: a Bayesian perspective
Statistical Science
,
2004
, vol.
19
(pg.
188
-
204
)
15
Tompa
M
Li
N
Bailey
TL
Church
GM
Moor
BD
Eskin
E
Favorov
AV
Frith
MC
Fu
Y
Kent
et al.
,
Assessing computational tools for the discovery of transcription factor binding sites
Nat. Biotechnol.
,
2005
, vol.
23
(pg.
137
-
144
)
16
Sandve
GK
Abul
O
Walseng
V
Drablos
F
,
Improved benchmarks for computational motif discovery
BMC Bioinformatics
,
2007
, vol.
8
pg.
193
17
Jones
S
van Heyningen
P
Berman
HM
Thornton
JM
,
Protein-DNA interactions: a structural analysis
J. Mol. Biol.
,
1999
, vol.
287
(pg.
877
-
896
)
18
Luscombe
NM
Laskowski
RA
Thornton
JM
,
Amino acid-base interactions: a three-dimensional analysis of protein-DNA interactions at an atomic level
Nucleic Acids Res.
,
2001
, vol.
29
(pg.
2860
-
2874
)
19
Krishna
SS
Majumdar
I
Grishin
NV
,
Structural classification of zinc fingers: survey and summary
Nucleic Acids Res.
,
2003
, vol.
31
(pg.
532
-
550
)
20
Jones
S
Shanahan
HP
Berman
HM
Thornton
JM
,
Using electrostatic potentials to predict DNA-binding sites on DNA-binding proteins
Nucleic Acids Res.
,
2003
, vol.
31
(pg.
7189
-
7198
)
21
Mandel-Gutfreund
Y
Schueler
O
Margalit
H
,
Comprehensive analysis of hydrogen bonds in regulatory protein DNA-complexes: in search of common principles
J. Mol. Biol.
,
1995
, vol.
253
(pg.
370
-
382
)
22
Mandel-Gutfreund
Y
Margalit
H
,
Quantitative parameters for amino acid-base interaction: implications for prediction of protein-DNA binding sites
Nucleic Acids Res.
,
1998
, vol.
26
(pg.
2306
-
2312
)
23
Sarai
A
Kono
H
,
Protein-DNA recognition patterns and predictions
Annu. Rev. Biophys. Biomol. Struct.
,
2005
, vol.
34
(pg.
379
-
398
)
24
Zhou
Q
Liu
JS
,
Extracting sequence features to predict protein-DNA interactions: a comparative study
Nucleic Acids Res.
,
2008
, vol.
36
(pg.
4137
-
4148
)
25
Ahmad
S
Gromiha
MM
Sarai
A
,
Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information
Bioinformatics
,
2004
, vol.
20
(pg.
477
-
486
)
26
Ahmad
S
Keskin
O
Sarai
A
Nussinov
R
,
Protein-DNA interactions: structural, thermodynamic and clustering patterns of conserved residues in DNA-binding proteins
Nucleic Acids Res.
,
2008
, vol.
36
(pg.
5922
-
5932
)
27
Pham
TH
Clemente
JC
Satou
K
Ho
TB
,
Computational discovery of transcriptional regulatory rules
Bioinformatics
,
2005
, vol.
21
(pg.
101
-
107
)
28
Ofran
Y
Mysore
V
Rost
B
,
Prediction of DNA-binding residues from sequence
Bioinformatics
,
2007
, vol.
23
(pg.
i347
-
i353
)
29
Agrawal
R
Imieliński
T
Swami
A
,
Mining association rules between sets of items in large databases
SIGMOD '93: Proceedings of the 1993 ACM SIGMOD international conference on Management of data
,
1993
(pg.
207
-
216
http://portal.acm.org/citation.cfm?id=170072
30
Hipp
J
Güntzer
U
Nakhaeizadeh
G
,
Algorithms for association rule mining—a general survey and comparison
SIGKDD Explor. Newsl.
,
2000
, vol.
2
(pg.
58
-
64
)
31
May
KO
,
A set of independent necessary and sufficient conditions for simple majority decision
Econometrica
,
1952
, vol.
20
(pg.
680
-
684
)
32
Altschul
SF
Gish
W
Miller
W
Myers
EW
Lipman
DJ
,
Basic local alignment search tool
J. Mol. Biol.
,
1990
, vol.
215
(pg.
403
-
410
)
33
Geng
L
Hamilton
HJ
,
Interestingness measures for data mining: a survey
ACM Comput. Surv.
,
2006
, vol.
38
pg.
9
34
Moreland
JL
Gramada
A
Buzko
OV
Zhang
Q
Bourne
PE
,
The Molecular Biology Toolkit (MBT): a modular platform for developing molecular visualization applications
BMC Bioinformatics
,
2005
, vol.
6
pg.
21
35
Guilford
JP
Psychometric Methods
,
1936
New York
McGraw-Hill
36
Brin
S
Motwani
R
Ullman
JD
Tsur
S
,
Dynamic itemset counting and implication rules for market basket data
SIGMOD Rec.
,
1997
, vol.
26
(pg.
255
-
264
)
37
Wilson
D
Charoensawan
V
Kummerfeld
SK
Teichmann
SA
,
DBD taxonomically broad transcription factor predictions: new content and functionality
Nucleic Acids Res.
,
2008
, vol.
36
(pg.
D88
-
D92
)
38
Privalov
PL
Gill
SJ
,
Stability of protein structure and hydrophobic interaction
Adv. Protein Chem.
,
1988
, vol.
39
(pg.
191
-
234
)
39
Moore
J
Devaney
E
,
Cloning and characterization of two nuclear receptors from the filarial nematode Brugia pahangi
Biochem. J.
,
1999
, vol.
344
Pt 1
(pg.
245
-
252
)
40
Brent
MM
Anand
R
Marmorstein
R
,
Structural basis for DNA recognition by FoxO1 and its regulation by posttranslational modification
Structure
,
2008
, vol.
16
(pg.
1407
-
1416
)
41
Bates
DL
Chen
Y
Kim
G
Guo
L
Chen
L
,
Crystal structures of multiple GATA zinc fingers bound to DNA reveal new insights into DNA recognition and self-association by GATA
J. Mol. Biol.
,
2008
, vol.
381
(pg.
1292
-
1306
)
42
Chandra
V
Huang
P
Hamuro
Y
Raghuram
S
Wang
Y
Burris
TP
Rastinejad
F
,
Structure of the intact PPAR-gamma-RXR-alpha nuclear receptor complex on DNA
Nature
,
2008
, vol.
456
(pg.
350
-
356
)
43
Lamber
EP
Vanhille
L
Textor
LC
Kachalova
GS
Sieweke
MH
Wilmanns
M
,
Regulation of the transcription factor Ets-1 by DNA-mediated homo-dimerization
EMBO J.
,
2008
, vol.
27
(pg.
2006
-
2017
)
44
Pabo
CO
Sauer
RT
,
Transcription Factors: structural families and Principles of DNA recognition
Annu. Rev. Biochem.
,
1992
, vol.
61
(pg.
1053
-
1095
)
45
Ellenberger
TE
Brandl
CJ
Struhl
K
Harrison
SC
,
The GCN4 basic region leucine zipper binds DNA as a dimer of uninterrupted alpha helices: crystal structure of the protein-DNA complex
Cell
,
1992
, vol.
71
(pg.
1223
-
1237
)
46
Sibe'ril
Y
Doireau
P
Gantet
P
,
Plant bZIP G-box binding factors. Modular structure and activation mechanisms
Eur. J. Biochem.
,
2001
, vol.
268
(pg.
5655
-
5666
)

Appendix 1

Algoritham 1 Pseudocode of Apriori algorithm ( 29 )
data : A dataset of itemsets
Ln : Frequent n -itemsets
Cn : Candidate n -itemsets
x : An itemset
minsupport : Minimum Support
i ← 1;
Scan data to get Li ;
whileformulado
     Ci + 1 ← E xtend ( Li );
     Li + 1 ← ∅;
     ForxCi + 1 do
         Ifformulathen
             Li + 1 Li + 1 x ;
         end if
     end for
     ii + 1;
end while
Notes:
E xtend ( L i ) is the function ‘Candidate itemset generation procedure' stated in ( 29 ). Support ( x ) returns the support ( 30 ) of the itemset x . A frequent n -itemset is the n -itemset support is higher than minsupport .
Algoritham 1 Pseudocode of Apriori algorithm ( 29 )
data : A dataset of itemsets
Ln : Frequent n -itemsets
Cn : Candidate n -itemsets
x : An itemset
minsupport : Minimum Support
i ← 1;
Scan data to get Li ;
whileformulado
     Ci + 1 ← E xtend ( Li );
     Li + 1 ← ∅;
     ForxCi + 1 do
         Ifformulathen
             Li + 1 Li + 1 x ;
         end if
     end for
     ii + 1;
end while
Notes:
E xtend ( L i ) is the function ‘Candidate itemset generation procedure' stated in ( 29 ). Support ( x ) returns the support ( 30 ) of the itemset x . A frequent n -itemset is the n -itemset support is higher than minsupport .
Algoritham 1 Pseudocode of Apriori algorithm ( 29 )
data : A dataset of itemsets
Ln : Frequent n -itemsets
Cn : Candidate n -itemsets
x : An itemset
minsupport : Minimum Support
i ← 1;
Scan data to get Li ;
whileformulado
     Ci + 1 ← E xtend ( Li );
     Li + 1 ← ∅;
     ForxCi + 1 do
         Ifformulathen
             Li + 1 Li + 1 x ;
         end if
     end for
     ii + 1;
end while
Notes:
E xtend ( L i ) is the function ‘Candidate itemset generation procedure' stated in ( 29 ). Support ( x ) returns the support ( 30 ) of the itemset x . A frequent n -itemset is the n -itemset support is higher than minsupport .
Algoritham 1 Pseudocode of Apriori algorithm ( 29 )
data : A dataset of itemsets
Ln : Frequent n -itemsets
Cn : Candidate n -itemsets
x : An itemset
minsupport : Minimum Support
i ← 1;
Scan data to get Li ;
whileformulado
     Ci + 1 ← E xtend ( Li );
     Li + 1 ← ∅;
     ForxCi + 1 do
         Ifformulathen
             Li + 1 Li + 1 x ;
         end if
     end for
     ii + 1;
end while
Notes:
E xtend ( L i ) is the function ‘Candidate itemset generation procedure' stated in ( 29 ). Support ( x ) returns the support ( 30 ) of the itemset x . A frequent n -itemset is the n -itemset support is higher than minsupport .
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data

Comments

0 Comments
Submit a comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.