Figure 2.
Overview of the creation of our 12 datasets (Section 2.5). We create two sets of proteins, i.e. allAF (denoted by the letter code a) and reliableAF (denoted by rl). Among each of allAF and reliableAF protein sets, we create a sequence redundant (letter code r) and a sequence non-redundant (letter code ) set of proteins. From each of the sequence redundant and sequence non-redundant protein sets, we create three datasets reflecting the ratio of the numbers of non-TFs vs. TFs of 3, 5, and 10. We name a dataset using the convention , where , and . In the figure, the shaded boxes highlight the relevant parts of the data creation logic leading to sequence non-redundant datasets.