## Results: 7

1.

2.

3.

4.

**Time complexity of Steps 2 and 3 of the amcBPPS program.**(

**A**) Plot of run times versus the number of aligned residues in the input multiple alignment. Shown are data points from Table 1 and the corresponding linear regression trend line (

*r*= 0.95). Because this plot is shown using a logarithmic scale for both axes, the observed time complexity O(n) of the program can be estimated from the slope of the trend line: Since time

*t = c n*

^{k}, it follows that log

*t*= logc +

*k*log

*n*on a log-log plot. The slope of the trend line is

*k*= 1.2 indicating an observed time complexity somewhat worse than linear. (

**B**) Plot of run times versus the number of aligned residues times the number of nodes in the hierarchy created in Step 2. This plot results in a slightly better fit (r = 0.98). The slope of the trend line is

*k*= 0.9 indicating an observed time complexity that is essentially linear.

5.

**A multiple category model optimized by the mcBPPS sampler.**

**(top)**A tree representing the hierarchical relationships between functionally-divergent protein subgroups. Color code: internal nodes, blue; leaf nodes, red. Each subtree within the tree (i.e., each node and its descendents) corresponds to a set of sequences that generally conserve a pattern that sequences in the rest of the tree generally lack. For example, node 5 could represent a subfamily whose family, superfamily and class are represented by the subtrees rooted at nodes 4, 2 and 1, respectively.

**(middle)**The corresponding functional divergence (FD-)table. A tree is converted into a FD-table, as follows: The subtree rooted at each node of the tree corresponds to the foreground (‘+’ rows) for that column in the table, whereas the rest of the subtree rooted at the parent of that node corresponds to the background (‘-‘rows). (A set of randomly-generated sequences serves as the background for the root node.) Each internal node in the tree corresponds to a miscellaneous category—that is to sequences sharing a common pattern with, but lacking patterns specific to each of its descendent subtrees.

**(bottom)**Contrast alignment corresponding to column 4 of the table. Each subgroup corresponding to a row with a ‘+’ or a ‘-‘symbol in that column is assigned to the foreground or background, respectively; subgroups with an ‘o’ symbol are omitted from that contrast alignment.

6.

**Schematic drawing of a contrast alignment and the corresponding probability model.**Aligned sequences are assigned to either a ‘foreground’ or a ‘background’ partition (orange and gray horizontal bars, respectively). Partitioning is based on the conservation of foreground residues (blue vertical bars) that diverge from (or contrast with) the background residues at those positions (white vertical bars). Red vertical bar heights quantify the selective pressure imposed on divergent residue positions. Below this is given the logarithm of the corresponding probability distribution for the possible sequence partitions and corresponding discriminating patterns which together serve as the random variables over which sampling occurs.

**X**is an

*n × k*matrix representing a multiple alignment of

*n*sequences and

*k*columns;

*x*

_{i j}is a 20-dimensional vector of all 0’s except for a lone ‘1’ indicating the observed residue type;

**R**is a vector indicating which rows (i.e., sequences) belong to the foreground (

*R*

_{i}=1) or background (R

_{i}= 0) partitions;

**C**is a vector indicating which columns do (C

_{j}=1) or do not (

*C*

_{j}=0) differentiate the foreground from the background;

**Θ**is an array of vectors representing the amino acid compositions at each column position for each partition; denotes the inner product of two vectors; and models the foreground composition at pattern positions where is the background amino acid frequency vector for column

*j*, the parameter α specifies the expected background ‘contamination’ at pattern positions in the foreground, and δ

_{Aj}is a vector that specifies the pattern residues at position

*j*. At non-pattern positions, the vector

*θ*

_{j}corresponds to the overall (foreground and background) composition. The third through sixth terms in the equation correspond to the logarithm of the product of the prior probabilities with

*p*(α) and

*p*(

**Θ)**defined by the beta and product Dirichlet distributions, respectively, and with

*p*(

**R**) and

*p*(

**C**) defined by independent Bernoulli distributions; prior definitions are as shown (in parentheses). The log-likelihood ratio (LLR) is computed by subtracting from the log-probability for the observed contrast alignment the log-probability for a ‘null’ contrast alignment, in which all of the sequences are assigned to the background partition.

7.

**The amcBPPS procedural substeps used to obtain a hierarchy from a multiple alignment.**Starting from a multiple sequence alignment for a particular protein domain, the amcBPPS program applies the following substeps (‘a’ to ‘e’) to create a domain hierarchy. Note that substep (a) corresponds to Step 1 of the amcBPPS algorithm whereas the other substeps correspond to Step 2. (

**a**) Use heuristic procedures to create distinct FD-tables, corresponding to a forest of simple (rooted, branchless) trees; each leaf of a given tree corresponds to a distinct subgroup within the protein class. (The mcBPPS sampler is used to optimally assign sequences to each leaf node; different prior probability settings can be used to favor convergence on subfamilies, families or superfamilies.) (

**b**) Select leaf nodes from the forest corresponding to more or less distinct, functionally divergent subgroups; this is done by combining each set of nearly identical nodes into a single set. Define a root node (labeled R in the figure) corresponding to the universal sequence set. Larger superfamily nodes (labeled with red integers) also are created from related leaf nodes. The haze around nodes indicate the partially-overlapping nature (i.e., fuzziness) of the corresponding sequence sets. (

**c**) Generate a directed acyclic graph (DAG) representing superset-to-subset relationships between nodes and with arcs weighted by (the negative of) the corresponding log-likelihood ratios (LLRs) associated with the BPPS statistical model. For clarity, nodes and arcs directly connected to the root are shown in orange whereas other (non-root) nodes are uniquely colored. (

**d**) Obtain from the DAG a shortest path spanning tree using a breadth-first scanning algorithm [45]. Because the arcs are weighted using LLRs, this procedures returns a maximum likelihood tree associated with the DAG. (

**e**) Prune nodes that both are directly attached to the root and significantly overlap with other nodes and thus correspond to ill-defined sequence sets. For the remaining nodes, remove the overlap between their corresponding sequence sets (see text for details) and prune from the tree those nodes that lack a minimum number of sequences (30 by default). This typically yields a reduced hierarchy (as shown), which is converted into a FD-table (as illustrated in Figure 2) for optimization by the mcBPPS sampler.