The cluster-size curve for 560 sequences of HA1. This curve shows the relationship between the threshold distance *d* (at which to connect two sequences into the same cluster) and the mean cluster size *C*(*d*), defined as the normalized first moment of the resulting distribution of cluster sizes. Equivalently, *C*(*d*) is the probability that two randomly chosen sequences lie in the same cluster. Plateaus in the cluster size curve correspond to stable length scales at which the sequences form nonrandom clusters. Random data would not exhibit any plateaus except for *C* = 0 and *C* = 1 (). The smooth cluster size curve results from averaging over 100 probabilistic Gaussian draws for each mean distance parameter *d*, with a 5% coefficient of variation (). The HA1 data exhibit two significant plateaus corresponding to clusterings at *d* = 2–3 and *d* = 4–5. The long tail for *d* ≥ 6 corresponds to the gradual accumulation of outlier sequences. When *d* = 2, there are 174 resulting clusters with *C* (2) = 0.0614; at this scale, the expected size of the cluster containing a randomly chosen sequence is 560 × 0.0614 = 34.4 sequences. (The clustering for *d* = 3 is extremely similar to *d* = 2, as the first plateau indicates.)

## PubMed Commons