## Results: 9

1.

**Information profile for**. (A)

*P. aeruginosa*Fur and*H. influenzae*CRP motifs*R*

_{sequence }and

*RE*profiles for Fur on the

*P. aeruginosa*genome. (B)

*R*

_{sequence }and

*RE*profiles for CRP on the

*H. influenzae*genome, and for the mean

*R*

_{sequence }profile obtained from 10,000 45-site subsamples of the 210

*E. coli*binding sites. Vertical bars show the standard deviation.

2.

**Search efficiency for CRP sites in**. ROC curves for search methods trying to locate

*E. coli*and*H. influenzae**H. influenzae*and

*E. coli*CRP binding sites on, respectively,

*H. influenzae*and

*E. coli*genomes. Abbreviations: Eco –

*E. coli*, Hin –

*H. influenzae*. The plot is scaled to encompass a 1/10 true to false positive ratio (450 false positives) in

*H. influenzae*.

3.

**Search efficiency for Fur sites in**. ROC curves for search methods trying to locate

*E. coli*and*P. aeruginosa**P. aeruginosa*and

*E. coli*Fur binding sites on, respectively,

*P. aeruginosa*and

*E. coli*genomes. Abbreviations: Eco –

*E. coli*, Hin –

*H. influenzae*. The plot is scaled to encompass a 1/10 true to false positive ratio (320 false positives) in

*P. aeruginosa*.

4.

**Search efficiency in**. Mean ROC curves for the

*E. coli*with "weakened" CRP sites*R*

_{i }search method trying to locate CRP binding sites on the

*E. coli*genome, using the original, asymmetric and mirrored collections of CRP. The plot is scaled to encompass a 1/10 true to false positive ratio for CRP (2100 false positives). The

*R*

_{sequence }profile of the original, asymmetrical and mirrored CRP motifs is shown in the inset.

5.

**Search efficiency in the**. ROC curves for different IT-based binding site search methods attempting to locate known LexA, Fur, CRP and Fis sites on the

*E. coli*genome*E. coli*genome. The plot is scaled to encompass a 1/10 true to false positive ratio for the transcription factor with the largest number of known sites (CRP; 210 sites). Vertical arrows indicate this same ratio for all transcription factors.

6.

**Standard and effective affinity range for different transcription factors**. (a) Estimation of the affinity range for the different transcription factors analyzed in this work. For each transcription factor, the affinity range is represented as the distribution of affinities for all its experimentally determined binding sites. The affinity of each binding site is estimated using the

*R*

_{sequence }· BvH ranking index. (b) Estimation of the effective affinity range. For each transcription factor, the effective affinity range is represented as the distribution of normalized affinities for all its experimentally determined binding sites. Normalized affinities are estimated by normalizing the

*R*

_{sequence }· BvH ranking index for each site with the number of false positives required to find that site. For comparison purposes, in both affinity range plots

*R*

_{sequence }· BvH values (Y-axis) are normalized to the length of the binding motif for each transcription factor and ranges (X-axis) are shown as the percentage of experimentally determined sites (collection).

7.

**Search efficiency for**. ROC curves for search methods trying to locate 210 CRP binding sites on randomly generated backgrounds. The ROC curve depicts the mean and standard deviation of three independent experiments (searches against three independently genrerated backgrounds). The plot is scaled to encompass a 1/10 true to false positive ratio (2100 false positives) in the equiprobable background.

*E. coli*CRP sites in a skewed random background*RE'*results, which completely overlap

*RE · BvH*ones, are not shown for clarity. The

*RE*profiles for CRP against the different backgrounds are shown in the bottom-right inset.

8.

**Search efficiency for**. ROC curves for search methods trying to locate 45 Fur binding sites on randomly generated backgrounds. The ROC curve depicts the mean and standard deviation of three independent experiments (searches against three independently genrerated backgrounds). The plot is scaled to encompass a 1/10 true to false positive ratio (450 false positives) in the equiprobable background.

*E. coli*Fur sites in a skewed random background*RE'*results, which completely overlap

*RE · BvH*ones, are not shown for clarity. The

*RE*profiles for Fur against the different backgrounds are shown in the bottom-right inset.

9.

**Observed vs. expected frequency of 20-mers in genomes**. Mean ratio between observed and expected 20-mers in real genomes versus randomly generated sequences. Ratios were computed independently for 3 different genomes and 3 random sequences of similar %GC composition. Vertical bars show the standard deviation of these ratios. Genomes used for calculations:

*E. coli*str. K-12 substr. MG1655 [50.8% GC],

*P. aeruginosa*PAO1 [66.6% GC],

*H. influenzae*Rd KW20 [38.1% GC],

*Colwellia psychrerythraea*34H [38.0% GC],

*Salinibacter ruber*DSM 13855 [66.2% GC],

*Thiobacillus denitrificans*ATCC 25259 [66.1% GC],

*Enterococcus faecalis*V583 [37.5% GC],

*Anaplasma marginale*str. St. Maries [49.8% GC] and

*Nitrosococcus oceani*ATCC 19707 [50.3% GC].