(A) The ratio of thymine-lacking to thymine-containing codons for all amino acids that have both types (A, T, P, H, N, D, R, S, and G) was plotted for the first 69 nucleotides or the rest of the open reading frame of SSCR, MSCR or other genes. As in Figure 2A, all bars represent the average from open reading frames from the human genome which either lacked (“–”) or had (“+”) 5UIs. Error bars represent the standard error of the mean. (B) The position specific scoring matrix corresponding to the 6 nt motif was visualized using WebLogo [40] (C) For each occurrence of the motif, the frame of translation is determined. The fraction of motif occurrences in all three possible frames were plotted for both 5UI− and 5UI+ SSCR-containing genes. (D) The distribution of the number of motifs in the set of SSCR-containing genes with 5UIs (negative set) and without 5UIs (positive set) were plotted. (E) For a given number of motif occurrences, the fraction of sequences in the positive versus negative set was plotted. Even though there were ∼2.5 times more sequences in the negative set, the fraction of sequences in the positive set with one or more occurrences of the motif was much higher compared to the fraction in the negative set. (F) The cumulative distribution of the motif occurrences were plotted for both sets (blue line for 5UI− genes and red line for 5UI+ genes) and for the uniform distribution (grey line). While the negative set did not differ from uniform distribution, the positive set displayed a left shift towards the 5′ of the transcript. (G) An ROC curve was generated to evaluate the discovered motif's predictive power in identifying the absence of 5UIs among MSCR-containing genes (see Materials and Methods). The performance of the CGSSGC motif is shown with the solid blue line, while the pale pink lines depict the performance of 50 randomly generated motifs. The boxplots represent the interquartile range of TPRs at a specified FPR for all 100,000 random motifs, and whiskers are drawn to 1.5 times the interquartile range. Outliers are not shown, and black horizontal line in each boxplot corresponds to the median TPR at the given FPR. The solid red line is the median performance of all 100,000 random motifs.