# Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve

^{a}To whom correspondence should be addressed. Tel: +86 22 2740 1008; Fax: +86 22 2335 8329; Email: ctzhang@tju.edu.cn

## Abstract

The Z curve is a three-dimensional space curve constituting the unique representation of a given DNA sequence in the sense that each can be uniquely reconstructed from the other. Based on the Z curve, a new protein coding gene-finding algorithm specific for the yeast genome at better than 95% accuracy has been proposed. Six cross-validation tests were performed to confirm the above accuracy. Using the new algorithm, the number of protein coding genes in the yeast genome is re-estimated. The estimate is based on the assumption that the unknown genes have similar statistical properties to the known genes. It is found that the number of protein coding genes in the 16 yeast chromosomes is ≤5645, significantly smaller than the 5800–6000 which is widely accepted, and much larger than the 4800 estimated by another group recently. The mitochondrial genes were not included into the above estimate. A codingness index called the YZ score (YZ Œ [0,1]) is proposed to recognize protein coding genes in the yeast genome. Among the ORFs annotated in the MIPS (Munich Information Centre for Protein Sequences) database, those recognized as non-coding by the present algorithm are listed in this paper in detail. The criterion for a coding or non-coding ORF is simply decided by YZ > 0.5 or YZ < 0.5, respectively. The YZ scores for all the ORFs annotated in the MIPS database have been calculated and are available on request by sending email to the corresponding author.

## INTRODUCTION

An important problem in the study of the yeast genome is whether an ORF longer than a threshold is a true protein coding gene or not. Traditionally, the codingness of an ORF or a fragment of DNA sequence was described using the Codon Bias Index (CBI) (1) or the Codon Adaptation Index (CAI) (2). Although these indices were used widely (3), the coding properties of a coding sequence are not sufficiently reflected by them. For example, some ORFs shorter than 150 codons with CAI < 0.11 have identified phenotypes (4). The analysis of the entire yeast genome created the need for a more accurate codingness index. It is the aim of this paper to propose a new gene-finding algorithm at better than 95% accuracy. Based on the algorithm, a new index called the YZ score is proposed, which is used to reflect the codingness of an ORF or a fragment of DNA sequence. The YZ score is not meant to replace CBI or CAI, rather, to act as a complement to these already widely used indices.

The methodology adopted here is based on the Z curve theory of DNA sequences (5–7). Although most computational biologists are not aware of the technique term Z curve, it is a powerful tool for visualizing and analyzing DNA sequences. The Z curves method has been applied with some success to areas such as distinguishing between genes with and without introns (8), and recognizing coding sequences in the human genome (9). It is hoped that the Z curves method will become a convenient tool for genome analysis.

Using the new gene-finding algorithm, we re-estimate the number of protein coding genes in 16 yeast chromosomes. To our surprise, the number of genes estimated here is ≤5645, significantly less than the 5800–6000 widely accepted (10–12), and significantly greater than the 4800 estimated recently by another group (4).

## DATABASES AND METHODS

### The database

The *Saccharomyces cerevisiae* genome DNA sequences were obtained from a CD-ROM distributed from MIPS, the Munich Information Centre for Protein Sequences, Release 1997. The newest data for classification of ORFs in the yeast genome were downloaded from http://speedy.mips.biochem.mpg.de Release September 27, 1999.

### The Z curve

The Z curve is a three-dimensional space curve constituting the unique representation of a given DNA sequence in the sense that for the curve and sequence each can be uniquely reconstructed from the other. We present briefly the method of the Z curve as follows. Consider a DNA sequence read from the 5′ to the 3′-end with *N* bases. Inspect the sequence one base at a time, beginning from the first base. Let the number of the inspecting steps be denoted by *n*, i.e., *n* = 1, 2, …, *N*. In the *n*th step, count the cumulative numbers of the bases A, C, G and T, occurring in the subsequence from the first to the *n*th base in the DNA sequence inspected. Denoting the cumulative occurring numbers of the bases A, C, G and T in the above subsequence by A* _{n}*, C

*, G*

_{n}*and T*

_{n}*, respectively, we defined the Z curve in the following. The Z curve consists of a series of nodes P*

_{n}*, where*

_{n}*n*= 1, 2, …,

*N*, whose coordinates are denoted by

*x*,

_{n}*y*and

_{n}*z*. It was shown (6,7) that

_{n}where A_{0} = C_{0} = G_{0} = T_{0} = 0 and hence *x*_{0} = *y*_{0} = *z*_{0} = 0. The connection of the nodes P_{0} (P_{0} = 0), P_{1}, P_{2}, …, until P* _{N}* one by one sequentially by straight lines is called the Z curve for the DNA sequences inspected. To clarify the biological implication of the Z curve defined, using the normalized equation A

*+ C*

_{n}*+ G*

_{n}*+ T*

_{n}*=*

_{n}*n*we rewrite equation

**1**as

where R, Y, M, K, W and S represent the bases of purine, pyrimidine, amino, keto, weak hydrogen bonds and strong hydrogen bonds, respectively, according to the Recommendation 1984 by the NC-IUB (13). The Z curve defined above is a three-dimensional space curve, having three independent components, i.e., *x _{n}*,

*y*and

_{n}*z*. Each has a clear biological meaning. The component x

_{n}_{n}displays the distribution of bases of the purine/pyrimidine (A or G/C or T) types along the sequence. When the number of the purine bases in the subsequence from the first to the

*n*th base is greater than that of the pyrimidine bases,

*x*> 0, otherwise

_{n}*x*< 0. Similarly, the component

_{n}*y*displays the distribution of bases of the amino/keto (A or C/G or T) types along the sequence. When the number of the amino bases in the subsequence from the first to the

_{n}*n*th base is greater than that of the keto bases,

*y*> 0, otherwise,

_{n}*y*< 0. Finally, the component

_{n}*z*displays the distribution of bases of the weak H-bond/strong H-bond (A or T/G or C) types along the sequence. When the number of the weak H-bond bases in the subsequence from the first to the

_{n}*n*th base is greater than that of the strong H-bond bases,

*z*> 0, otherwise,

_{n}*z*< 0. In summary, the Z curve is the unique representation for a given DNA sequence in a three-dimensional space and each can be uniquely reconstructed from the other (6,7). Therefore, any DNA sequence is uniquely and completely described by the three distributions, i.e., those of the bases of purine/pyrimidine, amino/keto and weak/strong H-bonds, respectively. The Z curve offers an intuitive and convenient approach to study DNA sequences. By viewing the Z curve, some overall and local features of the sequence can be detected in a perceivable way. Furthermore, a new methodology has been derived from the Z curve by which DNA sequences can be studied geometrically.

_{n}### The phase-specific Z curve

Most gene-finding algorithms are based on the differences of statistical properties between DNA sequences in coding and non-coding regions. The distributions of bases among the three phases in one strand of a DNA double helix are heterogeneous in the coding region, whereas uniform in the non-coding regions, (e.g. 5). This fact constitutes the basis of the present gene-finding algorithm. The Z curve for the subsequence in an ORF with bases at positions 1, 4, 7, …, forms a phase-specific curve. We call this curve the phase-1 Z curve. Similarly, the Z curves with bases at positions 2, 5, 8, …, and 3, 6, 9, ..., are called the phase-2 and phase-3 Z curves, respectively. For an ORF sequence, the phase-1, -2 and -3 Z curves describe the distributions of bases at first, second and third codon positions, respectively. For each phase-specific Z curve there are three components, as for the ordinary Z curve. The three components of the phase-1 Z curve are denoted by *x _{n}*(1),

*y*(1) and

_{n}*z*(1), respectively, and

_{n}*x*(2),

_{n}*y*(2),

_{n}*z*(2),

_{n}*x*(3),

_{n}*y*(3) and

_{n}*z*(3) are defined similarly.

_{n}To simplify the later calculation, each component curve of a phase-specific Z curve listed above (e.g., *x _{n}*(1) ~

*n*) is approximately described by a straight line. Consequently, we have

where *k _{x}*(1),

*k*(1),

_{y}*k*(1),

_{z}*k*(2),

_{x}*k*(2),

_{y}*k*(2),

_{z}*k*(3),

_{x}*k*(3) and

_{y}*k*(3) are the slopes for the straight lines. For simplicity, they are calculated as follows

_{z}where *M* = *N*/3, and *N* is the length of the ORF. According to the property of the Z curve, the slopes of the straight lines defined in equation **4** are determined by the average base composition of the corresponding sequences associated with the curve. For example, given *k _{x}*(1),

*k*(1) and

_{y}*k*(1), the base composition of the subsequence in an ORF with bases at positions 1, 4, 7, …, can be calculated (6,7). Therefore, slopes are statistical quantities describing the basic features of the sequence concerned. The approximation expressed in equations

_{z}**3**and

**4**is simple and effective. Of course, it is possible to fit Z curves by using more complicated functions, rather than straight lines.

### The Fisher discriminant algorithm in a 10-dimensional space

Each ORF (or an intergenic DNA sequence) is described by a point or a vector in a 10-dimensional (10-D) space spanned by *u*_{1}, *u*_{2}, …, *u*_{10}. They are defined by

where *a*, c, g and t are the average occurrence frequencies of bases A, C, G and T in the DNA sequence studied. That is, *a* = A* _{N}*/

*N*,

*c*= C

*/*

_{N}*N*,

*g*= G

*/*

_{N}*N*and

*t*= T

*/*

_{N}*N*, where A

*, C*

_{N}*, G*

_{N}*and T*

_{N}*are the occurrence numbers of bases A, C, G and T, respectively, in the sequence, and*

_{N}*N*is the total length of the sequence. The variable

*u*

_{10}was found to be a useful statistical quantity for the analysis of DNA sequences (5). Obviously, the minimum of

*u*

_{10}is equal to 1/4, if, and only if,

*a*=

*c*=

*g*=

*t*= 1/4. Usually the value of

*u*

_{10}in the coding region is smaller than that in the non-coding region.

To complete the protein coding gene-finding algorithm, we need two groups of samples. One is a set of the positive samples corresponding to the true protein coding genes; another is a set of the negative samples corresponding to the intergenic sequences. The number of samples in each group should be identical. The two groups of samples form the training set used in the Fisher discrinimant algorithm. The Fisher linear discriminant equation in this case represents a super-plane in the 10-D space, described by a vector **c** which has 10 components *c*_{1}, *c*_{2}, … and *c*_{10}. The determination of **c** is extremely simple in the case of two groups of samples, such as the case studied here. Group 1 (denoted by *g* = 1) corresponds to coding samples; whereas group 2 (denoted by *g* = 2) corresponds to non-coding samples. Denoted by u_{jk}^{g} the *j*th component of the 10-D vector defined in equation **5** of the *k*th sample in the *g* group, where g = 1, 2; *j* = 1, 2, …., 10; and *k* = 1, 2, …, *n _{g}*(

*n*

_{1}=

*n*

_{2}, i.e., the numbers of samples in both groups are identical), we calculate the geometrical center vector

**U**

_{g}for each group

where ‘T’ indicates the transpose of a matrix, and

Denoting by **S** = (*s _{ij}*) the sum of the covariance matrices of two groups, we have

The vector **c** is simply determined by the following equation

where **S**^{–1} is the inverse of the matrix **S**. See the detailed explanation on these equations in Mardia *et al*. (14). The vector **c** is not unique in the sense that **c** multiplied by a constant is still acceptable. Without losing generality we choose the constant such that │**c**│^{2} = 1. Based on the data in the training set, an appropriate threshold *c*_{0} is determined to make the coding/non-coding decision. The threshold *c*_{0} is uniquely determined by letting the false negative rate and the false positive rate be identical. Once the vector **c** and the threshold *c*_{0} are obtained, the decision of coding/non-coding for each ORF in the test set is simply performed by the criterion of **c·u** > *c*_{0} / **c·u** < *c*_{0}, where **c** = (*c*_{1}, *c*_{2}, …, *c*_{10})^{T} and **u** = (*u*_{1}, *u*_{2}, …, *u*_{10})^{T}.

### The YZ score for an ORF or a fragment of DNA sequence

The criterion of **c·u** > *c*_{0} / **c·u** < *c*_{0} for making the decision of coding/non-coding can be rewritten as F(**u**) > 0 / F(**u**) < 0, where F(**u**) = **c·u** – *c*_{0}. Let the maximum and minimum of F(**u**), calculated based on the data in the training set, be denoted by F_{max} and F_{min}, respectively. Furthermore, let F_{max}+ and F_{max}^{–} be the quantities a little bit larger and smaller than F_{max} and F_{min}, respectively. Define the YZ score (**Y**east, **Z** curve)

Then the criterion to make the decision of coding/non-coding simply becomes YZ > F_{0} / YZ < F_{0}, where

Choose F_{max}+ = 0.30 and F_{min}^{–} = –0.30 such that F_{0} = 0.50. The criterion to make the decision of coding/non-coding clearly becomes YZ > 0.5 / YZ < 0.5. In some rare cases, the YZ scores calculated for some practical samples may be <0 or >1. In the former case, let the YZ score be equal to 0, whereas in the latter case, let the YZ score be equal to 1. Consequently, for any **u**, YZ Œ [0,1].

## RESULTS AND DISCUSSION

### Six-fold cross-validation tests

To test the new algorithm, six-fold cross-validation tests are performed. In the version of MIPS database, Release September 27, 1999, the ORFs were classified into six classes, in which the first class consists of 3199 entries corresponding to the known proteins. Excluding the protein coding genes from the mitochondria and those containing introns, 2958 protein coding genes of the first class residing at the 16 yeast chromosomes remain. The number of the mitochondrial genes available at present is too limited to perform a statistical study. They are thus excluded from the present study. Randomly divide the 2958 genes into two unequal parts, in which the larger part consists of 1958 genes, and the smaller consists of 1000 genes. The former serves as a training set used to find the Fisher coefficients; whereas the latter serves as a test set used to test the accuracy of the algorithm.

As mentioned above, both the training and test sets should be accompanied by the counterparts of negative samples. We have randomly selected about 6000 intergenic sequences with length longer than 300 bp from the 16 yeast chromosomes, and each of them starts with ATG and ends with one of the stop codons. The detailed procedure to select the intergenic sequences is described as follows. For each of the 16 yeast chromosomes:

(i) Find the number and locations of the ORFs annotated in the MIPS database and denote the number of ORFs by K.

(ii) Calculate the length for each of the (K–1) DNA sequences between any two adjoining ORFs. Ignore sequences where the length is <300 bp.

(iii) For all sequences ≥300 bp, starting from the first base, search for the first ‘ATG’ codon encountered along the sequence. In the downstream direction, starting from the 101^{st} codon beginning from ATG, search for the first stop codon encountered. Then the DNA sequence starting from ATG and ending with one of the stop codons is regarded as one candidate for the intergenic sequences. Note that this is not an ORF because there often may be several stop codons within it. Continue to search for more intergenic sequences in the downsteam direction until no more can be found in the remaining sequence.

(iv) Repeat step (iii) for each of the six phases in the sequence. The possible numbers of such sequences are quite large. Randomly select about 6000 such sequences from the 16 yeast chromosomes as the intergenic sequences used for complementing the Fisher algorithm. A computer program has been written to do this job. We should point out that the lengths of the intergenic sequences thus obtained are roughly equal to the ORF lengths, but not identical. Because the present algorithm is based on the difference of the base composition between coding and non-coding sequences, the non-identity of the lengths between the two kinds of sequences does not seem to be a major problem. When the lengths of both kinds of sequences are >300 bp, the calculated results of base composition are not usually sensitive to small variations in sequence length.

Randomly select 1958 and 1000 intergenic sequences from the 6000 sequences, which form the training and test sets of negative samples, respectively. In summary, the training set consists of 1958 positive samples (true genes) and 1958 negative samples (intergenic sequences). The test set consists of 1000 positive samples (true genes) and 1000 negative samples (intergenic sequences). Using the sequences in the training sets, the Fisher coefficients *c*_{0}, *c*_{1}, *c*_{2}, … and *c*_{10} are determined. Using the Fisher coefficients just obtained, the accuracy of the gene-finding algorithm is calculated based on the test set.

Repeating the above procedure three times, we have performed 3-fold cross-validation tests. The sensitivity, specificity and accuracy of each test are listed in Table Table1.1. As can be seen, all three quantities obtained are >95%.

There are 223 intron-containing genes of the 1^{st} class in the MIPS database. These ORFs are used as an independent test set to perform another 3-fold cross-validation test. Consequently, the accuracy (defined as the sensitivity) is always >95% for each of the above three tests.

We now discuss the definitions of accuracy, sensitivity and specificity, which are used to evaluate the performance of the algorithm. The notations used here are the same as those used by Burset and Guigo (15). Using TP and FN to denote the number of coding ORFs that have been predicted as coding and non-coding, respectively, we define the sensitivity *s _{n}* as

That is, *s _{n}* is the proportion of coding ORFs that have been correctly predicted as coding. Similarly, using TN and FP to denote the number of intergenic sequences that have been predicted as non-coding and coding, respectively, we define the specificity

*s*as

_{p}That is, *s _{p}* is the proportion of intergenic sequences that have been correctly predicted as non-coding. The accuracy is defined as the average of

*s*and

_{n}*s*.

_{p}The definition of *s _{p}* in equation

**13**may cause problems in recognizing genes along the genomic DNA sequence. Because the frequency of non-coding nucleotides is generally much larger than that of coding ones, TN >> FP, and therefore

*s*tends towards 1. To solve this problem, instead of using the definition of

_{p}*s*in equation

_{p}**13**, one used the refined definition (15,16):

However, in the present study, the test set consists of 1000 coding ORFs and 1000 intergenic sequences, respectively, and it is therefore appropriate to use *s _{p}* as defined in equation

**13**, rather than in equation

**14**.

### The final Fisher coefficients

The 2958 positive samples (true genes) are merged together as a new training set. The 2958 negative samples are selected randomly from the 6000 intergenic sequences mentioned above. The random selection is repeated three times. Consequently, we have three experiments. For each experiment the positive samples are identical, whereas the negative samples are different each time. Calculating the Fisher coefficients for each experiment, the results are listed in Table Table2.2. The final Fisher coefficients are obtained by simply averaging the corresponding values for the three experiments, which are listed in the last column of Table Table2.2. The Fisher coefficients *c*_{0} ~ *c*_{10} make an internally consistent set. Averaging with coefficients from several experiments may break the internal consistency. However, since the variations of coefficients for different experiments are considerably small, as shown in Table Table2,2, the problem is not severe. On the other hand, the Fisher super-plane in the 10-D space is described by the equation **c·u** – *c*_{0} = 0. To take advantage of each experiment, averaging the coefficients allows to adjust the position and orientation of the super-plane slightly.

### Apply the algorithm to recognize yeast genes

As mentioned above, in the version of the MIPS database, Release September 27, 1999, the ORFs were classified into six classes, which consist of 3199, 248, 869, 789, 805 and 447 entries, respectively. They correspond to known proteins (1st class), strong similarity to known proteins (2nd class), similarity or weak similarity to known proteins (3rd class), similarity to unknown proteins (4th class), no similarity (5th class) and questionable ORFs (6th class), respectively. Using the final Fisher coefficients and the criterion of **c·u** > c_{0} / **c·u** < c_{0} for making the decision of coding/non-coding, we re-recognize the nuclear genes from the ORFs in the 2nd ~ 6th classes in the MIPS database. The detailed results are listed in Tables Tables33 and and4,4, for the non-coding ORFs in the 2nd ~ 5th classes and the 6th class, respectively, in which the names of non-coding ORFs are clearly indicated. As shown in Table Table3,3, 434 ORFs of the 2nd ~ 5th classes in the MIPS database are recognized as non-coding. Similarly in Table Table4,4, 340 ORFs of the 6th class are recognized as non-coding. However, due to the limited sensitivity (95%) and specificity (95%) achieved, statistically, 119 of the 434 ORFs listed in Table Table33 and four of the 340 ORFs listed in Table Table44 (see calculations below), are actually coding genes. We cannot identify which 119 of the 434 or which four of the 340 ORFs are coding genes at present, unless the sensitivity and specificity are further increased.

The four quantities TP, TN, FP and FN mentioned above can be calculated, based on the sensitivity, specificity and the gene-recognition result obtained. The calculation for recognizing genes of the 2^{nd} ~ 5^{th} class ORFs in the MIPS database should be performed first. The total number of ORFs to be recognized is 2710, of which 2276 and 434 are recognized as coding and non-coding, respectively. We have a set of four equations as follows: TP/(TP + FN) = 0.95; TN/(TN + FP) = 0.95; TP + FP = 2276 and TN + FN = 434. Solving the above set of equations, we find TP ≈ 2259; TN ≈ 315; FP ≈ 17 and FN ≈ 119. The number of real coding ORFs should be equal to TP + FN ≈ 2378. Of the 434 ORFs recognized as non-coding, statistically, 119 (FN) are actually coding. Next, the calculation for the 6^{th} class ORFs in the MIPS database should be performed. The total number of ORFs to be recognized is 439, of which 99 and 340 are recognized as coding and non-coding, respectively. In this case, the set of four equations consists of: TP/(TP + FN) = 0.95; TN/(TN + FP) = 0.95; TP + FP = 99 and TN + FN = 340. Solving this set of equations, we find TP ≈ 81; TN ≈ 336; FP ≈ 18 and FN ≈ 4. The number of real coding ORFs should be equal to TP + FN ≈ 86. Of the 340 ORFs recognized as non-coding, statistically, four (FN) are actually coding.

Based on the above results, we re-estimate the number of protein coding genes in the 16 yeast chromosomes. The total number should be equal to the number of intronless genes in the 1st class (2958) + the number of intron-containing genes in the 1st class (223) + the number of coding ORFs in the 2nd ~ 5th classes (including intronless and intron-containing genes) recognized by the present algorithm (2378) + the number of coding ORFs in the 6th class (including intronless and intron-containing genes) recognized by the present algorithm (86). The final result is 5645. Considering the fact that the actually sensitivity and specificity are >95% (see Table Table1),1), the above estimate should be considered as an upper limit. Note that the above number (5645) does not include the mitochondrial genes. The estimate that the total number of the nuclear protein coding genes in the yeast genome is ≤5645 conflicts with the previous estimate of 5800–6000 genes (10–12).

The YZ score for each ORF annotated in the MIPS database is calculated. The distribution of the YZ scores for the 2958 genes classified as ORFs of the 1st class in the MIPS database is shown in Figure Figure1.1. Here the *y*-axis indicates the YZ scores, whereas the *x*-axis indicates the rank number of ORFs, arranged according to the increasing order of the YZ scores. For comparison, the YZ scores for 2958 negative samples (intergenic sequences) are also calculated. The corresponding plot is also shown in Figure Figure1.1. As can be seen, for most genes the points are situated above the threshold 0.5, denoted by a horizontal line, whereas for most intergenic sequences the points are situated below the threshold 0.5. This fact demonstrates the accuracy of the new algorithm in distinguishing between the two kinds of DNA sequences. Furthermore, the curves clearly displaying the above two distributions are shown in Figure Figure2.2. Both distribution curves are well fitted by normal distributions with a small overlapping area between them. For comparison, the curve displaying the distribution of YZ scores calculated for the 2669 ORFs of the 2nd ~ 5th classes in the MIPS database is also shown. This curve is also well fitted by a normal distribution. As can be seen, the third normal distribution curve is in between the former two, indicating that a fraction of the ORFs of the 2nd ~ 5th classes are actually non-coding. This observation is in agreement with the data listed in Table Table33.

*y*-axis indicates the YZ scores, whereas the

*x*-axis indicates the rank number of ORFs, arranged according to the increasing order of the YZ

**...**

### On the mystery of orphan ORFs

There are more than 7000 ORFs longer than 300 bp in the yeast genome (4). For some of them, known as orphan ORFs (17,18), neither their function nor homology is known. With the increase in known genes, more orphans should be found to have homologous relationships with the known genes and, as a result, the number of orphans should decrease. In fact, this is not the case. This paradox was deemed as a mystery of orphans (17,18). However, the results presented in this paper give some insight into the problem. According to the classification of ORFs in the MIPS database, orphans are mainly assigned to the 5th class (no similarity) and the 6th class (questionable, including no similarity to other ORFs). As can be seen from Table Table5,5, of the 805 ORFs in the 5th class, 193 (24%) are non-coding. Furthermore, of the 439 ORFs in the 6th class, 340 (77%) are non-coding. In other words, more than 500 orphans or partially overlapping ORFs are actually not protein-coding genes. After removing these ORFs from the list of orphans in the MIPS database, there remain some real orphans which may be true protein-coding genes whose functions and homology need to be explored.

**The percentages of non-coding ORFs of the 2nd ~ 6th classes recognized by the present algorithm, over the total numbers of ORFs in the classes**

We should emphasise that the present work is based on an assumption that has not been clearly elucidated previously. The claim of <5% error is valid only when all of the unknown genes have the same statistical properties as the known genes. As pointed out by an anonymous referee, this obviously will not be the case, especially for the ORFs in the 6th class. Some genes tend to have low expression levels and many of them will only express in extreme conditions. Since they are under-represented in the training sample (i.e., in the ORFs of the 1st class), they would mostly be predicted to be non-coding. One would not be surprised if many of the ‘non-coding’ ORFs predicted in Table Table44 later turn out to be coding. Therefore, based on this consideration, the predictive error for the ORFs in the 6th class would be >5%. We remind readers that the results listed in Table Table44 should be referred to with caution.

It will be very interesting to see if most or many ORFs listed in Table Table44 will be experimentally verified to be functional genes in the future. If the answer is yes, we have to say that the DNA sequences coding for these genes have different statistical properties with those coding for genes of the 1st class in the MIPS database. Alternatively, if the answer is no, the statistical properties for both the 1st and 6th class ORFs should be similar. To avoid the inherently circular argument, we have compared the distributions of bases at the first and second codon positions for the 1st and 6th class ORFs in the MIPS database with those of other species, specifically human, *Escherichia coli*, etc. One cannot simply compare the base distributions at the third codon position between different species, because the distributions are species-dependent (19). Consequently, we have found that the distributions of bases at the first and second codon positions for the 1st class ORFs in the MIPS database of the yeast genome show considerable similarity to those of genes for other species. In contrast, the distributions of bases at the first and second codon positions for the 6th class ORFs are not only remarkably different from those of the 1st class ORFs, but are also remarkably different from those of genes from other species. It is thought that the distributions of DNA bases at the first and second codon positions reflect the need for native folding of proteins (19). Based on this consideration, it is unlikely that most or many ORFs listed in Table Table44 code for proteins.

## ACKNOWLEDGEMENTS

Stimulating discussions with Ren Zhang are acknowledged. We are grateful to both referees for their comments, which were very useful in improving the paper. The present study was supported in part by the 973 Project grant G1999075606 of China.

## REFERENCES

*et al*. (1994) Nature, 369, 371–378. [PubMed]

*Multivariate Analysis*. Academic Press, London, UK.

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (146K) |
- Citation

- Recognition of protein coding genes in the yeast genome based on the relative-entropy of DNA.[Comb Chem High Throughput Screen. 2006]
*Li C, Helal N, Wang J.**Comb Chem High Throughput Screen. 2006 Jan; 9(1):49-54.* - Using a Euclid distance discriminant method to find protein coding genes in the yeast genome.[Comput Chem. 2002]
*Zhang CT, Wang J, Zhang R.**Comput Chem. 2002 Feb; 26(3):195-206.* - Origin and properties of non-coding ORFs in the yeast genome.[Nucleic Acids Res. 1999]
*Mackiewicz P, Kowalczuk M, Gierlik A, Dudek MR, Cebrat S.**Nucleic Acids Res. 1999 Sep 1; 27(17):3503-9.* - Large scale analysis of sequences from Neurospora crassa.[J Biotechnol. 2002]
*Schulte U, Becker I, Mewes HW, Mannhaupt G.**J Biotechnol. 2002 Mar 14; 94(1):3-13.* - Functional analysis of the yeast genome.[Curr Opin Genet Dev. 1997]
*Winzeler EA, Davis RW.**Curr Opin Genet Dev. 1997 Dec; 7(6):771-6.*

- Identification of three extra-chromosomal replicons in Leptospira pathogenic strain and development of new shuttle vectors[BMC Genomics. ]
*Zhu W, Wang J, Zhu Y, Tang B, Zhang Y, He P, Zhang Y, Liu B, Guo X, Zhao G, Qin J.**BMC Genomics. 16(1)90* - Recognition of Protein-coding Genes Based on Z-curve Algorithms[Current Genomics. 2014]
*-Biao Guo F, Lin Y, -Ling Chen L.**Current Genomics. 2014 Apr; 15(2)95-103* - A Brief Review: The Z-curve Theory and its Application in Genome Analysis[Current Genomics. 2014]
*Zhang R, Zhang CT.**Current Genomics. 2014 Apr; 15(2)78-94* - Recognizing short coding sequences of prokaryotic genome using a novel iteratively adaptive sparse partial least squares algorithm[Biology Direct. ]
*Chen S, Zhang CY, Song K.**Biology Direct. 823* - Re-Annotation of Protein-Coding Genes in the Genome of Saccharomyces cerevisiae Based on Support Vector Machines[PLoS ONE. ]
*Lin D, Yin X, Wang X, Zhou P, Guo FB.**PLoS ONE. 8(7)e64477*

- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- Recognition of protein coding genes in the yeast genome at better than 95% accur...Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curveNucleic Acids Research. 2000 Jul 15; 28(14)2804

Your browsing activity is empty.

Activity recording is turned off.

See more...