• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. May 2008; 36(8): 2756–2763.
Published online Mar 26, 2008. doi:  10.1093/nar/gkn086
PMCID: PMC2377439

Spontaneous symmetry breaking in genome evolution


The quest for evolutionary mechanisms providing separation between the coding (exons) and noncoding (introns) parts of genomic DNA remains an important focus of genetics. This work combines an analysis of the most recent achievements of genomics and fundamental concepts of random processes to provide a novel point of view on genome evolution. Exon sizes in sequenced genomes show a lognormal distribution typical of a random Kolmogoroff fractioning process. This implies that the process of intron incretion may be independent of exon size, and therefore could be dependent on intron–exon boundaries. All genomes examined have two distinctive classes of exons, each with different evolutionary histories. In the framework proposed in this article, these two classes of exons can be derived from a hypothetical ancestral genome by (spontaneous) symmetry breaking. We note that one of these exon classes comprises mostly alternatively spliced exons.


A substantial fraction of the genomic DNA sequence does not directly encode the primary structure of any cellular protein, or any other cellular product (1). This is largely due, in eukaryotes, to the division of genes into introns (noncoding parts of DNA) and exons (coding parts of DNA), each of which have pronounced size distributions (2). The actual mechanism (or mechanisms) of intron insertion is the subject of intense discussion and a currently popular working hypothesis suggests that introns may be largely produced by insertion of transposons—DNA elements that can move around to different positions within the genome (3). Very frequently, this point of view implicitly assumes that there is a higher probability of splitting longer exons since they are larger targets for transposons. Here, we show that the distributions of exon sizes for different organisms have a general property: the presence of two distinguishable classes of exons with different size distributions. We present an idealized scheme that explains how the observed distribution of exon sizes can be derived from a common ancestral genome by a random (quasi)-evolutionary process. This formal model makes it possible to investigate the evolution of the genomes of particular organisms, and to estimate the number of evolutionary steps that separate them from the hypothetical ancestral genome. Conceptually, our results support the opinion that at the initial stages of evolution, simple genomes had a lower fraction of introns (introns late hypothesis). The model we propose explains the observed lognormal distribution of exon sizes, and suggests that introns may be inserted by a process that is independent of exon size. Our findings also can be rationalized in relation to the phenomenon of alternative splicing (4).


In our study, we used genome data for 12 animal species provided by Ensembl (5) (http://www.ensembl.org/), the joint project of the European Molecular Biology Laboratory—European Bioinformatics Institute and the Sanger Institute, and a plant genome sequence from The Arabidopsis Information Resource (TAIR, http://www.arabidopsis.org/) (6). All the entries annotated as ‘exon’ were retrieved for each complete genome. All duplicates of exons (which originate primarily from multiple accessions) starting and ending at the same points were excluded from the analysis. The natural logarithms of exon sizes were divided into bins of width Δ ln(exon size) = 0.2 to obtain the exon size distributions presented in Figure 1 and in Supplementary data. The distributions were fit using the unweighted χ2-measure normalized over the number of degrees of freedom (7) as the criterion of fitting quality. The values of fitted parameters are presented in Table 1. Table 2 shows the corresponding values of χ2 and the Pearson correlation coefficient, r, which characterize the agreement between original data and the fitted model.

Figure 1.
Distributions of the natural logarithms of exon sizes for the Homo sapiens and Drosophila melanogaster genomes. Both data sets can be approximated by a sum of two Gaussian peaks with very high correlation between the data and the best-fit distributions: ...
Table 1.
Parameters of fitting for different models of exon size distribution
Table 2.
Fitting quality, χ2 and r, together with the number of data points and number of degrees of freedom, df

Three classes of fitting models were used:

  1. A model based on the assumption that an exon can be split at any position with equal and constant probability; this model leads to an exponential distribution of exon sizes:
    equation image
    where E is exon size, dN is the number of exons in a bin, N is the amplitude of the peak and λ is the probability of splitting an exon at a particular place.
  2. Two models that produce lognormal distributions of exon sizes. These models are based on a Kolmogoroff process, which does not assume any relationship between exon size and probability of splitting an exon. Particularly, to fit the data we used a model of single lognormal peak
    equation image
    and a mixture of two lognormal distributions,
    equation image
    where, N, Na and Nb are the amplitudes, M, Ma and Mb the mean positions, and σ, σa and σb the variances of lognormal peaks observed in the data.
  3. A combination of a Weibull distribution and exponential distribution
    equation image
    where Nw, λw and c are the amplitude, frequency parameter and shape parameter of the Weibull distribution, respectively.

Sequences of pseudorandom numbers were obtained using the Mersenne Twister algorithm (8) implemented in the standard MATLAB 7.1 installation. The sequences of pseudorandom numbers were seeded with the values 14 (sequence A) and 16 (sequence B), and were used to generate the multiplicative processes presented in Figure 2. Each pseudorandom sequence consists of 107 random numbers, i.e., 107 exon splitting events. The ratio between the initial lengths of the two ancestral exons used to generate the exon size distributions presented in Figure 2 was 1:1000.

Figure 2.
Distributions of the natural logarithm of exon sizes for two different realizations of a random multiplicative process starting from an ancestral genome comprising two exons. The diagrams at the top of the figure illustrate two different splitting patterns ...

The confidence intervals shown for the fitted parameters in Figure 1, error bars in Figure 3, Tables Tables11 and and3,3, and in the Supplementary data are 95% confidence intervals estimated from the covariance matrix (7).

Figure 3.
Widths, σ, of exon size distributions for several complete genomes (see Table 1 and in Supplementary data). Open symbols correspond to the narrow peaks; full symbols to the wide peaks. All the species were ordered according to the width of wider ...
Table 3.
Parameters of fitting for different models approximating size distributions of alternatively spliced exons (see Methods Sections)

For Homo sapiens and Mus musculus genomes, we also analyzed distributions of alternatively spliced exons. These data were received from the third release of the Alternative Splicing Database (ASD), (http://www.ebi.ac.uk/asd/altsplice/). This resource provides manually curated data on alternative splicing with all exons confirmed by EST/mRNA alignments (9,10). These data were also divided into bins of width [open triangle] ln (exon size) = 0.2 and fitted using the unweighted χ2-measure to models of single exponential peak, single Weibull peak, and single lognormal distribution (Figure 4 and Table 3).

Figure 4.
Top row presents comparison between statistical distributions of all exons (gray bars) with the distributions of alternatively spliced exons (black bars) from ASD. The data for Homo sapiens are in left panes; for Drosophila melanogaster in right panels. ...

In the case of fitting the model with two lognormal peaks, the fractions of exons contributing to each peak (Table 4) were estimated by formal integration of the peak areas, which, for a standard Gaussian distribution, are proportional to the peak amplitudes.

Table 4.
Distributions of exons between narrow and wide peaks together with peak positions, exp M, and peak widths, exp σ

F-test critical values, F0.05 presented in Table S1 in Supplementary data, were calculated for significance level of 5% and number of degrees of freedom, df, given in Table S1 and Table 3. For each pair of models under consideration, we compared the ratio, An external file that holds a picture, illustration, etc.
Object name is gkn086i1.jpg, to the calculated F0.05. When An external file that holds a picture, illustration, etc.
Object name is gkn086i2.jpg, the hypothesis that the variance σI associated with An external file that holds a picture, illustration, etc.
Object name is gkn086i3.jpg is greater than variance σII associated with An external file that holds a picture, illustration, etc.
Object name is gkn086i4.jpg, i.e. the assumption that Model II is a better description of the data than Model I (11), is verified at the P <0.05 level.


The Homo sapiens (12) and Drosophila melanogaster (13) genomes show (Figure 1) a striking similarity in the distributions of exon sizes; in both, the distribution of the logarithm of exon size forms two distinctive peaks. Other genomes [see Supplementary data in which we analyze all complete genome data provided by Ensembl (5) and TAIR (6)] show similar patterns. A simplistic model of intron insertion assumes that the probability of inserting an intron is equal at all positions of an exon (making it more likely that a longer exon will be split). This type of process would lead to an exponential distribution of exon sizes showing a single peak on a logarithmic scale (14,15). Recently, the simplistic exponential model of exon size distribution was reconsidered by Gudlaugsdottir and co-authors (15) who suggested treating distributions of exon sizes as a combination of Weibull (16) and exponential distributions. However, they did not provide a model for an evolutionary process, which could lead to the Weibull distribution but rather consider this distribution to be only an empirical approximation of the observed exon size data.

Our analysis of thirteen genomes (Table 2), shows that the exponential distribution model always produces the largest χ2-values, which signifies that it is the least adequate fit to the observed exon size distributions. The model of a single lognormal peak is a closer fit to the data. The mixture of Weibull and exponential distributions suggested by Gudlaugsdottir and co-authors is a better approximation, but for most genomes (except Bos taurus, Canis familiaris and Gallus gallus) a combination of two lognormal distributions produces the best agreement between the real data and a fitted model. The length distributions of alternatively spliced exons in Homo sapiens and Mus musculus, drawn from ASD (9,10), reveal a single peak pattern (Figure 4). In this case, analysis of χ2-values also shows that an exponential distribution is poorest fit to the data, and the lognormal distribution is the best (Table 3). More sophisticated comparison between different models used in this work involves F-test criterion (11) and suggests similar hierarchy of the models (see Methods section and Table S1 in Supplementary data).


The lognormal distribution is a normal (Gaussian) distribution of the logarithm of some quantity. This kind of distribution commonly originates from a Kolmogoroff random multiplicative process (17), which was originally introduced to describe the distribution of ore particle sizes observed in geological samples (18,19), and later was found to be useful as a paradigm for a whole universe of different breakage and splitting processes. A modified version of this process, in the context of our problem, can be described as follows. Let us consider a single exon, which is split by a random mutational process into two parts (equal in size, for the sake of illustration). Then, in the next step of the process, let us assume that one randomly selected part of the ancestral exon undergoes the same splitting. Subsequently, this process repeats for a large number of splitting events. The key assumption for this process, which is the same as the assumption of Kolmogoroff, is independence of the probability of undergoing a splitting event from exon size. This version of a random multiplicative process is slightly different from that discussed by Kolmogoroff, which assumes a constant frequency of splitting events for all parts of the fractionating set (exons in a genome in our case) at every time, while the process considered here assumes a single exon splitting event at each step of the process. Thus, the probability of breaking a particular exon at the next step of the process is not constant, but constantly decreases with the increase in the number of exons. When started from a single ancestral exon, this model of the multiplicative process, similarly to Kolmogoroff's version, produces a lognormal distribution of exon sizes. The resulting distribution of exon sizes obtained in this manner is independent of the actual sequence of random splitting events. After a sufficiently large number of steps, the mean position of the peak in the distribution of ln (exon size), M, shows an asymptotic linear shift with the logarithm of the number of splitting events, Nspl (for the process introduced here) or on time, t (for the Kolmogoroff process), ln E0M ~ ln Nspl ~ t, where E0 is the length of the ancestral exon. Similarly, the peak width, σ, is related to the same quantities, An external file that holds a picture, illustration, etc.
Object name is gkn086i5.jpg. The particular values of the coefficients for these proportionalities depend on the details of the process and can be ignored for our current purposes. However, we assume that the details of the splitting processes, i.e. the coefficients of proportionality, are the same for all species.

The assumption that the splitting probability is independent of exon size is essential for a lognormal distribution. A similar hypothesis, that selection of exons from open reading frames is independent of exon size for large exons (larger than a certain threshold of 105–110 bp), was discussed earlier (20). Our model assumes that the probability of exon splitting is independent of exon size for all exons, irrespective of any threshold. We note, in this regard, that the currently most common point of view is that the evolutionary process splits exons by intron insertions. As mentioned in the Introduction section, this generally acknowledged hypothesis of exon splitting by transposon insertion implicitly assumes a larger probability of splitting longer exons because they are larger targets for transposons.

In contrast to this, the lognormal distribution of exon sizes suggests that the mechanism of intron insertion should be independent of exon size. Obvious candidates for such a process would be mechanisms involving exon–intron boundaries, of which are always two for each exon.

A Kolmogoroff process provides a conceptual background for understanding the lognormal nature of exon size distribution. However, the distribution of exon length in real genomes is somewhat more complicated and generally reveals two lognormal peaks (Figure 1 and Supplementary data). To demonstrate how this two-peak pattern can be obtained for a random exon splitting process, let us consider a random process, similar to the one described before, initiated for a model ancestral genome that comprises two exons of unequal lengths. In this case, the two parts of the ancestral genome generate two distinct peaks in the exon size distribution. However, unlike previously, the resulting distribution of exon sizes is heavily dependent on the actual splitting pathway. Figure 2 demonstrates this fact for two splitting patterns generated from two sequences of pseudorandom numbers (see Methods section). This figure reveals a (spontaneous) symmetry breaking between the two peaks in the exon size distribution. In a certain sense, this phenomenon is similar to bifurcation, but, in contrast to an instant bifurcation event, this type of behavior can be regarded as a kind of ‘soft’ breaking of symmetry. The initial steps of the splitting process have a major impact on this phenomenon. Every subsequent splitting event has progressively less effect, and after ~5000 steps, the ratio between numbers of exons contributing to the two different peaks remains nearly constant. We believe that this phenomenon of symmetry breaking plays a major role in the diversity of patterns of exon size distributions.

The idealized model presented above can be reformulated in a different way where, instead of splitting exons on every step of the process, one randomly selected exon produces a descendant of larger size (a doublet, for example). From a formal point of view, these two variants of the process correspond to two possible directions for the time variable. From a biological prospective, they can be viewed as processes producing orthologous and paralogous gene families: the exons produced by duplications could be considered as paralogs, while exons derived from splitting as orthologs. Both variants of the model assume that the process starts from an ancestral genome with few exons, which are later split into smaller parts (or duplicated to produce longer parts). Thus, the proposed model, based on Kolmogoroff's ideas, supports the opinion that, at the initial stages of evolution, simple genomes had a lower fraction of introns (introns late hypothesis).

Figure 3 shows the parameters fit to the exon size distributions for 13 different genomes, using a two component lognormal model. All genomes show the presence of a narrow and a wide peak. Since peak width, σ, is a direct measure of the number of exon splitting events, our model suggests that exons in all of these genomes are distributed between two groups having different evolutionary histories. The width of the narrow peak is approximately equal for all species. The width of wider peak varies among species. For simple organisms, like Caenorhabditis elegans, its width is close to the width of narrow peak while for higher organisms, such as Homo sapiens and Pan troglodytes, it is about three times wider than the width of narrow peak. This may suggest both: that exons in this wider peak are older or are evolving more rapidly.

We argue that the biological origins of the exons contributing to the narrow and wide peaks are related to the appearance of new splicing mechanisms, such as alternative splicing (4). There recently has been great progress in the development of probabilistic methods for determination of exon boundaries (21–25). However, computational prediction of alternative splicing boundaries is still imperfect. Thus, to test the hypothesis that the observed distribution of exons between the two classes correlates with alternative splicing, we used the sets of alternatively spliced exons, which were manually curated and confirmed by EST/mRNA aligning (9,10). Unfortunately, curated datasets are available only for human and mouse, but there is a remarkable correlation between these data and the narrow peaks in Homo sapiens and Mus musculus (see data in Tables Tables11 and and33 and Figure 4). This strongly suggests that alternatively spliced exons are major contributors to the narrow peaks in all of the species examined here. This observation favors the hypothesis that the narrow peak is younger, since alternative splicing is a comparatively advanced evolutionarily mechanism. If one assumes that the evolutionary process modifies all exons in all the species with the same rate, the observation of nearly equal width narrow peaks in all species leads one to conclude that those peaks appeared at approximately the same and quite recent time. At least hypothetically, one may conclude that existence of the narrow peak in all discussed genomes is a manifestation of a spontaneous symmetry break in genome evolution, which is correlated with the evolution of alternative splicing.

The assumption that narrow peak is more recent does not completely eliminate the hypothesis that that this peak is more conserved. Indeed, one may assume is that exons contributing in narrow peaks are simultaneously more recent, and more conserved (with respect to the exon size distributions) than those contributing to the wide peak. In support of this point of view, one may note that the fraction of exons contributing to the narrow peak, in general, correlates with the width of wide peak (see data in Table 4). With two exceptions for genomes of Anopheles gambiae and Drosophila melanogaster, the increases in the fraction of exons in the narrow peak parallel the increases of width, σ, of the wider peak. In other words, its looks like the exons in the narrow peak are less frequently split by evolutionary processes (narrow peak width), but more frequently duplicated (larger fraction of exons) than are exons in the wide peak. This point of view is in agreement with the general pattern of alternative splicing in which a single copy of an exon is replaced by two (or several) isoforms of similar lengths.


The model presented here does not directly implicate a specific biological mechanism. However, we have shown that a very simple process, length independent splitting of exons, can produce the observed lognormal distribution of exon sizes. This suggests that it would be profitable to focus more attention on biological processes that are length independent, or on processes that could constrain the length of exons independently of exon length. Processes involved in mRNA splicing or associated with intron–exon boundaries are obvious candidates.

Furthermore, the observation that all eukaryotic organisms possess two exon length classes, one in common among all eukaryotes and one variable, suggests that exon splitting has played a key role in the evolution of eukaryotes. The difference in the peak widths of these two length classes suggests that the wider peak is older, or undergoes more rapid splitting, and mostly comprises nonalternatively spliced exons. The narrow peaks mostly comprise alternatively spliced exons, which are rather conserved in length but multiplied in number in more rapidly evolving genomes. The presence of a narrow peak in all genomes examined could be explained as a manifestation of a ‘spontaneous’ symmetry break in genome evolution associated with the appearance of the mRNA splicing mechanism.

More detailed investigations, including characterization of conservation of exons in narrow and wide peaks between different species and between different subclasses within a particular genome, are extremely intriguing but beyond the scope of this article. The main goals of this work are to draw attention to statistical properties of exon size distribution and to highlight the utility of Kolmogoroff process model in understanding of genome evolution background.


Supplementary Data are available at NAR Online.

[Supplementary Data]


Y.R. acknowledges stimulating discussions with Profs Yu. Feldman, D. Fushman, A. Sokolov and S. Mount, and expresses deep gratitude for constant support from Mrs. Natalia Grishina. Y.R. was supported by NSF award DBI-0515986 (M.G. PI) and, when working on the revision of this article by the National Research Council Associateship Program (Award # 0710430). Funding to pay the Open Access publication charges for this article was provided by NSF award DBI-0515986.

Conflict of interest statement. None declared.


1. Gilbert W. Why genes in pieces. Nature. 1978;271:501–501. [PubMed]
2. Zhang MQ. Statistical features of human exons and their flanking regions. Hum. Mol. Genet. 1998;7:919–932. [PubMed]
3. Roy SW. The origin of recent introns: transposons? Genome Biol. 2004;5:251. [PMC free article] [PubMed]
4. Brett D, Pospisil H, Valcarcel J, Reich J, Bork P. Alternative splicing and genome complexity. Nat. Genet. 2002;30:29–30. [PubMed]
5. Curwen V, Eyras E, Andrews TD, Clarke L, Mongin E, Searle SMJ, Clamp M. The Ensembl automatic gene annotation system. Genome Res. 2004;14:942–950. [PMC free article] [PubMed]
6. Rhee SY, Beavis W, Berardini TZ, Chen GH, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, et al. The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res. 2003;31:224–228. [PMC free article] [PubMed]
7. Press WH. Numerical Recipes in C: the Art of Scientific Computing. 2nd. Cambridge, New york: Cambridge University Press; 1992.
8. Matsumoto M, Nishimura T. Mersenne Twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans. Model Comp. Simul. 1998;8:3–30.
9. Thanaraj TA, Stamm S, Clark F, Riethoven JJ, Le Texier V, Muilu J. ASD: the Alternative Splicing Database. Nucleic Acids Res. 2004;32:D64–D69. [PMC free article] [PubMed]
10. Stamm S, Riethoven JJ, Le Texier V, Gopalakrishnan C, Kumanduri V, Tang YS, Barbosa-Morais NL, Thanaraj TA. ASD: a bioinformatics resource on alternative splicing. Nucleic Acids Res. 2006;34:D46–D55. [PMC free article] [PubMed]
11. Snedecor GW, Cochran WG. Statistical Methods. 8th. Ames: Iowa State University Press; 1989.
12. Collins FS, Lander ES, Rogers J, Waterston RH, Conso IHGS. Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. [PubMed]
13. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, et al. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. [PubMed]
14. Balakrishnan N, Basu AP. The Exponential Distribution: Theory, Methods, and Applications. Amsterdam, United States: Gordon and Breach; 1995.
15. Gudlaugsdottir S, Boswell DR, Wood GR, Ma J. Exon size distribution and the origin of introns. Genetica. 2007;131:299–306. [PubMed]
16. Weibull W. A statistical distribution function of wide applicability. J. Appl. Mech-T. Asme. 1951;18:293–297.
17. Kolmogoroff AN. Concerning the logarithmic normal distribution principle of dimensions of particles during dispersal. Cr. Acad. Sci. Urss. 1941;31:99–101.
18. Razumovsky NK. Distribution of metal values in ore deposits. Cr. Acad. Sci. Urss. 1940;28:814–816.
19. Razumovsky NK. On the role of the logarithmically normal law of frequency distribution in petrology and geochemistry. Cr. Acad. Sci. Urss. 1941;33:48–49.
20. Hoglund M, Sall T, Rohme D. On the origin of coding sequences from random open reading frames. J. Mol. Evol. 1990;30:104–108.
21. Burge CB, Karlin S. Finding the genes in genomic DNA. Curr. Opin. Struc. Biol. 1998;8:346–354. [PubMed]
22. Pertea M, Lin XY, Salzberg SL. GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res. 2001;29:1185–1190. [PMC free article] [PubMed]
23. Mathe C, Sagot MF, Schiex T, Rouze P. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res. 2002;30:4103–4117. [PMC free article] [PubMed]
24. Zhang XHF, Heller KA, Hefter L, Leslie CS, Chasin LA. Sequence information for the splicing of human Pre-mRNA identified by support vector machine classification. Genome Res. 2003;13:2637–2650. [PMC free article] [PubMed]
25. Yeo G, Burge CB. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J. Comput. Biol. 2004;11:377–394. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...