# Spontaneous symmetry breaking in genome evolution

^{1}Department of Chemistry, Purdue University, 560 Oval drive, Box 202 and

^{2}Department of Biological Sciences, Lilly Hall of Life Sciences 915 W. State Street, Purdue University, West Lafayette, IN, 47907, USA

## Abstract

The quest for evolutionary mechanisms providing separation between the coding (exons) and noncoding (introns) parts of genomic DNA remains an important focus of genetics. This work combines an analysis of the most recent achievements of genomics and fundamental concepts of random processes to provide a novel point of view on genome evolution. Exon sizes in sequenced genomes show a lognormal distribution typical of a random Kolmogoroff fractioning process. This implies that the process of intron incretion may be independent of exon size, and therefore could be dependent on intron–exon boundaries. All genomes examined have two distinctive classes of exons, each with different evolutionary histories. In the framework proposed in this article, these two classes of exons can be derived from a hypothetical ancestral genome by (spontaneous) symmetry breaking. We note that one of these exon classes comprises mostly alternatively spliced exons.

## INTRODUCTION

A substantial fraction of the genomic DNA sequence does not directly encode the primary structure of any cellular protein, or any other cellular product (1). This is largely due, in eukaryotes, to the division of genes into introns (noncoding parts of DNA) and exons (coding parts of DNA), each of which have pronounced size distributions (2). The actual mechanism (or mechanisms) of intron insertion is the subject of intense discussion and a currently popular working hypothesis suggests that introns may be largely produced by insertion of transposons—DNA elements that can move around to different positions within the genome (3). Very frequently, this point of view implicitly assumes that there is a higher probability of splitting longer exons since they are larger targets for transposons. Here, we show that the distributions of exon sizes for different organisms have a general property: the presence of two distinguishable classes of exons with different size distributions. We present an idealized scheme that explains how the observed distribution of exon sizes can be derived from a common ancestral genome by a random (quasi)-evolutionary process. This formal model makes it possible to investigate the evolution of the genomes of particular organisms, and to estimate the number of evolutionary steps that separate them from the hypothetical ancestral genome. Conceptually, our results support the opinion that at the initial stages of evolution, simple genomes had a lower fraction of introns (introns late hypothesis). The model we propose explains the observed lognormal distribution of exon sizes, and suggests that introns may be inserted by a process that is independent of exon size. Our findings also can be rationalized in relation to the phenomenon of alternative splicing (4).

## MATERIALS AND METHODS

In our study, we used genome data for 12 animal species provided by Ensembl (5) (http://www.ensembl.org/), the joint project of the European Molecular Biology Laboratory—European Bioinformatics Institute and the Sanger Institute, and a plant genome sequence from The Arabidopsis Information Resource (TAIR, http://www.arabidopsis.org/) (6). All the entries annotated as ‘exon’ were retrieved for each complete genome. All duplicates of exons (which originate primarily from multiple accessions) starting and ending at the same points were excluded from the analysis. The natural logarithms of exon sizes were divided into bins of width Δ ln(*exon size*) = 0.2 to obtain the exon size distributions presented in Figure 1 and in Supplementary data. The distributions were fit using the unweighted χ^{2}-measure normalized over the number of degrees of freedom (7) as the criterion of fitting quality. The values of fitted parameters are presented in Table 1. Table 2 shows the corresponding values of χ^{2} and the Pearson correlation coefficient, *r*, which characterize the agreement between original data and the fitted model.

*Homo sapiens*and

*Drosophila melanogaster*genomes. Both data sets can be approximated by a sum of two Gaussian peaks with very high correlation between the data and the best-fit distributions:

**...**

^{2}and

*r*, together with the number of data points and number of degrees of freedom,

*df*

Three classes of fitting models were used:

- A model based on the assumption that an exon can be split at any position with equal and constant probability; this model leads to an exponential distribution of exon sizes:where
*E*is exon size,*dN*is the number of exons in a bin,*N*is the amplitude of the peak and λ is the probability of splitting an exon at a particular place. - Two models that produce lognormal distributions of exon sizes. These models are based on a Kolmogoroff process, which does not assume any relationship between exon size and probability of splitting an exon. Particularly, to fit the data we used a model of single lognormal peakand a mixture of two lognormal distributions,where,
*N, N*and_{a}*N*are the amplitudes,_{b}*M, M*and_{a}*M*the mean positions, and σ, σ_{b}and σ_{a}the variances of lognormal peaks observed in the data._{b} - A combination of a Weibull distribution and exponential distributionwhere
*N*, λ_{w}and_{w}*c*are the amplitude, frequency parameter and shape parameter of the Weibull distribution, respectively.

Sequences of pseudorandom numbers were obtained using the *Mersenne Twister* algorithm (8) implemented in the standard MATLAB 7.1 installation. The sequences of pseudorandom numbers were seeded with the values 14 (sequence **A**) and 16 (sequence **B**), and were used to generate the multiplicative processes presented in Figure 2. Each pseudorandom sequence consists of 10^{7} random numbers, i.e., 10^{7} exon splitting events. The ratio between the initial lengths of the two ancestral exons used to generate the exon size distributions presented in Figure 2 was 1:1000.

**...**

The confidence intervals shown for the fitted parameters in Figure 1, error bars in Figure 3, Tables Tables11 and and3,3, and in the Supplementary data are 95% confidence intervals estimated from the covariance matrix (7).

**...**

For *Homo sapiens* and *Mus musculus* genomes, we also analyzed distributions of alternatively spliced exons. These data were received from the third release of the Alternative Splicing Database (ASD), (http://www.ebi.ac.uk/asd/altsplice/). This resource provides manually curated data on alternative splicing with all exons confirmed by EST/mRNA alignments (9,10). These data were also divided into bins of width ln (*exon size*) = 0.2 and fitted using the unweighted χ^{2}-measure to models of single exponential peak, single Weibull peak, and single lognormal distribution (Figure 4 and Table 3).

*Homo sapiens*are in left panes; for

*Drosophila melanogaster*in right panels.

**...**

In the case of fitting the model with two lognormal peaks, the fractions of exons contributing to each peak (Table 4) were estimated by formal integration of the peak areas, which, for a standard Gaussian distribution, are proportional to the peak amplitudes.

*M*, and peak widths, exp

*σ*

*F*-test critical values, *F*_{0.05} presented in Table S1 in Supplementary data, were calculated for significance level of 5% and number of degrees of freedom, *df*, given in Table S1 and Table 3. For each pair of models under consideration, we compared the ratio, , to the calculated *F*_{0.05}. When , the hypothesis that the variance *σ _{I}* associated with is greater than variance

*σ*

_{II}associated with , i.e. the assumption that Model II is a better description of the data than Model I (11), is verified at the

*P*<0.05 level.

## RESULTS

The *Homo sapiens* (12) and *Drosophila melanogaster* (13) genomes show (Figure 1) a striking similarity in the distributions of exon sizes; in both, the distribution of the logarithm of exon size forms two distinctive peaks. Other genomes [see Supplementary data in which we analyze all complete genome data provided by Ensembl (5) and TAIR (6)] show similar patterns. A simplistic model of intron insertion assumes that the probability of inserting an intron is equal at all positions of an exon (making it more likely that a longer exon will be split). This type of process would lead to an exponential distribution of exon sizes showing a single peak on a logarithmic scale (14,15). Recently, the simplistic exponential model of exon size distribution was reconsidered by Gudlaugsdottir and co-authors (15) who suggested treating distributions of exon sizes as a combination of Weibull (16) and exponential distributions. However, they did not provide a model for an evolutionary process, which could lead to the Weibull distribution but rather consider this distribution to be only an empirical approximation of the observed exon size data.

Our analysis of thirteen genomes (Table 2), shows that the exponential distribution model always produces the largest χ^{2}-values, which signifies that it is the least adequate fit to the observed exon size distributions. The model of a single lognormal peak is a closer fit to the data. The mixture of Weibull and exponential distributions suggested by Gudlaugsdottir and co-authors is a better approximation, but for most genomes (except *Bos taurus, Canis familiaris* and *Gallus gallus*) a combination of two lognormal distributions produces the best agreement between the real data and a fitted model. The length distributions of alternatively spliced exons in *Homo sapiens* and *Mus musculus*, drawn from ASD (9,10), reveal a single peak pattern (Figure 4). In this case, analysis of χ^{2}-values also shows that an exponential distribution is poorest fit to the data, and the lognormal distribution is the best (Table 3). More sophisticated comparison between different models used in this work involves *F*-test criterion (11) and suggests similar hierarchy of the models (see Methods section and Table S1 in Supplementary data).

## DISCUSSION

The lognormal distribution is a normal (Gaussian) distribution of the logarithm of some quantity. This kind of distribution commonly originates from a Kolmogoroff random multiplicative process (17), which was originally introduced to describe the distribution of ore particle sizes observed in geological samples (18,19), and later was found to be useful as a paradigm for a whole universe of different breakage and splitting processes. A modified version of this process, in the context of our problem, can be described as follows. Let us consider a single exon, which is split by a random mutational process into two parts (equal in size, for the sake of illustration). Then, in the next step of the process, let us assume that one randomly selected part of the ancestral exon undergoes the same splitting. Subsequently, this process repeats for a large number of splitting events. The key assumption for this process, which is the same as the assumption of Kolmogoroff, is independence of the probability of undergoing a splitting event from exon size. This version of a random multiplicative process is slightly different from that discussed by Kolmogoroff, which assumes a constant frequency of splitting events for all parts of the fractionating set (exons in a genome in our case) at every time, while the process considered here assumes a single exon splitting event at each step of the process. Thus, the probability of breaking a particular exon at the next step of the process is not constant, but constantly decreases with the increase in the number of exons. When started from a single ancestral exon, this model of the multiplicative process, similarly to Kolmogoroff's version, produces a lognormal distribution of exon sizes. The resulting distribution of exon sizes obtained in this manner is independent of the actual sequence of random splitting events. After a sufficiently large number of steps, the mean position of the peak in the distribution of ln (*exon size*), *M*, shows an asymptotic linear shift with the logarithm of the number of splitting events, *N*_{spl} (for the process introduced here) or on time, *t* (for the Kolmogoroff process), ln *E*_{0} − *M* ~ ln *N*_{spl} ~ *t*, where *E*_{0} is the length of the ancestral exon. Similarly, the peak width, σ, is related to the same quantities, . The particular values of the coefficients for these proportionalities depend on the details of the process and can be ignored for our current purposes. However, we assume that the details of the splitting processes, i.e. the coefficients of proportionality, are the same for all species.

The assumption that the splitting probability is independent of exon size is essential for a lognormal distribution. A similar hypothesis, that selection of exons from open reading frames is independent of exon size for large exons (larger than a certain threshold of 105–110 bp), was discussed earlier (20). Our model assumes that the probability of exon splitting is independent of exon size for all exons, irrespective of any threshold. We note, in this regard, that the currently most common point of view is that the evolutionary process splits exons by intron insertions. As mentioned in the Introduction section, this generally acknowledged hypothesis of exon splitting by transposon insertion implicitly assumes a larger probability of splitting longer exons because they are larger targets for transposons.

In contrast to this, the lognormal distribution of exon sizes suggests that the mechanism of intron insertion should be independent of exon size. Obvious candidates for such a process would be mechanisms involving exon–intron boundaries, of which are always two for each exon.

A Kolmogoroff process provides a conceptual background for understanding the lognormal nature of exon size distribution. However, the distribution of exon length in real genomes is somewhat more complicated and generally reveals two lognormal peaks (Figure 1 and Supplementary data). To demonstrate how this two-peak pattern can be obtained for a random exon splitting process, let us consider a random process, similar to the one described before, initiated for a model ancestral genome that comprises two exons of unequal lengths. In this case, the two parts of the ancestral genome generate two distinct peaks in the exon size distribution. However, unlike previously, the resulting distribution of exon sizes is heavily dependent on the actual splitting pathway. Figure 2 demonstrates this fact for two splitting patterns generated from two sequences of pseudorandom numbers (see Methods section). This figure reveals a (spontaneous) symmetry breaking between the two peaks in the exon size distribution. In a certain sense, this phenomenon is similar to bifurcation, but, in contrast to an instant bifurcation event, this type of behavior can be regarded as a kind of ‘soft’ breaking of symmetry. The initial steps of the splitting process have a major impact on this phenomenon. Every subsequent splitting event has progressively less effect, and after ~5000 steps, the ratio between numbers of exons contributing to the two different peaks remains nearly constant. We believe that this phenomenon of symmetry breaking plays a major role in the diversity of patterns of exon size distributions.

The idealized model presented above can be reformulated in a different way where, instead of splitting exons on every step of the process, one randomly selected exon produces a descendant of larger size (a doublet, for example). From a formal point of view, these two variants of the process correspond to two possible directions for the time variable. From a biological prospective, they can be viewed as processes producing orthologous and paralogous gene families: the exons produced by duplications could be considered as paralogs, while exons derived from splitting as orthologs. Both variants of the model assume that the process starts from an ancestral genome with few exons, which are later split into smaller parts (or duplicated to produce longer parts). Thus, the proposed model, based on Kolmogoroff's ideas, supports the opinion that, at the initial stages of evolution, simple genomes had a lower fraction of introns (introns late hypothesis).

Figure 3 shows the parameters fit to the exon size distributions for 13 different genomes, using a two component lognormal model. All genomes show the presence of a narrow and a wide peak. Since peak width, *σ*, is a direct measure of the number of exon splitting events, our model suggests that exons in all of these genomes are distributed between two groups having different evolutionary histories. The width of the narrow peak is approximately equal for all species. The width of wider peak varies among species. For simple organisms, like *Caenorhabditis elegans*, its width is close to the width of narrow peak while for higher organisms, such as *Homo sapiens* and *Pan troglodytes*, it is about three times wider than the width of narrow peak. This may suggest both: that exons in this wider peak are older or are evolving more rapidly.

We argue that the biological origins of the exons contributing to the narrow and wide peaks are related to the appearance of new splicing mechanisms, such as alternative splicing (4). There recently has been great progress in the development of probabilistic methods for determination of exon boundaries (21–25). However, computational prediction of alternative splicing boundaries is still imperfect. Thus, to test the hypothesis that the observed distribution of exons between the two classes correlates with alternative splicing, we used the sets of alternatively spliced exons, which were manually curated and confirmed by EST/mRNA aligning (9,10). Unfortunately, curated datasets are available only for human and mouse, but there is a remarkable correlation between these data and the narrow peaks in *Homo sapiens* and *Mus musculus* (see data in Tables Tables11 and and33 and Figure 4). This strongly suggests that alternatively spliced exons are major contributors to the narrow peaks in all of the species examined here. This observation favors the hypothesis that the narrow peak is younger, since alternative splicing is a comparatively advanced evolutionarily mechanism. If one assumes that the evolutionary process modifies all exons in all the species with the same rate, the observation of nearly equal width narrow peaks in all species leads one to conclude that those peaks appeared at approximately the same and quite recent time. At least hypothetically, one may conclude that existence of the narrow peak in all discussed genomes is a manifestation of a spontaneous symmetry break in genome evolution, which is correlated with the evolution of alternative splicing.

The assumption that narrow peak is more recent does not completely eliminate the hypothesis that that this peak is more conserved. Indeed, one may assume is that exons contributing in narrow peaks are simultaneously more recent, and more conserved (with respect to the exon size distributions) than those contributing to the wide peak. In support of this point of view, one may note that the fraction of exons contributing to the narrow peak, in general, correlates with the width of wide peak (see data in Table 4). With two exceptions for genomes of *Anopheles gambiae* and *Drosophila melanogaster*, the increases in the fraction of exons in the narrow peak parallel the increases of width, *σ*, of the wider peak. In other words, its looks like the exons in the narrow peak are less frequently split by evolutionary processes (narrow peak width), but more frequently duplicated (larger fraction of exons) than are exons in the wide peak. This point of view is in agreement with the general pattern of alternative splicing in which a single copy of an exon is replaced by two (or several) isoforms of similar lengths.

## CONCLUSIONS

The model presented here does not directly implicate a specific biological mechanism. However, we have shown that a very simple process, length independent splitting of exons, can produce the observed lognormal distribution of exon sizes. This suggests that it would be profitable to focus more attention on biological processes that are length independent, or on processes that could constrain the length of exons independently of exon length. Processes involved in mRNA splicing or associated with intron–exon boundaries are obvious candidates.

Furthermore, the observation that all eukaryotic organisms possess two exon length classes, one in common among all eukaryotes and one variable, suggests that exon splitting has played a key role in the evolution of eukaryotes. The difference in the peak widths of these two length classes suggests that the wider peak is older, or undergoes more rapid splitting, and mostly comprises nonalternatively spliced exons. The narrow peaks mostly comprise alternatively spliced exons, which are rather conserved in length but multiplied in number in more rapidly evolving genomes. The presence of a narrow peak in all genomes examined could be explained as a manifestation of a ‘spontaneous’ symmetry break in genome evolution associated with the appearance of the mRNA splicing mechanism.

More detailed investigations, including characterization of conservation of exons in narrow and wide peaks between different species and between different subclasses within a particular genome, are extremely intriguing but beyond the scope of this article. The main goals of this work are to draw attention to statistical properties of exon size distribution and to highlight the utility of Kolmogoroff process model in understanding of genome evolution background.

## ACKNOWLEDGEMENTS

Y.R. acknowledges stimulating discussions with Profs Yu. Feldman, D. Fushman, A. Sokolov and S. Mount, and expresses deep gratitude for constant support from Mrs. Natalia Grishina. Y.R. was supported by NSF award DBI-0515986 (M.G. PI) and, when working on the revision of this article by the National Research Council Associateship Program (Award # 0710430). Funding to pay the Open Access publication charges for this article was provided by NSF award DBI-0515986.

*Conflict of interest statement*. None declared.

## REFERENCES

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (361K) |
- Citation

- Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss.[Nat Genet. 2003]
*Modrek B, Lee CJ.**Nat Genet. 2003 Jun; 34(2):177-80.* - The effect of intron length on exon creation ratios during the evolution of mammalian genomes.[RNA. 2008]
*Roy M, Kim N, Xing Y, Lee C.**RNA. 2008 Nov; 14(11):2261-73. Epub 2008 Sep 16.* - Modeling the evolution dynamics of exon-intron structure with a general random fragmentation process.[BMC Evol Biol. 2013]
*Wang L, Stein LD.**BMC Evol Biol. 2013 Feb 28; 13:57. Epub 2013 Feb 28.* - Evolutionary convergence of alternative splicing in ion channels.[Trends Genet. 2004]
*Copley RR.**Trends Genet. 2004 Apr; 20(4):171-6.* - How prevalent is functional alternative splicing in the human genome?[Trends Genet. 2004]
*Sorek R, Shamir R, Ast G.**Trends Genet. 2004 Feb; 20(2):68-71.*

- Global Genetic Response in a Cancer Cell: Self-Organized Coherent Expression Dynamics[PLoS ONE. ]
*Tsuchiya M, Hashimoto M, Takenaka Y, Motoike IN, Yoshikawa K.**PLoS ONE. 9(5)e97411* - Modeling the evolution dynamics of exon-intron structure with a general random fragmentation process[BMC Evolutionary Biology. ]
*Wang L, Stein LD.**BMC Evolutionary Biology. 1357*

- Spontaneous symmetry breaking in genome evolutionSpontaneous symmetry breaking in genome evolutionNucleic Acids Research. May 2008; 36(8)2756

Your browsing activity is empty.

Activity recording is turned off.

See more...