• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of ejbiosysbioJournal's HomeManuscript SubmissionSpringerOpen.comRegisterThis article
EURASIP J Bioinform Syst Biol. 2009; 2009(1): 924601.
Published online Mar 2, 2009. doi:  10.1155/2009/924601
PMCID: PMC3171443

A Hybrid Technique for the Periodicity Characterization of Genomic Sequence Data

Abstract

Many studies of biological sequence data have examined sequence structure in terms of periodicity, and various methods for measuring periodicity have been suggested for this purpose. This paper compares two such methods, autocorrelation and the Fourier transform, using synthetic periodic sequences, and explains the differences in periodicity estimates produced by each. A hybrid autocorrelation—integer period discrete Fourier transform is proposed that combines the advantages of both techniques. Collectively, this representation and a recently proposed variant on the discrete Fourier transform offer alternatives to the widely used autocorrelation for the periodicity characterization of sequence data. Finally, these methods are compared for various tetramers of interest in C. elegans chromosome I.

1. Introduction

The detection of structure within the DNA sequence has long captivated the interest of the research community. Among the various statistical characterizations of sequence data, one measure of structure within sequences is the degree of correlation or periodicity at various displacements along the sequence. Periodicity characterization of sequence data provides a compact and informative representation that has been used in many studies of structure within genomic sequences, including DNA sequence analysis [1], gene and exon detection [2], tandem repeat detection [3], and DNA sequence search and retrieval [4].

To measure such periodicity, autocorrelation has been widely employed [1, 511]. Similarly, Fourier analysis and its variants have been used for periodicity characterization of sequences [4, 9, 1224]. In some cases [25, 26], the Fourier transform of the autocorrelation sequence has also been computed, however using existing symbolic-numeric mappings such as binary indicator sequences [27], this transform can also be calculated without first determining the autocorrelation. Other recent promising approaches to periodicity characterization for biological sequences include the periodicity transform [28], the exactly periodic subspace decomposition [3], and maximum-likelihood statistical periodicity [29], however these techniques have yet to be adopted by biologists for the purposes of sequence structure characterization.

Studies of structure within sequences, such as those referenced above, have tended to use either the autocorrelation or the Fourier transform, and to the author's knowledge, the limitations of each have not been compared in this context. In this paper, the limitations of both approaches are investigated using synthetic symbolic sequences, and caveats to their characterization of sequence data are discussed. A hybrid approach to periodicity characterization of symbolic sequence data is introduced, and its use is illustrated in a comparative manner on a study of tetramers in C. elegans.

2. Periodicity Measures for Symbolic Sequence Characterization

2.1. Definition of Periodicity

Perhaps the most common definition of exact periodicity in a general sequence An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i1.gif is

equation image
(1)

for some An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i3.gif. Assuming An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i4.gif can be represented numerically as An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i5.gif, this definition admits the following decomposition:

equation image
(2)

where

equation image
(3)

is the numerical representation of a repeated symbol or pattern, and An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i8.gif is a periodic binary impulse train:

equation image
(4)

While this expression of An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i10.gif in terms of a binary impulse train is perhaps not so common in signal processing of numerical sequences, the reverse is true for DNA sequences, which have been represented numerically using binary indicator sequences [27] in many studies (e.g., [13, 19, 23, 24, 30]).

2.2. Autocorrelation

The autocorrelation of a finite length numerical sequence An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i11.gif is defined as

equation image
(5)

where n is the sequence index, ρ is the lag, and N is the length of the sequence. The application of the autocorrelation as defined in (5) to a symbolic sequence An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i13.gif requires a numerical representation An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i14.gif. The binary indicator sequences [27], which are sufficiently general as to form the basis for many different representations of DNA sequences, are employed in this analysis to represent An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i15.gif in terms of M binary signals:

equation image
(6)

where M is the number of symbols (or patterns of symbols, such as a polynucleotide) An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i17.gif, to which the numerical values An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i18.gif are assigned, respectively, resulting in M components An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i19.gif. Assuming An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i20.gif, the numerical representation can thus be unambiguously expressed as

equation image
(7)

Note that applying the decomposition in (2) to an exactly periodic sequence results in An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i22.gif comprising a sequence of the numerical values An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i23.gif that correspond to the repeated pattern of symbols.

Alternatively, the autocorrelation can be defined directly on a symbolic sequence An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i24.gif, as used in [20]:

equation image
(8)

so that the autocorrelation at a lag, or period, An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i26.gif for a symbol (or pattern of symbols) is simply the count of the number of instances of that symbol at a spacing of ρ.

Consider now a sequence containing a symbol (or pattern of symbols) An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i27.gif that repeats with exactly period p, so that the numerical representation of the sequence has a component An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i28.gif. The autocorrelation of this component An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i29.gif, for a segment of finite length N, has the following expression:

equation image
(9)

where An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i31.gif is the energy of An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i32.gif over a segment of finite length N. Thus a shortcoming of the autocorrelation for sequence characterization is that an exactly p-periodic sequence will show not only a peak at An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i33.gif, but also peaks at values of An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i34.gif that are integer multiples of p (an example is given in Figure 1(a)). Note that similar artifacts can be found in other periodicity detection methods (e.g., [29]).

Figure 1
Periodicity characterization of the period-12 synthetic signal An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i35.gif using (a) autocorrelation, (b) integer period DFT, and (c) hybrid autocorrelation-IPDFT.

2.3. Fourier Interpretation of Periodicity

In many applications, including sequence analysis, the discrete Fourier transform has been used to determine the periodic component(s) of a numerical sequence An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i36.gif. The discrete Fourier transform (DFT) of a numerical sequence An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i37.gif is defined as

equation image
(10)

where k is the discrete frequency index. Since the DFT has sinusoidal basis functions, the notion of periodicity in the Fourier sense is described in terms of the frequencies of those basis functions onto which the projections of An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i39.gif are the largest in magnitude. That is, the magnitude of the DFT at a frequency k, An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i40.gif, is often taken as an estimate of the relative amount of that frequency component occurring in An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i41.gif [13, 19, 23, 24], from which the relative contribution of a particular period An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i42.gif can be estimated.

Assuming a numerical representation An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i43.gif of the kind shown in (7), the linearity property of the DFT means that the DFT of a symbolic sequence An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i44.gif can be determined as

equation image
(11)

where the An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i46.gif are determined according to (10).

For the purposes of characterizing sequence data using periodicity, it can be noted that positive integer periods are generally of most interest. This means firstly that N and k need to be carefully chosen to allow fast Fourier transform-based calculation of An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i47.gif for periods An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i48.gif, where P is the longest period to be estimated. Secondly, calculating the DFT at other frequencies An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i49.gif is unnecessary. For these reasons, the integer period DFT (IPDFT) was proposed as an alternative to the DFT [19]:

equation image
(12)

Using a similar process to that described above in (10) and (11), the numerical representation of a symbolic sequence An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i51.gif can also be transformed using the IPDFT to produce a spectrum An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i52.gif that is linear in period (ρ) rather than in frequency (k). For the periodicity characterization of sequences, usually the magnitude An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i53.gif is of greatest interest. Some care is needed in the interpretation of the IPDFT, since for a binary periodic sequence such as An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i54.gif of fixed length N, An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i55.gif will decrease for longer periods due to the fact that the energy of An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i56.gif is An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i57.gif.

Consider now the effect of representing an exactly periodic sequence component An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i58.gif using the IPDFT. From (2) and the convolution theorem, An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i59.gif, where An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i60.gif is the IPDFT of An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i61.gif. In particular, if An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i62.gif is assumed to be aperiodic, consider the IPDFT of An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i63.gif:

equation image
(13)

where An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i65.gif. That is, An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i66.gif is relatively large for An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i67.gif, and relatively small for An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i68.gif. From this, we see that a shortcoming of Fourier transform approaches such as the IPDFT for sequence characterization by periodicity is that they produce not only a peak at An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i69.gif, but also peaks at values of An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i70.gif that are integer divisors of the period p (see example in Figure 1(b)). For the DFT, this effect is also seen, but instead for indices whose value is An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i71.gif (i.e., harmonics of the frequency An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i72.gif with integer frequency indices).

2.4. Periodicity of a Synthetic Sequence Using Autocorrelation and DFT

To illustrate the shortcomings of the autocorrelation and DFT discussed in Sections 2.2 and 2.3, consider the periodicity characterization of an example signal An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i73.gif (i.e., exact monomer periodicity An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i74.gif), where An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i75.gif and An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i76.gif. The autocorrelation and IPDFT are shown in Figures 1(a) and 1(b), respectively, from which the ambiguities in period estimate discussed in Sections 2.2 and 2.3 can be clearly seen.

3. Hybrid Autocorrelation-IPDFT Periodicity Estimation

3.1. Hybrid Autocorrelation-IPDFT

From Figure Figure1,1, it is apparent that the autocorrelation and IPDFT are complementary, and that their combination can improve periodicity estimation. This is the motivation for the hybrid autocorrelation-IPDFT period estimate:

equation image
(14)

For the simple example signal An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i78.gif from Section 2.4, the calculation of An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i79.gif results in a single, unambiguous periodicity estimate, as seen in Figure 1(c).

An alternative, more flexible formulation is

equation image
(15)

where An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i81.gif, which may be helpful for biologists who have conventionally used either the autocorrelation (An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i82.gif) or the Fourier transform (An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i83.gif). For the purpose of sequence periodicity visualization, for example, An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i84.gif could be represented as a parameter available for real-time control, so that a biologist viewing a periodicity characterization of a sequence might subjectively assign a relative weight to each of the autocorrelation and Fourier transform components. Care is needed, however, with the application of (15), since An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i85.gif is only well defined for An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i86.gif for all An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i87.gif. Note that this is satisfied by the autocorrelation defined in (8), in addition to a number of DNA numerical representations (several example representations are discussed in [30]). It is further noted that (14) and (15) do not have a straightforward physical interpretation, in contrast to An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i88.gif and An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i89.gif.

Applying the hybrid autocorrelation-IPDFT period estimate to another example, synthetic signal with multiple exact periodic components (An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i90.gif) further illustrates the shortcomings of the autocorrelation and IPDFT, and suggests the hybrid approach as suitable for periodicity analyses, as seen in Figure Figure22.

Figure 2
Periodicity characterization of a period-7, 10 and 12 synthetic signal using (a) autocorrelation, (b) integer period DFT, and (c) hybrid autocorrelation-IPDFT.

3.2. Evaluation of Periodicity Estimation in Noise

In the absence of an obvious objective evaluation metric for periodicity characterization approaches, one limited approach is to compare their accuracies for the problem of estimating a single periodic component that has been obscured by noise. Specifically, suppose a periodic binary impulse train An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i91.gif is degraded by random binary noise, simulating the effect of the DNA substitution process, to produce a binary pseudo-periodic signal An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i92.gif. Then estimates of the signal periodicity using each of the autocorrelation, integer period DFT and hybrid autocorrelation-IPDFT can be calculated, respectively, as

equation image
(16)

where An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i94.gif is calculated using (14) throughout both this section and Section 4.

A comparison of the periodicity estimates was conducted by generating synthetic periodic signals of length An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i95.gif, introducing various amounts of substitution (noise) and estimating An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i96.gif, An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i97.gif, and An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i98.gif. This process was repeated 100 times for each combination of period and substitution rate tested. The resulting average period error rates are shown as a function of substitution rate for three example values of period p in Figure Figure33 (p small, p larger and prime, and p larger and highly composite), and as a function of the period in Figure Figure4.4. These results confirm earlier observations that the IPDFT provides more robust period estimates for prime periods than the autocorrelation, while the reverse is true for highly composite periods. The results also show that the hybrid technique is often able to provide a lower period error rate than either the autocorrelation or the IPDFT. Exceptions to this occur for some prime periods (see Figure Figure4),4), where the poorer performance of the autocorrelation seems to slightly adversely affect the hybrid estimate An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i99.gif relative to the IPDFT-only estimate An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i100.gif.

Figure 3
Error rate versus substitutions averaged over 100 instances of sequences of length 10000 with (a) An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i101.gif, (b) An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i102.gif, (c) An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i103.gif, for period estimates using autocorrelation (An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i104.gif), integer period DFT (An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i105.gif), and hybrid autocorrelation-IPDFT (—).
Figure 4
Error rate versus period averaged over 100 instances of sequences of length 10000 with a substitution rate of 30%, for period estimates using autocorrelation (An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i106.gif), integer period DFT (An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i107.gif), and hybrid autocorrelation-IPDFT (—).

3.3. Evaluation of Multiple Periodicity Estimation

For periodicity characterization, a more relevant evaluation criterion is the extent to which all periodicities present can be detected correctly. Since an exhaustive evaluation is impractical, in this work, synthetic sequences comprising three randomly chosen integer periodic components An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i108.gif were constructed, and the frequency with which all three periods were correctly detected was measured. When multiple perfectly periodic components are present in a binary signal, the shorter periods will be favoured during estimation, as a result of their greater occurrence in a fixed-length signal. Hence, when combining three periodic components, the shorter period components were randomly eroded to give an equal occurrence between all periods. In the general case of multiple periodicities, some periodic components will be stronger than others. To simulate this, the An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i109.gif-periodic component was further randomly eroded by An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i110.gif % and the An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i111.gif-periodic component was further randomly eroded by An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i112.gif %, that is, larger values of An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i113.gif correspond to a more dominant An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i114.gif component. Erosions of greater than about 20% were experimentally found to degrade the accuracy of all three period estimates, using all methods. Finally, the percentage of instances for which the periods An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i115.gif, An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i116.gif, and An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i117.gif were correctly estimated in correct order of strength according to the 3-best period estimates, calculated similarly to equations (16), was determined. The results, shown in Figure Figure5,5, strongly support the validity of the proposed hybrid autocorrelation-IPDFT technique relative to the autocorrelation and IPDFT.

Figure 5
Percentage of sequence instances for which all three periods were correctly estimated in order of strength versus erosion An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i118.gif, over 500 instances of sequences of length 10000 with three randomly chosen integer periodic components, estimated using autocorrelation ...

It is noted that the signal processing literature includes examples of methods for detecting multiple periodic signal components, such as the MUSIC algorithm [31]. For comparative purposes, the above experiment was repeated employing MUSIC to estimate the strengths of the periodic components. Results indicated that MUSIC was unable to consistently estimate either the periods or the relative strengths of the three components, returning no instances of all three periods correct and in the correct order. The dominant period estimate often contained the common factors of two or more of the true periodic components, an artifact attributable to the superposition of harmonic spectra reinforcing multiples of the individual component fundamentals that coincide in frequency. Two assumptions of MUSIC are not valid for this application: (i) the periodic components are not sinusoidal (although they can be represented as a harmonic series of sinusoids), (ii) the periodic components and noise may not be uncorrelated.

4. Application to DNA Sequence Data

Having discussed the differences between the autocorrelation and DFT for synthetic sequences, we now investigate the effect of using the IPDFT and hybrid autocorrelation-IPDFT in place of the autocorrelation on real sequence data. Numerous researchers have used autocorrelation [1, 510, 32]; here we compare with examples from the study of tetramer periodicity in the C. elegans genome using autocorrelation by Kumar et al. [1].

In the investigation of TATA tetramers, particular mention was made of the strong period-2 component [1], which features prominently in estimates by all three techniques, as seen in Figure Figure2.2. In the autocorrelation estimate (Figure 6(a)), the period-10 component appears to have been virtually completely masked by the period-2 component. In contrast, the period-10 component features strongly in the IPDFT (Figure 6(b)) and hybrid (Figure 6(c)) estimates. Although this period-10 component was not mentioned in the analysis of TATA tetramers specifically, it was found to be characteristic of all other C. elegans tetramers analyzed in [1].

Figure 6
(a) Autocorrelation from [1], (b) integer period DFT magnitude, and (c) hybrid autocorrelation-IPDFT of TATA tetramers from C. elegans chromosome I.

Note also that the IPDFT reveals a strong period-25 component, not at all evident in the autocorrelation. This surprising result was verified by constructing a synthetic sequence with perfect periodic components at An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i121.gif and An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i122.gif, and examining its autocorrelation and IPDFT. The autocorrelation of the sequence did not display visually any significant peak at An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i123.gif until the period-2 component had been eroded by at least 80%. In contrast, the IPDFT showed a clear peak at An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i124.gif with no period-2 erosion at all. The period-25 component has rarely been noted in previous literature, however in [11], a filtered distribution of distances between TA dinucleotides shows a strong peak at An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i125.gif, which Salih et al. attribute to a 5-base periodicity associated with the period-10 consensus sequence structure for C. elegans.

In the investigation of TGCC tetramers (see Figure Figure7),7), the periodic components at 8 and 35 bp were noted in [1]. The proposed hybrid technique also produces peaks at these periods (mainly due to the autocorrelation in this instance), however it additionally finds period-12 and period-39 components. Note that the IPDFT produces a strong peak at a 6 bp period (presumably due to being an integer divisor of 12), however in the hybrid result, this is effectively suppressed by the autocorrelation.

Figure 7
(a) Autocorrelation from [1], (b) integer period DFT magnitude, and (c) hybrid autocorrelation-IPDFT of TGCC tetramers from C. elegans chromosome I.

In [1], mention is made of the period-10 and 11 behaviour of AGAA tetramers. As seen in Figure Figure8,8, the autocorrelation finds a dominant peak at 9 bp, while the hybrid technique is more convincing in revealing period-10 behavior. Note that, as previously, the period-5 IPDFT component (presumably due to the 10 bp periodicity) is effectively attenuated in the hybrid result.

Figure 8
(a) Autocorrelation from [1], (b) integer period DFT magnitude, and (c) hybrid autocorrelation-IPDFT of AGAA tetramers from C. elegans chromosome I.

In the investigation of WWWW tetramers (where W represents either A or T), the autocorrelation (Figure 9(a)), as in [1], is dominated by the period-10 component. A very similar characteristic is observed in the distribution of distances between TT to TT dinucleotides in [11], and in the distribution of AAAA to AAAA tetramer distances in [33], suggesting a strong influence by these motifs. While the dominance of the period-10 component is similar for the IPDFT, it also detects a relatively strong period-25 component, perhaps due to TA dinucleotide periodicity, as discussed above for TATA tetramers. In this example, the hybrid autocorrelation-IPDFT result is biased towards the IPDFT, as a result of the IPDFT having a larger dynamic range than the autocorrelation. Here, the effect is not detrimental, having the effect of suppressing the spurious peaks at periods 20, 30, and 40, however in other applications it may be desirable to offset the autocorrelation and/or IPDFT to produce a minimum value of zero prior to calculating the hybrid autocorrelation-IPDFT period estimate.

Figure 9
(a) Autocorrelation from [1], (b) integer period DFT magnitude, and (c) hybrid autocorrelation-IPDFT of WWWW tetramers from C. elegans chromosome I.

5. Conclusion

This paper has made two contributions to the periodicity characterization of sequence data. Firstly, the origins of ambiguities in period estimates for symbolic sequences due to multiples or sub multiples of the true period in the autocorrelation and Fourier transform methods, respectively, were explained. This is significant because these two methods account for perhaps the majority of the periodicity analysis seen in biology literature, and yet, to the author's knowledge, their limitations have not been discussed in this context. Secondly, a hybrid autocorrelation-IPDFT technique for periodicity characterization of sequences has been proposed. This technique has been shown to provide improved accuracy relative to the autocorrelation and IPDFT for period estimation in noise and multiple periodicity estimation, for synthetic sequence data. Comparative results from a preliminary investigation of tetramers in C. elegans chromosome I suggest that the proposed approach yields estimates that are consistently less prone to attribute significance to integer multiples or divisors of the true period(s). Thus, the hybrid autocorrelation-IPDFT is putatively advanced as a useful tool for biologists in their quest to reveal and explain structure within biological sequences. Future work will include studies of different types of periodicity in sequence data from other organisms, using IPDFT-based and hybrid techniques.

Acknowledgments

The author would like to thank two anonymous reviewers for a number of helpful suggestions, which have certainly improved the quality of this paper. Thanks are also due to Professor Eliathamby Ambikairajah for helpful discussions. This research was supported by a University of New South Wales Faculty of Engineering Early Career Research Grant for genomic signal processing, 2009.

References

  • Kumar L, Futschik M, Herzel H. DNA motifs and sequence periodicities. In Silico Biology. 2006;6(1-2):71–78. [PubMed]
  • Trifonov EN. 3-, 10.5-, 200- and 400-base periodicities in genome sequences. Physica A. 1998;249(1–4):511–516.
  • Muresan DD, Parks TW. Orthogonal, exactly periodic subspace decomposition. IEEE Transactions on Signal Processing. 2003;51(9):2270–2279. doi: 10.1109/TSP.2003.815381. [Cross Ref]
  • Santo E, Dimitrova N. Improvement of spectral analysis as a genomic analysis tool. Proceedings of the 5th IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '07), Tuusula, Finland, June 2007.
  • Bernaola-Galván P, Carpena P, Román-Roldán R, Oliver JL. Study of statistical correlations in DNA sequences. Gene. 2002;300(1-2):105–115. doi: 10.1016/S0378-1119(02)01037-5. [PubMed] [Cross Ref]
  • Chakravarthy N, Spanias A, Iasemidis LD, Tsakalis K. Autoregressive modeling and feature analysis of DNA sequences. EURASIP Journal on Applied Signal Processing. 2004;2004(1):13–28. doi: 10.1155/S111086570430925X. [Cross Ref]
  • Herzel H, Trifonov EN, Weiss O, Große I. Interpreting correlations in biosequences. Physica A. 1998;249(1–4):449–459.
  • Li W. The study of correlation structures of DNA sequences: a critical review. Computers and Chemistry. 1997;21(4):257–271. doi: 10.1016/S0097-8485(97)00022-3. [PubMed] [Cross Ref]
  • McLachlan AD. Multichannel Fourier analysis of patterns in protein sequences. The Journal of Physical Chemistry. 1993;97(12):3000–3006. doi: 10.1021/j100114a028. [Cross Ref]
  • Peng C-K, Buldyrev SV, Goldberger AL. et al. Long-range correlations in nucleotide sequences. Nature. 1992;356(6365):168–170. doi: 10.1038/356168a0. [PubMed] [Cross Ref]
  • Salih F, Salih B, Trifonov EN. Sequence structure of hidden 10.4-base repeat in the nucleosomes of C. elegans. Journal of Biomolecular Structure and Dynamics. 2008;26(3):273–281. [PubMed]
  • Afreixo V, Ferreira PJSG, Santos D. Fourier analysis of symbolic data: a brief review. Digital Signal Processing. 2004;14(6):523–530. doi: 10.1016/j.dsp.2004.08.001. [Cross Ref]
  • Anastassiou D. Genomic signal processing. IEEE Signal Processing Magazine. 2001;18(4):8–20. doi: 10.1109/79.939833. [Cross Ref]
  • Berger JA, Mitra SK, Astola J. Power spectrum analysis for DNA sequences. Proceedings of the 7th International Symposium on Signal Processing and Its Applications (ISSPA '03), Paris, France, July 2003. pp. 29–32.
  • Coward E. Equivalence of two Fourier methods for biological sequences. Journal of Mathematical Biology. 1997;36(1):64–70. doi: 10.1007/s002850050090. [Cross Ref]
  • Datta S, Asif A. A fast DFT based gene prediction algorithm for identification of protein coding regions. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '05), Philadelphia, Pa, USA, March 2005. pp. 653–656.
  • Dodin G, Vandergheynst P, Levoir P, Cordier C, Marcourt L. Fourier and wavelet transform analysis, a tool for visualizing regular patterns in DNA sequences. Journal of Theoretical Biology. 2000;206(3):323–326. doi: 10.1006/jtbi.2000.2127. [PubMed] [Cross Ref]
  • Emanuele VA, II, Tran TT, Zhou GT. A fourier product method for detecting approximate tandem repeats in DNA. Proceedings of the 13th IEEE/SP Workshop on Statistical Signal Processing (SSP '05), Bordeaux, France, July 2005. pp. 1390–1395.
  • Epps J, Ambikairajah E, Akhtar M. An integer period DFT for biological sequence processing. Proceedings of the 6th IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '08), Phoenix, Ariz, USA, June 2008. pp. 1–4.
  • Issac B, Singh H, Kaur H, Raghava GPS. Locating probable genes using Fourier transform approach. Bioinformatics. 2002;18(1):196–197. doi: 10.1093/bioinformatics/18.1.196. [PubMed] [Cross Ref]
  • Makeev VJu, Tumanyan VG. Search of periodicities in primary structure of biopolymers: a general Fourier approach. Computer Applications in the Biosciences. 1996;12(1):49–54. [PubMed]
  • Silverman BD, Linsker R. A measure of DNA periodicity. Journal of Theoretical Biology. 1986;118(3):295–300. doi: 10.1016/S0022-5193(86)80060-1. [PubMed] [Cross Ref]
  • Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R. Prediction of probable genes by Fourier analysis of genomic sequences. Computer Applications in the Biosciences. 1997;13(3):263–270. [PubMed]
  • Wang W, Johnson DH. Computing linear transforms of symbolic signals. IEEE Transactions on Signal Processing. 2002;50(3):628–634. doi: 10.1109/78.984752. [Cross Ref]
  • Hosid S, Trifonov EN, Bolshoy A. Sequence periodicity of Escherichia coli is concentrated in intergenic regions. BMC Molecular Biology. 2004;5, article 14:1–7. [PMC free article] [PubMed]
  • Worning P, Jensen LJ, Nelson KE, Brunak S, Ussery DW. Structural analysis of DNA sequence: evidence for lateral gene transfer in Thermotoga maritima. Nucleic Acids Research. 2000;28(3):706–709. doi: 10.1093/nar/28.3.706. [PMC free article] [PubMed] [Cross Ref]
  • Voss RF. Evolution of long-range fractal correlations and An external file that holds a picture, illustration, etc.
Object name is 1687-4153-2009-924601-i126.gif noise in DNA base sequences. Physical Review Letters. 1992;68(25):3805–3808. doi: 10.1103/PhysRevLett.68.3805. [PubMed] [Cross Ref]
  • Sethares WA, Staley TW. Periodicity transforms. IEEE Transactions on Signal Processing. 1999;47(11):2953–2964. doi: 10.1109/78.796431. [Cross Ref]
  • Arora R, Sethares WA. Detection of periodicities in gene sequences: a maximum likelihood approach. Proceedings of the 5th IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '07), Tuusula, Finland, June 2007.
  • Akhtar M, Epps J, Ambikairajah E. Signal processing in sequence analysis: advances in eukaryotic gene prediction. IEEE Journal on Selected Topics in Signal Processing. 2008;2(3):310–321.
  • Schmidt RO. Multiple emitter location and signal parameter estimation. IEEE Transactions on Antennas and Propagation. 1986;34(3):276–280. doi: 10.1109/TAP.1986.1143830. [Cross Ref]
  • Li W, Marr TG, Kaneko K. Understanding long-range correlations in DNA sequences. Physica D. 1994;75(1–3):392–416.
  • Fire A, Alcazar R, Tan F. Unusual DNA structures associated with germline genetic activity in Caenorhabditis elegans. Genetics. 2006;173(3):1259–1273. doi: 10.1534/genetics.106.057364. [PMC free article] [PubMed] [Cross Ref]

Articles from EURASIP Journal on Bioinformatics and Systems Biology are provided here courtesy of BioMed Central
PubReader format: click here to try

Formats: