- Journal List
- Bioinformatics
- PMC2735659

# Transcriptional landscape estimation from tiling array data using a model of signal shift and drift

^{1,}

^{*}Aurélie Leduc,

^{1}Stéphane Robin,

^{2}Simon Rasmussen,

^{3}Hanne Jarmer,

^{3}and Philippe Bessières

^{1}

^{1}INRA, Mathématique Informatique et Génome UR1077, 78350 Jouy-en-Josas,

^{2}AgroParisTech/INRA, Mathématiques et Informatique Appliquées UMR518, 16 rue Claude Bernard, 75005 Paris, France and

^{3}Technical University of Denmark, Center for Biological Sequence analysis, Building 208, 2800 Lyngby, Denmark

## Abstract

**Motivation:** High-density oligonucleotide tiling array technology holds the promise of a better description of the complexity and the dynamics of transcriptional landscapes. In organisms such as bacteria and yeasts, transcription can be measured on a genome-wide scale with a resolution >25 bp. The statistical models currently used to handle these data remain however very simple, the most popular being the piecewise constant Gaussian model with a fixed number of breakpoints.

**Results:** This article describes a new methodology based on a hidden Markov model that embeds the segmentation of a continuous-valued signal in a probabilistic setting. For a computationally affordable cost, this framework (i) alleviates the difficulty of choosing a fixed number of breakpoints, and (ii) permits retrieving more information than a unique segmentation by giving access to the whole probability distribution of the transcription profile. Importantly, the model is also enriched and accounts for subtle effects such as signal ‘drift’ and covariates. Relevance of this framework is demonstrated on a *Bacillus subtilis* dataset.

**Availability:** A software is distributed under the GPL.

**Contact:** rf.arni.yuoj@salocin.erreip

**Supplementary information:** Supplementary data is available at *Bioinformatics* online.

## 1 INTRODUCTION

High-density oligonucleotide tiling arrays carry tightly spaced probes that provide uniform covering of the genomic sequence. By hybridization with RNA samples (cDNA), they have been used to query the transcriptional activity of the whole genome in an array of model organisms (Bertone *et al*., 2004; Biemar *et al*., 2006; He *et al*., 2007; Stolc *et al*., 2005). The approach is particularly attractive for organisms with small-sized genome such as bacteria and yeasts where a resolution >25 bp is more easily achieved (David *et al*., 2006; S.Rasmussen *et al*., submitted for publication). The generalization of the use of such arrays should provide unbiased and high-quality pictures of the complexity and the dynamics of transcriptional landscapes (Xu *et al*., 2009). The great promise of these data justifies the improvement of the currently available statistical methods dedicated to their analysis.

From the methodological standpoint, the problem is naturally stated in terms of finding segments where the hybridization signal is relatively constant, delimited by breakpoints that are expected to correspond to biological features such as transcript start and stop sites or splicing sites. A variety of tools including local non-parametric smoothing (Royce *et al*., 2007; Wang *et al*., 2009) and iterative hypothesis testing (Olshen *et al*., 2004) have been proposed to answer this question. Probably the most popular and best mathematically grounded methodology consists of seeking the piecewise constant model with Gaussian noise that best fits the signal (Huber *et al*., 2006; Picard *et al*., 2005). Namely, for a fixed number of segments *S*, fitting the model consists of finding the combination of breakpoints 1 < *t*_{1} ≤ ··· ≤*t*_{S−1}≤*n* that minimizes the sum of squared residuals:

where *x*_{k} is the signal at position *k*, is the average signal level in segment *s* (i.e. between *t*_{s−1} and *t*_{s} − 1), *t*_{0} = 1 and *t*_{S} = *n* + 1. In full generality, minimizing the sum of squared residuals in Equation (1) can be achieved by Dynamic Programming and requires time *O*(*n*^{2}*S*). Huber *et al*. (2006) fixed an upper bound *l* on the maximum length of each segment to reduce the time complexity to *O*(*nlS*) with *l* < *n*. The problem of choosing the correct number of segments *S* was more specifically examined by Picard *et al*. (2005), but visual assessment and use of prior belief have also been advocated (Huber *et al*., 2006) and have been useful in practice (David *et al*., 2006, S.Rasmussen *et al*., submitted for publication).

The simplicity of this approach is appealing but hinders a number of difficulties, the most important being the choice of the number of segments. In principle, this issue can be tackled by embedding the segmentation model in a probabilistic setting that includes not only the noise but also the evolution of the signal. This idea stimulated the development of hidden Markov models (HMMs) (Fridlyand *et al*., 2004; Marioni *et al*., 2006; Stjernqvist *et al*., 2007) for the analysis of comparative genomic hybridization data. For transcriptomic data, a different approach consists of training HMMs to distinguish between transcribed and non-transcribed regions (Du *et al*., 2006; Munch *et al*., 2006). When the quality of the data is good enough it is both more natural and more ambitious to try to recover the ‘denoised’ transcription signal instead of directly summarizing the data via a classification algorithm. Transcript level is, however, a continuous quantity and none of the available models is satisfactory for a continuous-valued underlying signal. An HMM that achieves this aim at a computationally affordable cost is described in the present article. The proposed model does also extend the piecewise constant model in two directions. First, it integrates the influence of covariates that serve to account for differential affinity between probes. This allows to achieve segmentation and within-array normalization in one step. Second, the proposed model relaxes the assumption of strictly constant transcript levels between abrupt ‘shifts’ by also allowing progressive ‘drift’ of the signal. Inference based on this model is examined and discussed.

## 2 METHODS

### 2.1 Experimental data

The main example dataset used here comes from pilot experiments conducted on *Bacillus subtilis* within the European Consortium BaSysBio (S.Rasmussen *et al*., submitted for publication). This array consists of 383 149 probes starting every 22 nt on each strand of the *B.subtilis* genome (GenBank: {"type":"entrez-nucleotide","attrs":{"text":"AL009126","term_id":"225184640","term_text":"AL009126"}}AL009126). Probe lengths range between 45 nt and 65 nt and were adjusted to reduce melting temperature (TM) variations (isothermal design). Production of the tiling arrays, synthesis of labeled cDNA from the RNA samples with random priming, hybridization and signal acquisition were carried out by Nimblegen. Antisense artifacts were controlled by using actinomycin D during reverse transcription (Perocchi *et al*., 2007). RNA was extracted from *B.subtilis* culture during exponential growth on rich medium. One out of four biological replicates gave a high-quality signal and is analyzed here (S.Rasmussen *et al*., submitted for publication). For comparison with the algorithm of Huber *et al*. (2006), we also analyzed a dataset corresponding to the chromosome 1 of the yeast *Saccharomyces cerevisiae* (David *et al*., 2006). This second array was produced by Affymetrix and uses shorter oligonucleotide (25 nt) tiled at intervals of 8 nt on each strand. The data from the three biological replicates were averaged after quantile normalization.

Both experimental settings included hybridization of genomic DNA (gDNA) preparations to assess variation of affinity between probes (four replicates for *B.subtilis* and three replicates for *S.cerevisiae*). Data were averaged across replicates after quantile normalization. *Bacillus subtilis* gDNA data varied smoothly between the replication origin and the replication terminus, presumably reflecting the chromosome dosage. Taking the residuals after median smoothing (window size 110 011 bp) removed this trend. For *S.cerevisiae* data, we preferred to compute the residuals as the distance to the mode rather than to the median to account for the highly skewed distribution of probe affinities. The formatted datasets are distributed with the software.

### 2.2 Shift and drift in an HMM framework

Like in previous approaches (Huber *et al*., 2006; Olshen *et al*., 2004; Picard *et al*., 2005), the log_{2} of the observed intensity *x*_{t} is modeled as the sum of an unobservable signal *u*_{t} that is the focus of interest plus a Gaussian noise with SD σ. This general model can be written as:

However, *u*_{t} is not seen in our model as a parameter but is itself a random variable. Correlation between probes that are adjacent on the chromosome is accounted for by a Markov transition kernel π(*u*_{t}, *u*_{t+1}) and (*x*_{t}, *u*_{t})_{1≤t≤n} is thus said to be an HMM (Durbin *et al*., 1998; Rabiner, 1989). Compared with traditional use of HMMs, the complication comes from the continuous nature of *u*_{t}, whereas the efficient algorithmic machinery of the HMMs (Viterbi algorithm, forward–backward algorithm, expectation-maximization (EM) algorithm) works well for discrete and typically small number of hidden states (Rabiner, 1989). In general, with *K* hidden states, the time complexity of the algorithms is *O*(*nK*^{2}).

Here, we propose a structure of the transition matrix π(*u*_{t}, *u*_{t+1}) accounting for abrupt shifts and progressive drifts in the unobservable signal *u*_{t} that allows to discretize the continuous range *U*_{min}≤*u*_{t}≤*U*_{max} in *K* points spaced by a regular interval, *h*=(*U*_{max}−*U*_{min})/(*K* − 1). This particular structure warrants time complexity *O*(*nK*) for the classical HMM algorithms and thus permits appropriately high resolution of discretization.

For values of *u*_{t} and *u*_{t+1} taken in the discretized hidden state space, the transition probability writes

where the parameters verify 0≤α_{n}, α_{s}, α_{u}, α_{d}≤1, α_{n}+α_{s}+α_{u}+α_{d}=1 and 0≤λ_{u}, λ_{d}<1, with 𝕀_{{X}} standing for 1 if *X* is true, 0 otherwise.

This transition kernel is best understood as a mixture of four types of moves with weights α_{n}, α_{s}, α_{u} and α_{d}. The parameter α_{n} accounts for unchanged *u* between successive probes. Shift moves have probability α_{s} and the distribution of the signal after the move is independent of the value of the signal before the move. This distribution is given by η_{h} and it approximates the marginal distribution of the signal. Namely, η_{h}(*u*_{t+1}) = ∫_{ut+1−h/2}^{ut+1+h/2} η(*u*)d*u*, where η is the kernel density estimate computed on *x* with a Gaussian kernel and Scott's bandwidth (Scott, 1992). The possibility of small drift, either upward or downward, is accounted for by α_{u} and α_{d}. Drift amplitudes are modeled by two geometric distributions of parameters λ_{u} and λ_{d} and average amplitudes write *h*+*h*/(1 − λ).

It can be verified that as *h*→0 and *h*/(1 − λ) → γ the transition kernel of the discrete-valued Markov chain of Equation (3) converges in distribution toward the transition kernel of a continuous-valued Markov chain. In its continuous version, the kernel writes as a mixture of a point mass at *u*_{t} of weight α_{n}, a continuous-valued distribution of density η and weight α_{s}, and two shifted exponential distributions of rates γ_{u} and γ_{d} and weights α_{u} and α_{d}. With an appropriately high *K* it should thus be possible to approach, using the discrete-valued model of Equation (3), the results that one would obtain with the continuous-valued model.

The Supplementary Material available online gives a detailed presentation of the equations that allow *O*(*nK*) implementations of the HMM classical algorithms, namely:

- likelihood computation (ℙ(
*x*_{1…n})), - forward–backward algorithm (computation of ℙ(
*u*_{t}|*x*_{1…n}) for each*t*), - Viterbi algorithm (finding the trajectory
*u*_{1…n}that maximizes ℙ(*u*_{1…n}|*x*_{1…n})).

These algorithms are implemented in our software. All the parameters are estimated in the maximum likelihood (ML) framework with the EM algorithm, an iterative algorithm that alternates an E-step (forward–backward algorithm) and a M-step (parameter update). The output provides a detailed report on the ‘denoised’ signal based on the results of the Viterbi and forward–backward algorithms.

### 2.3 gDNA signal as a covariate

gDNA hybridization data were used in a preprocessing step by Huber *et al*. (2006) for the purpose of between-probe signal normalization and outlier trimming. The model proposed here accounts for these effects by modeling the gDNA hybridization intensities as a covariate.

The probability distribution for the observed variable *x*_{t} given the underlying signal *u*_{t} and the gDNA residuals *r*_{t} writes as a mixture model

where ϵ(*r*_{t}) corresponds to the probability of outliers, 𝒰(*U*_{min}, *U*_{max}) is the uniform distribution that models outlier data and 𝒩(*u*_{t} + ρ(*u*_{t})*r*_{t}, σ(*u*_{t})^{2}) is the Gaussian distribution modeling non-outlier data. This model is markedly richer than Equation (2). Notice (i) the non-constant proportionality factor ρ(*u*_{t}) applied to *r*_{t}; (ii) the non-constant standard error σ(*u*_{t}) of the Gaussian distribution; and (iii) the probability of outliers ϵ that depends on *r*_{t}. More precisely, ρ and σ are modeled as piecewise constant function of *u*_{t} with eight intervals, and ϵ is a two-parameter logistic function of the absolute value of *r*_{t}, ϵ(*r*_{t}) = 1/(1 + *e*^{−(a+b|rt|)}). All the parameters are simultaneously estimated with the EM algorithm (see Supplementary Material).

Finally, left and right censoring are incorporated in the model to account for the experimental limitations that preclude exact measurements of extremely high and extremely low intensities. In practice, the lower and upper 5% of the original range of variation of the intensity *x* are considered as censored.

## 3 RESULTS AND DISCUSSION

### 3.1 Selecting the appropriate level of discretization

The model was designed with the explicit aim of modeling a continuous-valued underlying signal. In other words, discretization of the hidden state space is seen only as a necessary technicality and the step *h* ∝ 1/*K* should ideally be sufficiently small to have no impact on the results. Intuitively, the smaller the SD of the noise σ, the smaller the step *h* should be. The results obtained on the *B.subtilis* dataset and presented in Figure 1 confirm this intuition and thereby provide some form of validation for the model.

**A**) Log-likelihood (in natural log) as a function of the number of hidden states,

*K*. (

**B**) Estimated average variance of the noise as a function of

*K*(plain line). The discretization step

*h*∝ 1/(

*K*−

**...**

Figure 1 shows that increasing *K* (and thus decreasing *h*) actually increases the model adequation to the data as measured by the log-likelihood after ML estimation. Beyond a certain value of *K* the impact of this change becomes, however, almost unnoticeable. Figure 1 also reports the parallel evolution of *h* and σ. According to this plot, having *h* around 0.5σ seems more than sufficient. Indeed, with such a value of *h*, the 95% confidence interval (CI) of the distribution of the noise is about eight times as large as the discretization interval *h*. *K* was set to 100 for this particular dataset.

This choice of *K* = 100 corresponds to an acceptable running time for the algorithm. Our setting throughout this study consisted to explore 10 random starting points for the EM algorithm. Here, it resulted in a total of 885 iterations taking 5 h 6 min on an Intel(R) Xeon(TM) CPU 3.40 GHz CPU, less than the 5 h 36 min needed for the segmentation algorithm of Huber *et al*. (2006) with maximum segment length *l* = 1000 (22 000 bp) and segment number on each strand *S* = 1500.

### 3.2 Importance of modeling drift and covariates

Parameter estimates in model-based analyses are an invaluable source of information to understand both the behavior of the model and the data. The model contains a total of 23 parameters. Figure 2 is intended to provide an overview of their ML estimates on the *B.subtilis* data. The first row of Table 1 gives numerical values for a selection of parameters.

**A**) Transition matrix π(

*u*

_{t},

*u*

_{t+1}). One row is represented. (

**B**) SD of the noise σ as a function of the underlying signal level

*u*

_{t}. (

**C**) Outlier probability ϵ as a function of the magnitude of the gDNA residuals

**...**

The shape of the transition matrix that describes the trajectory of the underlying signal is defined by the parameters in Equation (3), one row of this matrix is shown in Figure 2A. The sharp peak reflects the high value of α_{n}: it is estimated that the underlying signal remains unchanged between adjacent probes in > 85% of the cases (α_{s} in Table 1). The narrow shoulders on both sides of the peak correspond to the upward and downward drift moves and reflect the value of the parameters (α_{u}, λ_{u}) and (α_{d}, λ_{d}), respectively. Close inspection reveals a small asymmetry, with upward moves being less frequent than downward moves (5.0% versus 7.8%). The small estimated proportion of abrupt shift moves between adjacent probes is almost invisible at this scale (1.2%).

As expected, the probability of outliers is estimated to increase with the magnitude of the residuals of the gDNA signal. The two-parameter logistic curve that models this relationship is shown in Figure 2C. Remarkably, the probability of outliers is found to be overall very small.

The parameters σ and ρ that model the observed intensity *x*_{t} are modeled as eight-parameter piecewise constant functions of the underlying signal level *u*_{t}. Figures 2B and D show these two functions. Whereas the SD of the noise σ is a relatively flat function of *u*_{t}, the parameter ρ that serves to account for the gDNA covariate varies by more than a factor of eight. An obvious characteristic of the latter is its sharp decrease for low values of the signal. This behavior probably reflects higher level of non-specific signal in the lower end of the intensity spectrum. It is also re-insuring to observe that the value of ρ in the middle of the spectrum is just slightly below unity, the value that we expect in an idealized situation [see the rationale behind the preprocessing step in Huber *et al*. (2006)].

As a whole, these results emphasize the importance of two specificities of our model: the modeling of drift moves as a complement to shift moves and the non-constant ρ that provides a simple adaptive method to account for the variation of affinity between probes.

To better understand the behavior of the model and the characteristics of the data, we carried out a comparative analysis of eight models. For the purpose of robust assessment of model fitness with respect to the *B.subtilis* dataset each model was fitted two times, once on each strand of the chromosome, and the likelihood was each time computed on the other strand. The sum of both log-likelihood terms is reported as the cross-validated log-likelihood in Table 1. Parameter values in Table 1 were estimated on the full dataset.

Sorted by decreasing value of adequacy with the data, the models ranged from ℳ1, the full 23-parameter model, to ℳ8, a nine-parameter model that does not account for drifts, outliers nor covariates. Not accounting for outliers has only a small impact on the overall model fitness (ℳ2 versus ℳ1), but the probability of shift moves is increased by >30% in this simpler model. This can have a non-negligible impact in practice given that these particular shift moves are indeed likely to be spurious. Not modeling drifts has a much more pronounced impact (ℳ5 versus ℳ1). Fitness is 6.5% better for ℳ1 than for ℳ5 and the estimated proportion of shift moves is about four times lower in ℳ1 (1.2% versus 4.6%), suggesting that a substantial fraction of the drift moves in ℳ1 are interpreted as shift moves in ℳ5. A closer examination underscores the importance of downward drift as compared with upward drift. Not accounting for downward drift has 74% more effect on the overall fitness that not accounting for upward drift (ℳ3 and ℳ4 versus ℳ1). More spectacularly, if a single drift direction is allowed, modeling downward drift improves the model ∼4.5 times more than modeling only upward drift (ℳ3 versus ℳ5 and ℳ4 versus ℳ5). Setting ρ to either 1 or 0 were both found to result in a dramatic drop in fitness but with different specific effects. Setting ρ to 1 in ℳ6 results in estimation of high drift compared with original model, whereas setting ρ to 0 in ℳ7 results in estimation of high noise.

### 3.3 Estimation of transcriptional landscape: illustration on *B.subtilis* data

The ultimate goal of the use of the model is to infer the underlying signal supposed to reflect the actual transcriptional landscape.

The adoption of a probabilistic setting for the trajectory of the underlying signal allows for a considerably richer signal reconstruction than just ‘optimal’ trajectory reconstruction. Figure 3 gives an illustration of these possibilities by superimposing a number of results obtained with the model on a 10 000 bp region of the *B.subtilis* chromosome. Results include: (i) the prediction interval for the value of the signal *u*_{t} at each chromosome position; (ii) a point prediction for the signal value by the conditional mean of *u*_{t} (the best predictor in terms of quadratic error); (iii) the inferred position of the experimental point after correction for differential probe affinity [computed as ]; (iv) the exact position of each type of move in the best trajectory given by the Viterbi path (abrupt shift, upward drift and downward drift); and (v) the probability of having each type of move at each position. All these values can be read directly from the output of our software.

*B.subtilis*chromosome. Upper part: open circles show the original signal. Closed gray circles represent the signal after ‘correction’

**...**

The biological pertinence of the distinction between shifts and drifts seems remarkable in Figure 3. Inferred shifts are found mostly in intergenic regions that a priori correspond to possible positions for transcriptional promoters and terminators.

The position of each move (2893 shifts and 13 460 drifts) was compared with sequence predictions for two biological features: Rho-independent (intrinsic) terminators predicted with the algorithm of d'Aubenton-Carafa *et al*. (1990); promoters dependent on Sigma-A predicted using an HMM whose structure was chosen according to the results of Nicolas *et al*. (2006). To fulfill the needs of an unbiased analysis, both categories of predictions were made without prior on the position of the genes and confidence cutoffs were set relatively low to increase sensitivity (a total 4164 Sigma-A predictions and 3492 terminator predictions are considered).

The results presented in Figure 4 confirm the practical relevance of the distinction between shift and drift moves. For upward moves, it shows the difference between shift and drift with respect to the distance between the breakpoint and the nearest promoter prediction. Similar results for downward moves and terminator predictions are presented in the Supplementary Material (Fig. S1). Although shifts represent only 18% of all moves, a clear majority of the moves lying at <22 bp of a predicted biological feature are shifts. The proportion of shifts is 59% among the 977 upward moves near a predicted promoter, and 71% among the 1157 moves near a predicted terminator.

**...**

Drift might partly reflect local variations of labeled cDNA that result from technical artifacts such as random priming bias. Drift could also reflect biological differences in the amount of mRNA. In particular, Figures 4 and S1 leave no doubt that a fraction of the drifts correspond to promoters or terminators whose activity is too weak to be detected as shifts in this biological condition. A preliminary exploration of the patterns of drift is reported in the Supplementary Material. Figure S2 shows that downward drift is most pronounced after upward shifts and before downward shifts, near the 5′ and 3′ ends of transcriptionally active regions. An excess of upward drift is found before upward shifts, at the 3′ end of regions with low-transcriptional activity. Random priming artifacts could most easily be invoked to explain downward drift at the 3′ end of transcriptionally active regions (Xu *et al*., 2009). Downward drift may also, for instance, be partly caused by molecules whose synthesis is still incomplete. Here, no single explanation could apparently account the patterns of upward and downward drift. Instead, drift is observed in a variety of chromosomal and transcriptional contexts that the landscape snapshots presented in Figures S3, S4 and S5 intend to illustrate. As an example, some spectacular cases of downward drift are found for transcription units apparently lacking a clear terminator. The intensity of the resulting downstream antisense transcription drifts downward progressively. In Figure S3, a pattern reminiscent of the bidirectional transcriptional activity recently described in *S.cerevisiae* (Xu *et al*., 2009) can also be observed.

### 3.4 Benchmark comparisons

In addition to allow insightful reconstructions of the transcriptional landscape, good algorithms should identify breakpoints that match, as closely as possible, the position of the promoters and terminators. To compare different sets of breakpoints, promoter and terminator predictions were used as a proxy for the true (unknown) reference. Results are shown in Figure 5.

**A**) and terminator (

**B**) predictions (using a 22 bp distance cutoff) is reported as the number of breakpoints considered increases. Plain, dashed and dotted lines show the results obtained

**...**

The results obtained with the HMMs, ℳ1, ℳ5 and ℳ8, give another confirmation of the biological pertinence of the distinction between shift moves and drift moves in ℳ1. It also revealed the deep impact of the correction for variation of affinity between probes using covariates, not implemented in ℳ8. The misbehavior of ℳ8 translates paradoxically in an apparent success at detecting terminators. This most likely does not reflect the transcription signal itself, but rather the low probe affinity due to the stem–loop secondary structure distinctive of the rho-independent terminators.

For the comparison of the new HMM segmentation method and the piecewise constant regression implemented in the algorithm of Huber *et al*. (2006), the later was run on the data after correction for difference of affinity between probes (as shown in Fig. 3) with maximum segment length *l* = 22000 bp and number of segments on each strand *S* between 1000 and 3000. Results clearly demonstrate the benefit of the new HMM framework. For *S* = 1 500, the number of breakpoints matching promoter and terminator predictions were, respectively, 8.9% and 25% higher for the HMM.

### 3.5 Results on *S.cerevisiae* data

Examination of the segmentation produced by piecewise constant regression on Watson (+)-strand of *S.cerevisiae* yeast chromosome 1 leads to the choice of 152 (average segment size 1500 bp) as a sensible number of breakpoints (Huber *et al*., 2006). A question was thus whether the automatic procedure presented here will identify a similar number of shift moves. The model was fitted on the mRNA and gDNA data of the 57 616 probes representing both strands of the chromosome 1.

The Viterbi path of our HMM on the (+)-strand contained 125 shift moves and 373 drift moves with a median distance of 60 bp between each of the 152 breakpoints of Huber *et al*. (2006) and the closest of the 125 shift moves. On this dataset, modeling drift can thus be useful to single out the most abrupt changes in the signal intensity.

Interestingly, further comparisons of models with and without drift indicated that drift improve the model fitness by only 1.4% on the *S.cerevisiae* data, much less than the 6.5% found on *B.subtilis* data. Biology and array technology are two sources of possible differences between *S.cerevisiae* and *B.subtilis* datasets. Our model of drift seems more relevant for prokaryotic data obtained using long isothermal probes.

## 4 CONCLUSIONS

This article describes a new methodology based on an HMM that embeds the segmentation of a continuous-valued signal in a probabilistic setting. For a computationally affordable cost, this framework alleviates the difficulty of choosing a fixed number of breakpoints and permits retrieving more information than a unique segmentation. Probabilistic modeling makes it straightforward to compute confidence measures on the estimated transcriptional landscape. This information should prove particularly useful to pinpoint the differences in large collections of arrays. Extension of the model could also be imagined to tackle the problem of the joint segmentation of datasets where transcript boundaries and expression level differ.

By accounting for gDNA hybridization data as a covariate, the model automatically corrects the data for the variation of affinity between probes. David *et al*. (2006) proposed for this purpose a preprocessing step to be carried out on the raw data, before log-transformation, and producing a significant fraction of negative values. The data could thus no longer be simply log-transformed and more complicated variance stabilization transformation, requiring multiple arrays, was used (Huber *et al*., 2002). In comparison, the normalization carried out by the model needs only one array and it alters only minimally the overall distribution of the log of the original data.

The model is also enriched and accounts for subtle effects such as signal ‘drift’ and covariates. Interestingly, our results unambiguously document the existence of a drifts in the *B.subtilis* dataset. The interest of this observation is 2-fold. First, drift have not been accounted in the previous models and this may partially explain why selecting the number of breakpoints on real dataset proved so difficult (Huber *et al*., 2006; Picard *et al*., 2005). Second, the causes and the patterns of drift deserve to be investigated if we want to make the best use of tiling array expression data.

The software is distributed under the GNU Public License http://genome.jouy.inra.fr/∼pnicolas/hmmtiling/.

## ACKNOWLEDGEMENTS

We thank Etienne Dervyn, Philippe Noirot and Franck Picard for constructive comments on the content of the manuscript.

*Funding*: BaSysBio project, European Commission research grant (LSHG-CT2006-037469).

*Conflict of Interest*: none declared.

## REFERENCES

- d'Aubenton Carafa Y, et al. Prediction of rho-independent
*Escherichia coli*transcription terminators. A statistical analysis of their RNA stem-loop structures. J. Mol. Biol. 1990;216:835–858. [PubMed] - Bertone P, et al. Global identification of human transcribed sequences with genome tiling arrays. Science. 2004;306:2242–2246. [PubMed]
- Biemar F, et al. Comprehensive identification of
*Drosophila*dorsal-ventral patterning genes using a whole-genome tiling array. Proc. Natl. Acad. Sci. USA. 2006;103:12763–12768. [PMC free article] [PubMed] - David L, et al. A high-resolution map of transcription in the yeast genome. Proc. Natl Acad. Sci. USA. 2006;103:5320–5325. [PMC free article] [PubMed]
- Du J, et al. A supervised hidden Markov model framework for efficiently segmenting tiling array data in transcriptional and ChIP-chip experiments: systematically incorporating validated biological knowledge. Bioinformatics. 2006;22:3016–3024. [PubMed]
- Durbin R, et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press; 1998.
- Fridlyand J, et al. Hidden Markov model analysis of array CGH data. J. Multivar. Anal. 2004;90:132–153.
- He H, et al. Mapping the
*C. elegans*noncoding transcriptome with a whole-genome tiling microarray. Genome Res. 2007;17:1471–1477. [PMC free article] [PubMed] - Huber W, et al. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002;18(Suppl. 1):96–104. [PubMed]
- Huber W, et al. Transcript mapping with high-density oligonucleotide tiling arrays. Bioinformatics. 2006;e22:1963–1970. [PubMed]
- Marioni JC, et al. BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics. 2006;22:1144–1146. [PubMed]
- Munch K, et al. A hidden Markov model approach for determining expression from genomic tiling micro arrays. BMC Bioinformatics. 2006;7:e239. [PMC free article] [PubMed]
- Nicolas P, et al. A reversible jump Markov chain Monte Carlo algorithm for bacterial promoter motifs discovery. J. Comput. Biol. 2006;13:651–667. [PubMed]
- Olshen AB, et al. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004;5:557–572. [PubMed]
- Perocchi F, et al. Antisense artifacts in transcriptome microarray experiments are resolved by actinomycin D. Nucleic Acids Res. 2007;35:e128. [PMC free article] [PubMed]
- Picard F, et al. A statistical approach for array CGH data analysis. BMC Bioinformatics. 2005;6:e27. [PMC free article] [PubMed]
- Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE. 1989;77:257–286.
- Royce TE, et al. An efficient pseudomedian filter for tiling microrrays. BMC Bioinformatics. 2007;8:e186. [PMC free article] [PubMed]
- Scott DW. Multivariate Density Estimation. Theory, Practice and Visualization. New York: Wiley; 1992.
- Stolc V, et al. Identification of transcribed sequences in
*Arabidopsis thaliana*by using high-resolution genome tiling arrays. Proc. Natl Acad. Sci. USA. 2005;102:4453–4458. [PMC free article] [PubMed] - Stjernqvist S, et al. Continuous-index hidden Markov modelling of array CGH copy number data. Bioinformatics. 2007;23:1006–1014. [PubMed]
- Wang L-Y, et al. MSB: a mean-shift-based approach for the analysis of structural variation in the genome. Genome Res. 2009;19:106–117. [PMC free article] [PubMed]
- Xu Z, et al. Bidirectional promoters generate pervasive transcription in yeast. Nature. 2009;457:1033–1037. [PMC free article] [PubMed]

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (351K) |
- Citation

- A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and chIP-chip experiments: systematically incorporating validated biological knowledge.[Bioinformatics. 2006]
*Du J, Rozowsky JS, Korbel JO, Zhang ZD, Royce TE, Schultz MH, Snyder M, Gerstein M.**Bioinformatics. 2006 Dec 15; 22(24):3016-24. Epub 2006 Oct 12.* - Transcript mapping with high-density oligonucleotide tiling arrays.[Bioinformatics. 2006]
*Huber W, Toedling J, Steinmetz LM.**Bioinformatics. 2006 Aug 15; 22(16):1963-70. Epub 2006 Jun 20.* - Multivariate segmentation in the analysis of transcription tiling array data.[J Comput Biol. 2008]
*Piccolboni A.**J Comput Biol. 2008 Sep; 15(7):845-56.* - Genomic analysis.[Curr Opin Biotechnol. 2000]
*Lee PS, Lee KH.**Curr Opin Biotechnol. 2000 Apr; 11(2):171-5.* - From biophysics to evolutionary genetics: statistical aspects of gene regulation.[BMC Bioinformatics. 2007]
*Lässig M.**BMC Bioinformatics. 2007 Sep 27; 8 Suppl 6:S7. Epub 2007 Sep 27.*

- Transcript mapping based on dRNA-seq data[BMC Bioinformatics. ]
*Bischler T, Kopf M, Voß B.**BMC Bioinformatics. 15122* - Detecting Splicing Variants in Idiopathic Pulmonary Fibrosis from Non-Differentially Expressed Genes[PLoS ONE. ]
*Deng N, Sanchez CG, Lasky JA, Zhu D.**PLoS ONE. 8(7)e68352* - Wavelet-based detection of transcriptional activity on a novel Staphylococcus aureus tiling microarray[BMC Bioinformatics. ]
*Segura V, Toledo-Arana A, Uzqueda M, Lasa I, Muñoz-Barrutia A.**BMC Bioinformatics. 13222* - Analysis of tiling array expression studies with flexible designs in Bioconductor (waveTiling)[BMC Bioinformatics. ]
*Beuf KD, Pipelers P, Andriankaja M, Thas O, Inzé D, Crainiceanu C, Clement L.**BMC Bioinformatics. 13234* - Bacillus subtilis RNase Y Activity In Vivo Analysed by Tiling Microarrays[PLoS ONE. ]
*Laalami S, Bessières P, Rocca A, Zig L, Nicolas P, Putzer H.**PLoS ONE. 8(1)e54062*

- Gene (nucleotide)Gene (nucleotide)Records in Gene identified from shared sequence links
- MedGenMedGenRelated information in MedGen
- NucleotideNucleotidePublished Nucleotide sequences
- PubMedPubMedPubMed citations for these articles

- Transcriptional landscape estimation from tiling array data using a model of sig...Transcriptional landscape estimation from tiling array data using a model of signal shift and driftBioinformatics. 2009 Sep 15; 25(18)2341

Your browsing activity is empty.

Activity recording is turned off.

See more...