- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC3172970

# Bayesian Analysis of iTRAQ Data with Nonrandom Missingness: Identification of Differentially Expressed Proteins

^{}Corresponding author.

## Abstract

iTRAQ (isobaric Tags for Relative and Absolute Quantitation) is a technique that allows simultaneous quantitation of proteins in multiple samples. In this paper, we describe a Bayesian hierarchical model-based method to infer the relative protein expression levels and hence to identify differentially expressed proteins from iTRAQ data. Our model assumes that the measured peptide intensities are affected by both protein expression levels and peptide specific effects. The values of these two effects across experiments are modeled as random effects. The nonrandom missingness of peptide data is modeled with a logistic regression which relates the missingness probability for a peptide with the expression level of the protein that produces this peptide. We propose a Markov chain Monte Carlo method for the inference of model parameters, including the relative expression levels across samples. Our simulation results suggest that the estimates of relative protein expression levels based on the MCMC samples have smaller bias than those estimated from ANOVA models or fold changes. We apply our method to an iTRAQ dataset studying the roles of Caveolae for postnatal cardiovascular function.

**Keywords:**Bayesian hierarchical model, iTRAQ, Mixed-effects model, Nonignorable missing, Protein quantitation

## 1 Introduction

One main objective of proteomic research is to detect and quantify all proteins present in a biological sample. iTRAQ, a shotgun technique using Isobaric Tags for Relative and Absolute Quantitation, has become commonly used because of its improved quantitative reproducibility and higher quantification sensitivity [16] compared to other methods such as 2DE [9], ICAT [3], and DIGE [4, 10]. Using four or eight isobaric tags, iTRAQ can simultaneously analyze up to eight biological samples [2, 12]. Peptides digested from different samples of protein mixtures are labeled with different tags independently, mixed together, separated, and studied by MS (mass spectrometry) and MS/MS (tandem mass spectrometry). The resulting collection of mass spectra provides information on peptide identification and quantification, which can be utilized to identify and quantify relative protein expression levels.

We use the data from an iTRAQ experiment with four isobaric tags (114, 115, 116, and 117) as an example to illustrate the iTRAQ data format in Table 1. Each row represents a specific peptide identified from a software, such as MASCOT [11], which searches a protein sequence database to identify the peptide corresponding to a specific peak in the mass spectrum. The peptides thus identified are given in the second column. The peak areas for different samples labeled with different tags are shown in the last four columns, and their values can be used to calculate the relative abundance of a given peptide across samples. Each peptide may arise from different spectra and hence have multiple observations in an experiment. For example, the first three rows in Table 1 correspond to the same peptide across all the spectra. Missing peptides is a common phenomenon in iTRAQ data. That is, a peptide may be only observed in some of the samples, or some spectra, or some experiments. For example, the seventh row in Table 1 shows that peptide “DVDEIEAWISEK” is only observed in the samples labeled with 114 and 117. The fifth row indicates that in one spectrum, the intensities of the peptide “DLASVQALLR” are missing in all the samples. When multiple experiments are conducted, a peptide may be found to be missing in one experiment but observed in some other experiments (not shown in Table 1).

**...**

As seen above, the basic unit of iTRAQ data is the peptide. Each peptide has an associated intensity level. Several factors can affect the observed peptide intensities, i.e., the area columns in Table 1. The most obvious factor is the level of the protein in the sample that generates the peptide. Peptide specific features, such as ionization and fragmentation efficiency, affect the intensity levels for different peptides derived from the same protein. This is easily seen in Table 1, where all peptides are derived from the same protein. In addition, other factors such as sample preparation and experimental variation also contribute to the variabilities in the observed iTRAQ data. Hill et al. [5] illustrate in detail the possible sources of variations in iTRAQ data.

Another commonly encountered issue in iTRAQ data analysis is data missingness. Due to the nature of the technology, overlap in protein and peptide identifications between replicate experiments is less than ideal, and certain peptides are only observed for some samples in some spectra, leading to a large amount of missing data. Table 2 gives the number of proteins and peptides that are identified in only one, only two, or all three experiments when iTRAQ is performed three times on the same biological sample. It can be seen that only about 1/3 of the proteins we identified in all three experiments, whereas only about 1/4 of the peptides produced by these proteins we observed in all experiments. Liu et al. [6] and Wang et al. [15] suggested that the probability that a protein is missing is not random, but rather related to its abundance. Less abundant peptides are harder to detect due to the data-dependent acquisition of the analysis process, hence more likely to be missing. This is a nonignorable missing data problem. Ignoring the nonrandom missing pattern in statistical analysis may introduce significant bias in statistical inference and scientific conclusions.

To identify differentially expressed proteins, one common approach is to calculate the ratio of the observed peptide intensities (the area columns in Table 1) between two samples and to compare the calculated ratios against prespecified upper and lower bounds. However, the criterion for threshold selection is subjective. For example, Seshi [14] considered iTRAQ ratios >5/4 or <4/5 as significant, whereas Salim et al. [13] used thresholds 1.20 and 0.83. These thresholds fail to consider the variability in data and are not statistically based. Oberg et al. [8] and Hill et al. [5] applied ANOVA models to incorporate the variability sources in inferring differentially expressed proteins. But they do not consider the nonrandom missingness, potentially biasing their results. In this paper, we introduce a novel approach to inferring the relative protein expression levels and hence to identify differentially expressed proteins. We model the measured peptide intensities as the results of both protein expression levels and peptide specific effects. For iTRAQ data from multiple experiments, we utilize a Bayesian hierarchical model in the sense that the model has an observation component that models the observed peptide intensities as random effects whose conditional distribution depends on the expected protein expression levels and peptide effects, and a second (hierarchical) component that defines the distributions of these expected values. If a sample is labeled with multiple tags in a single experiment, the variations across different tags are modeled as random effects. In this paper, we also describe a model for iTRAQ data from a single experiment. As for the nonrandom missingness, we use a logistic regression to model the missingness probability as a function of the protein expression level. Based on this model set-up, we infer differentially expressed proteins through posterior inferences.

The paper is organized as follows. Section 2 develops the hierarchical model and details the inferential procedure. Section 3 reports a simulation study comparing our method with ANOVA methods and ratio estimates, and studies the robustness of our method. Section 4 reports the analysis of a mouse caveolin-1 experiment, and discussion follows in Sect. 5. We describe the detailed MCMC scheme in Appendix A, and a model for iTRAQ data from a single experiment in Appendix B.

## 2 Model

We first describe the model for iTRAQ data from multiple experiments and estimate the relative expressions of proteins that are present in all experiments. We assume that the labeling effects have been removed by normalization methods such as quantile normalization [1]. Throughout the paper, we consider log-transformed peptide intensities and protein expressions. We assume that there are *S* (≥2) biological samples studied in *K* (≥2) experiments. Multiple isobaric tags may label the same sample in one experiment. We use *L _{s}* ≥ 1 to denote the number of tags labeling the

*s*th sample. Then ∑

_{s}

*L*=

_{s}*M*is the number of isobaric tags used in one experiment, which is 4 when we use 4-plex isobaric reagents and

*M*= 8 in the 8-plex version. Assume that there are

*I*proteins in the sample and there are

*J*peptides for the

_{i}*i*th protein. For the

*l*th label of the

*s*th sample in the

*k*th experiment, let

*y*denote the observed intensity for the

_{kijsln}*j*th peptide of the

*i*th protein from the

*n*th spectrum. Note that

*j*should be more appropriately denoted as

*j*(

*i*) to explicitly indicate that peptides are nested within proteins, and

*l*should be denoted as

*l*(

*s*) to indicate the

*l*th labeled tag of the

*s*th sample. For notational simplification, we omit the parentheses. The measured intensity of a peptide depends on the protein expression level and the peptide effect. Let

*x*denote the expression level of the

_{kisl}*i*th protein of the

*s*th sample with the

*l*th labeling tag in the

*k*th experiment. Let

*z*denote the peptide effect for the

_{kij}*j*th peptide of the

*i*th protein in the

*k*th experiment. We consider an additive model for

*y*(

_{kijsln}*k*= 1, …,

*K*;

*i*= 1, …,

*I*;

*j*= 1, …,

*J*;

_{i}*s*= 1, …,

*S*;

*l*= 1, …,

*L*;

_{s}*n*= 1, …,

*N*):

_{kijsl}which corresponds to a multiplicative model in the original scale. In (1), we assume ${\epsilon}_{\mathit{\text{kijsln}}}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}N(0,{\sigma}_{\epsilon}^{2})$ independently, where $N(0,{\sigma}_{\epsilon}^{2})$ denotes a Normal distribution with mean 0 and variance ${\sigma}_{\epsilon}^{2}$.

In addition to the additive model in (1), we also consider the multiplicative model *y _{kijsln}* =

*x*×

_{kisl}*z*+ ε

_{kij}_{kijsln}on a small dataset with one protein and 11 peptides observed in three caveolin-1 experiments. The inferences from both models are quite close in terms of the magnitudes of the residual standard deviation (0.58 for the additive model vs. 0.60 for the multiplicative model) and the ratio of sum of squares of predicted values and sum of squares of original data

*R*

^{2}(0.73 for the additive model vs. 0.69 for the multiplicative model). The residuals vs. fitted values plots are also similar (not shown). This is also true when we apply both models to the data in the original scale. Since the multiplicative model in the logarithm scale and the additive model in the original scale do not greatly improve the inference (or even do worse), we use model (1) which is also easy to interpret.

### Missing Data Mechanism

Peptide missingness presents a challenge even when we focus on proteins that are detected in all experiments. It is known that the probability of peptide missingness depends on the intensity of the peptide: lower intensity peptides are harder to detect. So there is a nonignorable missing data problem. To motivate a statistical model for missing peptide probability, we study the proportion of peptides observed in one experiment but missing in another experiment. As shown in Fig. 1, there is a negative correlation between missingness probability and peptide intensity. Furthermore, there is an approximate linear relationship between the peptide missingness probability and the observed intensity at the logit scale. Therefore, we model the missingness probability through a simple logistic regression,

where *I _{kijsln}* = 0 indicates that the

*j*th peptide of the

*i*th protein is missed in the

*k*th experiment, the

*l*th replicate of the

*s*th sample and the

*n*th spectrum. Formula (2) implies that the logit of the probability of peptide missingness is linearly dependent on its intensity. We expect

*b*< 0, because peptides with lower intensities are more likely to be missing.

### Priors

Noting the hierarchical structure of the iTRAQ data and taking into account the variability across experiments and samples, we utilize a Bayesian hierarchical framework to model the data. We assume that *x _{kisl}* and

*z*are independently normally distributed across different experiments, i.e.,

_{kij}

where *x _{isl}* and

*z*denote the protein and peptide effects averaged over multiple experiments, respectively. The protein expression levels in different replicates (labeled with different tags) of the same sample are also assumed to be normally distributed:

_{ij}
where *x _{is}* denotes the expression level of the

*i*th protein in the

*s*th sample. Assumptions (3)–(5) lead to an equivalent form of (1):

where ${e}_{\mathit{\text{kisl}}}^{x}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}N(0,{\sigma}_{x}^{2})\text{and}{e}_{\mathit{\text{kij}}}^{z}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}N(0,{\sigma}_{z}^{2})$ denote the random effects across experiments, and ${e}_{\mathit{\text{isl}}}^{t}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}N(0,{\sigma}_{\delta}^{2})$ denotes the variation among multiple replicates of the same sample. Formula (6) is a mixed-effects model. To ensure the identifiability of the model, we restrict *x*_{i1} = 0. Then *x _{is}* denotes the expression level of the

*i*th protein in the

*s*th sample relative to the first sample.

The second level of priors are normal distributions for *x _{is}* and

*z*:

_{ij}

When we further assume hyperpriors for the hyperparameters, we finish the hierarchical model (Fig. 2) and can infer the posterior distributions of relevant parameters, *x _{is}*, by MCMC simulations. Appendix A describes other hyperpriors and the MCMC updates in detail. Hence we can summarize the simulated posterior distributions with statistics such as posterior means, standard deviations and quantiles, and identify differentially expressed proteins.

*circles*, and the observations are in

*rectangles*

When a sample is labeled with a unique isobaric tag in an experiment, there is no replicate variation component within a sample. We note that it is easy to modify the model and the MCMC updates for statistical inference in this scenario. We will not discuss it further in this paper.

### Single Experiment

When the iTRAQ data is from one experiment, we can similarly model the observed peptide intensities as the result of both protein expression levels and peptide effects, and model the nonrandom missingness through a logistic regression. We can further apply normal distributions as priors for protein expressions and peptide effects. The difference from the case of multiple experiments is that the experimental variability cannot be modeled. Appendix B describes this model and MCMC updates in more details.

### Comparison to ANOVA Model

The most important difference between our Bayesian model and the ANOVA model proposed by Hill et al. [5] and Oberg et al. [8] is that we clearly model the nonignorable missingness in iTRAQ data. Oberg et al. [8] remarked at the end of their paper that using a censoring mechanism to fit the model would be a natural next step. Instead of censoring the data at an unknown threshold value, we model a higher probability of peptide missingness for lower peptide intensities. Our Bayesian model also differs from the ANOVA model in the sources of variations included in the model. In addition to the terms in our model, the ANOVA analysis also considers the labeling effect, the interaction between labeling and experimental effect, and variable peptide effects under different conditions (we talk about this in Discussion). The experimental effect and the replicative effect (when multiple tags label a sample) are considered constants for all proteins in the ANOVA model. In contrast, we model them as random effects that are specific to peptides and (or) proteins.

## 3 Simulation Study

We simulate data from a 4-plex version of iTRAQ on one protein containing ten peptides across three replicate experiments. Each sample is labeled with a distinct isobaric tag. In this case, there is no need to model the replicate effects specified by prior (5). We assume *x* = (0, −0.04, −0.48, −0.66) to be the true relative protein expression levels in log scale compared to the first sample. Under different parameter values for σ_{x}, σ_{z}, and σ_{δ}, we simulate data as follows: (1) sample ${x}_{\mathit{\text{ks}}}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}N({x}_{s},{\sigma}_{x}^{2})\text{and}{z}_{\mathit{\text{kj}}}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}N({z}_{j},{\sigma}_{z}^{2})$, where *z _{j}* ~

*N*(0, 1); here we dismiss subscripts

*i*and

*l*since there is only one protein and only one isobaric tag for a sample in an experiment; (2) sample ${y}_{\mathit{\text{ksjn}}}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}N({x}_{\mathit{\text{ks}}}+{z}_{\mathit{\text{kj}}},{\sigma}_{\epsilon}^{2})$, calculate the missing data probability

*P*(

*I*= 0), and determine the missing pattern

_{ksjn}*I*. We take

_{ksjn}*a*= −0.16 and

*b*= −1.03 in the simulation, based on the posterior inference of a small subset of a real data.

We analyze the simulated data with our Bayesian method and infer the relative protein expression levels through the MCMC samples. For comparison, we also analyze the data with the ANOVA model proposed by Hill et al. [5] and Oberg et al. [8], and calculate the means of the log ratios of peptide intensities. For each parameter setting, we simulate ten data sets and summarize the results from one data set in Table 3. The Bayesian method and the ANOVA analysis provide measures of the uncertainties of estimates. We either obtain the 95% credible intervals of the posterior distributions or the 95% confidence intervals for the estimates from the ANOVA analysis. When performing the ANOVA analysis, we consider two models. “ANOVA 1” includes the sample effect, peptide effect, experimental effect, and the interaction of sample effect and peptide effect. “ANOVA 2” removes the interaction term from “ANOVA 1.” From Table 3 we observe that all but one credible interval cover the true values when using our Bayesian method to analyze the data. But about 1/3 of the confidence intervals from ANOVA analysis fail to cover the true values, including the case where Bayesian analysis fails (estimate *x*_{3} for the simulated data when ${\sigma}_{x}^{2}=0.01,{\sigma}_{z}^{2}=1,\text{and}{\sigma}_{\epsilon}^{2}=1.5$). Comparing the estimates to the true values, we find that our Bayesian estimates have smaller bias than those from ANOVA analysis. Figure 3 draws the boxplots of the biases of the estimates using different methods for all six parameter settings. It is clear that the Bayesian method leads to the smallest bias. The better coverage and smaller bias of the Bayesian method are consistently observed in the analyses of the other nine simulated data sets. In the 60 analysis (10 data sets for each parameter setting), the 95% credible intervals from our Bayesian method fail to cover the true values 3% of the time, but the 95% confidence intervals from the ANOVA method fail in 1/3 of the cases. The means of the biases for estimates of *x* from the Bayesian analysis are at least 1/2 smaller than those from the ANOVA method. The lengths of the credible intervals and confidence intervals are specific to a data set or the parameter setting. Neither is consistently smaller than the other.

*x*

_{2}= −0.04,

*x*

_{3}= −0.48, and

*x*

_{4}= −0.66. “Bayesian” refers to our method.

**...**

In the above results, we simulated data according to our model, which may favor our approach. To study the robustness of our approach, we also consider a different missing mechanism. For each experiment, we first simulate whether each peptide is present from a Bernoulli distribution with probability *p*, which determines the potential frequency *r _{j}* of the presence of peptide

*j*in

*K*= 3 experiments (

*r*= 0, 1, 2, or 3). Given

_{j}*r*, we sample the peptide effect

_{j}*z*|

_{j}*r*~ logGamma (

_{j}*l*) for

_{rj}, sh_{rj},*sc*_{rj}*r*> 0. The density function of a log-gamma distribution with shape

_{j}*a*> 0, scale

*b*> 0, and location

*c*is

Peptides with *r _{j}* = 0 are missed. Then we simulate the variabilities across experiments: ${x}_{\mathit{\text{ks}}}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}N({x}_{s},{\sigma}_{x}^{2}),{z}_{\mathit{\text{kj}}}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}N({z}_{j},{\sigma}_{z}^{2})$. Finally we follow the second step in the previous study to simulate

*y*and

_{ksjn}*I*. This mechanism differs from our model in two ways: (1) the distribution for

_{ksjn}*z*differs; and (2) the missing data mechanism differs since the simulation of possible presences of peptides from the Bernoulli distribution will also cause peptides missed. The resulting peptide frequency may be less than

_{j}*r*. We study how our method performs for the data simulated under this missing mechanism. We consider different values for the success probability in the Binomial distribution (0.9 and 0.2). For each case, we simulate ten data sets and analyze them with our method. Table 4 gives the means and standard deviations of the ten estimations. We find that the means are close to the true values and the inference is not sensitive to the new mechanism of peptide missing. Compared to the results obtained from the ANOVA analysis which contains the main effects of protein, peptide and experiments, the estimates from our Bayesian analysis are closer to the true values and have less variability (except

_{j}*x*

_{4}for Bi(3,0.2)) in the estimations.

*x*

_{2}= −0.04,

*x*

_{3}= −0.48, and

*x*

_{4}= −0.66. Values in parentheses are standard deviations of the estimates. The ANOVA

**...**

In previous simulations, we fix the number of observations for each peptide as the same. When a peptide is not observed in an experiment, we assume that only one spectrum is missing and impute the values for all samples in only one observation. To study the effect of varying number of observations for different peptides on our inference, we randomly sample these numbers from a Poisson distribution. The rate of the distribution is randomly picked from a set of values. We also apply the missing mechanism described in the previous paragraph with *p* = 0.5. Under this scheme, we simulate ten data sets and analyze them with our method and the ANOVA model. From the calculated means and standard deviations in Table 4 we see that the distribution of the number of observations does not have great effect on the inference, and the estimates from our method have less variability than those obtained from ANOVA analysis.

## 4 Case Study

We apply our method to an iTRAQ dataset which aims to identify proteins affected by caveolin-1. Caveolin-1 is essential to the formation of caveolae, while the functional perturbations in the caveolae and the caveolae coat proteins may cause a wide range of diseases from cancer to a rare form of muscular dystrophy. Recent studies from mice suggest that they may be important for postnatal cardiovascular function [7]. Comparing the protein profiles from wild-type (WT) mice and knock-out Cav-1 (KO) mice using iTRAQ, we can explore the physiological and pathophysiological roles of caveolins for postnatal cardiovascular function. Samples from three KO mice and three WT mice were labeled with iTRAQ reagents as shown in Table 5. Among the 424 proteins identified in the study, a total of 138 common proteins were identified in all three comparisons of the WT/KO mice from iTRAQ analysis (Table 2). Focusing on the 4765 peptides of these 138 common proteins, we found that 2124 of them were observed in all three experiments.

We first perform quantile normalization with each protein in the two replicates of each sample. Then we do two iterations of quantile normalization on each pair of samples to remove the systematic bias in the data. Applying our method to the log transformed value of the normalized data, we conduct 101000 iterations of MCMC updates and take the first 6000 as burn-in. The simulation takes 138 hours on the caveolin data with 4765 peptides and 200684 observations. Sampling every tenth iteration, we get 9500 samples, based on which we infer the posterior distributions of protein expression levels.

We illustrate the inferred posterior means of the relative protein expression levels in Fig. 4. We also depict the upper and lower 2.5% posterior quantiles in the figure. From these posterior inferences, we can further identify differentially expressed proteins. For example, if we require the 2.5% quantile above zero or the 97.5% quantile below zero, there are 19 up-regulated and 7 down-regulated proteins. We summarize the posterior inferences of other parameters in Table 6. For this normalized data, the randomness of peptide effects across experiments contributes the most significant source of variation (313.141). The replicate variation within a sample is almost ignorable (0.002). We infer the slope parameter in (2) to be negative (−0.217), implying that peptides with lower intensities are more susceptible to be missing.

To make a comparison with other methods, we also apply the ANOVA method to the data. Since there are 138 identified proteins and 4765 identified peptides, it is difficult to estimate all of the parameters in the ANOVA model simultaneously using current software and computers. Applying the stagewise regression idea in Oberg et al. [8], we first estimate the effects of experiments, proteins, and peptides (the first two groups of model (1) in Oberg et al. [8]), and then we take the residuals as responses for estimating the effects of samples, interactions between samples and proteins, peptides. The sample-related parameters are estimated for each protein individually, assuming that each protein has a different variance parameter, rather than a global variance parameter. Regarding the proteins as differentially expressed where the 95% confidence intervals do not cover zero, we find 60 up-regulated and 26 down-regulated proteins. They contain all the differentially expressed proteins inferred from our Bayesian model. Focusing on the proteins that are only found by ANOVA, we study their missing patterns and compare the estimates from both methods. We find that for 35 of the 41 (= 60 − 19) up-regulated and 15 of the 19 (= 26 − 7) down-regulated proteins, the differences of estimates of expression levels from both models may be due to missingness. Another reason that ANOVA identifies more proteins is likely due to the fact that protein-by-protein estimation leads to smaller variances than the global variance under our Bayesian approach. So the credible intervals from Bayesian analysis have wider, and more appropriate, ranges than the confidence intervals from ANOVA model.

## 5 Discussion

We have developed a novel Bayesian model to analyze iTRAQ data from multiple experiments or a single experiment. In our model, the observed peptide intensities are influenced by both the protein expression levels and the peptide effects. For data from multiple experiments, these two effects across experiments are modeled as random effects. If a sample is labeled with multiple isobaric tags, our model also allows random effects across replicates. We explicitly model the nonignorable missingness for peptides, which is a common phenomenon in iTRAQ data. The logit probability of peptide missingness is assumed to be linearly dependent on its intensity. We implement an MCMC approach to simulate the posterior distributions of relative protein expression levels. The MCMC samples provide both estimates of the expressions and measure of uncertainty for the estimates. Compared to the estimates from the ANOVA analysis and the simple log ratio calculation, we find that the estimates from the MCMC samples greatly reduces the bias due to missing data.

In our model, we assume that the logit of the missingness probability is linearly dependent on the peptide intensity *y _{kijsln}* (2), and the later depends on the protein expression levels, peptide effects, and several variation terms (6). For a particular peptide

*j*, in addition to the variation (ε

_{kijsln}) across multiple spectra in an experiment, experimental variations are modeled at both the protein $({e}_{\mathit{\text{kisl}}}^{x})$ and peptide $({e}_{\mathit{\text{kij}}}^{z})$ levels. A small peptide effect specific to a particular experiment (

*z*) may cause the missingness of the peptide in that experiment (

_{kij}*k*). When both protein effect and peptide effect are large in an experiment

*k*, this peptide will be observed in experiment

*k*, but an extremely small value of ε

_{kijsln}can lead to the missingness of this peptide in spectrum

*n*of experiment

*k*. So this model explains the peptide missingness both at the experiment level and at the spectrum level.

We have performed simulation studies to check whether our analysis is sensitive to this assumption of missingness. We simulate data sets from different missing mechanisms and analyze them with our Bayesian method. The estimated values are close to the true values and have smaller bias than the results from the ANOVA analysis. Furthermore, we also check how variable number of spectra for peptides affects our analysis, since when a peptide is missing in an experiment, we impute the values in only one MS spectrum. It is found that when we sample the number of MS spectra for a peptide from a Poisson distribution, our analysis leads to estimates close to the true values. This implies that our method is robust to these model violations.

Labeling effect is an issue that is not directly addressed in our model. The ability of peptides’ linkage to isobaric reagents may vary, implying peptide-tag specific labeling effect. Modeling all such labeling effects increases the number of model parameters dramatically. If we treat the labeling effects as constants for all peptides, this amounts to adding a constant specific to each tag in model (1). Due to the limitation of the data in caveolin study, the labeling effect is confounded with signals. In this paper, we first perform normalization to remove the labeling effect and systematic bias, and then apply our method to infer the relative protein expressions. In practical studies, we suggest to randomize the isobaric tags applied to samples when multiple experiments are conducted.

The fast convergence requirement is a challenge to our Bayesian approach. For a larger scale study, more MCMC iterations and hence longer time are needed to ensure the convergence. Although the Bayesian method is slower than the ANOVA method, the latter cannot fit all the involved parameters simultaneously using current software and computers. Oberg et al. [8] suggest to use the stagewise regression and then to infer the sample effects based on protein-by-protein estimation. But to get correct answers from the stagewise approach, it is necessary that the portions of the linear model design matrix corresponding to the multiple stages be orthogonal, which is not necessarily true.

In this study, we assume that all of the peptide-based observations accurately reflect the intact proteins. As a result, we ignore the possibility of homologous genes resulting in two or more proteins that share identical and nonidentical peptides as well as the possibility of post-transcriptional modifications. In addition to ignoring labeling effects, we do not include the interactions between peptide effects and sample conditions comparing to the ANOVA model. This corresponds to the assumption that certain proteins will have differential expressions under different conditions, but that any change in protein expression will affect all of the peptides for that protein equally. We expect this to be the common case, except for certain biological conditions: for example, a post-translational modification that involved a peptide substitution [8]. Despite these limitations, our method explicitly models the nonrandom missingness of iTRAQ data and provides a great improvement in estimating the relative expressions of proteins.

## Acknowledgements

The work was supported in part by NIH grants HV28286, DA018343, GM59507 and NSF grant DMS 0714817. The work was also supported in part by “Yale University Biomedical High Performance Computing Center” and NIH grant RR19895, which funded the instrumentation.

## Appendix A: MCMC Updates

We assume inverse gamma distributions as priors for the hyperparameters of variance: ${\sigma}_{x}^{-2}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}\text{Gamma}({\gamma}_{1},{\gamma}_{2}),{\sigma}_{z}^{-2}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}\text{Gamma}({\gamma}_{3},{\gamma}_{4}),{\sigma}_{\delta}^{-2}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}\text{Gamma}({\gamma}_{5},{\gamma}_{6}),\text{and}{\sigma}_{\epsilon}^{-2}~\phantom{\rule{thinmathspace}{0ex}}\text{Gamma}\phantom{\rule{thinmathspace}{0ex}}({\gamma}_{7},{\gamma}_{8}),$, where γ_{1} and γ_{2} denote the shape and scale parameters of a gamma distribution, respectively. We assume *a* ~ *N*(0, *ν*^{2}) and *b* ~ *N*(0, *ν*^{2}). The joint distribution of the model is

where MVN(․ | μ, Σ) denotes a multivariate normal distribution with mean vector μ and covariance matrix Σ, invGamma(․) denotes an inverse gamma distribution, and *p*(**I**_{kijsl} | **y**_{kijsl}, *a, b*) can be determined by formula (2). The full conditional distributions for involved parameters are given below.

- Protein and peptide effects:
*x*,_{kisl}*z*,_{kij}*x*,_{isl}*x*, and_{is}*z*._{ij}$${x}_{\mathit{\text{kisl}}}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}N\phantom{\rule{thinmathspace}{0ex}}\left(\frac{{\displaystyle {\sum}_{j}}\frac{{\overline{y}}_{\mathit{\text{kijsl}}}-{z}_{\mathit{\text{kij}}}}{{\sigma}_{\epsilon}^{2}/{N}_{\mathit{\text{kijsl}}}}+\frac{{x}_{\mathit{\text{isl}}}}{{\sigma}_{x}^{2}}}{{\displaystyle {\sum}_{j}}\frac{{N}_{\mathit{\text{kijsl}}}}{{\sigma}_{\epsilon}^{2}}+\frac{1}{{\sigma}_{x}^{2}}},\frac{1}{{\displaystyle {\sum}_{j}}\frac{{N}_{\mathit{\text{kijsl}}}}{{\sigma}_{\epsilon}^{2}}+\frac{1}{{\sigma}_{x}^{2}}}\right)\phantom{\rule{thinmathspace}{0ex}},$$(11)$${z}_{\mathit{\text{kij}}}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}N\phantom{\rule{thinmathspace}{0ex}}\left(\frac{{\displaystyle {\sum}_{m}}\frac{{\overline{y}}_{\mathit{\text{kijsl}}}-{x}_{\mathit{\text{kisl}}}}{{\sigma}_{\epsilon}^{2}/{N}_{\mathit{\text{kijsl}}}}+\frac{{z}_{\mathit{\text{ij}}}}{{\sigma}_{z}^{2}}}{{\displaystyle {\sum}_{m}}\frac{{N}_{\mathit{\text{kijsl}}}}{{\sigma}_{\epsilon}^{2}}+\frac{1}{{\sigma}_{z}^{2}}},\frac{1}{{\displaystyle {\sum}_{m}}\frac{{N}_{\mathit{\text{kijsl}}}}{{\sigma}_{\epsilon}^{2}}+\frac{1}{{\sigma}_{z}^{2}}}\right)\phantom{\rule{thinmathspace}{0ex}},$$(12)$${x}_{\mathit{\text{isl}}}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}N\phantom{\rule{thinmathspace}{0ex}}\left(\frac{{\displaystyle {\sum}_{k}}{x}_{\mathit{\text{kisl}}}/{\sigma}_{x}^{2}+{x}_{\mathit{\text{is}}}/{\sigma}_{\delta}^{2}}{K/{\sigma}_{x}^{2}+1/{\sigma}_{\delta}^{2}},\frac{1}{K/{\sigma}_{x}^{2}+1/{\sigma}_{\delta}^{2}}\right)\phantom{\rule{thinmathspace}{0ex}},$$(13)$${x}_{\mathit{\text{is}}}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}N\phantom{\rule{thinmathspace}{0ex}}\left(\frac{{\displaystyle {\sum}_{l}}{x}_{\mathit{\text{isl}}}/{\sigma}_{\delta}^{2}}{{L}_{s}/{\sigma}_{\delta}^{2}+1/{\tau}_{x}^{2}},\frac{1}{{L}_{s}/{\sigma}_{\delta}^{2}+1/{\tau}_{x}^{2}}\right)\text{for}s1,$$(14)When we take τ$${z}_{\mathit{\text{ij}}}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}N\phantom{\rule{thinmathspace}{0ex}}\left(\frac{{\displaystyle {\sum}_{k}}{z}_{\mathit{\text{kij}}}/{\sigma}_{z}^{2}}{K/{\sigma}_{z}^{2}+1/{\tau}_{z}^{2}},\frac{1}{K/{\sigma}_{z}^{2}+1/{\tau}_{z}^{2}}\right)\phantom{\rule{thinmathspace}{0ex}}.$$(15)_{x}= τ_{z}= ∞, i.e., noninformative prior for*x*and_{is}*z*, ${x}_{\mathit{\text{is}}}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}N({\displaystyle {\sum}_{l}}{x}_{\mathit{\text{isl}}}/{L}_{s},{\sigma}_{\delta}^{2}/{L}_{s})\text{and}{z}_{\mathit{\text{ij}}}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}N({\displaystyle {\sum}_{k}}{z}_{\mathit{\text{kij}}}/K,{\sigma}_{z}^{2}/K)$._{ij} - Missing value
*y*. Let μ_{kijsln}_{kijsl}=*x*+_{kisl}*z*; then_{kij}$$f({y}_{\mathit{\text{kijsln}}}\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}\dots )\propto \text{exp}\phantom{\rule{thinmathspace}{0ex}}\{-\frac{1}{2{\sigma}_{\epsilon}^{2}}{({y}_{\mathit{\text{kijsln}}}-{\mu}_{\mathit{\text{kijsl}}})}^{2}\}\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}\frac{1}{1+\text{exp}(a+{\mathit{\text{by}}}_{\mathit{\text{kijsln}}})}.$$(16)Note that*f*(*y*| …) is log-concave. We can use Adaptive Rejection Sampling (ARS) method._{kijsln} - Parameters in the logistic model for missing mechanism:
*a*and*b*. Sinceis log-concave, we can use ARS.$$f(a,b\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}\dots )\propto \frac{{\displaystyle {\prod}_{\mathit{\text{kijsln}}:{I}_{\mathit{\text{kijsln}}}=1}}\text{exp}(a+{\mathit{\text{by}}}_{\mathit{\text{kijsln}}})}{{\displaystyle {\prod}_{\mathit{\text{kijsln}}}}(1+\text{exp}(a+{\mathit{\text{by}}}_{\mathit{\text{kijsln}}}))}\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}N(a\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}0,{\nu}^{2})\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}N(b\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}0,{\nu}^{2})$$(17) - Variances σ
_{ε}, σ_{x}, and σ_{z}.$${\sigma}_{x}^{-2}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}\text{Gamma}({\gamma}_{1}+\frac{\mathit{\text{KI}}{\displaystyle {\sum}_{s}}{L}_{s}}{2},{[\frac{1}{{\gamma}_{2}}+\frac{1}{2}{\displaystyle \sum _{\mathit{\text{kisl}}}}{({x}_{\mathit{\text{kisl}}}-{x}_{\mathit{\text{mi}}})}^{2}]}^{-1})\phantom{\rule{thinmathspace}{0ex}},$$(18)$${\sigma}_{z}^{-2}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}\text{Gamma}({\gamma}_{3}+\frac{K\phantom{\rule{thinmathspace}{0ex}}{\displaystyle {\sum}_{i}}{J}_{i}}{2},{[\frac{1}{{\gamma}_{4}}+\frac{1}{2}{\displaystyle \sum _{\mathit{\text{kij}}}}{({z}_{\mathit{\text{kij}}}-{z}_{\mathit{\text{ij}}})}^{2}]}^{-1})\phantom{\rule{thinmathspace}{0ex}},$$(19)$${\sigma}_{\delta}^{-2}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}\text{Gamma}({\gamma}_{5}+\frac{I\phantom{\rule{thinmathspace}{0ex}}{\displaystyle {\sum}_{s}}{L}_{s}}{2},{[\frac{1}{{\gamma}_{6}}+\frac{1}{2}{\displaystyle \sum _{\mathit{\text{isl}}}}{({x}_{\mathit{\text{isl}}}-{x}_{\mathit{\text{is}}})}^{2}]}^{-1})\phantom{\rule{thinmathspace}{0ex}},$$(20)$${\sigma}_{\epsilon}^{-2}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}\text{Gamma}({\gamma}_{7}+\frac{{\displaystyle {\sum}_{\mathit{\text{kijsl}}}}{N}_{\mathit{\text{kijsl}}}}{2},{[\frac{1}{{\gamma}_{8}}+\frac{1}{2}{\displaystyle \sum _{\mathit{\text{kijsln}}}}{({y}_{\mathit{\text{kijsln}}}-{x}_{\mathit{\text{kisl}}}-{z}_{\mathit{\text{kij}}})}^{2}]}^{-1})\phantom{\rule{thinmathspace}{0ex}}.$$(21)

## Appendix B: iTRAQ Data from One Experiment

We illustrate the model when each sample is labeled differently, or we treat the samples with distinct isobaric tags as different samples. It is easy to modify this model to take the replicates of samples into account. For the *m*th marker (or sample) and the *i*th protein, let *y _{ijmn}* denote the log value of the

*n*th measured intensity for the

*j*th peptide, and let

*x*denote the (log) protein expression level in the experiment. Let

_{mi}*z*be the peptide effect for the

_{ij}*j*th peptide of the

*i*th protein. We consider the additive model for

*y*(

_{ijmn}*m*= 1, …,

*M*;

*i*= 1, …,

*I*;

*j*= 1, …,

*J*;

_{i}*n*= 1, …,

*N*) and missing mechanism:

_{ijm}

and restrict *x*_{i1} = 0. We take normal distributions as priors for *x _{mi}* and

*z*:

_{ij}

Priors for other parameters are the same as those in Sect. 2. Then the joint distribution of the model is

The full conditional distributions for missing *y _{ijmn}*,

*a*and

*b*are the same as those in multiple experiments. For

*x*,

_{mi}*z*, and σ

_{ij}_{ε}, their full conditional distributions are given below:

## Contributor Information

Ruiyan Luo, Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, CT 06520, USA.

Christopher M. Colangelo, W.M. Keck Foundation, Biotechnology Resource Laboratory, Yale University School of Medicine, New Haven, CT 06511, USA.

William C. Sessa, Department of Pharmacology, Yale University School of Medicine, New Haven, CT 06510, USA.

Hongyu Zhao, Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, CT 06520, USA.

## References

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (840K)

- Protein quantitation using iTRAQ: Review on the sources of variations and analysis of nonrandom missingness.[Stat Interface. 2012]
*Luo R, Zhao H.**Stat Interface. 2012 Jan 1; 5(1):99-107.* - iQuantitator: a tool for protein expression inference using iTRAQ.[BMC Bioinformatics. 2009]
*Schwacke JH, Hill EG, Krug EL, Comte-Walters S, Schey KL.**BMC Bioinformatics. 2009 Oct 18; 10:342. Epub 2009 Oct 18.* - Simultaneous analysis of relative protein expression levels across multiple samples using iTRAQ isobaric tags with 2D nano LC-MS/MS.[Nat Protoc. 2010]
*Unwin RD, Griffiths JR, Whetton AD.**Nat Protoc. 2010 Sep; 5(9):1574-82. Epub 2010 Aug 26.* - Shotgun proteomics using the iTRAQ isobaric tags.[Brief Funct Genomic Proteomic. 2006]
*Aggarwal K, Choe LH, Lee KH.**Brief Funct Genomic Proteomic. 2006 Jun; 5(2):112-20. Epub 2006 May 10.* - Bayesian normalization and identification for differential gene expression data.[J Comput Biol. 2005]
*Zhang D, Wells MT, Smart CD, Fry WE.**J Comput Biol. 2005 May; 12(4):391-406.*

- Data Pre-Processing for Label-Free Multiple Reaction Monitoring (MRM) Experiments[Biology. ]
*Chung LM, Colangelo CM, Zhao H.**Biology. 3(2)383-402* - Exploring the Nicotinic Acetylcholine Receptor-associated Proteome with iTRAQ and Transgenic Mice[Genomics, proteomics & bioinformatics. 2013...]
*McClure-Begley TD, Stone KL, Marks MJ, Grady SR, Colangelo CM, Lindstrom JM, Picciotto MR.**Genomics, proteomics & bioinformatics. 2013 Aug; 11(4)207-218* - Protein quantitation using iTRAQ: Review on the sources of variations and analysis of nonrandom missingness[Statistics and its interface. 2012]
*Luo R, Zhao H.**Statistics and its interface. 2012 Jan 1; 5(1)99-107* - Normalization and missing value imputation for label-free LC-MS analysis[BMC Bioinformatics. ]
*Karpievitch YV, Dabney AR, Smith RD.**BMC Bioinformatics. 13(Suppl 16)S5*

- PubMedPubMedPubMed citations for these articles