![]() | ![]() |
Formats:
|
||||||||||||||||||||||
Copyright © 2007 Yu et al; licensee BioMed Central Ltd. Transcript-level annotation of Affymetrix probesets improves the interpretation of gene expression data 1Shanghai Center for Bioinformation Technology, Shanghai 200235, P. R. China 2School of Life Science and Technology, Shanghai Jiaotong University, Shanghai 200240, P. R. China 3Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, P. R. China 4Bioinformatics Center, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, P. R. China Corresponding author.#Contributed equally. Hui Yu: yuhui/at/scbit.org; Feng Wang: wangfengbme/at/sjtu.edu.cn; Kang Tu: ktu/at/sibs.ac.cn; Lu Xie: xielu/at/scbit.org; Yuan-Yuan Li: yyli/at/scbit.org; Yi-Xue Li: yxli/at/scbit.org Received November 21, 2006; Accepted June 11, 2007. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background The wide use of Affymetrix microarray in broadened fields of biological research has made the probeset annotation an important issue. Standard Affymetrix probeset annotation is at gene level, i.e. a probeset is precisely linked to a gene, and probeset intensity is interpreted as gene expression. The increased knowledge that one gene may have multiple transcript variants clearly brings up the necessity of updating this gene-level annotation to a refined transcript-level. Results Through performing rigorous alignments of the Affymetrix probe sequences against a comprehensive pool of currently available transcript sequences, and further linking the probesets to the International Protein Index, we generated transcript-level or protein-level annotation tables for two popular Affymetrix expression arrays, Mouse Genome 430A 2.0 Array and Human Genome U133A Array. Application of our new annotations in re-examining existing expression data sets shows increased expression consistency among synonymous probesets and strengthened expression correlation between interacting proteins. Conclusion By refining the standard Affymetrix annotation of microarray probesets from the gene level to the transcript level and protein level, one can achieve a more reliable interpretation of their experimental data, which may lead to discovery of more profound regulatory mechanism. Background Microarray technology was invented to rapidly profile the quantities of mRNA transcripts in a particular cellular context [1-3]. Its application has become universal in biomedical researches. Although it is mRNA that is actually detected by microarray experiments, and it is mRNA that has the direct relationship with protein, the methodology and algorithms for data analysis are commonly gene based. As evident in the probe annotation file provided by Affymetrix, gene-level annotation is widely accepted even though it fails to discriminate multiple mRNAs transcribed from the same gene. As a result, the analysis results are usually summarized at the gene level, such as differentially expressed genes [4-6] or gene-sets [7,8]. Even in the recent works that integrate protein-protein interaction data and microarray data [9-11], the analysis unit is reduced to the gene instead of mRNA. This practice could be attributed to the fact that most of the functional knowledge is at gene level instead of transcript level, and the functional differences between mRNA variants transcribed from the same gene are seldom clear. Although the gene-level analysis ignores the difference among mRNAs variants, this strategy is still biologically meaningful considering that the diversity of genes is much higher than that of transcripts encoded by the same gene. It has been well established that alternative splicing increases mRNA diversity, and over 60% of human genes are involved in this mechanism [12]. In addition, other RNA processing events, such as RNA editing, also account for the increased diversity at the mRNA level [13]. Since these events enable one gene to encode multiple proteins which might be functionally heterogeneous, we feel it necessary to separate transcript-level synonymous probesets from gene-level synonymous ones. The probesets that hybridize to more than one transcript variant of the same gene are referred to as gene-level synonymous probesets; while the ones that hybridize to a single variant are named as transcript-level synonymous probesets. It has been noticed that transcript-level synonymous probesets tend to have similar expression profiles, while gene-level synonymous probesets may have distinct expression profiles [14,15]. In fact, individual reports demonstrated that the expression of transcript variants could be totally different [16,17]. These phenomena indicate that the gene-level strategy of microarray data analysis is imprecise enough that one may overlook the expressional inconsistencies among gene-level synonymous probesets. As a matter of fact, Affymetrix suffixes their probeset ID according to the probeset's specificity. For example, probesets that recognize unique transcript variants are suffixed with _at, and probesets that recognize multiple alternative transcripts from a single gene are suffixed with _a_at or _s_at, and so on. This suffix system gives a hint on the varied specificities of the probesets, and could be considered as an endeavor trying to do away with customers' worry about the gene-level data analysis strategy. However, the correctness of the suffix system has been in doubt [14,18]. Therefore it is not reliable to perform transcript-level analysis on the basis of this imperfect suffix system. As the standard annotation files and most of the analysis algorithms are gene-oriented, analysts often average out the expression heterogeneity of the same gene when dealing with probeset level data [19]. In this paper, we linked the probesets of two widely used Affymetrix arrays with the International Protein Indexes (IPIs) [20,21] through proper association and rigorous alignment procedures, and demonstrated the statistically significant advantage of interpreting microarray data at the transcript-level or protein-level. Our results can be viewed as a more precise annotation of Affymetrix array's probesets, with which one may achieve a more reliable interpretation of their experimental data. Moreover, the application of this new annotation substantially increased the expression correlation between interacting proteins. Results Transcript-level or protein-level annotation of Affymetrix probesets Two Affymetrix arrays, MOE430A_2 and HG-U133A, were chosen for this study, in which 21,097 and 16,213 putative protein-coding probesets for the two arrays were subject to the alignment investigation (Table 1). Candidate probeset-mRNA relationships were compiled based on the probeset-gene mapping information specified in the standard Affymetrix annotation files and gene-mRNA mapping relationships provided by separate mRNA transcript sources. Rigorous blast procedure similar to that described in [22] were performed to filter these candidate probeset-mRNA relationships. Finally, the mRNA targets passing the filtering criteria were linked to protein IDs in the IPI database.
Through the rigorous association and alignment, we obtained precise annotations for 18,894 and 15,288 probesets in Affymetrix arrays MOE430A_2 and HG-U133A respectively (see Additional file 1). These annotations discriminate alternative mRNA variants transcribed from a same gene, thus are at transcript level as opposed to the standard gene-level annotation files provided by Affymetrix. It is worth noting that since the transcript data we used were quite redundant, a conceptual transcript variant may be represented by multiple redundant transcript accessions in the transcript database. In our transcript-level annotation file, each conceptual transcript is identified with one IPI ID, as we only investigated the probesets associated with protein-coding transcripts. Statistics on the non-control, investigated and annotated probesets, together with the number of involved genes and proteins, are shown in Table 1. It is evident that the proportion of genes covered by our annotated probesets to those covered by all non-control probesets ('gene retaining percentage' in Table 1), 95.2% for MOE430A_2 and 85.4% for HG-U133A, are higher than the corresponding probeset retaining percentages, 83.5% and 68.8%, indicating that the gene coverage has only been slightly reduced by our filtering procedures. This observation is in support of our primary goal, that is to refine gene-level probeset annotations to transcript-level, but not to simply remove the poor-quality gene-level annotations. Furthermore, we classified the probesets based on the way they linked to proteins. By checking the alignment results, the probesets in our annotation tables were divided into two groups, namely one-to-one and one-to-many. In our annotation table, a one-to-one probeset was linked to only a single protein, while a one-to-many probeset was linked to multiple proteins due to alternative splicing of one gene. The rest of the investigated probesets that were not linked to any protein were categorized into the third group of “one-to-null”. The statistics of these three groups are shown in Table 2. As we know, Affymetrix suffixes their probeset ID according to the probeset's specificity. The over ten types of probeset suffixes can be categorized into three groups: transcript-specific, with '_at', gene-specific, with '_a_at' or '_s_at', and other suboptimal probesets that may cross-hybridize with unrelated sequences. The first two groups are comparable to our one-to-one and one-to-many probesets respectively. However, we found that, out of a total of 21,097 and 16,213 investigated probesets in MOE430A_2 and HG-U133A respectively (Table 1), our one-to-one type accounts for only 53.7% and 43.3%, which were significantly lower than that of _at probesets, 74.0% and 72.6%, in the respective arrays. This suggests that some of _at probesets are not really specific for a transcript.
To further clarify the issue of probeset specificity, we grouped the investigated probesets with the varied Affymetrix suffixes as well as our own categorizing system (Table 3). It is evident that overall one-to-one and one-to-many take up the majority of the _at group and the _a_at/_s_at group respectively, but there are some disagreements between the two classifications. For example, the Affymetrix _at group has 30.1% one-to-many probesets and 11.7% one-to-null probesets, suggesting that these so-called 'transcript-specific' probesets do not have the expected high specificity, and that they should be treated with caution in data analysis. On the other hand, the Affymetrix _a_at/_s_at group contains 42.0% one-to-one probesets. We attributed this mainly to the trimming of the originally false transcripts during the update of sequence information, as they may lead to the absence of the transcript targets of some probesets. Similarly, there are a large number of one-to-one probesets in the so-called Affy others group, 1,206 out of 2,619 for MOE430A_2 and 476 out of 1,555 for HG-U133A. These probesets, however, might not be really one-to-one mapping to the identified transcript targets, as our strategy was based on the premise that most probesets were specific for certain genes, and our blast was limited to the candidate transcripts associated with the probesets, but not the entire transcript pool.
These subgroups of the investigated probesets with different Affymetrix suffixes indicate the imperfection of the Affymetrix suffix system, thus affirming the necessity of our transcript-level or protein-level probeset annotations. In fact, several other research groups have addressed the misleading nature of the Affymetrix suffix system and the imperfection of its standard annotation file, including some re-annotation works for array HG-U133 [14,18,19,23]. We will discuss these related works in detail in next section. Verification of the protein-level annotation and comparison with related annotations Since the probesets were linked to transcripts and proteins through rigorous association and alignment procedures, the expression profiles of transcript-level synonymous probesets were supposed to be more consistent than those of gene-level synonymous probesets [14,15]. This was taken as the basis for the evaluation of our annotations. We downloaded 30 expression datasets assayed with MOE430A_2 from the Gene Expression Omnibus database (GEO) [24,25], and calculated the Pearson correlation coefficients (PCCs) of expression profiles of synonymous probesets at gene level and transcript level. Since transcript-level synonymous probesets in our work recognized a single protein-coding transcript variant identified with a unique IPI ID, transcript-level synonymous probesets were also named as protein-level synonymous ones. Practically, we grouped gene-level synonymous probesets according to gene ID, and protein-level ones according to IPI ID. The probeset-protein mapping tables provided at NetAffx, the official protein-level annotation of Affymetrix probesets [26], was used as comparison. Probesets corresponding to a single protein according to NetAffx were designated as 'Affy-protein' level synonymous probesets, which was a third level of synonymy. Within each synonymous group, all pair-wise PCCs were calculated and then summarized to one value indicating the expression correlation of this group. The expression correlations of all synonymous groups at one level were then averaged into an overall value for a dataset (see Method section for details). The correlations of synonymous groups for 30 datasets at three different levels were depicted in parallel in Figure Figure1A.1A
These results support the argument that it is more reliable to interpret microarray data at transcript level than at gene level. The inferiority of the Affy-protein level annotation to our annotation could be attributed to the technical details in their alignment and association procedures [26]. First, they used the representative mRNA sequence ('consensus' or 'exemplar' sequence of each probeset), instead of the probe sequences themselves, as the query sequence in the alignment. Second, they aligned against the GenBank non-redundant protein database, rather than a comprehensive pool of mRNA sequences. In microarray experiment hybridization takes place between the probe sequences immobilized on the array and the cDNA sequences from the sample, so one can deduce that the alignment between the representative mRNA sequences and the protein sequences cannot precisely simulate the hybridization between probes and mRNAs. Finally, NetAffx filtered the blast results according to a cutoff of E-value, which indicates the likelihood of the observed alignment by chance [27]. Although frequently adopted for sequence homology analysis in closely related species, E-value is not sensitive enough to grade the many well aligned targets from the same species. In our practice, we took the probe sequences as the query and the mRNA sequences as the alignment targets, and adopted the matching nucleotide proportion as the filtering criterion (see Methods). A similar comparison was conducted for HG-U133A array, involving another transcript-level annotation by Harbig et al. [14]. Across the whole 28 datasets, the annotation by Harbig et al. showed advantage over Affy-protein-level annotation, while our transcript-level annotation performed best (p < 0.05 for 18 datasets under Student's t-test, see Figure Figure1B1B According to our data, the puzzling fact that probesets for one gene may show variable expression profiles can be clarified when viewing the data at the protein level. Such genes are likely to involve alternative transcript variants which are constitutively different in expression levels. We looked at two genes as examples. Shown in Figure Figure2A,2A
Shown in Figure Figure2B,2B Application of our new annotations the evaluation of the expression correlation between interacting proteins In recent years, considerable efforts have been devoted to identifying and characterizing protein-protein interaction (PPI). Besides investigations on the molecular events involved in PPI, functional annotation of an unclassified protein according to its interacting partners is also an important topic [28]. Since it is too bold to infer protein functions according to the "majority rule" that utilizes only the PPI network structure [29,30], many studies integrate other data sources into the functional characterization of PPI, among which the gene expression data is the favorite [9,31,32]. All these works assumed that interacting protein pairs were characterized with higher expression correlation than random ones. However, previous investigations indicated that the relationship between expression correlation and PPI was weak on a genomic scale [33-35] although a recent work strengthened the association by integrating cross-species conservation information [10]. We noticed that in these genome-scale studies PPI information was always first converted to gene pairs, after which the Pearson correlations of the probeset pairs corresponding to the gene pairs were evaluated. That is, the analysis targets were expanded from real interacting protein pairs to all possible cross-gene protein pairs for which interaction may not always exist. As illustrated in Figure Figure3,3
In order to validate our proposition, we investigated the relationship between protein interactions and expression correlation at both gene-level and protein-level perspectives, using 1,037 interacting protein pairs from the HPRD [36] and 28 microarray datasets assayed with HG-U133A from the GEO. As depicted in Figure Figure3,3 For all datasets, we observed strengthened expression correlations between interacting proteins under the PPI schema relative to the GGI schema. Taking the GDS987 dataset including 41 arrays as an example (Figure (Figure4A4A
In Figure Figure4A,4A Similar observations were made over all datasets (see Additional file 4 and Additional file 5). Given the results from all 28 datasets, we were also able to compare the PPI PCC calculation and GGI PCC calculation at a higher level. Under a FDR threshold of 0.1, the list of significant PPI pairs or GGI pairs from each dataset was determined. Based on these lists, we counted the significantly correlated PPIs or GGIs (termed 'significant PPIs' or 'significant GGIs') over each dataset, and the datasets on which a PPI or GGI demonstrates significant correlation (termed 'PPI-significant datasets' or 'GGI-significant datasets'). The former is a 'twenty-eight by two' table, shown in Table 4; the latter is a '1,037 by two' table, shown fully in Additional file 4 and partly in Table 5. In both summary tables, we find that the statistics of PPI PCC calculation are mostly larger than the counterparts of GGI PCC calculation, indicating that PPI PCC calculation can detect more correlated PPI pairs in a certain experiment setting (Table 4), and that it can detect the correlation tendency across more experiment settings (Table 5).
Discussion In this work, we re-annotated the probesets of two widely used Affymetrix arrays, MOE430A_2 and HG-U133A, via proper association and rigorous alignment procedures in a transcript perspective, and demonstrated the necessity and advantage of exploring microarray data at the transcript or protein level, instead of the conventional gene level. Although Affymetrix utilized the most complete information available at the time of array design, tremendous progress in genome sequencing and annotation in recent years renders existing probeset designs and target identifications suboptimal. In recent years, there have been continuous reports on systematic false expression signals of Affymetrix probesets [39], spurious expression correlation caused by cross hybridization [18], and expressional inconsistency among different microarray platforms or even different generations of one platform [40-43]. A few research groups performed probe-against-mRNA blast similar to ours [22,42,44], but mostly they centered around UniGene [45] and therefore improved the accuracy of annotation only at gene level. A major trend among these efforts was to redefine probesets so that probes matching the same molecular target were placed into custom probesets, as proposed by [19,23,39,42], but as the authors of [19] pointed out, 'these transcript-targeted probesets are not transcript-specific, as probesets targeting transcripts from the same gene may share many or even all probes'. Thus the probe re-organization strategy may be used to make distinction at the level of genes, but not at the level of transcripts or splice variants [18]. Besides, this strategy takes the probe-level intensity file (the CEL file) as a prerequisite, however only around half of the expression datasets deposited in public databases like GEO were found with CEL files. In order to make distinction precisely at the transcript level, we preserved the classical Affymetrix probesets, but distinguished them among their alternatively spliced transcript targets according to the consistent alignments of probes against up-to-date mRNA sequences. Our annotation table clearly divides the Affymetrix probesets into three groups with increased transcript-level specificity (reliability): one-to-null probesets that do not recognize any transcript, one-to-many probesets that hybridize to multiple alternative transcript variants of the intended gene, and one-to-one probesets that hybridize to unique alternative transcript variants of the intended gene. We discriminate the intended alternative transcript variants of Affymetrix probesets based on the NetAffx's gene-level annotation for the first time. Given the fact that existing solutions are accompanied with imperfections and no consensus has been reached on an overwhelming strategy, our alternative solution to the problematic standard annotation points out a new way to improve the interpretation and exploitation of Affymetrix microarray data. Although the transcript collections were not identical and the reannotation strategies differed more or less, we made out some similar discoveries to previous reports. For example, Harbig et al. found that a number of probesets did not detect any transcript and attributed this phenomenon to the elimination of the target sequence in the process of sequence update [14]. In our study, altogether 16.5% and 31.2% of non-control probesets in the MOE430A_2 and HG-U133A arrays were not found with any transcript targets in the pool of GenBank, RefSeq and Ensembl. Using newer and larger collection of transcript sequences, we even obtained a quite similar statistics of the percentage of 'multiple-targeting' probesets to that estimated in a foregoing study [18], specifically 54.6% for MOE430A_2 and 54.1% HG-U133A (see Table 2). The significant mutual agreement among the related researches justifies the necessity to set up an improved annotation mechanism of the Affymetrix probes in the face of the continual growth of genomic and transcriptomic knowledge, ideally at transcript or protein level. Over the past few years, the analysis of alternative splicing has emerged as an important new field in bioinformatics, and several recent large-scale studies have shown that alternative splicing can be analyzed in a high-throughput manner using DNA-microarray methods [46,47]. Most of these studies used arrays particularly manufactured for analyzing alternative splicing, such as genomic tiling array and exon-exon junction array. Constructed without any priori knowledge of the possible exon content of a genomic sequence, the genomic tiling array [48,49] is in principle capable of detecting novel alternative splicing events of diverse types, but it is in doubt whether their data will be readily interpretable as successful experiences remain insufficient [46]. On the other hand, although designed particularly to address the alternative splicing issue, exon and exon-exon junction arrays [49] were reported to be plagued by problematic probe specificity and unsatisfying hybridization efficiency because of the necessity of probe coverage across the full length of the gene (including 5' end) [5]. Many questions about the reproducibility of the amplification protocol, the quantitative accuracy, and the data analysis need to be addressed as a prerequisite to reliable quantitative analysis using these splicing arrays [50]. Given the current imperfection of splicing array techniques and inconvenience in deciphering their generated data, it is an economic way to do large-scale investigations of alternative transcribing events with standard gene expression arrays, provided that the recognizing targets of the probes can be rigorously defined at the transcript level. Hu et al. proposed a primitive analysis method to explore alternative splicing with Affymetrix 3' gene expression arrays, though they regretted that only alternative splicing biased toward the 3'end of the gene can be detected in their way [51]. In the present paper, we conducted a large-scale alignment of the probe sequences in traditional gene expression arrays against the currently most comprehensive collection of transcript sequences, highlighting the probesets mapping to unique alternative transcripts unambiguously. For each of the two Affymetrix expression arrays tested in this study, we found over 40% of all probesets could be mapped to transcripts in a one-to-one manner, so our work strongly validate that it is feasible to analyze alternative splicing using traditional gene expression arrays. While the foregoing work contributed by Hu et al. remains as a qualitative analysis method aiming at detecting novel alternative splicing events, our work gives explicitly the relationship between the probesets and the currently known alternative transcript variants, which can be immediately exploited to facilitate quantitative analysis of alternative variants. As our mapping relationships are defined for the standard probesets of the traditional gene expression arrays, they can be conveniently exploited as the standard NetAffx annotation information, without any ad hoc influence on the widely applied experiment protocols or the routine data processing algorithms. In the demonstrative implementations of the novel annotation tables, we actually observed several examples of negatively correlated alternative variants (see Figure Figure2B2B Conclusion To sum up, we re-annotated two popular Affymetrix gene expression arrays, MOE430_2 and HG-U133A, in a transcript-level perspective, aiming at identifying probesets' detecting targets precisely at the transcript level. Although previous works addressed similar issues [14,15,18,19,22,23], we are the first to rigorously link existing Affymetrix probesets to their specific transcript targets and their corresponding proteins. Armed with this new annotation, we re-examined a number of previous studies, 30 datasets for MOE430_2 and 28 datasets for HG-U133A from GEO, and revealed increased expression consistency among synonymous probesets and closer expression correlation among interacting proteins. This transcript-level annotation of Affymetrix probesets allows for a more reliable gene expression data analysis and a more accessible protein-level correlation study. Methods Sequences and related information of Affymetrix probesets The Affymetrix 3' eukaryotic gene expression analysis arrays MOE430A_2 and HG-U133A were selected for this study. Probe sequence files and corresponding annotation files, 'Mouse430A_2_annot_csv.zip' (annotated on 2005-12-19) and 'HG-U133A_annot_csv.zip' (annotated on 2006-04-11), were downloaded from Affymetrix website [52]. Also downloaded there were the NetAffx probeset-protein mapping files for MOE430A_2 (file 'Mouse430A_2_blast_csv', updated on 2005-12-18) and HG-U133A ('HG-U133A.na21.blast.csv.zip', updated 2006-04-11), which were the blast results of the representative mRNA sequence of probes against protein sequence databases [26]. Sources of mRNA transcripts GenBank: mRNA sequences from CoreNucleotide division of NCBI Nucleotide database were obtained via the Entrez Nucleotide search [53] on April 10th, 2006. For mouse, this dataset comprises 1,582,211 sequences, with 1,521,234 from DDBJ, 6,506 from EMBL and 54,471 from GenBank. For human, there are totally 201,206 sequences, with 41,128 from DDBJ, 62,701 from EMBL and 97,377 from GenBank. File 'gene2accession' (updated on 2006-03-28), downloaded from Entrez Gene [54], provides the mapping relationship between the CoreNucleotide sequences, Entrez Gene IDs, and protein sequence accessions. RefSeq: 55,832 mouse mRNA sequences and 40,530 human mRNA sequences were obtained from the RefSeq database [55]. Mapping relationships between RefSeq mRNA accessions, RefSeq protein accessions, and Entrez GeneIDs were extracted from related flat files 'mouse.rna.gbff.gz' and 'human.rna.gbff.gz', which were downloaded from RefSeq in April 2006. IPI entries and their mappings to external sequence accession numbers IPI entries and their mappings to external protein accession numbers were acquired from the International Protein Index (IPI) database [21] (release 3.17). Also obtained there were the mapping relations between IPI numbers and transcript IDs (GenBank, RefSeq, and Ensembl). The counterpart file for human was downloaded there too (release 3.16). Expression datasets from Gene Expression Omibus Microarray datasets were downloaded from the Gene Expression Ominibus on April 15, 2006. Array MOE430A_2, indexed as GPL339, was associated with 2,276 samples in GEO, ranking the second among all registered Affymetrix mouse arrays. All 31 GDS datasets profiled with MOE430A_2 were used in the analyses except for GDS1057, which contains only two samples. Array HG-U133A, indexed as GPL96, was associated with 8,698 samples in GEO, ranking the first among all registered Affymetrix human arrays. For our analysis, we downloaded 31 GDS datasets with largest sample sizes, and used 28 of them in our analyses, excluding GDS534, GDS1329, and GDS1324 as they are in a data format inconsistent with the others. Details about the used datasets can be found in our Additional file 8. PPI datasets from IntAct and HPRD An alternative annotation of the Affymetrix U133 Plus 2.0 array A recently proposed transcript-level annotation of the Affymetrix U133 plus 2.0 array was obtained from the H. Lee Moffitt cancer center and research institute [60], which was used for comparison with our transcript-level annotation of HG-U133A array. Generation of a new transcript-level annotation table for Affymetrix array Out of the total 22,690 and 22,283 probesets in arrays MOE430A_2 and HG-U133A, respectively, 64 and 68 control probesets were firstly removed. The remaining probesets were associated with genes according to the probeset-gene mapping information provided in Affymetrix's standard annotation file. The probeset-transcript mapping relationships were obtained based on the gene-mRNA mapping tables from GeneBank, RefSeq and Ensembl. In the process, we only included probesets that were identified with one Entrez Gene ID or one Ensembl gene ID. We ignored the probesets that were associated with multiple entities or no entity in the two gene-centric databases, since their gene-level specificity appears doubtful in the standard annotation file. This filtered out 3.2% and 5.2% of non-control probesets in MOE430A_2 and HG-U133A, respectively. For the rest of the probesets, we linked the candidate transcript targets to their corresponding protein entries in IPI database. IPI is currently the least redundant yet most complete protein database for featured species, with one protein sequence matching each transcript variant. Those probesets of which transcript targets do not have any protein counterparts were also excluded from the following blast validation in order to focus our attention to the transcripts with well-characterized functions at protein level. The remaining probesets, 21,097 for MOE430A_2 and 16,213 for HG-U133A, were selected for the BLAST procedure. We then filtered the candidate probeset-mRNA mapping relationships by aligning probe sequences in these probesets against their corresponding transcripts. Probes were blasted against their candidate mRNA targets using the bl2seq program [61], and the probe to transcript matches were accepted if no more than one mismatch was found. At the level of probesets, the probeset to transcript matches were accepted only if more than 90% of all probes within a probeset (over 10 probes for the typical 11-probeset) were mapped to the transcript in the same orientation. The probeset-transcript-protein links related to the above probesets passing BLAST filter were retrieved. After reducing the redundancy information of multiple transcripts corresponding to the same IPI, we finally obtained rigorous probeset annotation files for Affymetrix arrays MOE430A_2 and HG-U133A. There are two types of probesets in the new annotation file: one-to-one probesets, where one probeset maps to only one IPI ID; and one-to-many probesets, where one probeset maps to two or more IPI IDs. Only the one-to-one probesets were used in the subsequent analyses. Evaluating expression consistencies within synonymous probesets We grouped gene-level synonymous probesets according to gene ID (gene-level), and protein-level ones according to IPI ID (protein-level). Additionally, probesets corresponding to a single protein according to NetAffx probeset-protein mapping tables (see Materials) are grouped as 'Affy-protein' level synonymous probesets. In the case of the HG-U133A array, we included a fourth level, the 'Harbig-protein level', for comparison. The Harbig-protein level reflects the probeset-protein association transformed from a recent alternative annotation of the Affymetrix U133 plus 2.0 array [14], also proposed in a transcript-level perspective. The downloaded annotation file mapped 33,579 probesets to 287,791 GenBank mRNA accessions, among which 21,669 were found on HG-U133A array, mapped to 186,085 GenBank mRNA accessions. The probeset-mRNA associations related to HG-U133A array were further linked to IPI IDs, finally giving rise to 26,960 probeset-IPI mapping relationships among which 12,146 were one-to-one. The following treatments were the same as those implemented to the standard gene-level annotation, the NetAffx Protein annotations, and our novel annotations. 30 and 28 expression datasets were selected from GEO respectively for MOE430A_2 and HG-U133A, and the original intensity data within each GEO dataset (GDS) were transformed to log 2 base and normalized to a constant median across all samples. For a synonymous group, if the expression values of all probesets in all samples were no larger than the constant median value, the probesets in this group were regarded not moderately expressed, and their expression profiles not informative enough. Therefore, we only kept the synonymous groups with at least one expression value above the constant median value, similar to the filtering procedure used by Tian et al. [62]. For the remaining synonymous groups, Pearson correlation coefficients (PCCs) were calculated for the expression profiles of each probeset pair. The minimum value of these PCCs was taken as a measurement of the expression coherence of this group. We used the minimum aggregation because the gene level synonymous probesets gave rise to within-protein PCCs (which are theoretically higher) and across-protein PCCs (which are theoretically lower), and the former was identical to the result of the protein-level synonymous group. In such a setting, the maximum did not result in any difference, and the average aggregation was not as sensitive as the minimum in terms of differentiating the two groups, so we adopted the minimum aggregation. Finally, the mean of the expression coherences of all synonymous groups over a dataset was calculated. In this way we obtained an evaluation of expression consistencies within synonymous probesets for a microarray dataset, and may compare the expression consistencies at the three levels over different microarray datasets. Investigating expression correlations between interacting protein pairs Given protein-protein interaction data from HPRD or IntAct, we first transformed the binary relations of protein accessions to IPI-IPI pairs, and also got the corresponding Gene-Gene pairs. For each PPI, we assembled the PPI probeset pairs and the GGI probeset pairs as illustrated in Figure Figure3,3 The distributions of the PPI PCCs and the GGI PCCs were plotted in a same figure to show the contrast (Figure (Figure4A).4A Abbreviations BLAST: The Basic Local Alignment Search Tool GEO: Gene Expression Omnibus GGI: gene-gene interaction HG-U133A: Human Genome U133A Array HPRD: Human Protein Reference Database IPI: International Protein Index MOE430A_2: Mouse Genome 430A 2.0 Array PCC: Pearson correlation coefficient PPI: Protein-protein interaction Authors' contributions YYL, KT, and YXL conceived and designed the experiments. HY, FW, KT performed the experiments. HY, YYL, FW, and LX analyzed data. YYL, HY, and LX wrote the manuscript. All authors read and approved the final manuscript. Additional file 1 Novel annotation tables. Novel annotation files at transcript level, for MOE430A_2 and HG-U133A arrays. Click here for file(1.3M, rar) Additional file 2 Expression correlations of synonymous probesets. Expression consistency within synonymous probeset groups for 30 (assayed with MOE430A_2) and 28 (assayed with HG-U133A array) GEO datasets, with synonymity defined at different levels: gene level, Affy-protein level, and our protein level. For HG-U133A array, a recent related work (Harbig et al. 2005 NAR) was also adopted for comparison. Click here for file(40K, xls) Additional file 3 Pearson correlation coefficients (PCCs) calculated on dataset GDS987. Pearson correlation coefficients (PCCs) of PPI pairs and corresponding GGI pairs. Calculated for 1037 HPRD protein-protein interactions using the GEO dataset GDS987. Click here for file(379K, xls) Additional file 4 Expression correlation of PPI and GGI pairs from HPRD. PPI and GGI Pearson correlation coefficients for 1037 HPRD PPIs. Click here for file(6.9M, xls) Additional file 5 Summary of expression correlations of 1037 HPRD PPI and GGI pairs. 1) Distribution plots for PPI and GGI Pearson correlation coefficients (PCCs), one for a GDS dataset. PCCs of random pairs form the background distribution. 2) Scatter plots of PPI and GGI PCCs. Click here for file(693K, rar) Additional file 6 Expression correlation of PPI and GGI pairs from IntAct. PPI and GGI Pearson correlation coefficients for 274 IntAct PPIs. Click here for file(1.1M, xls) Additional file 7 Summary of expression correlations of 274 IntAct PPI and GGI pairs. 1) Distribution plots for PPI and GGI Pearson correlation coefficients (PCCs), one for a GDS dataset. PCCs of random pairs form the background distribution. 2) Scatter plots of PPI and GGI PCCs. Click here for file(580K, rar) Additional file 8 Information on used GDS datasets. Accession numbers of GEO datasets used in the analysis. Click here for file(20K, xls) Acknowledgements We would like to thank Prof. James Scott for the insightful discussion when conceiving the project and Dr. Wei-Zhong He for editorial helps. We also wish to thank the reviewers for their helpful criticisms and suggestions for improving the manuscript. This work was supported in part by grants from the Shanghai Morning-Star Program (Type A) (06QA14037), the Shanghai Fundamental Research Program (05DJ14009), the Pujiang Promising Scientist Program (Class A, 06PJ14073), the Shanghai Human Brain Disease Proteome Research (04DZ14005), and the National "973" Basic Research Program (2006CB0D1203, 2006CB0D1205). References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||
Nat Biotechnol. 1998 Jan; 16(1):40-4.
[Nat Biotechnol. 1998]Annu Rev Biochem. 2005; 74():53-82.
[Annu Rev Biochem. 2005]J Comput Biol. 2001; 8(1):37-52.
[J Comput Biol. 2001]Bioinformatics. 2005 Apr 1; 21(7):1084-93.
[Bioinformatics. 2005]Proc Natl Acad Sci U S A. 2005 Oct 25; 102(43):15545-50.
[Proc Natl Acad Sci U S A. 2005]Genome Biol. 2002 Oct 23; 3(11):reviews0008.
[Genome Biol. 2002]EMBO Rep. 2006 Mar; 7(3):303-7.
[EMBO Rep. 2006]Nucleic Acids Res. 2005 Feb 18; 33(3):e31.
[Nucleic Acids Res. 2005]Bioinformatics. 2005 May 15; 21(10):2552-3.
[Bioinformatics. 2005]Endocrinology. 1992 Mar; 130(3):1108-14.
[Endocrinology. 1992]Nucleic Acids Res. 2005 Feb 18; 33(3):e31.
[Nucleic Acids Res. 2005]BMC Bioinformatics. 2006 Jun 2; 7():276.
[BMC Bioinformatics. 2006]Nucleic Acids Res. 2005 Nov 10; 33(20):e175.
[Nucleic Acids Res. 2005]Proteomics. 2004 Jul; 4(7):1985-8.
[Proteomics. 2004]Bioinformatics. 2004 Jun 12; 20(9):1457-8.
[Bioinformatics. 2004]Nucleic Acids Res. 2005 Feb 18; 33(3):e31.
[Nucleic Acids Res. 2005]BMC Bioinformatics. 2006 Jun 2; 7():276.
[BMC Bioinformatics. 2006]Nucleic Acids Res. 2005 Nov 10; 33(20):e175.
[Nucleic Acids Res. 2005]BMC Bioinformatics. 2004 Aug 14; 5():111.
[BMC Bioinformatics. 2004]Nucleic Acids Res. 2005 Feb 18; 33(3):e31.
[Nucleic Acids Res. 2005]Bioinformatics. 2005 May 15; 21(10):2552-3.
[Bioinformatics. 2005]Nucleic Acids Res. 2002 Jan 1; 30(1):207-10.
[Nucleic Acids Res. 2002]Nucleic Acids Res. 2003 Jan 1; 31(1):82-6.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2003 Jan 1; 31(1):82-6.
[Nucleic Acids Res. 2003]Science. 2004 Mar 12; 303(5664):1666-9.
[Science. 2004]Nucleic Acids Res. 2005 Feb 18; 33(3):e31.
[Nucleic Acids Res. 2005]Endocrinology. 1992 Mar; 130(3):1108-14.
[Endocrinology. 1992]Nat Biotechnol. 2000 Dec; 18(12):1257-61.
[Nat Biotechnol. 2000]Nat Biotechnol. 2003 Jun; 21(6):697-700.
[Nat Biotechnol. 2003]Nucleic Acids Res. 2004; 32(21):6414-24.
[Nucleic Acids Res. 2004]J Biotechnol. 2006 Jul 25; 124(3):475-85.
[J Biotechnol. 2006]Bioinformatics. 2003; 19 Suppl 1():i197-204.
[Bioinformatics. 2003]Genome Res. 2003 Oct; 13(10):2363-71.
[Genome Res. 2003]Bioinformatics. 2004 Jul 22; 20(11):1737-45.
[Bioinformatics. 2004]Bioinformatics. 2005 Jun 1; 21(11):2730-8.
[Bioinformatics. 2005]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D452-5.
[Nucleic Acids Res. 2004]Genomics. 2005 Mar; 85(3):297-308.
[Genomics. 2005]BMC Bioinformatics. 2006 Jun 2; 7():276.
[BMC Bioinformatics. 2006]Bioinformatics. 2002 Mar; 18(3):405-12.
[Bioinformatics. 2002]BMC Bioinformatics. 2002 Aug 23; 3():22.
[BMC Bioinformatics. 2002]Bioinformatics. 2004 Jun 12; 20(9):1457-8.
[Bioinformatics. 2004]Nucleic Acids Res. 2005 Feb 18; 33(3):e31.
[Nucleic Acids Res. 2005]BMC Bioinformatics. 2006 Jun 2; 7():276.
[BMC Bioinformatics. 2006]Nat Genet. 2004 Dec; 36(12):1306-11.
[Nat Genet. 2004]Brief Bioinform. 2005 Mar; 6(1):23-33.
[Brief Bioinform. 2005]Nature. 2001 Feb 15; 409(6822):922-7.
[Nature. 2001]Genome Res. 2004 Mar; 14(3):331-42.
[Genome Res. 2004]Nucleic Acids Res. 2004 Dec 14; 32(22):e180.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2005 Feb 18; 33(3):e31.
[Nucleic Acids Res. 2005]Bioinformatics. 2005 May 15; 21(10):2552-3.
[Bioinformatics. 2005]BMC Bioinformatics. 2006 Jun 2; 7():276.
[BMC Bioinformatics. 2006]Nucleic Acids Res. 2005 Nov 10; 33(20):e175.
[Nucleic Acids Res. 2005]Bioinformatics. 2004 Jun 12; 20(9):1457-8.
[Bioinformatics. 2004]Nucleic Acids Res. 2003 Jan 1; 31(1):82-6.
[Nucleic Acids Res. 2003]Genome Res. 2004 May; 14(5):925-8.
[Genome Res. 2004]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Nucleic Acids Res. 2005 Feb 18; 33(3):e31.
[Nucleic Acids Res. 2005]Proc Natl Acad Sci U S A. 2005 Sep 20; 102(38):13544-9.
[Proc Natl Acad Sci U S A. 2005]