• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of plntphysLink to Publisher's site
Plant Physiol. Jun 2010; 153(2): 642–651.
Published online Apr 13, 2010. doi:  10.1104/pp.109.152553
PMCID: PMC2879776

Robin: An Intuitive Wizard Application for R-Based Expression Microarray Quality Assessment and Analysis1,[W][OA]

Abstract

The wide application of high-throughput transcriptomics using microarrays has generated a plethora of technical platforms, data repositories, and sophisticated statistical analysis methods, leaving the individual scientist with the problem of choosing the appropriate approach to address a biological question. Several software applications that provide a rich environment for microarray analysis and data storage are available (e.g. GeneSpring, EMMA2), but these are mostly commercial or require an advanced informatics infrastructure. There is a need for a noncommercial, easy-to-use graphical application that aids the lab researcher to find the proper method to analyze microarray data, without this requiring expert understanding of the complex underlying statistics, or programming skills. We have developed Robin, a Java-based graphical wizard application that harnesses the advanced statistical analysis functions of the R/BioConductor project. Robin implements streamlined workflows that guide the user through all steps of two-color, single-color, or Affymetrix microarray analysis. It provides functions for thorough quality assessment of the data and automatically generates warnings to notify the user of potential outliers, low-quality chips, or low statistical power. The results are generated in a standard format that allows ready use with both specialized analysis tools like MapMan and PageMan and generic spreadsheet applications. To further improve user friendliness, Robin includes both integrated help and comprehensive external documentation. To demonstrate the statistical power and ease of use of the workflows in Robin, we present a case study in which we apply Robin to analyze a two-color microarray experiment comparing gene expression in tomato (Solanum lycopersicum) leaves, flowers, and roots.

Since the first microarray experiments were performed in the 1990s (Schena et al., 1995) a lot of effort has been put into the development of this technique as well as into approaches for the correct analysis of the resulting data. Widespread use of the various array technologies has been accompanied by the development of many sophisticated statistical methods to process the raw data, and to analyze the results to infer new biological insights (Sreenivasulu et al., 2006; Usadel et al., 2008; Winfield et al., 2009; Zanor et al., 2009; and see below). The wealth of data and methods leaves the individual researcher with the problem of choosing the correct strategy since it is not directly obvious to the inexperienced user which approach is suitable for a given experimental design. Furthermore, the wide application and technical improvement of microarrays has also resulted in the establishment of large publicly accessible expression data repositories such as Gene Expression Omnibus, AtGenExpress, or Genevestigator (Schmid et al., 2005; Barrett et al., 2007). Data mining of these and other public collections is facilitated by descriptive meta data that is attached to the expression data (MIAME and MIAME/Plant [Brazma et al., 2001; Zimmermann et al., 2006]; XEML [Hannemann et al., 2009]). However, choosing the correct approach to statistically (re)analyze such data also inevitably requires expertise in statistics.

One of the most advanced tools for the analysis of high-throughput experimental data is the statistics environment R. This open source project is constantly being developed and refined by leading statisticians (R Development Core Team, 2009). Together with the R packages provided by the BioConductor project (Gentleman et al., 2004), R provides a powerful, yet flexible, platform for microarray data analysis and quality assessment. The big disadvantage of R/BioConductor-based data analysis however, is its general lack of an intuitive graphical user interface (GUI). The largest part of the functionality of R can only be accessed via a text console. This represents a considerable obstacle for many biologists, who are inexperienced in the use of such interfaces. Furthermore, full use of the power of R/BioConductor-based data analysis requires programming skills.

Although several GUI applications have been developed that allow analysis of microarray data generated by different technical platforms, these are often commercial (GeneSpring, GeneMaths XT, GeneSifter, etc.), not very intuitive (limmaGUI, affylmGUI; Wettenhall and Smyth, 2004; Wettenhall et al., 2006), not available on all computing platforms (PreP+07; Martin-Requena et al., 2009), or are Web-based solutions that would either require uploading of potentially sensitive, unpublished data or laborious local installation such as CARMAWEB, EMMA 2, and RACE (Psarros et al., 2005; Rainer et al., 2006; Dondrup et al., 2009). Although packages like the TM4 suite (Saeed et al., 2003) or MayDay (Dietzsch et al., 2006) provide a collection of excellent tools for microarray analysis, they do not offer a consistent, workflow-oriented interface to the user due to their multiprogram (TM4) or plugin-based (MayDay) structure. Additionally, the TM4 suite does not provide support for single-color chip platforms like Affymetrix GeneChips without further adaptation.

To address the need for a free, user-friendly, and instructive open source tool for microarray analysis, we have developed Robin. Robin provides a Java-based GUI to up-to-date R/BioConductor functions for the analysis of both two-color and single-channel (Affymetrix GeneChip) microarrays and implements wizard-like workflows that guide the user through all steps of the analysis including quality assessment, evaluation, and experiment design. Robin assists the user in the interpretation of the results by automatically issuing warnings if quality-check parameters exceed or undercut conservatively chosen threshold values, or statistical analysis indicates problems like insufficient input data. During the whole workflow the major attention is placed on simplicity and intuitiveness of the GUI. Advanced options to modify the parameters of the analysis functions are, by default, hidden from the user. Naturally, more experienced users have the possibility to activate an expert mode, which allows them to adjust the settings to meet their individual needs, and even review and modify the R scripts before they are executed by the embedded R engine. The generated output includes informative plots visualizing the quality-check and statistical results, the R scripts that have been automatically generated from the users’ input, and a complete statistical analysis of the response of gene expression in a form that can directly be imported into common spreadsheet applications, and meta-analysis tools like MapMan for visualization. A detailed user's manual including step-by-step walk-throughs for the different analysis workflows implemented in Robin, examples for all types of quality checks, and comprehensive explanations of the statistical settings are available online (http://mapman.gabipd.org/web/guest/tutorials-manuals-etc; Supplemental Material S2). To support users beyond the manual and to provide a platform for discussion on improvements and special-use cases, we set up a discussion forum for Robin (please visit http://mapman.gabipd.org/web/guest/forum).

RESULTS AND DISCUSSION

Robin implements standardized workflows for the analysis of common microarray experiment designs, including common reference and direct design two-color experiments and simple multifactorial designs in which more than one experimental condition is being varied. Robin is not restricted to plant microarrays but can be used to analyze data generated on most two-color and non-Affymetrix single-channel microarray platforms. It does also support all Affymetrix GeneChip arrays that are included in the bioconductor project (for an up-to-date list of supported Affymetrix chips please see http://www.bioconductor.org/packages/release/data/annotation/).

Installation and Scope

Robin is available as a stand-alone installer package including an embedded minimal R engine (plus the required packages) for Microsoft Windows (XP or higher) and Mac OS X (version 10.5 or higher) from http://mapman.gabipd.org/web/guest/robin-download. Installing these packages will leave an existing installation of R on the target system untouched. For all other systems that support Java and R, such as Linux, a lightweight package that can incorporate and configure an existing R installation for usage with Robin is available. Currently, Robin is released under the terms of the General Network User Lesser General Public License version 3.0 and hence is free open source software. It will stay freely available for academic users in the future. The source code is distributed as part of the installation package and can optionally be installed alongside the program. Interested developers are free to inspect and reuse the source code, if desired.

Importing Raw Data

The user can choose between three separate workflows, specialized for Affymetrix GeneChip, for generic single-channel (e.g. Agilent etc), and for two-color microarray data normalization and analysis. Importing Affymetrix GeneChip data is very simple and just requires the user to pick the raw data files that will be included in the analysis. Since the Affymetrix CEL data format is uniform and does not require further processing or configuration, the user can directly proceed to the quality assessment step. Due to the various file formats in use for non-Affymetrix microarray data, special care has been taken to provide a versatile import wizard that assists the user in the import of arbitrary tabular single- and two-color data. The only restriction imposed is that the data has to be in tabular text format.

The user chooses the chip grid layout from a list of predefined layouts, or enters a custom layout. For convenience, the layouts of several common plant microarrays such as TOM1, TOM2, Medicago16K, and Pisum6k (Alba et al., 2004; Hohnjec et al., 2005; Thompson et al., 2005; CGEP [Cornell University, Ithaca, NY]) are bundled with Robin as layout presets. All settings of the import wizard interface can be saved as an input data preset to speed up loading of similar data. During the import, Robin tries to automatically separate header information from the tabular data section in the input file and asks the user to specify which columns contain the fields required for analysis (i.e. red channel foreground and background, green channel foreground and background intensities, and a unique identifier for each measured signal). When importing single- and two-color data, Robin tries to determine whether the chip layout comprises probes spotted in duplicates. After importing the data, the user is asked to define the targets table by entering the different RNA samples and specifying which sample has been labeled with which dye on each chip. For subsequent analysis, a reference sample must be specified. In very simple experiments that only comprise replicate chips of two different treatments (possibly including dye swaps), Robin uses the first entered sample as reference by default. If data conforming to a common reference design was entered, Robin automatically detects the common reference sample and prompts the user in case this sample was not set as reference. During this step, Robin also analyzes the input and tries to make sure that the data is consistent, for example by verifying that the samples are not disconnected. Import of Affymetrix single-channel data does not cause such problems, since the data format is uniform and it is not necessary to define a targets table.

Quality Assessment

After importing the chip data, a variety of quality assessment methods (Fig. 1) can be run to allow the user to get an overview of the quality of input data and subsequently exclude chips that show strong technical artifacts individually. The various quality assessment methods can be freely chosen and combined as required. For ease of use, robust standards are preselected for the normalization, P-value correction, and statistical analysis that yield reliable results in most cases. However, the expert user can choose which normalization, P-value correction, and statistical analysis approach (linear model or rank product based) to use. These more advanced settings are not displayed by default, but advanced users can take control of analysis parameters and modify them according to their needs.

Figure 1.
A, Screenshot of the quality assessment functions available for Affymetrix (R) chips. All methods can be freely combined to obtain an overview of the input data quality. Short in-line explanations for each method are displayed in the info field on the ...

To support the user in the evaluation of quality assessment results, warnings are issued automatically if quality measures of individual chips exceed conservatively chosen threshold values (see “Materials and Methods” section for details). Specifically, methods available for quality assessment of single-channel data are (1) RNA degradation analysis, (2) box plots, (3) density plots of raw probe signal intensities, (4) pseudo images of probe level model (PLM) residuals, (5) scatter plots of the average probe intensity (A) against the logarithmic fold change in expression (M; MA plots), (6) scatter plots comparing all possible combinations of two individual chips, (7) visualization of principal component analysis and hierarchical clustering of the normalized expression values, (8) box plots showing the normalized unscaled ses and relative logarithmic expression of the PLMs, and (9) false color images of the background signal intensity for non-Affymetrix arrays (Supplemental Fig. S1).

PLM-based methods are available for Affymetrix arrays only, while the other functions can also be run on generic single-channel chips. Methods available for two-color chip quality assessment are (1) image plots visualizing the chip background signal intensities, (2) density plots of the probe intensity distribution before and after normalization, (3) MA plots of raw and normalized data for each chip, and (4) image plots showing the M value for each probe color coded on a pseudo chip (Supplemental Figs. S1 and S6).

All of the above-mentioned quality checks have been implemented in R using functions provided by the Bioconductor packages affy, affyPLM, affycoretools, simpleaffy, gcrma, plier, limma, marray, and RankProd (Wang et al., 2002; Bolstad, 2004; Gautier et al., 2004; Smyth, 2004; Wu et al., 2004; Affymetrix, 2005; Wilson and Miller, 2005; Hong et al., 2006; J.W.MacDonald, unpublished data). Some functions were modified to enhance the visual output. Depending on the type of input data the user can choose between different analysis approaches: In the case of single-channel data, linear model-based (limma) or rank product-based (RankProd) analysis is available. Two-color data will always be analyzed using limma functions. Quality analysis results will be summarized in a scrollable list showing clickable thumbnail images of the quality analysis plots. Individual chips showing warnings may be manually excluded from the analysis to prevent them from introducing technical bias in the subsequent assessment of differential gene expression.

Experiment Design

When working with Affymetrix data, depending on the statistical analysis strategy chosen, the user can define two (when using rank product) to any number (using limma) of groups of replicates, and assign the imported data files accordingly. Unique labels identifying the groups have to be chosen—these labels will be used later on when defining the contrasts of interest. Robin will generate a warning if groups contain less than three replicates, which can lead to a lower reliability of the results if too few data points are available for the analysis of differential expression. It should be noted that in this build of Robin, all replicate experiments are treated as true biological replicates. Entering data that is only technically replicated as an independent replicate will lead to an overestimation of significance when analyzing differential gene expression, however given the reliability of modern microarrays, using technical replicates is most often no longer necessary.

Subsequently, the replicate groups are depicted as draggable boxes on the graphical designer panel. This allows the user to visually lay out comparisons of interests between the groups. To achieve this, one simply has to draw an arrow by control-click-dragging from one box to a second box, e.g. from wildtype to mutant as shown in (Fig. 1). Robin interprets this operation as the comparison wildtype minus mutant. If more than one experimental condition is being varied, the difference of differences can be extracted using so-called interaction terms. These can be defined by creating meta groups and drawing arrows between them (Fig. 1). Specifically, the operation performed on the meta groups shown in Figure 1 will be interpreted as the interaction term (wildtype minus wildtype stressed) versus (mutant minus mutant stressed) and will extract those genes that respond to stress differently in mutant and wild type.

The expert settings box included on the experiment designer panel again allows advanced users to change all relevant parameters of the statistical analysis, like P value and minimal log2 fold-change cutoff, correction method for multiple testing, normalization (although it is not recommended to use different normalization methods for quality control and main analysis), and the statistical strategy for multiple testing across contrasts. Additionally, expert users can choose to review the R script that is generated from the inputs before it is sent to the R engine and include custom code or use Robin to quickly and comfortably generate skeletons of analysis scripts that can then be used as starting points for more sophisticated customized analyses.

ANALYSIS AND RESULTS

The statistical methods Robin employs to identify differentially expressed genes are based on two different approaches: Linear modeling (limma; Smyth, 2004) and rank product-based analysis (RankProd; Breitling et al., 2004; Hong et al., 2006). When analyzing Affymetrix data, the user can choose between these two options, with the restriction that rank product-based inference of differential expression is only available when two groups are to be compared. The two methods differ in the approach they take to the detection of differentially expressed genes. While the linear model-based method relies on advanced statistical modeling and Bayesian inference, the rank product approach has a closer resemblance to biological reasoning on the data. For further details on the statistical methods, please refer to Smyth (2004; Breitling et al., 2004; Hong et al., 2006) and the Robin Users’ Guide available online (http://mapman.gabipd.org/web/guest/tutorials-manuals-etc). Since rank product-based analysis is limited to comparing two experimental conditions, the linear model-based analysis offers far more options and flexibility with respect to the available settings and design of the experiment (e.g. if two factors, like genotype and treatment, are being varied in an experiment and the user is interested in the interaction effect).

After collecting all necessary information from the user, Robin generates an R script that is subsequently executed by the embedded R engine. The script produces a comprehensive set of output files that are organized in a folder structure. The results include several informative plots summarizing the statistical analysis: MA plots are created for each comparison, in which the genes that are called as significantly differentially expressed are highlighted in red (Supplemental Fig. S2). If less than five comparisons are defined, Robin generates Venn diagrams visualizing the number of genes responding differentially and the overlap of response between contrasts (Fig. 2). Dendrograms showing the hierarchical clustering of the data based on Pearson correlation of expression, and scatter plots of principal component analysis provide an overview of the internal structure of the data. Robin automatically saves several tables containing the complete statistical analysis for all the genes, and for the top 100 differentially expressed genes for each comparison made. Summary tables that are formatted for direct import and visualization in the meta-analysis tools MapMan and PageMan (Usadel et al., 2005, 2006) allow Robin to be easily integrated with downstream analyses. These files list the log2 fold change in expression for each gene in each comparison, plus a flag denoting the results of the statistical testing (0 = not significantly regulated, 1 = significantly up-regulated, −1 = significantly down-regulated). These flags can be used for convenient filtering in MapMan (see Usadel et al., 2009 for further details). Of course, thanks to the simple tabular data format, the result files can also be easily imported into network analysis tools like Cytoscape (Shannon et al., 2003). For Affymetrix data, present and absent calls are calculated using the mas5calls implementation provided by the affy BioConductor package (Gautier et al., 2004). All plots generated in the quality analyses, processed input files, the generated R source code, and a short text file summarizing the analysis are written to the output folder to completely document the analysis workflow and ensure reproducibility of the results.

Figure 2.
Venn diagram showing the numbers of genes called significantly differentially expressed when comparing tomato leaf, flower, and root tissue. The numbers include both up- and down-regulated genes. Genes that are differentially regulated in more than one ...

Case Study—Comparison of Tomato Tissues

Robin was used to analyze a data set generated by analyzing gene expression in tomato flowers, roots, and leaves, using TOM2 microarrays in a two-color microarray experiment setup (see the “Materials and Methods” section for details). Quality assessment showed that there were no obvious or severe technical artifacts visible on the chips when investigating the background intensity images and the signal intensity distributions plots (Supplemental Fig. S6). Warnings were generated for all MA plots of the individual chips because of a slightly elevated percentage (between 10.141% and 13.43%) of genes that showed a greater than 2-fold change in expression.

These warnings are based on the assumption that most of the genes will not show differential expression in any given experiment, and are automatically issued if the percentage exceeds 5%. However, when comparing very different tissue types, as it is the case in the experiment described in this study, larger differences in gene expression may be expected. Nevertheless, having high percentages of differentially expressed genes runs counter to the initial assumption that most of the genes are not responding, and since the normalization procedure is based on this assumption, normalization might fail. Another reason might be an overestimation of expression values due to an elevated signal-to-noise ratio. As often observed in two-color microarray experiments, the raw signal intensities differ in the red and green channel (Supplemental Fig. S6). This technical bias can largely be eliminated by using the standard background subtraction and scaling normalization approach in Robin, as shown on Supplemental Figure S6. Since none of the chips showed strongly outlying behavior in the quality assessment step, all were included in the statistical analysis of differential gene expression.

The three tomato tissues were compared against each other using a direct design with three biological replicates and dye swaps. In total, 418 genes were found to be significantly differentially regulated between leaves and roots, 200 when comparing leaves to flowers and 234 in the comparison of flowers to roots. As indicated on the Venn diagram (Fig. 2), a substantial number of genes showed differential expression levels in more than one comparison.

The results obtained in Robin were then analyzed using MapMan (Usadel et al., 2009) to gain insights into the biological context of relevant differences in gene expression. Using the biological pathway visualization capabilities of MapMan, general differences could be observed when comparing the aboveground organs with roots. The most prominent changes were, as could be expected, for genes related to photosynthesis. The MapMan BINs (1.1 PS.light reaction, 1.2 PS.photorespiration, 1.3 PS.calvin cycle, and 19 tetrapyrrole synthesis) were strongly and very consistently up-regulated in leaf and flower tissue (Fig. 3; Supplemental Table S2; Supplemental Fig. S3) compared to roots. The difference between leaves and flowers was much less pronounced, although still significant. This result can clearly be attributed to the fact that leaves as the primary sites of photosynthesis supply sink organs like roots and flowers with assimilates and hence need to maintain the photosynthetic machinery in a functional state. These results indicate that the major biological differences were readily identified by Robin and MapMan and prompted us to investigate more subtle differences.

Figure 3.
PageMan analysis of the tomato case study. A Wilcoxon test was performed, analogous to the test implemented in MapMan, to identify significantly differentially regulated MapMan bins. Individual bins that show distinct responses are highlighted. The plot ...

In addition to the visual inspection of pathways provided by MapMan, the built-in Wilcoxon rank sum test function was used on all three comparisons to identify significantly changed MapMan BINs (Supplemental Table S2). Other general processes that were found to be significantly up-regulated in leaves compared to both flowers and roots included starch synthesis and degradation. In line with the expectations, Suc-breakdown-related genes like Suc synthase showed increased expression in roots. Suc synthase is presumably involved in Suc breakdown to provide for carbon supply in sink organs (Sun et al., 1992; Zrenner et al., 1995). Surprisingly, invertases, which are required for normal root growth in Arabidopsis (Arabidopsis thaliana; Barratt et al., 2009), showed slightly stronger expression in leaves.

YABBY transcription factors have previously been shown to be involved in the regulation of lateral organ development (Street et al., 2008; Stahle et al., 2009). They were found to be significantly up-regulated in leaf (SGN-U603003) and flower tissue (SGN-U591723, SGN-U577176, SGN-U603003; Supplemental Fig. S3). The expression of YABBY proteins was strongest in flowers, supporting their well-described prominent role in flower development (Fourquin et al., 2007; Ishikawa et al., 2009; Orashakova et al., 2009). Investigation of the development-specific expression pattern of Arabidopsis YABBY proteins using the Genevestigator tool (Zimmermann et al., 2004) revealed a similar expression pattern for the crabs claw protein showing highest expression in mature flowers (Supplemental Fig. S4). Similarly, the MADS-box transcription factors showing high similarity to SEPALLATA (SEP1/2) and AGAMOUS-like (AGL8/12) from Arabidopsis that are known to regulate flower and seed development (Mizukami et al., 1996; Pelaz et al., 2000; for review, see Robles and Pelaz, 2005), show strongest expression in flower tissues (Supplemental Fig. S3), confirming the fidelity of the results generated using Robin.

MapMan BINs that were primarily up-regulated in root tissue included lignin biosynthesis (16.2.1), plasma membrane intrinsic proteins like aquaporins (34.19), and genes related to flavonoid synthesis and metabolism of phenolic compounds. Although the latter two were not significantly responding according to the Wilcoxon rank sum, individual genes showed significant responses. Since expression of flavonoid biosynthesis genes in root tissue is induced in the light (Hemm et al., 2004) the up-regulation of SGN-U565166, SGN-U565164 (similar to flanonol synthase), and SGN-U563058 (similar to flavonone-3-hydroxylase) might indicate an artifact due to exposure of the root to light during sample harvesting.

Flower tissue displayed a strong expression of cell wall-degrading enzymes like pectin methyl esterase (PME), pectate lyases, and polygalacturonases in comparison to both leaves and roots. PMEs catalyze the demethylation of pectin, changing the gelating properties of pectin and making it amenable to cleavage by pectate lyases and polygalacturonases. Apart from their role in simple pectin degradation, recent studies have also shown a prominent role of PMEs in controlling cell adhesion, organ development, and phylotactic patterning (for review, see Wolf et al., 2009). Previous screens of cDNA libraries derived from maize (Zea mays) pollen have shown high expression levels of pectin degradation related genes in flower tissues (Wakeley et al., 1998) that are believed to play a role in pollen tube elongation. Interestingly, two putative PMEs (SGN-U585819 and SGN-U585823) exhibited deviating behavior with low expression in flowers. Further investigations using the tomato genome browser provided by the sol genomics network (http://solgenomics.net/gbrowse/gbrowse/ITAG_devel_genomic/) revealed that both genes are located on the same chromosome in direct vicinity of each other, possibly indicating that they originate from a tandem duplication event (Supplemental Fig. S5). The observations reported above were highly significant both on the pathway level, as tested by the Wilcoxon rank sum test, and on the level of individual genes as confirmed by the statistical analysis of differential gene expression (see Supplemental Table S1 for full details). The raw data files and the complete Robin analysis project are available as supplemental material (Supplemental Materials S1 and S3).

MATERIALS AND METHODS

Implementation of Robin

Robin was implemented in Java and R using free extension libraries developed by several software projects. Specifically, the NetBeans visual application programming interface (http://graph.netbeans.org/) was used to develop the visual experiment designer, and the AffxFusion (http://www.affymetrix.com/partners_programs/programs/developer/index.affx) library was employed for the extraction of detailed information from Affymetrix chips. Apache commons (http://commons.apache.org/) was used to facilitate generic string operations. To achieve an improved user experience and better integration into the Mac OS X platform, we used the AppleJavaExtensions provided by Apple, Inc., and the QuaQua (http://www.randelshofer.ch/quaqua/) look and feel.

A stand-alone slim-line R engine is embedded in the Robin package, and is independent of user-installed versions of R. All required BioConductor packages have been included to provide an all-in-one package that works directly after installation. Installer packages for different operating systems were created using the free IzPack installer generator (http://izpack.org/). We also provide a lightweight package without R that can be deployed on any Java-enabled platform. On first use, this version of Robin will ask the user for a path to a working R installation, check this installation, and automatically download all required packages (if not already present), provided the computer has a working internet connection.

Automatic Input Assessment and Generation of Warnings

Robin tries to aid the user in assessing the quality of the microarray data by automatically generating warnings if diagnostic measures are exceeding preset threshold values. The assessment of global RNA degradation effects as implemented by the AffyRNAdeg function (Gautier et al., 2004) yields slopes for each of the degradation curves. If the slopes of individual RNA degradation curves exceed a value of 3 or deviate by more than 10% from the median slope of all curves, a warning message indicating the affected chips is displayed in the quality-check result list. MA plots visualizing the log2 fold change in expression of gene G under condition C versus condition D (M = logGC − logGD) plotted against the average log2 probe or probeset intensity (A = ½ × [logGC + logGD]) are generated for each individual chip. In the case of two-color microarrays the red channel signal intensity is compared against the green channel signal intensity. To display MA plots for Affymetrix arrays, the normalized expression values of each chip are compared against a synthetic chip created using the median expression values of all probesets across all chips in the experiment. Based on the assumption that most genes will not respond differentially to a given treatment, Robin automatically warns the user if more than 5% of the probesets on an individual chip are more than 2-fold up- or down-regulated. This threshold might be too restrictive in certain experiments, e.g. where very different developmental stages of an organism are compared or a drastic treatment is applied. Nevertheless, on data sets that violate the assumption that most genes are not responding, the normalization might fail and introduce artificial effects distorting the original data. Generally, though, a high percentage of differentially responding probesets might indicate artifacts caused, for example, by a low signal-to-noise ratio or large differences in probe signal intensity that could not be eliminated by normalization or even pathogen attack. Again based on the aforementioned assumption, the M values plotted on MA plots should be centered around M = 0. A lowess fit (Cleveland, 1979) is calculated for the MA plots. In the ideal case the lowess fit curve would be identical to the M = 0 line. As an estimate for a strong deviation of the lowess fit from the M = 0 line, the area between the lowess curve and the M = 0 line is calculated. If the area exceeds a value of 1, a warning will be issued to notify the user of possible artifacts that might be caused by, for example, a bimodal probe signal intensity distribution. Probe signal intensity oversaturation is estimated by calculating the percentage of probes whose raw signal intensity is equal to the highest intensity value measured within that chip. Usually only one or a few probes display maximal intensity (in the case of Affymetrix GeneChips the theoretically possible maximal dynamic range of probe signal intensity is 0–216 due to the 16-bit data precision of Affymetrix GeneChip scanning devices). If more than 0.25% of the probes have maximal intensity, the chip is considered oversaturated and a warning is generated, informing the user of the possible information loss.

Detection of spot replication relies on the spot identifiers and is based on the assumption that if the gene spots are not duplicated but the controls are duplicated, the number of unique identifiers will be greater than 50% of the total number of spots. This should be true for all array types that have more gene spots than control spots, but might not be the case for boutique arrays that only contain few probes (e.g. custom arrays designed for small organellar genomes). If replicate spots are detected, Robin sorts the input data by identifier to make sure that replicates are consecutive, sets the number of duplicates to two, and the spacing between duplicates to one. Obviously, this is incorrect in cases where more than two replicates are spotted on the array. When analyzing arrays on which the spacing of replicate spots is not uniform, this approach might lead to overestimation of significance and underestimation of correlation for replicate spots that are close together on the array. To account for this possible bias, Robin generates a warning when replicates are detected and informs the user of the assumptions made.

Since the rank product-based analysis does not accept duplicated spots on one array, Robin checks the input data and collapses replicated values identified by the same identifier to the median value within each array. If replication is detected a file containing the replicated spot identifiers and values will be written to disc. In addition to the warnings issued during the quality assessment, Robin will also inform the user of problems that occurred during the statistical analysis of differential expression, like low or imbalanced numbers of biological replicates and low significance of the results (e.g. none of the probes tested is called significantly differentially expressed given the chosen thresholds). At the end of the analysis workflow, Robin will present a summary list of all generated warnings to ensure that the user is made aware of possible shortcomings of the data.

Plant Material

Tomato (Solanum lycopersicum ‘M82’) seeds were allowed to germinate directly on soil and were then transferred to a vermiculte-based growth substrate and further cultivated as described in van der Merwe et al. (2009). Plant materials for microarray analysis were harvested from 6-week-old plants. Specifically, leaf samples were taken from the third to fourth node from the top, roots were washed in tap water to remove growth substrate, and all fully expanded flowers were collected. To minimize circadian effects, samples were taken on two consecutive days at the same time of day within 1.5 h. Tissue samples were immediately shock frozen in liquid nitrogen and stored at −80°C.

Sample Preparation

Tomato RNA extraction was performed using a modification of the standard TRIzol (Invitrogen GmbH) extraction protocol. Briefly, 500 mg of frozen material was finely ground in a mortar and subsequently mixed with 5 mL of TRIzol solution by vortexing. After addition of 3 to 5 mL chloroform and centrifugation for 20 min at 4,000g, the aqueous phase containing the RNA was transferred to a fresh tube. RNA was precipitated overnight following addition 0.5 volumes of precipitation solution (0.8 m sodium citrate, 1.2 m sodium chloride) and 0.5 volumes of 2-propanol. Precipitated RNA was recovered by centrifugation for 20 min at 4,000g and subsequently washed twice by adding 5 mL of 70% ethanol and centrifuging for 5 min at 4,000g. After complete removal of 70% ethanol, the RNA pellets were air dried and finally dissolved in 40 μL of sterile water. cDNA synthesis and labeling was carried out as described in Degenkolbe et al. (2005) using Dynabeads Oligo(dT)25 (Dynal) to extract mRNA from the whole RNA samples.

Chip Hybridization and Data Processing

The TOM2 microarrays were obtained from the Boyce Thompson Institute. Each microarray contains 11,890 oligonucleotide probes designed based on gene transcript sequences from the Lycopersicon Combined Build # 3 unigene database (http://www.sgn.cornell.edu). Following RNA extraction, chip hybridization was performed as described in Degenkolbe et al. (2005) with the following modifications: The slides were rehydrated over a 65°C waterbath for 10 s and UV cross-linked at 65 mJ. The prehybridization was performed for 45 min at 43°C in 5× SSC, 0.1% SDS, 1% bovine serum albumin, washed twice for 10 s in milliQ water (Millipore) and in isopropanol for 5 s and drained by centrifugation at 1,500 rpm for 1 min. After hybridization the slides were washed in 1× SSC, 0.2% SDS for 3 min at 42°C, and 3 min at room temperature; after that the slides were washed again in 0.1× SSC, 0.2% SDS for 3 min at room temperature, three times in 0.1× SSC for 3 min at room temperature. The arrays were then drained by centrifugation at 1,500 rpm for 2 min. All three possible comparisons between the three tissues were performed in three biological replicates, resulting in nine microarray hybridizations. Raw signal intensity values were computed from the scanned array images using the image analysis software GeneSpotter version 2.3 (MicroDiscovery). The raw intensity values were normalized using Robin's default settings for two-color microarray analysis. Specifically, background intensities estimated by GeneSpotter were subtracted from the foreground values and subsequently a printtip-wise loess normalization (Yang et al., 2002) was performed within each array. To reduce technical variation between chips, the logarithmized red and green channel intensity ratios on each chip were subsequently scaled across all arrays (Yang et al., 2002; Smyth and Speed, 2003) to have the same median absolute deviation. Statistical analysis of differential gene expression was carried out using the linear model-based approach developed by Smyth (2004). The obtained P values were corrected for multiple testing using the strategy described by Benjamini and Hochberg (1995) separately for each of the comparisons made. Genes that showed an absolute log2 fold-change value of at least 1 and a P value lower than 0.05 were considered significantly differentially expressed. The log2 fold-change cutoff value was imposed to account for noise in the experiment and make sure that only genes that show a marked reaction are recorded. The TOM2 chip oligonucleotide annotation was updated based on BLAST (Altschul et al., 1990) searches against the newest version of the SGN tomato unigene set (Tomato 200607 build2, http://solgenomics.net/) and MapMan BINs were assigned to each oligonucleotide on the chip based on the SGN tomato unigene mapping. Wilcoxon rank sum tests were performed to test whether there were bins that were significantly and consistently behaving different than the other bins in the MapMan ontology using the built-in function in MapMan.

Supplemental Data

The following materials are available in the online version of this article.

  • Supplemental Figure S1. Exemplary overview of the quality assessment plots generated by Robin.
  • Supplemental Figure S2. MA plots of the three comparisons made in the tomato case study experiment.
  • Supplemental Figure S3. Exemplary visualization of the most strongly reacting bins using MapMan.
  • Supplemental Figure S4. Expression patterns of three YABBY transcription factor homologs from Arabidopsis created using the Genevestigator Web application.
  • Supplemental Figure S5. Genomic locations of two putative PMEs from tomato (SGN-U585819 and SGN-U585823) as shown by the Gbrowse genome browser (http://solgenomics.net/gbrowse/gbrowse/ITAG_devel_genomic/).
  • Supplemental Figure S6. Summary of all quality-check plots generated for the tomato case study experiment.
  • Supplemental Table S1. Detailed statistical results tables as produced by Robin.
  • Supplemental Table S2. Wilcoxon rank sum test results generated by MapMan.
  • Supplemental Material S1. Complete analysis results of the case study as described in the text, including the processed raw microarray data.
  • Supplemental Material S2. Robin Users’ Guide.
  • Supplemental Material S3. Raw microarray data files of the case study experiment.

Supplementary Material

[Supplemental Data]

Acknowledgments

We are grateful to Diana Pese for excellent assistance in the lab. We also wish to acknowledge Paulina Troc, Steffen Kulawik, and Florian Hetsch for helping in harvesting the tomato samples and Anthony Bolger for helpful comments on the manuscript. We want to acknowledge James J. Giovannoni for kindly providing tomato microarrays. Finally, we also wish to thank all colleagues who tested the Robin application and gave useful comments and suggestions helping us to improve the user experience and stability.

References

  • Affymetrix (2005) Guide to probe logarithmic intensity error (plier) estimation. Technical Report, Affymetrix, Inc. www.affymetrix.com/support/technical/technotesmain.a.x (December 15, 2009)
  • Alba R, Fei Z, Payton P, Liu Y, Moore SL, Debbie P, Cohn J, D'Ascenzo M, Gordon JS, Rose JK, et al. (2004) ESTs, cDNA microarrays, and gene expression profiling: tools for dissecting plant physiology and development. Plant J 39: 697–714. [PubMed]
  • Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990) Basic local alignment search tool. J Mol Biol 215: 403–410. [PubMed]
  • Barratt DH, Derbyshire P, Findlay K, Pike M, Wellner N, Lunn J, Feil R, Simpson C, Maule AJ, Smith AM. (2009) Normal growth of Arabidopsis requires cytosolic invertase but not sucrose synthase. Proc Natl Acad Sci USA 106: 13124–13129. [PMC free article] [PubMed]
  • Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R. (2007) NCBI GEO: mining tens of millions of expression profiles—database and tools update. Nucleic Acids Res 35: D760–765. [PMC free article] [PubMed]
  • Benjamini Y, Hochberg Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc, B 57: 289–300.
  • Bolstad BM. (2004) Low level analysis of high-density oligonucleotide array data: background, normalization and summarization. PhD thesis. University of California, Berkeley, CA.
  • Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, et al. (2001) Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 29: 365–371. [PubMed]
  • Breitling R, Armengaud P, Amtmann A, Herzyk P. (2004) Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett 573: 83–92. [PubMed]
  • Cleveland WS. (1979) Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc 74: 829–836.
  • Degenkolbe T, Hannah MA, Freund S, Hincha DK, Heyer AG, Kohl KI. (2005) A quality-controlled microarray method for gene expression profiling. Anal Biochem 346: 217–224. [PubMed]
  • Dietzsch J, Gehlenborg N, Nieselt K. (2006) Mayday—a microarray data analysis workbench. Bioinformatics 22: 1010–1012. [PubMed]
  • Dondrup M, Albaum S, Griebel T, Henckel K, Junemann S, Kahlke T, Kleindt C, Kuster H, Linke B, Mertens D, et al. (2009) EMMA 2—a MAGE-compliant system for the collaborative analysis and integration of microarray data. BMC Bioinformatics 10: 50. [PMC free article] [PubMed]
  • Fourquin C, Vinauger-Douard M, Chambrier P, Berne-Dedieu A, Scutt CP. (2007) Functional conservation between CRABS CLAW orthologues from widely diverged angiosperms. Ann Bot (Lond) 100: 651–657. [PMC free article] [PubMed]
  • Gautier L, Cope L, Bolstad BM, Irizarry RA. (2004) affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20: 307–315. [PubMed]
  • Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5: R80. [PMC free article] [PubMed]
  • Hannemann J, Poorter H, Usadel B, Blasing OE, Finck A, Tardieu F, Atkin OK, Pons T, Stitt M, Gibon Y. (2009) Xeml Lab: a tool that supports the design of experiments at a graphical interface and generates computer-readable metadata files, which capture information about genotypes, growth conditions, environmental perturbations and sampling strategy. Plant Cell Environ 32: 1185–1200. [PubMed]
  • Hemm MR, Rider SD, Ogas J, Murry DJ, Chapple C. (2004) Light induces phenylpropanoid metabolism in Arabidopsis roots. Plant J 38: 765–778. [PubMed]
  • Hohnjec N, Vieweg MF, Puhler A, Becker A, Kuster H. (2005) Overlaps in the transcriptional profiles of Medicago truncatula roots inoculated with two different Glomus fungi provide insights into the genetic program activated during arbuscular mycorrhiza. Plant Physiol 137: 1283–1301. [PMC free article] [PubMed]
  • Hong F, Breitling R, McEntee CW, Wittner BS, Nemhauser JL, Chory J. (2006) RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis. Bioinformatics 22: 2825–2827. [PubMed]
  • Ishikawa M, Ohmori Y, Tanaka W, Hirabayashi C, Murai K, Ogihara Y, Yamaguchi T, Hirano HY. (2009) The spatial expression patterns of DROOPING LEAF orthologs suggest a conserved function in grasses. Genes Genet Syst 84: 137–146. [PubMed]
  • Martin-Requena V, Munoz-Merida A, Claros MG, Trelles O. (2009) PreP+07: improvements of a user friendly tool to pre-process and analyse microarray data. BMC Bioinformatics 10: 16. [PMC free article] [PubMed]
  • Mizukami Y, Huang H, Tudor M, Hu Y, Ma H. (1996) Functional domains of the floral regulator AGAMOUS: characterization of the DNA binding domain and analysis of dominant negative mutations. Plant Cell 8: 831–845. [PMC free article] [PubMed]
  • Orashakova S, Lange M, Lange S, Wege S, Becker A. (2009) The CRABS CLAW ortholog from California poppy (Eschscholzia californica, Papaveraceae), EcCRC, is involved in floral meristem termination, gynoecium differentiation and ovule initiation. Plant J 58: 682–693. [PubMed]
  • Pelaz S, Ditta GS, Baumann E, Wisman E, Yanofsky MF. (2000) B and C floral organ identity functions require SEPALLATA MADS-box genes. Nature 405: 200–203. [PubMed]
  • Psarros M, Heber S, Sick M, Thoppae G, Harshman K, Sick B. (2005) RACE: remote analysis computation for gene expression data. Nucleic Acids Res 33: W638–643. [PMC free article] [PubMed]
  • R Development Core Team (2009) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna.
  • Rainer J, Sanchez-Cabo F, Stocker G, Sturn A, Trajanoski Z. (2006) CARMAweb: comprehensive R- and bioconductor-based web service for microarray data analysis. Nucleic Acids Res 34: W498–503. [PMC free article] [PubMed]
  • Robles P, Pelaz S. (2005) Flower and fruit development in Arabidopsis thaliana. Int J Dev Biol 49: 633–643. [PubMed]
  • Saeed AI, Sharov V, White J, Li J, Liang W, Bhagabati N, Braisted J, Klapa M, Currier T, Thiagarajan M, et al. (2003) TM4: a free, open-source system for microarray data management and analysis. Biotechniques 34: 374–378. [PubMed]
  • Schena M, Shalon D, Davis RW, Brown PO. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270: 467–470. [PubMed]
  • Schmid M, Davison TS, Henz SR, Pape UJ, Demar M, Vingron M, Schölkopf B, Weigel D, Lohmann JU. (2005) A gene expression map of Arabidopsis thaliana development. Nat Genet 37: 501–506. [PubMed]
  • Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13: 2498–2504. [PMC free article] [PubMed]
  • Smyth GK. (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3: Article 3. [PubMed]
  • Smyth GK, Speed T. (2003) Normalization of cDNA microarray data. Methods 31: 265–273. [PubMed]
  • Sreenivasulu N, Radchuk V, Strickert M, Miersch O, Weschke W, Wobus U. (2006) Gene expression patterns reveal tissue-specific signaling networks controlling programmed cell death and ABA-regulated maturation in developing barley seeds. Plant J 47: 310–327. [PubMed]
  • Stahle MI, Kuehlich J, Staron L, von Arnim AG, Golz JF. (2009) YABBYs and the transcriptional corepressors LEUNIG and LEUNIG_HOMOLOG maintain leaf polarity and meristem activity in Arabidopsis. Plant Cell 21: 3105–3118. [PMC free article] [PubMed]
  • Street NR, Sjodin A, Bylesjo M, Gustafsson P, Trygg J, Jansson S. (2008) A cross-species transcriptomics approach to identify genes involved in leaf development. BMC Genomics 9: 589. [PMC free article] [PubMed]
  • Sun J, Loboda T, Sung SJ, Black CC. (1992) Sucrose synthase in wild tomato, Lycopersicon chmielewskii, and tomato fruit sink strength. Plant Physiol 98: 1163–1169. [PMC free article] [PubMed]
  • Thompson R, Ratet P, Küster H. (2005) Identification of gene functions by applying TILLING and insertional mutagenesis strategies on microarray-based expression data. Grain Legumes 41: 20–22.
  • Usadel B, Bläsing OE, Gibon Y, Retzlaff K, Höhne M, Günther M, Stitt M. (2008) Global transcript levels respond to small changes of the carbon status during progressive exhaustion of carbohydrates in Arabidopsis rosettes. Plant Physiol 146: 1834–1861. [PMC free article] [PubMed]
  • Usadel B, Nagel A, Steinhauser D, Gibon Y, Bläsing OE, Redestig H, Sreenivasulu N, Krall L, Hannah MA, Poree F, et al. (2006) PageMan: an interactive ontology tool to generate, display, and annotate overview graphs for profiling experiments. BMC Bioinformatics 7: 535. [PMC free article] [PubMed]
  • Usadel B, Nagel A, Thimm O, Redestig H, Blaesing OE, Palacios-Rojas N, Selbig J, Hannemann J, Piques MC, Steinhauser D, et al. (2005) Extension of the visualization tool MapMan to allow statistical analysis of arrays, display of corresponding genes, and comparison with known responses. Plant Physiol 138: 1195–1204. [PMC free article] [PubMed]
  • Usadel B, Poree F, Nagel A, Lohse M, Czedik-Eysenberg A, Stitt M. (2009) A guide to using MapMan to visualize and compare Omics data in plants: a case study in the crop species, maize. Plant Cell Environ 32: 1211–1229. [PubMed]
  • van der Merwe MJ, Osorio S, Moritz T, Nunes-Nesi A, Fernie AR. (2009) Decreased mitochondrial activities of malate dehydrogenase and fumarase in tomato lead to altered root growth and architecture via diverse mechanisms. Plant Physiol 149: 653–669. [PMC free article] [PubMed]
  • Wakeley PR, Rogers HJ, Rozycka M, Greenland AJ, Hussey PJ. (1998) A maize pectin methylesterase-like gene, ZmC5, specifically expressed in pollen. Plant Mol Biol 37: 187–192. [PubMed]
  • Wang J, Nygaard V, Smith-Sørensen B, Hovig E, Myklebost O. (2002) MArray: analysing single, replicated or reversed microarray experiments. Bioinformatics 18: 1139–1140. [PubMed]
  • Wettenhall JM, Simpson KM, Satterley K, Smyth GK. (2006) affylmGUI: a graphical user interface for linear modeling of single channel microarray data. Bioinformatics 22: 897–899. [PubMed]
  • Wettenhall JM, Smyth GK. (2004) limmaGUI: a graphical user interface for linear modeling of microarray data. Bioinformatics 20: 3705–3706. [PubMed]
  • Wilson CL, Miller CJ. (2005) Simpleaffy: a BioConductor package for Affymetrix quality control and data analysis. Bioinformatics 21: 3683–3685. [PubMed]
  • Winfield MO, Lu C, Wilson ID, Coghill JA, Edwards KJ. (2009) Cold- and light-induced changes in the transcriptome of wheat leading to phase transition from vegetative to reproductive growth. BMC Plant Biol 9: 55. [PMC free article] [PubMed]
  • Wolf S, Mouille G, Pelloux J. (2009) Homogalacturonan methyl-esterification and plant development. Mol Plant 2: 851–860. [PubMed]
  • Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F, Spencer F. (2004) A model-based background adjustment for oligonucleotide expression arrays. J Am Stat Assoc 99: 909–917.
  • Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30: e15. [PMC free article] [PubMed]
  • Zanor MI, Osorio S, Nunes-Nesi A, Carrari F, Lohse M, Usadel B, Kuhn C, Bleiss W, Giavalisco P, Willmitzer L, et al. (2009) RNA interference of LIN5 in tomato confirms its role in controlling Brix content, uncovers the influence of sugars on the levels of fruit hormones, and demonstrates the importance of sucrose cleavage for normal fruit development and fertility. Plant Physiol 150: 1204–1218. [PMC free article] [PubMed]
  • Zimmermann P, Hirsch-Hoffmann M, Hennig L, Gruissem W. (2004) GENEVESTIGATOR: Arabidopsis microarray database and analysis toolbox. Plant Physiol 136: 2621–2632. [PMC free article] [PubMed]
  • Zimmermann P, Schildknecht B, Craigon D, Garcia-Hernandez M, Gruissem W, May S, Mukherjee G, Parkinson H, Rhee S, Wagner U, et al. (2006) MIAME/Plant—adding value to plant microarrray experiments. Plant Methods 2: 1. [PMC free article] [PubMed]
  • Zrenner R, Salanoubat M, Willmitzer L, Sonnewald U. (1995) Evidence of the crucial role of sucrose synthase for sink strength using transgenic potato plants (Solanum tuberosum L.). Plant J 7: 97–107. [PubMed]

Articles from Plant Physiology are provided here courtesy of American Society of Plant Biologists

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...