Logo of plntphysLink to Publisher's site
Plant Physiol. Dec 2000; 124(4): 1582–1594.

A New Set of Arabidopsis Expressed Sequence Tags from Developing Seeds. The Metabolic Pathway from Carbohydrates to Seed Oil1,[w]


Large-scale single-pass sequencing of cDNAs from different plants has provided an extensive reservoir for the cloning of genes, the evaluation of tissue-specific gene expression, markers for map-based cloning, and the annotation of genomic sequences. Although as of January 2000 GenBank contained over 220,000 entries of expressed sequence tags (ESTs) from plants, most publicly available plant ESTs are derived from vegetative tissues and relatively few ESTs are specifically derived from developing seeds. However, important morphogenetic processes are exclusively associated with seed and embryo development and the metabolism of seeds is tailored toward the accumulation of economically valuable storage compounds such as oil. Here we describe a new set of ESTs from Arabidopsis, which has been derived from 5- to 13-d-old immature seeds. Close to 28,000 cDNAs have been screened by DNA/DNA hybridization and approximately 10,500 new Arabidopsis ESTs have been generated and analyzed using different bioinformatics tools. Approximately 40% of the ESTs currently have no match in dbEST, suggesting many represent mRNAs derived from genes that are specifically expressed in seeds. Although these data can be mined with many different biological questions in mind, this study emphasizes the import of photosynthate into developing embryos, its conversion into seed oil, and the regulation of this pathway.

To understand the regulatory networks governing metabolism in developing oil seeds, we initiated a genome-wide analysis of gene expression in seeds of Arabidopsis, taking advantage of recently developed genomic tools (Hieter and Boguski, 1997; Bouchez and Hofte, 1998). Although the Arabidopsis genomic sequence is now fully available (www.arabidopsis.org; Meinke et al., 1998), expressed sequence tags (ESTs) derived from single-pass sequencing of cDNAs in Arabidopsis provide an invaluable resource for the annotation of genomic sequences and the analysis of gene expression associated with specific plant tissues or growth conditions (Newman et al., 1994; Cooke et al., 1996; Rounsley et al., 1996). Cloning of genes encoding enzymes of specific biochemical pathways by single-pass sequencing of cDNAs has been a very successful strategy, particularly when the cDNA libraries have been prepared from tissues with high activity for the respective enzymes. For example, sequencing of cDNAs derived from endosperm of developing castor bean seeds led to the identification of the enzyme involved in ricinoleic acid biosynthesis (Van de Loo et al., 1995a, 1995b). In a similar manner, genes essential for the biosynthesis of conjugated double bond-containing fatty acids were recently identified among ESTs from oleogenic tissues of Momordica charantia and Impatiens balsamina (Cahoon et al., 1999) and ESTs from wood-forming tissues of trees have proven to be an ideal source for the isolation of cDNAs encoding enzymes of cell wall biosynthesis (Allona et al., 1998; Sterky et al., 1998). ESTs and their accompanying cDNAs also provide the means to construct inexpensive microarrays on glass slides, which can be used to study the expression of genes on a genome-wide scale (DeRisi et al., 1997; Ruan et al., 1998). A careful bioinformatic analysis to identify tissue-specific ESTs is a prerequisite to obtain a comprehensive and representative set of cDNAs for gene expression studies by microarrays (Loftus et al., 1999). Thus, given that only a small number of plant ESTs in the public databases have been derived from seeds, it was essential in the context of the genome-wide analysis of seed metabolism to obtain and analyze a large number of these ESTs first.

Even without subsequent microarray analysis, a sufficiently large number of ESTs derived from a specific tissue can provide a clue toward the expression of specific genes in the tissue (Rafalski et al., 1998; Ewing et al., 1999; Mekhedov et al., 2000). In most cases and within statistical limitations (Audic and Claverie, 1997) the abundance of a specific cDNA in the EST collection is a measure for gene expression. Here we apply this technique also referred to as “electronic or digital northern” to address the questions about the primary metabolic route for the conversion of photosynthate into oil in developing seeds of Arabidopsis. The described analysis of 10,500 cDNAs by single-pass sequencing provides a rich data set, which we can only begin to explore here. For this reason the data set will be available at our web page for further studies.


Single-Pass Sequencing of 10,522 cDNAs from Developing Seeds

Despite the fact that over 45,000 Arabidopsis ESTs have already been deposited in dbEST (release 030300; Boguski et al., 1993), these are not necessarily representative with regard to genes specifically expressed in developing seeds, because siliques, but not isolated developing seeds were used as source of seed cDNAs. To initiate a “functional genomic” analysis of seed metabolism, we sequenced cDNAs derived exclusively from developing Arabidopsis seeds in a single pass from the 5′ end. Because seeds contain highly abundant mRNAs, e.g. those derived from genes encoding storage proteins, we probed nylon filters with 9,136 (data set I) and 18,432 arrayed clones (data set II), respectively, employing cDNA probes as summarized in Table TableI.I. From data set I, 4,641 clones (51%) were sequenced and analyzed with BLASTX. Additional clones were selected from data set I (Table (TableI)I) for probing of the second filter set to further reduce the redundancy in data set II. In this case, 5,922 clones (32%) were sequenced and analyzed. The average read lengths after trimming were 350 bp for clones from data set I and 259 bp for clones from data set II. Taken together, 10,522 clones were analyzed at the level of BLASTX searches equivalent to 38% of the clones on the filters. A total of 11,873 sequences were generated and kept in a FASTA file (complete raw data set), which includes 1,141 sequence runs from the 3′ ends of selected clones, a small number of repeats, and clones for which only poor sequence is available. The sequences have been deposited at GenBank and will be available along with annotations at our web site. The longest clones from each contig as well as singletons (see below) have been deposited at the Arabidopsis Biological Resource Center.

Table I
Clones corresponding to highly abundant messages used for prescreening

Classification of ESTs According to Predicted Function

To obtain qualitative information about the ESTs, each sequence was searched (BLASTX) against the non-redundant protein database of GenBank. The top scoring hits were automatically extracted and manually annotated according to the description of the sequence(s) returned by BLASTX. The number of clones falling into each class are shown in Table TableII.II. It must be emphasized that this procedure provides only tentative clues toward the function of the encoded proteins, due to the fact that relatively few of the descriptions associated with GenBank entries have been verified by wet-lab experiments (Boguski, 1999).

Table II
Distribution of cDNAs in classes of putative function

Two classes, “non-significant homology” (NSH) and “unidentified function” (UF) represent approximately 40% of the clones and warrant further explanation. Sequences that returned BLASTX scores (high scoring segment pairs) of less than 100 were grouped under NSH (24.3%), indicating that no protein similar to the translation product was present in the public databases at the time of the analysis. This group of sequences was repeatedly resubmitted for analysis. To rule out that the NSH class is enriched in low quality sequences as the primary cause for low BLASTX scores in this class, we compared the average quality values assigned by PHRED to each chromatogram and found similar average quality values for the NSH class and the total EST set. Based on this analysis one can assume that approximately 24% of the clones in the seed database encode novel proteins. The UF class (13.5%) contains ESTs that show significant similarity (BLASTX scores >100) at the level of predicted amino acid sequence to proteins from different organisms for which no function is known.

Despite the prescreening there is still a considerable number of storage protein entries (14.4%) present in the database (Table (TableII) II) representing the largest class of clones for which a putative function can be assigned. A similar observation was made for ESTs from castor bean and was explained by the presence of short incomplete cDNAs encoding storage proteins that would not hybridize efficiently to the probe during prescreening (Van de Loo et al., 1995b). Considering the number of storage protein clones and other abundant clones identified by hybridization (62%), a minimum of 75% of mRNAs are derived from less than 50 genes in developing seeds. Three classes of particular importance to the analysis of carbon flow in developing oil seeds include 701 entries classified as carbohydrate metabolism, 490 lipid metabolism entries, and 216 entries for putative membrane transporters.

How Many Novel ESTs and How Many Genes Are Represented in the Seed EST Set?

To evaluate whether novel, seed-specific ESTs were present we compared our entire 5′-sequence data set against the Arabidopsis set in “The Arabidopsis Information Resource” available at http://www.Arabidopsis.org/seqtools.html. Of the 10,552 BLASTN results returned, 4,173 (39.5%) showed BLASTN scores (high scoring segment pairs) of less than or equal to 50. Based on these scores it can be estimated that approximately 40% of the ESTs described here are not represented in the public Arabidopsis EST set and many of these therefore may correspond to genes specifically expressed in developing seeds of Arabidopsis.

Because multiple ESTs can be derived from a single gene, sequences were assembled into contigs to estimate the number of genes giving rise to the ESTs. Of the 11,850 sequences used for contig analysis, 7,567 (64%) assembled into 1,569 contigs and 4,283 (36%) remained as singletons. Thus the maximal number of unique cDNAs represented in the entire data set is 5,852. To estimate how many genes are represented in our data set that may be specifically expressed in developing seeds, we determined the number of contigs and singletons represented by the 4,173 ESTs not represented in the public data set. These were 743 contigs and 2,306 singletons representing a maximal number of 3,049 genes. Thus based on this analysis up to one-half of all genes represented by our data set may be specifically expressed in seeds. However, there are three caveats concerning this estimation. First, although in most cases each contig represents one gene, sometimes more than one contig of nonoverlapping sequences exist per gene resulting in an overestimation. Second, in some cases due to the limited quality of single-pass sequences, closely related gene families cannot be resolved into individual contigs resulting in an underestimation. Third, because silique-derived cDNA sequences are present in the public database, some of the ESTs in dbEST already represent genes specifically expressed in seeds, e.g. storage protein genes. These have not been taken into account above and will lead to an underestimation of seed-specific genes represented by the seed EST data set.

Mapping ESTs onto the Arabidopsis Genome

One step toward the determination of the exact number of genes represented by ESTs would be to map all ESTs and contig consensus sequences onto the Arabidopsis genome. For this purpose we searched (BLASTN) all sequences in the raw sequence file, as well as all contig consensus sequences against an Arabidopsis genomic sequence subset of all sequences longer than 10 kb. This set should primarily contain sequenced bacteria artificial chromosomes (BACs), phage artificial chromaosomes (PACs), and P1 clones from the Arabidopsis Genome Initiative. The individual results of this analysis can be found in the database and provide a location for most ESTs on the physical map of Arabidopsis by linking these results to the map locations of sequenced clones available at http://www.Arabidopsis.org/seqtools.html. In the past this information could only be obtained by direct PCR mapping approaches (Agyare et al., 1997) due to the absence of large scale genomic sequence information. Because BACs contain on the average 20 to 30 genes each, further analysis on an individual basis is required to ultimately determine whether two contigs are derived from one or several genes on a particular BAC.

Abundance of ESTs Derived from Specific Genes

The number of sequences assembled in the contigs gives an indication of the degree of expression of the respective gene in developing seeds. Table III lists contigs containing more than eight ESTs. The accession numbers provide direct access to the sequence in GenBank (whenever possible, a cDNA sequence) that shows the best match to the contig consensus sequence. As predicted by the initial classification of individual ESTs (Table (TableII), II), the most abundant ESTs form contigs that encode seed storage proteins. In agreement with the high demands for protein synthesis in developing seeds, ESTs for translational elongation factors were abundant in contigs (Table III, RB). ESTs for proteins possibly involved in storage protein body formation such as vacuolar processing enzyme (Kinoshita et al., 1995; Table III, TON) or proteases in general (Table III, PA) are highly abundant. In a similar manner, genes encoding enzymes involved in protein folding (Table III, CHP) such as protein disulfide isomerase genes are highly expressed in seeds (Boston et al., 1996). Developing embryos of Arabidopsis are green. Thus it is not surprising that ESTs encoding chlorophyll-binding proteins are present in high numbers (Table III, PS). The most highly abundant enzyme-encoding ESTs are those for S-adenosyl-Met decarboxylase (Table III, AA). This is a key enzyme of polyamine biosynthesis (Walden et al., 1997). However, ESTs encoding other enzymes of this pathway are not very abundant or are absent. Thus S-adenosyl-Met decarboxylase may be involved in addition in a pathway unrelated to polyamine biosynthesis. Among the contigs of abundant ESTs are 20 for which the consensus sequence did not have a match in GenBank or which are similar to proteins of unknown function (Table III, NSH and UF). These provide an interesting pool of novel proteins with a function that may be of special relevance for developing seeds and further functional analysis may lead to the discovery of molecular processes crucial to developing seeds. An obvious class missing in the contig list of most abundant ESTs (Table III) is that containing ESTs with similarity to transcription factor genes, even though the entire data set contains a considerable number of such ESTs (169, 1.6%; Table TableII,II, T). It is clear that regulatory genes are not as highly expressed as storage protein genes or genes essential for the biosynthesis of other storage compounds. Although this notion may be trivial, it nevertheless confirms that the observed abundance of ESTs in each contig or class is in agreement with common knowledge about the biology of plant cells and of developing seeds in particular.

Table III
Most abundant contigs in the Seed EST database

Different Representation of Genes in the Seed EST Set and the Public Arabidopsis EST Set

The public EST data set for Arabidopsis available March 2000 consists of over 45,000 sequences derived from cDNA libraries produced from a range of tissues. The largest group of sequences (approximately 31,000) originated from sequencing a mixed population of cDNAs from etiolated seedlings, tissue culture-grown roots, and aerial tissue from flowering plants (Newman et al., 1994). The 10,522 sequences from a developing seed cDNA library described in this study represent the largest set of public Arabidopsis ESTs currently available from a narrowly defined developmental stage of the plant. How different is this new set from those sequences already deposited? To answer this question we compared the percentage of ESTs in the seed database for several genes with their abundance among the non-seed Arabidopsis ESTs previously deposited in dbEST. For example, for glyceraldehyde-3-P dehydrogenase, a gene that might be considered constitutive, or “housekeeping,” the relative abundance in the two data sets is identical (0.3%). In contrast and as expected, genes that are known to be highly expressed in seeds were found to be abundant in the seed EST data set. For example, storage proteins represent at least 50% of the clones in the seed library, which is at least 500-fold more abundant than in the non-seed set. Likewise, oleosins are approximately 100-fold more prevalent in the seed library than in the non-seed data.

In mature Arabidopsis seeds, lipid in the form of triacylglycerol is the major form of carbon storage, representing 30% to 40% of the seed dry weight. It might be expected that higher flux of carbon into lipid synthesis in seeds would be reflected in a higher proportion of clones for fatty acid synthesis within the seed data set than in dbEST. This is in fact the case: approximately 0.5% of the seed ESTs encode proteins of the plastidic fatty acid synthase compared with approximately 0.15% of Arabidopsis ESTs found in dbEST for the same reactions. Furthermore, we detected ESTs for seed-specific genes that are completely missing from the public data set. For example, clones corresponding to FAE1 encoding a protein that controls seed-specific fatty acid elongation occurred 20 times in our database, but not at all in dbEST. In general, the vast majority of these comparisons validate that this new EST set provides the expected tissue-specific representation of gene expression in seeds and contains a very different population of ESTs than previously available.

The Conversion of Photosynthate into Fatty Acids

Figure Figure11 depicts the major pathways involved in the conversion of Suc into fatty acids. These include the conversion of imported Suc by a cytosolic glycolytic pathway (reactions 1–16), transfer of intermediates across the plastid envelopes (reactions 17–20), intermittent starch biosynthesis and degradation in the plastid (reactions 21–26), a plastidic glycolytic pathway (reaction 27–36), the oxidative pentose phosphate cycle (reactions 37–42), the plastidic pyruvate dehydrogenase complex (reaction 44), as well as reactions involved in fatty acid biosynthesis and modification (reactions 45–52). In Figure Figure11 the thickness of arrows represents the number of ESTs in data sets I and II, which encode the respective enzyme. Because different enzymes have different turnover numbers and other kinetic factors, this number cannot be used to compare the magnitude of flux through the different reactions. However, EST numbers in many cases can provide useful comparisons between the same reaction in different compartments, or between similar biochemical reactions. The assignment of the plastidic and cytosolic isoforms was generally based on BLASTX results showing sequence similarity of the respective ESTs or contigs to genes encoding proteins of known function and subcellular location. In ambiguous cases, e.g. for Glc-6-P dehydrogenase (Fig. (Fig.1,1, reaction 37) we used multiple alignment of the respective ESTs from the seed database with all known Glc-6-P dehydrogenase-encoding plant genes in conjunction with cluster analysis. Further refinement could be achieved by predicting the presence of chloroplast transit peptides from genomic DNA sequences that correspond to the ESTs. However, in the absence of biochemical data these assignments must be considered preliminary. A list of each enzyme, the number of ESTs, and the clone and contig identifiers are given in Table TableIV.IV. Reactions for which no corresponding EST is present are drawn with a dashed line in Figure Figure1. 1.

Figure 1
Schematic representation of metabolic pathways in a typical oil storing cell of a developing Arabidopsis embryo. The selective focus presented here is on carbohydrate metabolism and fatty acid biosynthesis. Only cytosolic and plastidic isoforms are ...
Table IV
Enzymes involved in carbohydrate and lipid metabolism

It is interesting that those reactions are often found in clusters, e.g. reactions 25 through 29 (plastidic glycolysis) or 38 through 41 (oxidative pentosephosphate cycle). It is tempting to speculate that the observed clustering reflects the coordinated regulation of gene expression according to metabolic pathways and may provide a first glimpse at the regulatory network governing seed metabolism. However, it must be emphasized that even though this new data set is large, it still is incomplete and the resolution for differential expression is lost for reactions that are not represented by ESTs.

Membrane Transporters

Suc is the transport form of CO2 fixed by photosynthesis and must be imported into the developing embryo. Studies with developing bean seeds suggest that Suc and hexose transporters located in the epidermis of the embryo are involved (Weber et al., 1997). Two Suc transporter genes are known for Arabidopsis, SUC1 and SUC2 (Sauer and Stolz, 1994) and corresponding ESTs are present in the seed database (Table (TableIV;IV; Fig. Fig.1, 1, reaction 1). Most ESTs correspond to SUC2, but there is also a contig of ESTs that are more similar to the Suc transporter from bean (Tab IV). Whether this class of ESTs represents a third Suc transporter gene from Arabidopsis specific for developing seeds needs to be further investigated. Furthermore, several ESTs with similarity to hexose transporters are present, which may be involved in the import of hexoses derived from Suc cleavage by apoplastic invertase.

Hexose metabolites enter the plastid to provide precursors for starch and fatty acid biosynthesis. Using isolated plastids of developing embryos of oilseed rape, it has been shown that labeled Glc-6-P and pyruvate are the most efficient of all the different possible substrates tested in labeling starch and triacylglycerols, respectively (Kang and Rawsthorne, 1994). Furthermore, fatty acid biosynthesis was stimulated if Glc-6-P and pyruvate were present (Kang and Rawsthorne, 1996). ESTs with similarity to a plastid Glc-6-P/phosphate (or triosephosphate) antiporter (Kammerer et al., 1998) are abundant in the seed EST database (Table (TableIV;IV; Fig. Fig.1,1, reaction 17). However, we were unable to identify a set of ESTs with similarities to any known pyruvate or monocarboxylic acid transporter (Table (TableIV;IV; Fig. Fig.1,1, reaction 20). Either pyruvate does not require a specific translocator, the respective protein cannot be identified without further biochemical or molecular information, or pyruvate is not the metabolite imported into plastids in vivo. It has been previously suggested that a plastid phosphoenolpyruvate/phosphate antiporter may be providing the plastid with pyruvate following metabolism of the imported phosphoenolpyruvate (Fischer et al., 1997). There are several ESTs present encoding proteins with similarity to a phosphoenolpyruvate translocator (Table (TableIV;IV; Fig. Fig.1,1, reaction 19). A high expression of this antiporter in non-green plant tissues has also been observed using conventional methods (Kammerer et al., 1998). In the same study it was also shown that the gene for the triosephosphate/phosphate translocator is much more highly expressed in green tissues as compared with non-green tissues. Thus, the presence of only one EST for the respective gene in the seed database (Table (TableIV; IV; Fig. Fig.1,1, reaction 18) is in agreement with the conventional northern analysis.

Glycolysis, Oxidative Pentose Phosphate Cycle, and Starch Metabolism

In general, plants do have a complete glycolytic pathway in the cytosol (Plaxton, 1996) and it has been shown that a complete pathway also exists in the plastids of oil seeds (Dennis and Miernyk, 1982; Kang and Rawsthorne, 1994). The question remains to what extent both pathways are utilized in the conversion of carbohydrates into precursors of fatty acid biosynthesis. All genes encoding glycolytic enzymes of the cytosol are expressed, whereas ESTs encoding plastidic isoforms are absent in many cases (Fig. (Fig.1;1; Table TableIV).IV). Exceptions are the central reactions 30 through 33 of the plastidic glycolytic pathway, as well as the plastidic isoform of pyruvate kinase (reaction 36). It seems certain that there is differential transcriptional regulation of the two pathways. Assuming that there is no general difference between the specific activities of the cytosolic and plastid enzymes, the data would be consistent with a more active cytosolic pathway. The peculiar high expression of plastidic pyruvate kinase genes (reaction 33) in conjunction with the relatively high abundance of phosphoenolpyruvate transporter ESTs (reaction 19) is consistent with a major route of carbon from Suc into precursors of fatty acid biosynthesis involving the cytosolic glycolytic pathway up to phosphoenolpyruvate, import of this compound into the plastid, and subsequent conversion to pyruvate. It is interesting that ESTs for plastid isoforms of pyruvate dehydrogenase (27 ESTs) are approximately 2-fold more abundant than for mitochondrial isoforms (13 ESTs). This contrasts with the non-seed Arabidopsis EST set in dbEST where ESTs are approximately equal for the two subcellular localizations. These comparisons are clearly consistent with our expectations of the relative flux through fatty acid synthesis and the tricarboxylic acid cycle in seed and non-seed tissues.

Biosynthesis of fatty acids does not only require carbon units, but more than twice as many moles of reduced nicotinamide nucleotides per fatty acid (Ohlrogge et al., 1993). Reductants for fatty acid biosynthesis can be generated in the heterotrophic plastid by the pyruvate dehydrogenase reaction (reaction 44), by the initial reactions (reactions 37 and 39) of the oxidative pentose phosphate cycle, and in green seeds by photosystem I. Although the different subunits of the plastidic pyruvate dehydrogenase complex are highly expressed (Table (TableIV,IV, reaction 44), only one out of seven Glc-6-P dehydrogenase- (reaction 37) encoding ESTs could be clearly identified as plastidic. No ESTs were found for reactions 38 through 41 of the plastidic oxidative pentose phosphate cycle, but ESTs encoding enzymes involved in recycling the carbon moieties were plentiful (reactions 42 and 43). It is known that plastidic Glc-6-P dehydrogenase is allosterically regulated in sophisticated ways in photosynthetic tissues (Wenderoth et al., 1997). Thus it seems possible that this tight regulation of the oxidative pentose phosphate pathway begins already at the level of transcription and is visible in the low abundance of the respective ESTs. Plastids of developing Arabidopsis seeds are transiently green and some of the most abundant ESTs encode proteins of the photosynthetic membrane (Table III), supporting the conclusion (Browse and Slack, 1985; Eastmond et al., 1996; Asokanthan et al., 1997; Bao et al., 1998) that some of the reducing equivalents required for fatty acid biosynthesis are derived from photosynthesis.

Developing seeds of Arabidopsis transiently accumulate starch (Focks and Benning, 1998). In accordance with this, ESTs encoding enzymes involved in starch biosynthesis and degradation are quite abundant (Fig. (Fig.1;1; Table TableIV,IV, reactions 21–24), similar to those encoding enzymes that catalyze the initial reactions of fatty acid biosynthesis (reactions 45–48). The ESTs of starch metabolism represent an example of the apparent coordinate expression of genes encoding enzymes of the same metabolic pathway and may reveal a regulon.

Fatty Acid Biosynthesis

Given that the major carbon storage in developing oil seeds is associated with triacylglycerol, but not starch, one would expect that ESTs encoding enzymes directly involved in fatty acid biosynthesis are at least as abundant as those encoding starch metabolic enzymes. This seems to be true for the ketoacyl-acyl carrier protein synthases (reactions 46, 47, and 51) as well as for acetyl-coenzyme A (CoA) carboxylase (reaction 45), which provides the malonyl-CoA substrate for fatty acid biosynthesis. In general the relative abundance of the cDNAs encoding different enzymes of fatty acid synthesis is similar in the seed and non-seed EST sets, suggesting that seeds do not alter to a substantial degree the relative expression of genes encoding pathway components to accomplish the increased flux through the pathway in seeds. Rather, the entire pathway is apparently up-regulated, as suggested by the overall higher relative abundance of ESTs noted for fatty acid synthesis ESTs in the seed compared with the non-seed sets (Mekhedov et al., 2000). These data, therefore, confirm tissue mRNA expression data from several studies of genes encoding individual enzymes of fatty acid synthesis (e.g. Fawcett et al., 1994), but furthermore suggest that at least nine genes encoding enzymes or subunits involved in this pathway are coordinately regulated. The broader scale in silico expression analysis presented here has thus uncovered phenomena that were not apparent from the previous studies focusing on single genes.


We have provided a large data set of ESTs from developing Arabidopsis seeds and have begun to analyze this rich resource. The analysis of this data set is not complete and some of the conclusions may have to be revised as better bioinformatics tools become available. However, based on our preliminary analysis it is clear that this data set is substantially different from the currently available public Arabidopsis EST data set. With few exceptions, there is considerable congruence between conventional biochemical wisdom regarding seed metabolism and the number of ESTs encoding seed metabolic enzymes. Even by examining only 52 reactions (Fig. (Fig.1),1), patterns of expression became obvious. These observed patterns may reflect the existence of metabolic regulons, groups of genes that are coordinately expressed. In many cases the current EST data set provides the first experimental access to these genes and the basis for their in-depth molecular analysis and for the biochemical studies of the encoded proteins.


Library Preparation and Screening

To construct the Arabidopsis developing seed cDNA library, immature seeds of Arabidopsis ecotype Columbia-2 were collected 5 to 13 d after flowering. RNA was extracted according to Hall et al. (1978) from 1 g of seed tissue and a directional Uni-ZAP XR cDNA library was commercially prepared from poly(A)+ mRNA (Stratagene, La Jolla, CA). The initial titer of the amplified library was 1.9 × 1010 plaque-forming units/mL. Based on 48 randomly selected clones, the average insert size was estimated to be 1.9 kb. Following the excision of phagemids, bacterial colonies were arrayed onto nylon membranes at a density of 36 clones cm−2 by Genome Systems (St. Louis). Data were generated in two stages corresponding to a membrane set with 9,136 cDNA clones and a second set containing 18,432 clones. The first set of membranes was hybridized with 12S and 2S seed storage protein cDNA clones. Non-hybridizing clones were selected for sequencing. The second set of membranes was hybridized with six pools of five different probes derived from cDNAs (Table (TableI)I) that were highly abundant among the EST sequences from the first set. Non-hybridizing clones were sequenced following re-racking.

Sequence Analysis

The first set of cDNAs (data set I) was sequenced at Michigan State University from the 5′ ends using the SK primer for pBluescript II, or from the 3′ ends using the M13 21 primer. The second set of cDNAs (data set II) was sequenced by Incyte Pharmaceuticals (Palo Alto, CA) from the 5′ ends using the Bluescript T3 primer. Chromatograms from the data set I were processed in batches using Sequencher v.3.0 (Gene Codes, Ann Arbor, MI). The 5′- and 3′-ambiguous sequences were trimmed. Vector sequences were removed as part of this process. Sequences that were less than 150 bp long or had >4% ambiguity were not processed. Chromatograms from data set II were processed in bulk using PHRED (Phil Green and Brent Ewing, University of Washington, Seattle). Sequences that were less than 225 bp or >4% ambiguous were not further processed. At this time 95% of the sequences have been deposited at GenBank. The remaining 5% (exclusively derived from data set II) will be available in GenBank by March 2001.

Database Searches

For data set I, sequences were processed with the Genetics Computer Group programs (Wisconsin Package Version 9.1, Madison, WI), and used for similarity searches against GenBank by using shell or PERL scripts that call Genetics Computer Group NETBLAST (BLASTX version 1.4.11; Altschul et al., 1990) for each sequence. Searches were done in batches. For data set II, the FASTA file produced by PHRED/PHD2FASTA was processed by PERL scripts to do BLASTX searches with default parameters. The BLASTX searches were done over a period of 12 months from September 2, 1998 to September 21, 1999 using the most recent releases of GenBank. A subset was periodically retested (see below). The output from BLASTX was processed with PERL scripts to extract the top scoring hit from each result file. The following information for the top scoring entry in each result file was retained: gene identifier, description, BLAST score, probability, percent identity, alignment length, and reading frame. These results were compiled in text files. Each result was manually interpreted and categorized according to predicted biochemical function. BLASTN searches were done against a subset of dbBEST (available at http://www.Arabidopsis.org/seqtools.html) containing only Arabidopsis sequences using a FASTA file with all raw sequences. Stand-alone BLASTN version 2.0.9 running under Linux 5.2 was used for this analysis.

Contig Analysis

Contig analysis was performed with PHRAP (Phil Green, University of Washington, Seattle). Chromatograms from both data sets were processed with PHRED/PHD2FASTA, CROSS_MATCH (to mask vector sequence), and PHRAP. The first 30 bp from each sequence were trimmed during assembly by PHRAP. The .ace output file from PHRAP was processed with a PERL script to obtain the list of ESTs in each contig. Contigs were manually screened and corrected in cases where obviously unrelated sequences were clustered together.


All data were imported into a Microsoft Access 97 relational database. The database was built around unique clone identifiers that refer to clone locations in microtiter plates. In some cases entries for 3′ sequences are available. These can be recognized by the last letter X added to the clone identifier. In a few cases the same clone has been sequenced twice. This has been marked by adding the last letters A and B to the clone identifier. The database and the PERL scripts are available for viewing at our web page at http://benningnt.bch.msu.edu/index.htm.

Supplementary Material

[Supplemental Data]


We thank Sergei Mekhedov for advice and Jay Thelen for analysis of pyruvate dehydrogenase sequences. We would also like to thank Chris Eakin and Chris Beasley for their help with the annotation of data and construction of the web site.


1This work was supported in parts by the National Science Foundation (grant nos. MCB–94–06466 and IBN–97–23778), the Midwestern Consortium for Plant Biotechnology Research, Dow Agroscience, and the Michigan Agricultural Experiment Station.

[w]The online version of this article contains Web-only data. The supplemental material is available at www.plantphysiol.org.


  • Agyare FD, Lashkari DA, Lagos A, Namath AF, Lagos G, Davis RW, Lemieux B. Mapping expressed sequence tag sites on yeast artificial chromosome clones of Arabidopsis thaliana DNA. Genome Res. 1997;7:1–9. [PubMed]
  • Allona I, Quinn M, Shoop E, Swope K, Cyr SS, Carlis J, Riedl J, Retzel E, Campbell MM, Sederoff R, Whetten RW. Analysis of xylem formation in pine by cDNA sequencing. Proc Natl Acad Sci USA. 1998;95:9693–9698. [PMC free article] [PubMed]
  • Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. [PubMed]
  • Asokanthan PS, Johnson RW, Griffith M, Krol M. The photosynthetic potential of canola embryos. Physiol Plant. 1997;101:353–360.
  • Audic S, Claverie JM. The significance of digital gene expression profiles. Genome Res. 1997;7:986–995. [PubMed]
  • Bao XM, Pollard M, Ohlrogge J. The biosynthesis of erucic acid in developing embryos of Brassica rapa. Plant Physiol. 1998;118:183–190. [PMC free article] [PubMed]
  • Boguski MS. Biosequence exegesis. Science. 1999;286:453–455. [PubMed]
  • Boguski MS, Lowe TM, Tolstoshev CM. dbEST: database for “expressed sequence tags.” Nat Genet. 1993;4:332–333. [PubMed]
  • Boston RS, Viitanen PV, Vierling E. Molecular chaperones and protein folding in plants. Plant Mol Biol. 1996;32:191–222. [PubMed]
  • Bouchez D, Hofte H. Functional genomics in plants. Plant Physiol. 1998;118:725–732. [PMC free article] [PubMed]
  • Browse J, Slack CR. Fatty acid synthesis in plastids from maturing safflower and linseed cotyledons. Planta. 1985;166:74–80. [PubMed]
  • Cahoon EB, Carlson TJ, Ripp KG, Schweiger BJ, Cook GA, Hall SE, Kinney AJ. Biosynthetic origin of conjugated double bonds: production of fatty acid components of high-value drying oils in transgenic soybean embryos. Proc Natl Acad Sci USA. 1999;96:12935–12940. [PMC free article] [PubMed]
  • Cooke R, Raynal M, Laudie M, Grellet F, Delseny M, Morris PC, Guerrier D, Giraudat J, Quigley F, Clabault G, Li YF, Mache R, Krivitzky M, Gy IJ, Kreis M, Lecharny A, Parmentier Y, Marbach J, Fleck J, Clement B, Philipps G, Herve C, Bardet C, Tremousaygue D, Hofte H. Further progress towards a catalogue of all Arabidopsis genes: analysis of a set of 5,000 non-redundant ESTs. Plant J. 1996;9:101–124. [PubMed]
  • Dennis D, Miernyk J. Compartmentation of non-photosynthetic carbohydrate metabolism. Annu Rev Plant Physiol. 1982;33:27–50.
  • DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 1997;278:680–686. [PubMed]
  • Eastmond P, Kolacna L, Rawthorne S. Photosynthesis by developing embryos of oil seed rape (Brassica napus L.) J Exp Bot. 1996;47:1763–1769.
  • Ewing RM, Kahla AB, Poirot O, Lopez F, Audic S, Claverie JM. Large-scale statistical analyses of rice ESTs reveal correlated patterns of gene expression. Genome Res. 1999;9:950–959. [PMC free article] [PubMed]
  • Fawcett T, Simon WJ, Swinhoe R, Shanklin J, Nishida I, Christie WW, Slabas AR. Expression of mRNA and steady-state levels of protein isoforms of enoyl-ACP reductase from Brassica napus. Plant Mol Biol. 1994;26:155–163. [PubMed]
  • Fischer K, Kammerer B, Gutensohn M, Arbinger B, Weber A, Hausler RE, Flugge UI. A new class of plastidic phosphate translocators: a putative link between primary and secondary metabolism by the phosphoenolpyruvate/phosphate antiporter. Plant Cell. 1997;9:453–462. [PMC free article] [PubMed]
  • Focks N, Benning C. wrinkled1: a novel, low-seed-oil mutant of Arabidopsis with a deficiency in the seed-specific regulation of carbohydrate metabolism. Plant Physiol. 1998;118:91–101. [PMC free article] [PubMed]
  • Hall TC, Ma Y, Buchbinder BU, Pyne JW, Sun SM, Bliss FA. Messenger RNA for G1 protein of French bean seeds: cell free translation and product characterization. Proc Natl Acad Sci USA. 1978;75:3196–3200. [PMC free article] [PubMed]
  • Hieter P, Boguski M. Functional genomics: it's all how you read it. Science. 1997;278:601–602. [PubMed]
  • Kammerer B, Fischer K, Hilpert B, Schubert S, Gutensohn M, Weber A, Flugge UI. Molecular characterization of a carbon transporter in plastids from heterotrophic tissues: the glucose 6-phosphate/phosphate antiporter. Plant Cell. 1998;10:105–117. [PMC free article] [PubMed]
  • Kang F, Rawsthorne S. Starch and fatty acid biosynthesis in plastids from developing embryos of oil seed rape. Plant J. 1994;6:795–805.
  • Kang F, Rawsthorne S. Metabolism of glucose-6phosphate and utilization of multiple metabolites for fatty acid synthesis by plastids from developing oilseed rape embryos. Planta. 1996;199:321–327.
  • Kinoshita T, Nishimura M, Hara-Nishimura I. Homologues of a vacuolar processing enzyme that are expressed in different organs in Arabidopsis thaliana. Plant Mol Biol. 1995;29:81–89. [PubMed]
  • Loftus SK, Chen Y, Gooden G, Ryan JF, Birznieks G, Hilliard M, Baxevanis AD, Bittner M, Meltzer P, Trent J, Pavan W. Informatic selection of a neural crest-melanocyte cDNA set for microarray analysis. Proc Natl Acad Sci USA. 1999;96:9277–9280. [PMC free article] [PubMed]
  • Meinke DW, Cherry JM, Dean C, Rounsley SD, Koornneef M. Arabidopsis thaliana: a model plant for genome analysis. Science. 1998;282:679–682. [PubMed]
  • Mekhedov S, Martinez de Ilarduya O, Ohlrogge J. Towards a functional catalog of the plant genome: a survey of genes for lipid biosynthesis. Plant Physiol. 2000;122:389–401. [PMC free article] [PubMed]
  • Newman T, de Bruijn FJ, Green P, Keegstra K, Kende H, McIntosh L, Ohlrogge J, Raikhel N, Somerville S, Thomashow M. Genes galore: a summary of methods for accessing results from large-scale partial sequencing of anonymous Arabidopsis cDNA clones. Plant Physiol. 1994;106:1241–1255. [PMC free article] [PubMed]
  • Ohlrogge JB, Jaworski JG, Post-Beittenmiller D. De novo fatty acid biosynthesis. In: Moore TS, editor. Lipid Metabolism in Plants. Boca Raton, FL: CRC Press; 1993. pp. 3–32.
  • Plaxton WC. Organization and regulation of plant glycolysis. Annu Rev Plant Physiol Plant Mol Biol. 1996;47:185–214. [PubMed]
  • Rafalski JA, Hanafey M, Miao GH, Ching A, Lee JM, Dolan M, Tingey S. New experimental and computational approaches to the analysis of gene expression. Acta Biochim Pol. 1998;45:929–934. [PubMed]
  • Rounsley SD, Glodek A, Sutton G, Adams MD, Somerville CR, Venter JC, Kerlavage AR. The construction of Arabidopsis expressed sequence tag assemblies: a new resource to facilitate gene identification. Plant Physiol. 1996;112:1177–1183. [PMC free article] [PubMed]
  • Ruan Y, Gilmore J, Conner T. Towards Arabidopsis genome analysis: monitoring expression profiles of 1,400 genes using cDNA microarrays. Plant J. 1998;15:821–833. [PubMed]
  • Sauer N, Stolz J. SUC1 and SUC2: two sucrose transporters from Arabidopsis thaliana: expression and characterization in baker's yeast and identification of the histidine-tagged protein. Plant J. 1994;6:67–77. [PubMed]
  • Sterky F, Regan S, Karlsson J, Hertzberg M, Rohde A, Holmberg A, Amini B, Bhalerao R, Larsson M, Villarroel R, Van Montagu M, Sandberg G, Olsson O, Teeri TT, Boerjan W, Gustafsson P, Uhlen M, Sundberg B, Lundeberg J. Gene discovery in the wood-forming tissues of poplar: analysis of 5,692 expressed sequence tags. Proc Natl Acad Sci USA. 1998;95:13330–13335. [PMC free article] [PubMed]
  • Van de Loo FJ, Broun P, Turner S, Somerville C. An oleate 12-hydroxylase from Ricinus communis L. is a fatty acyl desaturase homolog. Proc Natl Acad Sci USA. 1995a;92:6743–6747. [PMC free article] [PubMed]
  • Van de Loo FJ, Turner S, Somerville C. Expressed sequence tags from developing castor seeds. Plant Physiol. 1995b;108:1441–1150. [PMC free article] [PubMed]
  • Walden R, Cordeiro A, Tiburcio AF. Polyamines: small molecules triggering pathways in plant growth and development. Plant Physiol. 1997;113:1009–1013. [PMC free article] [PubMed]
  • Weber H, Borisjuk L, Heim U, Sauer N, Wobus U. A role for sugar transporters during seed development: molecular characterization of a hexose and a sucrose carrier in fava bean seeds. Plant Cell. 1997;9:895–908. [PMC free article] [PubMed]
  • Wenderoth I, Scheibe R, von Schaewen A. Identification of the cysteine residues involved in redox modification of plant plastidic glucose-6-phosphate dehydrogenase. J Biol Chem. 1997;272:26985–26990. [PubMed]

Articles from Plant Physiology are provided here courtesy of American Society of Plant Biologists
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • EST
    Published EST sequences
  • Gene (nucleotide)
    Gene (nucleotide)
    Records in Gene identified from shared sequence links
  • MedGen
    Related information in MedGen
  • Nucleotide
    Published Nucleotide sequences
  • Protein
    Published protein sequences
  • PubMed
    PubMed citations for these articles
  • Substance
    PubChem Substance links
  • Taxonomy
    Related taxonomy entry
  • Taxonomy Tree
    Taxonomy Tree

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...