Logo of plntphysLink to Publisher's site
Plant Physiol. 2011 Aug; 156(4): 1661–1678.
Published online 2011 Jun 8. doi:  10.1104/pp.111.178616
PMCID: PMC3149962

Gene Discovery and Tissue-Specific Transcriptome Analysis in Chickpea with Massively Parallel Pyrosequencing and Web Resource Development1,[W][OA]


Chickpea (Cicer arietinum) is an important food legume crop but lags in the availability of genomic resources. In this study, we have generated about 2 million high-quality sequences of average length of 372 bp using pyrosequencing technology. The optimization of de novo assembly clearly indicated that hybrid assembly of long-read and short-read primary assemblies gave better results. The hybrid assembly generated a set of 34,760 transcripts with an average length of 1,020 bp representing about 4.8% (35.5 Mb) of the total chickpea genome. We identified more than 4,000 simple sequence repeats, which can be developed as functional molecular markers in chickpea. Putative function and Gene Ontology terms were assigned to at least 73.2% and 71.0% of chickpea transcripts, respectively. We have also identified several chickpea transcripts that showed tissue-specific expression and validated the results using real-time polymerase chain reaction analysis. Based on sequence comparison with other species within the plant kingdom, we identified two sets of lineage-specific genes, including those conserved in the Fabaceae family (legume specific) and those lacking significant similarity with any non chickpea species (chickpea specific). Finally, we have developed a Web resource, Chickpea Transcriptome Database, which provides public access to the data and results reported in this study. The strategy for optimization of de novo assembly presented here may further facilitate the transcriptome sequencing and characterization in other organisms. Most importantly, the data and results reported in this study will help to accelerate research in various areas of genomics and implementing breeding programs in chickpea.

Chickpea (Cicer arietinum) is an annual and self-pollinated plant that ranks third in legume production worldwide. The genome of chickpea is diploid (2n = 2x = 16) and of moderate size (approximately 740 Mb). Chickpea provides a protein-rich supplement to the cereal-based diets and is a very important food crop for nutrition in the developing world. In addition, like other legumes, chickpea has the ability to fix atmospheric nitrogen biologically, which is very important for agricultural sustainability (Graham and Vance, 2003). Despite its economic importance, chickpea production has been very low due to significant yield losses caused by several biotic and abiotic factors. A few breeding programs have been implemented successfully to improve the yield of chickpea under various stress conditions (Millan et al., 2006). However, biotechnological approaches have not been applied due to the recalcitrant nature of the chickpea and the availability of very limited genomic resources.

The amount of genomic information available for chickpea has been increasing recently. Efforts have been made to generate linkage genetic maps for chickpea based on microsatellites (Hüttel et al., 1999; Radhika et al., 2007; Upadhyaya et al., 2008; Millan et al., 2010; Gaur et al., 2011). Chickpea genome organization and composition have also been analyzed based on the 500-kb sequence from 11 bacterial artificial chromosome (BAC) clones (Rajesh et al., 2008). The construction of BAC libraries has been reported for the identification of clones associated with disease resistance (Rajesh et al., 2004; Lichtenzveig et al., 2005). Recently, a BAC/binary BAC-based physical map of cultivated chickpea (cv Hadas) has also been developed (Zhang et al., 2010). In addition, a few EST projects have provided a few thousand single-pass sequences (Buhariwalla et al., 2005; Gao et al., 2008; Ashraf et al., 2009; Varshney et al., 2009; Jain and Chattopadhyay, 2010). Although several genes/ESTs involved in various stress responses have been identified based on transcriptomic and proteomic studies (Pandey et al., 2006, 2008; Mantri et al., 2007; Molina et al., 2008, 2011; Varshney et al., 2009), the gene discovery has been very limited in chickpea. So far, only a few candidate genes have been cloned and functionally validated (Kaur et al., 2008; Shukla et al., 2009; Tripathi et al., 2009; Peng et al., 2010).

The EST generation provides a very useful and quick means of gene discovery, which can be used for functional genomic studies and understanding the molecular basis of agriculturally important traits in crop plants. With the advent of next generation sequencing (NGS) technologies, the deep sequencing of transcriptomes has become very rapid and economical (Morozova et al., 2009). Recently, we reported the sequencing and de novo assembly of the chickpea transcriptome using short-read data, which provided first insights into the gene content (Garg et al., 2011). However, for a better assembly and coverage of the transcriptome and to complement our ongoing whole genome sequencing and annotation efforts for chickpea, we have generated a huge collection of longer reads also using pyrosequencing technology. A total of 1,931,224 high-quality reads from various tissue samples were obtained and assembled into 34,760 transcripts after optimization of de novo assembly. Furthermore, we have performed the functional annotation of the chickpea transcripts and analyzed the tissue-specific expression. This huge collection of new sequences represents a highly diverse set of chickpea genes and constitutes an important resource for genomic studies in chickpea and other related legume species. We have also identified several simple sequence repeats (SSRs) for the development of functional markers. The identification of lineage-specific chickpea transcripts has also been reported. In addition, we have developed a user-friendly Web resource for public access of the data and results generated in this study.


Pyrosequencing of the Chickpea Transcriptome

In our previous study, we presented the first snapshot of the chickpea transcriptome based on short-read sequence data (Garg et al., 2011). To present a better picture of the chickpea transcriptome, we performed long-read transcriptome sequencing from five different tissue samples and one mixed tissue sample in four flow cells of the GS FLX Titanium sequencer. After separation of sequence reads for each tissue type, more than 2.5 million reads (0.15–0.76 million reads from individual tissue samples) of Q20 quality were obtained. The reads thus obtained were passed through several quality-control filters. After removing low-quality reads, trimming adapter/primer sequences, trimming sequences containing homopolymers of more than 7 bp, and removing sequences of less than 100 bp, a total of 2,347,832 reads were obtained (Table I). Furthermore, 416,608 reads corresponding to rRNA were removed as per the criteria described in “Materials and Methods.” Finally, a total of 1,931,224 high-quality reads corresponding to mRNA with an average length of 372 bp were obtained. The number of high-quality reads for different tissue samples ranged from 103,524 for root sample to 551,932 for shoot sample. Overall, these reads covered a total of 718,245,395 bases. The length distribution of these high-quality reads shows that most of them are more than 300 bp in length (Supplemental Fig. S1A). The average Phred quality score of 90% of the high-quality reads was found to be more than 30 (Supplemental Fig. S1B). The summary of sequencing data generated from different tissue samples and their quality filtering is given in Table I.

Table I.
Summary of 454 sequencing data generated for chickpea transcriptome and quality filtering

De Novo Assembly Optimization of 454 Sequencing Data

All the previous studies about transcriptome sequencing and analysis based on Roche 454 pyrosequencing have used only a single program for de novo assembly without any optimization, barring few exceptions (Weber et al., 2007; Papanicolaou et al., 2009; Kumar and Blaxter, 2010). We used all (1,931,224) the high-quality sequence reads from different tissue samples to optimize the de novo assembly using eight assembly programs. Among these, MIRA, Newbler (v2.3 and v2.5p1), CAP3, and TGICL assemblers are based on the overlap-layout-consensus strategy, whereas CLC, Velvet, and ABySS are based on the de Bruijn graph algorithm. The parameters used for various assemblers are described in “Materials and Methods.” We took several criteria into consideration to select the best de novo assembly, including assembly statistics, reads utilized for the assembly, and similarity/coverage of the reference sequences.

Statistics of Assembly Output Generated by Various Assemblers

Among the assemblers used, Velvet and ABySS are primarily short-read assemblers and are not considered optimal for de novo assembly of 454 pyrosequencing reads. However, we still utilized these programs to check their performance on our data. The assembly using Velvet and ABySS was performed at different k-mer lengths, and best assembly was found for k = 99 for Velvet and k = 49 for AbySS, taking into consideration the N50 length (the length of the smallest contig in the set comprised of largest contigs whose combined length represents 50% of the total assembly size) and average length of the contigs of 100 bp or more in length (Supplemental Table S1). For other assemblers also, the contigs of 100 bp or more in length were filtered and used for comparison. The singlets/nonassembled sequences were not taken into consideration for the comparison of output among different assemblers. The largest number of contigs was obtained using Velvet followed by MIRA, and the smallest number was obtained using Newbler v2.5p1. However, the largest assembly in terms of number of bases was obtained by MIRA (approximately 52 Mb), and the smallest was obtained by ABySS (approximately 22 Mb). The size of assemblies generated by Newbler v2.3, CLC, and TGICL was comparable (36.5–40.1 Mb). Surprisingly, Newbler v2.5p1 produced an assembly of 28.5 Mb, which is about 38% smaller than Newbler v2.3. This is in sharp contrast to the study by Kumar and Blaxter (2010), which reported that the assembly size obtained by Newbler v2.5p1 was 39% larger than that of Newbler v2.3 on the transcriptome data set of a nematode and outperformed other assemblers. This indicates that choice of best assembler depends on the data set and needs to be optimized. Newbler v2.3 generated an assembly with a largest N50 of 1,617 bp and a largest average contig length of 1,200 bp. MIRA generated a largest number of contigs 1 kb or more in length, followed by Newbler v2.3 and TGICL. The average consensus quality of all the contigs generated by MIRA was 55, with 115 strong unresolved repeat positions and none of the weak unresolved repeat and mismatch positions. Velvet and ABySS did not perform well, as expected, and generated smaller contigs, as indicated by smaller N50 and average lengths. Our results indicate that Velvet and ABySS programs are not optimal for the de novo assembly of 454 sequencing data. The output of CLC assembly was moderate, with N50 length of 1,325 bp and average length of 961 bp. CAP3 produced a lesser number of contigs and smaller assembly size (32 Mb) as compared with TGICL. The assembly results of the different assemblers are summarized in Table II.

Table II.
Summary of de novo assembly results of 454 sequence data using various programs

Numbers of Reads Used for Assembly by Various Assemblers

Because different assemblers report number of reads utilized/not utilized for assembly in different ways or do not report it at all, we used a consensus method to identify the number of reads utilized for the assembly. We mapped all the high-quality reads on assembled contigs for each assembly using the CLC Genomics Workbench. The mapping results revealed that the largest number of reads was utilized by TGICL (95%) for the assembly, followed by MIRA (93%; Table II). The number of reads used for assembly using Velvet (16%) and ABySS (23%) was much less. Among other assemblers, Newbler v2.5p1 used the least number of reads for assembly (73%), but most of them showed unique hits, indicating the least redundancy. The redundancy was highest for Newbler v2.3 followed by MIRA, with about 20% and 14% of the total reads showing multiple hits, respectively. CLC and CAP3 were comparable in terms of the number of reads used and redundancy (Table II).

Coverage of a Reference Proteome

The validation of assembly output was also done by BLASTX search of contigs generated by various programs with reference sequences. We used the proteome of soybean (Glycine max) predicted from the genome sequence as a reference. The coverage of this proteome should reflect the quality of assembly. Although a variable number of contigs exhibited significant similarity with soybean proteins, the percentage was comparable (Table II). Therefore, we identified the number of unique soybean proteins to which the contigs showed significant similarity. The largest number of soybean proteins was represented in the Velvet (49.4%) assembly followed by MIRA (46.6%). However, the coverage of soybean proteins was much less in the Velvet assembly, which may be because of smaller contigs. The coverage of soybean proteins was highest for MIRA followed by TGICL, indicating better assemblies.

Overall, based on the statistics of assembly output, number of reads used for the assembly, and coverage of a reference sequence, the assemblies generated by TGICL, MIRA, and Newbler v2.3 were considered better. It was very difficult to decide the best assembly program among these, as one or the other outperformed the rest on one of several criteria.

Merged Super Assembly

Recently, in a comparison of de novo assemblers for 454 transcriptome data, it has been reported that the super assembly of two primary assemblies generated using different programs gives better results in terms of consistency and size of contigs and alignment to the reference sequences (Kumar and Blaxter, 2010). Therefore, we also attempted super assemblies using TGICL by merging the primary assemblies of two assemblers at a time among TGICL, MIRA, and Newbler v2.3 (hereafter Newbler). We analyzed the output of super assemblies for super contigs and singletons generated together after the assembly. The super contigs represent consensus sequences of two or more of the contigs, and singletons represent individual contig sequences from primary assemblies. The merging of primary assemblies resulted in a lesser number of contigs than individual assemblies (Table III). The largest number of super contigs was formed in the super assembly of MIRA+TGICL. Although N50 and average lengths were higher in the super assembly of TGICL+Newbler, the total assembly size was largest for MIRA+Newbler and MIRA+TGICL super assemblies. The largest number of soybean proteins showed significant similarity with the super contigs of MIRA+TGICL super assembly. Furthermore, we attempted the super assembly of primary assemblies of all three best assemblers, TGICL, MIRA, and Newbler, using TGICL. This further increased the assembly size and number of soybean proteins showing significant similarity with super contigs (Table III). The N50 and average lengths of the super contigs were also further improved over that of the MIRA+TGICL super assembly. The coverage of the soybean proteome was better in the super assembly of three primary assemblies as compared with the super assembly of any two of the primary assemblies. Overall, based on the assembly statistics and the coverage of reference sequences, the super assembly of the three primary assemblies appeared to be the best for our 454 sequencing data. Interestingly, however, the mapping of 454 reads on the merged super assemblies revealed a high percentage of nonuniquely mapped reads, and the number was largest for the merged super assembly of the three primary assemblies (Table III). This indicates the redundancy of the contigs or the formation of chimeric contigs in the super assemblies, which do not represent a good assembly and were thus rejected. The study by Kumar and Blaxter (2010), which recommended the merging of assemblies from different programs to get more credible results, did not report the statistics of read mapping on the merged super assemblies. The results of our study indicate that the output of merged assemblies should be considered very carefully.

Table III.
Summary of TGICL merged super assembly validation using 454 sequence data

Hybrid Assembly

It has been suggested that hybrid assembly of 454 long-read and Illumina short-read sequencing data help improve de novo assemblies of genomes and transcriptomes (Schatz et al., 2010). Earlier, we reported the de novo assembly of the chickpea transcriptome using about 107 million high-quality short reads generated on the Illumina platform (Garg et al., 2011). We attempted hybrid assembly using both 454 (this study) and Illumina short-read (Garg et al., 2011) data sets via Velvet and ABySS programs, but the results were not good in all respects, including assembly statistics, coverage of reference soybean proteins, and mapping of reads as expected (data not shown). Therefore, to further improve our above assembly of 454 data, we performed hybrid assemblies of Newbler, MIRA, and TGICL primary assemblies with the 53,409 transcripts reported earlier based on the short-read data using the TGICL program. The Newbler+short-read and TGICL+short-read hybrid assemblies generated more contigs as compared with Newbler and TGICL alone, respectively (Table IV). However, for the MIRA+short-read hybrid assembly, the number of contigs was reduced as compared with MIRA alone. The hybrid assemblies increased the N50 and average lengths of the contigs significantly. The number of larger contigs also increased in the hybrid assemblies as compared with primary assemblies. Among the three hybrid assemblies, the number of contigs and assembly size were smallest for Newbler+short-read, whereas its N50 and average contig lengths were largest. Notably, the coverage of soybean proteins also increased in the hybrid assemblies as compared with primary assemblies. Although a significantly larger number (46.5%) of soybean proteins showed significant hits with the contigs of MIRA+short-read hybrid assembly as compared with the other two hybrid assemblies (40.8% for TGICL+short-read and 39.9% for Newbler+short-read), the soybean proteins showing 80% or greater coverage were marginally higher for the MIRA+short-read hybrid assembly (Table IV). Furthermore, mapping of 454 reads on the contigs generated after hybrid assemblies revealed that although the smallest number of reads mapped on the contigs of the Newbler+short-read hybrid assembly, most of them were mapped uniquely (Table IV). Only 7.5% of the total 454 reads were nonuniquely mapped for the Newbler+short-read hybrid assembly as compared with 21.2% and 15.3% for the MIRA+short-read and TGICL+short-read hybrid assemblies, respectively. Based on the above observations, we concluded that the hybrid assembly of the 454 and Illumina data improved the assembly quality and considered that the hybrid assembly of Newbler+short-read is the best.

Table IV.
Summary of TGICL hybrid assembly validation

Features of the Chickpea Transcriptome

After a step-by-step optimization of the de novo assembly of the transcriptome, we report a total of 34,760 tentative consensus (TC) transcripts in chickpea representing 35,468,895 bp (35.5 Mb) of the sequence. A unique transcript identifier (ID) number has been assigned to all the 34,760 transcripts from TC00001 to TC34760. Although these transcripts represent a nonredundant set of sequences, some of these may represent alternatively spliced forms of the same gene locus. However, the exact picture will be clear only once the complete genome sequence of chickpea becomes available. Furthermore, the number of chickpea transcripts assembled here is lower than the total number of annotated protein-encoding loci in the completely sequenced genomes of the legumes soybean (55,787) and Medicago (53,423; Phytozome version 7.0; http://www.phytozome.net/). This difference might be due to the fact that chickpea transcripts generated here represent only the expressed sequences, and many of the annotated protein-encoding loci in soybean and Medicago genomes may not have the expression evidence, as has been reported for soybean (Libault et al., 2010; Severin et al., 2010). The average length of chickpea transcripts is 1,020 bp and the N50 size is 1,671 bp. The largest transcript length is 15,661 bp, and about 40% (13,803) of the transcripts were at least 1,000 bp in length (Supplemental Fig. S2). Only a very small fraction (220; 0.6%) of the transcripts were larger than the expected maximum length of 5,000 bp (Supplemental Fig. S2). Such large sequences were found to be present in the predicted coding sequences of Arabidopsis (Arabidopsis thaliana), rice (Oryza sativa), and soybean genes also (data not shown). A total of 1,720,477 (89.1%) 454 sequencing reads could be mapped (81.6% mapped uniquely) to these chickpea transcripts. In addition, a total of 88.3% Illumina reads generated earlier (Garg et al., 2011) could also be mapped (74.7% mapped uniquely) to these chickpea transcripts. In comparison with the chickpea transcriptome based on the de novo assembly of short reads (Garg et al., 2011), we found the present transcriptome much better in all perspectives, including size (1.3 times larger), average transcript size (1.95 times larger), and N50 size (1.86 times larger). In addition, the extent of coverage of the reference soybean protein sequences is more than two times as compared with the short-read transcriptome. About 46% of the chickpea transcripts in this study showed 50% or greater coverage of the coding sequence of the soybean proteins, as compared with about 20% of the short-read transcriptome. Furthermore, more than 34% of the contigs showed 80% or greater coverage of soybean proteins, which might represent full-length transcripts. In addition, we found that more than 77% of mRNA sequences and 66% of the ESTs available for chickpea at the National Center for Biotechnology Information (NCBI) mapped on the chickpea transcripts generated in this study. The average coverage of each chickpea transcript was 18.4 reads per kilobase per million (RPKM) for 454 reads and 22.8 RPKM for short reads. An average of 49.5 reads from 454 data and 2,712.3 reads from short-read data were mapped to each chickpea transcript. The average GC content of chickpea transcripts was 39.5%, which is nearly similar to that of other legumes but a little lower than that of Arabidopsis (42.5%), as reported earlier (Garg et al., 2011).

Identification of SSRs

SSRs consist of direct tandem repeats of short (two to six) nucleotides and are considered the most suitable choice for a wide range of genetic and population studies based on cost, labor, and genetic diversity. Based on the EST sequencing, only a few hundred SSRs (two to six nucleotides) have been identified in chickpea so far (Buhariwalla et al., 2005; Choudhary et al., 2009; Varshney et al., 2009). The availability of a limited number of SSR markers has been a constraint on many areas of chickpea research. To identify SSRs (dinucleotide to hexanucleotide repeats) in chickpea, we mined the transcripts generated in this study using Perl script MISA. We did not consider mononucleotide repeats in this study, as we trimmed sequence reads with homopolymers of more than seven bases as a part of quality control. A total of 4,111 SSRs of minimum length of 12 bp could be identified in 3,386 (9.7%) transcripts of chickpea (Table V). The average frequency of SSRs was found to be one SSR per 8.63 kb of the chickpea transcriptome sequence. Furthermore, 545 transcripts contained more than one SSR, and 334 SSRs were present in compound formation. Although the number of reiterations of a SSR type varied from five to 48, only a small fraction (7.8%) showed more than 10 reiterations (Supplemental Table S2). However, the SSRs with five reiterations were most abundant (35.3%). The largest fraction of SSRs identified were trinucleotides (57.9%) followed by dinucleotides (37.2%), as also reported by several studies in other plants (La Rota et al., 2005; Hisano et al., 2007; Cloutier et al., 2009). The largest fraction of trinucleotide repeats in ESTs/mRNAs has been attributed to the suppression of other types of repeats whose expansion and contraction may lead to frameshift errors in protein-coding regions (Metzgar et al., 2000, Morgante et al., 2002). However, in contrast, dinucleotide SSRs were found to be most abundant in the transcriptome of pigeonpea (Cajanus cajan), which has been attributed to the greater representation of untranslated regions in the sequence data set (Dutta et al., 2011). Among the SSRs identified, AG/CT (27.7%) accounted for 74.6% of the dinucleotide repeats and AAG/CTT (17.2%), AAC/GTT (9.6%), and AAT/ATT (8.7%) accounted for 61.2% of the trinucleotide repeats (Supplemental Table S2). A significant number of tetranucleotide (88), pentanucleotide (36), and hexanucleotide (79) repeats were also identified. The total number of SSRs identified in this study is about 47% higher than in our previous study excluding mononucleotide SSRs (Garg et al., 2011). The mapping of available EST sequences for chickpea at NCBI reveals that most of the EST-based SSRs identified in previous studies (Buhariwalla et al., 2005; Choudhary et al., 2009; Varshney et al., 2009) are well represented in the transcriptome presented in this study. The SSRs identified from the transcribed regions have better potential for linkage to loci that contribute to agronomic traits as compared with those identified from nontranscribed regions (Varshney et al., 2005). Being a cost-effective option for the development of functional markers, the identified SSRs will have considerable utility for marker-assisted selection and implementing breeding programs in chickpea.

Table V.
Statistics of SSRs identified in chickpea transcripts

Sequence Conservation with Other Plants and Functional Annotation

To study the sequence conservation of chickpea transcripts in other plant species, a BLASTX search was performed against the 15 available proteomes of completely sequenced plant genomes. The largest number (78.6%) of chickpea transcripts exhibited significant similarity with soybean proteins, and the smallest number (61.1%) exhibited significant similarity with proteins of distantly related Physcomitrella, as expected (Supplemental Fig. S3). Overall, 28,478 (82%) chickpea transcripts showed significant similarity with at least one protein sequence from 15 plants. These conserved transcripts from other plant species might perform similar functions in chickpea. Furthermore, we have been able to assign putative function to 23,491 chickpea transcripts based on their significant sequence similarity with Arabidopsis proteins. In addition, putative function was assigned to another 1,950 chickpea transcripts based on their significant hits in the UniRef90, UniRef100, NR, PFAM, SMART, KEGG, and COG databases. Overall, putative function could be assigned to a total of 25,441 chickpea transcripts, and another 9,319 transcripts were designated as expressed proteins. The orthologs of genes/gene families involved in a wide variety of cellular processes, such as protein turnover, signal transduction, DNA binding, metabolism, stress response, and plant growth and development, were well represented in the chickpea transcripts. Furthermore, a diversity (3,117) of PFAM domains were represented in our transcript data set, as revealed by AutoFACT analysis. Among these, the protein kinase domain was most highly represented, followed by the pentatricopeptide repeat, protein Tyr kinase, zinc finger C3HC4, RNA recognition motif, and cytochrome P450. The top 20 PFAM domains represented in the chickpea transcripts are shown in Supplemental Figure S4.

We assigned GOSlim terms to the chickpea transcripts based on their similarity to Arabidopsis proteins. The GOSlim terms for biological process, molecular function, and cellular component could be assigned to 65.3%, 65.7%, and 62.9% of the chickpea transcripts, respectively. In total, 71.9% of the chickpea transcripts were assigned at least one GOSlim term. Overall, the distribution of chickpea transcripts in various GOSlim categories appears to be very similar to that of Arabidopsis (Fig. 1A). The GOSlim terms protein metabolism, transferase activity, and chloroplast were most represented among the biological process, molecular function, and cellular component categories, respectively. Furthermore, the KEGG pathway analysis revealed that a diversity of pathways were represented in our chickpea transcriptome data set (Supplemental Table S3). The biosynthesis of secondary metabolites, metabolic pathways, ribosome, spliceosome, and ubiquitin-mediated proteolysis were the five most represented pathways among the chickpea transcripts. The number and assortment of allocated GO categories and pathways provide a good indication of the large diversity of expressed genes sampled from the chickpea transcriptome.

Figure 1.
GOSlim functional categorization of Arabidopsis and chickpea genes/transcripts, and distribution of TF-encoding genes/transcripts of soybean, Arabidopsis, and chickpea in different families. A, The percentage of Arabidopsis and chickpea genes/transcripts ...

Transcription Factor-Encoding Genes in Chickpea

Transcription factors (TFs) represent key proteins that bind to specific DNA sequences and regulate gene expression. TFs are represented by various multigene families and are highly conserved in eukaryotic organisms, especially plants. However, the number of genes encoding for a particular TF family may vary in different plant species due to evolutionary expansion and/or to perform species-specific function(s). Several studies have reported the evolutionary expansion/contraction and specific functions of TF families in various plant species, including legumes at the genome level or the individual family level (Jain et al., 2008; Libault et al., 2009; Schmutz et al., 2010). We also analyzed the TF repertoire in the chickpea transcripts. We identified a total of 1,851 (5.33%) transcripts encoding for TFs belonging to all 84 families searched for. The lower fraction of TF-encoding transcripts identified in this study as compared with our previous study (Garg et al., 2011) may be attributed to the better quality of the transcriptome presented here and the altogether different strategy used for the identification of TFs (stringent HMMER search in this study as compared with BLAST homology search in the previous study). Although the fraction of TF-encoding chickpea transcripts is similar to that of Arabidopsis and other plant species, it is much less than that of soybean (Libault et al., 2009; Schmutz et al., 2010). Interestingly, a much lower number of TF genes has been reported in legumes such as Medicago and Lotus as well (Libault et al., 2009) as compared with that of soybean (Schmutz et al., 2010). The difference in the TF content of these legumes may reflect their evolutionary relationship. Chickpea, Medicago, and Lotus, which contain lower number of TFs, belong to the galegoid clade, and soybean, containing more TFs, belongs to the millettioid clade of the Papilionoideae subfamily of the legumes (Doyle and Luckow, 2003; Cannon et al., 2009). Although we anticipate that the lower number of TFs identified in this study does not represent the underestimation of the TF content in chickpea, the possibility of lower representation of tissue-/cell type-specific TFs expressed at a very low level in our transcriptome data set cannot be ruled out. Furthermore, it may be due to species-specific evolutionary processes. However, it will only be clear once the complete genome sequence of chickpea becomes available. The number of transcripts belonging to various families varied from one to 131 in chickpea. The largest TF family was the MYB/MYB-related domain containing 131 transcripts (7.1%). A comparative analysis of the number of chickpea transcripts identified in various major TF families in this study compared with those of soybean and Arabidopsis reported in a previous study (Schmutz et al., 2010) showed a larger number of genes included in all the TF families in soybean as compared with Arabidopsis and chickpea (Fig. 1B). However, several events of expansion and contraction of TF families were found in chickpea vis-à-vis Arabidopsis. For example, ABI3VP1, AP2/EREBP, bHLH, BTB/POZ, MADS box, and MYB/MYB-related family TFs were significantly greater in number in Arabidopsis as compared with chickpea, and SNF2, TCP, and CCAAT domain TFs were significantly greater in number in chickpea as compared with Arabidopsis. These differences in abundance of TFs in Arabidopsis and chickpea might play an important role in regulating species-specific biological processes. Although legumes do not contain any specific TF family or preferential expansion of a particular TF family, the legume-specific functions have been attributed to the evolution of promoter sequences leading to alteration in gene expression patterns (Libault et al., 2009). However, the possibility of evolution of a few TF-encoding genes performing species-specific function(s) due to neofunctionalization/subfunctionalization is not ruled out.

Legume- and Species-Specific Chickpea Transcripts

The identification of lineage-specific genes provides insights into species-specific functions and evolutionary processes such as speciation and adaptation (Domazet-Loso and Tautz, 2003). A few studies have been performed to identify the lineage-specific genes in plants (Campbell et al., 2007; Lin et al., 2010). The legume-specific genes have also been identified based on the unigene sets from soybean, Medicago, and Lotus (Graham et al., 2004). We performed a series of BLAST searches to identify a core set of chickpea transcripts with conserved sequences within the Fabaceae that lack significant similarity to sequences outside the Fabaceae and transcripts that lack similarity to sequences from any other plant species. A summary of the approach followed and the results obtained is presented in Figure 2A. In the first step, the chickpea transcripts showing any significant BLASTX hit with a protein sequence of 12 annotated plant genomes were removed. In the second step, the remaining 7,703 chickpea transcripts were searched via TBLASTX against non-Fabaceae plant transcript assemblies from 236 species available at The Institute for Genomic Research (TIGR) and unigenes from 48 species available at NCBI. This resulted in the removal of 328 transcripts that showed significant hits with at least one of these sequences, thus eliminating a total of 27,385 transcripts that were considered to be conserved in plant species other than the Fabaceae. Subsequently, in the third step, TBLASTX search was performed against transcript assemblies of 13 Fabaceae species available at TIGR and EST/unigenes of five Fabaceae species available at NCBI, and BLASTX search against proteomes and TBLASTX/BLASTN search against the genome sequences of soybean, Medicago, and Lotus was performed. A total of 3,743 transcripts showed significant similarity with at least one of the above sequences and were considered as candidate legume-specific genes. Another 3,632 transcripts did not show significant hits with any of the above sequences and might represent chickpea-specific (CS) genes. This fraction (10.5%) is less than that of rice genes (17.4%), which did not show significant similarity to sequences from the plant kingdom (Campbell et al., 2007), but much higher than that predicted in Arabidopsis (4.9%; Lin et al., 2010). This set of genes might represent novel genes that perform species-specific functions. As the last step, after a comparative analysis of transcripts, we identified 741 transcripts that have significant similarity to protein/unigene/genome sequence of at least three of the five non-chickpea Fabaceae species simultaneously and were identified as a set of core legume-specific (CLS) genes. Although the emergence and evolution of CLS transcripts is not clear, it may be speculated that these genes might have appeared in the Fabaceae after their divergence from non-Fabaceae plants or they may have been lost in non-Fabaceae plants after their divergence from the Fabaceae plants. In addition, several hypotheses, including lateral gene transfer, gene duplication followed by sequence divergence, enhanced evolutionary rate, and de novo emergence of genes for noncoding DNA, have been proposed for the evolution of lineage-specific genes (Daubin et al., 2003; Cai et al., 2006; Levine et al., 2006).

Figure 2.
Identification and GC content analysis of legume- and species-specific transcripts in chickpea. A, Strategy for the identification of legume- and species-specific transcripts in chickpea. The transcripts that showed significant hits with non-Fabaceae ...

The GC content distribution of all the chickpea transcripts is bimodal, with a broader range of 35% to 45% for most (79.4%) chickpea transcripts (Fig. 2B), as compared with that of Arabidopsis (40%–45%). The analysis of CLS and CS transcripts revealed that the GC distribution of CS transcripts is also bimodal but shifted toward 30% to 40% (lower) as compared with all transcripts (Fig. 2B). However, the GC content distribution of CLS transcripts was distinctly different from all chickpea transcripts and CS transcripts, showing unimodal distribution with a peak in the GC content range of 35% to 40%. Similar results have been reported for core Poaceae-specific genes in rice (Campbell et al., 2007). The GC content of genes has been related to gene conversion, recombination rate, DNA thermostability, transcriptional activity, and gene/genome contraction/expansion phenomena (Carels and Bernardi, 2000; Galtier et al., 2001; Vinogradov, 2003). Although CS transcripts did not show significant similarity to any sequence in the plant kingdom, a conserved PFAM/SMART domain could be identified in at least 89 transcripts, leading to their putative annotation (Supplemental Table S4). Some of these domains were overrepresented in the CS transcripts, for example, Atrophin-1 (PF03154), Extensin_2 (PF04554), Amelogenin (PF07174), and Serpentine-type 7TM GPCR chemoreceptor Srz (PF10325). The putative function has been assigned to 133 of the CLS transcripts (Supplemental Table S5). Among these, the largest number (12) of annotated CLS transcripts represent cyclin-like F-box domain-containing proteins. The F-box domain-containing proteins have been found to be enriched in lineage-specific genes previously as well (Graham et al., 2004; Campbell et al., 2007), indicating their major role in evolutionary processes leading to lineage specificity. The events of extensive duplication followed by sequence divergence in plant F-box genes have already been reported (Jain et al., 2007; Xu et al., 2009). In addition, Atrophin-1, Extensin_2, NB-LRR-type disease resistance proteins, and WRKY TFs were also represented in the CLS transcript set. At least two and 12 TF-encoding transcripts were identified in the CS and CLS sets, respectively. It has been suggested that the lineage-specific TFs along with their binding DNA sequences may lead to novel gene regulatory networks and might be responsible for specific phenotypes/traits (Nowick and Stubbs, 2010). The study of species-specific and lineage-specific genes will help in understanding the molecular mechanisms underlying characteristic biological processes of chickpea and other legumes.

Tissue-Specific Expression Analysis of the Chickpea Transcriptome

The use of EST abundance as an indicator of transcript abundance is a well-accepted method for differential gene expression analysis and has been implemented in numerous studies (Weber et al., 2007; Hale et al., 2009; Kristiansson et al., 2009). The results of differential gene expression analysis using NGS technologies have been found to be accurate and highly correlated with other methods such as real-time PCR analysis and microarray analysis (Cloonan et al., 2008; Wilhelm and Landry, 2009; Zenoni et al., 2010). The gene expression studies in chickpea have been limited to the single gene level so far. In this study, we utilized RNA-Seq data to quantify the gene expression of all chickpea transcripts. For an analysis of the abundance of chickpea transcripts within a tissue sample, the high-quality reads from individual tissue samples were mapped to the transcripts and the number of reads mapped was normalized by the RPKM method. The RPKM method corrects for biases in total gene exon size and normalizes for the total read sequences obtained in each tissue library. We classified the gene expression in five categories (very low, low, moderate, high, and very high), arbitrarily based on the RPKM value of each transcript in different tissue samples (Fig. 3A). The largest fraction of transcripts showed moderate expression (RPKM > 10–50) followed by low expression (RPKM > 3–10) in all tissue samples. A small fraction (1.5%–3.2%) of transcripts were expressed at very high levels (RPKM > 100) in different tissue samples (Fig. 3A). Furthermore, we identified transcripts expressed preferentially in each tissue sample analyzed via tissue-by-tissue comparison. For this, we normalized the number of reads uniquely mapped to each transcript per million total reads (RPM) uniquely mapped for each tissue sample. Figure 3B shows a table giving the number of genes that have significant preferential expression (3-fold or greater) between two tissues. The fraction of transcripts that are preferentially expressed among the two tissues analyzed in this study varied from 1.2% to 9.2%. The largest number of transcripts showed preferential expression in root as compared with mature leaf, followed by flower bud in comparison with mature leaf. The identification of genes preferentially expressed will help gain insight into the gene functions and thereby the biological processes. Furthermore, we identified transcripts expressed specifically in a single tissue sample with at least 3 RPM in the tissue of interest and zero in others. The largest number of transcripts exhibited specific expression in flower bud (1,132) followed by young pod (695) with variable abundance (Fig. 3C). The heat map shown in Figure 3D clearly shows the tissue-specific expression of these chickpea transcripts. These transcripts may play specific roles in the biology of various tissue samples in chickpea. The results have been validated by quantitative real-time PCR analysis of representative transcripts (Supplemental Fig. S5A). All the primer pairs used in the study amplified a specific PCR product. A very good correlation was obtained among the results of RNA-Seq and real-time PCR analysis. Among the 18 representative transcripts analyzed by real-time PCR analysis, which are specifically expressed in different tissue samples, the expression pattern of only three transcripts predicted to be expressed in mature leaf did not correlate. These results further confirm the potential of NGS technologies to quantify gene expression. In addition, agarose gel electrophoresis analysis revealed the expected size of PCR products in real-time PCR for all the transcripts analyzed (Supplemental Fig. S5B; Supplemental Table S6), which further validates the accuracy of de novo assembly presented in this study.

Figure 3.
Differential and tissue-specific expression analysis of the chickpea transcriptome. A, Number of transcripts with different expression abundances in various tissue samples based on the RPKM method. The transcripts showing RPKM values of 1 to 3, greater ...

Furthermore, we analyzed various GOSlim categories represented in tissue-specific chickpea transcripts. Although all the GOSlim terms were represented in the tissue-specific transcripts, a few terms were significantly overrepresented in some tissue samples (Supplemental Fig. S6). For example, among the biological process GOSlim terms, protein metabolism was most overrepresented in all the tissue samples except root. However, the GOSlim term response to stress was significantly overrepresented in the root-specific chickpea transcripts. Furthermore, developmental process-related transcripts were also overrepresented in flower bud and young pod tissues. Likewise, various molecular function GOSlim terms were significantly overrepresented in various tissue-specific transcripts. For example, protein binding was most overrepresented in the shoot-, root-, and mature leaf-specific transcripts, followed by transferase activity in shoot- and mature leaf-specific transcripts and hydrolase activity in root-specific transcripts. However, transferase activity followed by nucleotide binding was most represented in flower bud-specific transcripts, and hydrolase activity followed by transferase activity was most represented in young pod-specific transcripts. The cellular component term chloroplast was most overrepresented in the green tissues, shoot, and mature leaf. The term nucleus was significantly overrepresented in the flower bud- and root-specific transcripts. A large fraction of root-, young pod-, and flower bud-specific transcripts showed overrepresentation of the cellular component term plasma membrane. We analyzed the statistically significant enrichment of specific Gene Ontology (GO) terms in the tissue-specific chickpea transcripts using the BiNGO tool (Fig. 4). In the root-specific set of transcripts, the biological process response to stimulus and its children terms (response to oxidative stress and defense response) and the molecular function oxidoreductase activity were significantly enriched, which are well related to the major role of roots in various stress responses (Fig. 4A). Notably, the biological process term nitrogen fixation and the molecular function terms electron carrier activity and heme binding/iron ion binding were also found to be significantly enriched in root-specific transcripts, which are related to the ability for atmospheric nitrogen fixation by legumes. In the flower bud-specific chickpea transcripts, the biological process developmental processes and its children terms (gametophyte development, pollen development, and microsporogenesis, etc.) were significantly enriched, which are related to flower development (Fig. 4B). Likewise, the molecular function term nutrient reservoir activity was significantly enriched in the young pod-specific transcripts, which is related to seed development (Fig. 4C). These observations further authenticate our gene expression analysis results and provide a clue toward specific functions of the chickpea transcripts in a particular tissue sample. Although we identified the tissue-specific transcripts based on sequence data from five different tissue samples using stringent criteria, their expression in other chickpea tissues not analyzed in this study cannot be ruled out. In addition, the possibility of the identification of novel transcripts not sequenced/assembled in this study also cannot be ruled out. The tissue-specific transcripts identified in this study may be analyzed further in functional genomics studies.

Figure 4.
Significantly enriched GO categories in tissue-specific chickpea transcripts. A, Root specific. B, Flower bud specific. C, Young pod specific. The chickpea transcripts showing specific expression in each tissue type were analyzed using BiNGO, and the ...

Chickpea Transcriptome Database

We developed a public data resource, the Chickpea Transcriptome Database (CTDB), which provides a searchable interface to the chickpea transcriptome data. CTDB is publicly available at http://www.nipgr.res.in/ctdb.html. The current release (release 1.0) of the database provides the transcriptome sequence of the chickpea (genotype ICC4958) reported in this study. The Web pages provide information about the project, data, research group, and other resources of chickpea transcriptome data. Figure 5 provides snapshots of the various tools/features of CTDB. The database can be queried based on transcript ID and keyword search for all the functional annotations. CTDB output can be displayed to the screen or saved as a tab-delimited file. The data available in the database can be mined/queried using a variety of search options provided. The sequence-based search has been employed using NCBI BLAST search (version 2.2.24+), which provides the option of different BLAST algorithms to search nucleotide or protein sequence(s) against chickpea transcriptome sequence data reported in this study and previously (Garg et al., 2011) at user-defined parameters. In addition, we have provided the option to search unigenes generated by assembling (using TGICL) all the chickpea ESTs available at NCBI. Various annotation search facilities are provided, namely, putative function, PFAM domain, transcript ID, and GO search. The expression data for transcripts can be accessed in two ways: first, the expression level (RPM values) for individual transcripts can be retrieved using unique transcript IDs for all tissues types used in this study; second, highly expressed transcripts (on the basis of RPKM expression values) can be retrieved for a specific tissue type. In addition, the batch download (for multiple transcript IDs) utility for sequence and annotation data in a tab-delimited file has also been provided, which provides the user the flexibility for downstream data analysis/interpretation. Links have also been provided to download the whole assembled sequence data as a single file. We anticipate that the database will be very useful for accelerating research in the area of chickpea functional genomics. Furthermore, we aim to update the database annually or as the new sequence/annotation data set(s) are available for chickpea.

Figure 5.
Snapshots of the public access resource CTDB showing its various utilities.


The generation and availability of genomic resources for chickpea, a very important food legume plant, lag significantly. The applications of NGS technologies in the characterization of the transcriptome of nonmodel species are well documented. In this study, we present the high-quality transcriptome sequence of the crop legume plant chickpea as a genomic resource using sequence data generated from Roche 454 and Illumina NGS platforms and de novo assembly optimization. The functional annotation of the transcriptome provides a greater insight into the gene content, biological processes, and pathways conserved in chickpea. The identification of lineage- and species-specific genes in chickpea will be very helpful in establishing evolutionary relationships among legumes. The SSRs identified in chickpea should provide breeders a very good resource for the future development of improved chickpea cultivars. The identification of tissue-specific transcripts should help accelerate the functional analysis of genes of interest in chickpea. Our data and results provide essential information for future genetic studies in chickpea and present a de novo transcriptome assembly workflow that should be applicable to other plants as well.


Plant Material and RNA Isolation

Chickpea (Cicer arietinum genotype ICC4958) seeds were grown as described (Garg et al., 2010). Root and shoot tissue samples were collected from 15-d-old seedlings. The mature leaves, flower buds, and young pods were collected from plants grown in the field. At least three independent biological replicates of each tissue sample were harvested and immediately frozen in liquid nitrogen. Total RNA was extracted from all tissue samples using TRI Reagent (Sigma Life Science) according to the manufacturer’s instructions. The quality and quantity of each RNA sample were assessed using NanoVue (GE Healthcare) and the Agilent 2100 Bioanalyzer (Agilent Technologies) as described previously (Garg et al., 2010). Equal quantities of total RNA samples from the three biological replicates were pooled for mRNA purification followed by library preparation for each tissue sample.

Library Preparation and 454 Sequencing

The mRNA was purified from total RNA samples using the Dynabead mRNA purification kit according to the manufacturer’s instructions (Invitrogen, Dynal). We found significant rRNA contamination in the mRNA samples purified from the five individual tissue samples. Therefore, we did two additional stringent washes before elution of the mRNA from the mixed tissue sample to avoid rRNA contamination. Six cDNA libraries were generated using the cDNA rapid library preparation kit for the GS FLX Titanium series essentially following the manufacturer’s instructions (Roche Diagnostics). The double-stranded cDNA was synthesized using the SuperScript double-stranded cDNA synthesis kit (Invitrogen). Approximately 600 ng of double-stranded cDNA was nebulized and selected for the 300- to 800-bp fragment length range. The specific adapters were ligated to the fragmented cDNA and denatured to generate single-stranded cDNA followed by emulsion PCR amplification for sequencing. Five cDNA libraries were generated, one each from the mRNA isolated from shoot, root, mature leaf, flower bud, and young pod tissue samples, and a sixth cDNA library was generated from the mRNA purified from total RNA samples pooled in equal amounts from all the above tissue samples. The quality of libraries was assessed using the High Sensitivity DNA kit on the Agilent 2100 Bioanalyzer. All six cDNA libraries were sequenced in two runs (one complete and two half runs) using GS FLX Titanium series sequencing reagents and sequencer. Each of the cDNA libraries from shoot and mixed samples were sequenced in one complete flow cell, whereas two each of the other four cDNA libraries were tagged with unique RL multiplex identifier adaptors and mixed before emulsion PCR and sequencing in one complete flow cell.

Sequence Quality Controls and Preprocessing

First, Q20 data were extracted for all the flow cells using the gsRunProcessor command with a value of 0.01 for errorQscoreWindowTrim in the filterTemplate.xml file. The sequence data for different tissue samples sequenced in a single flow cell were separated using the sfffile command. The reads not belonging to the multiplex identifiers used were discarded. The sequence data generated in this study have been deposited in Standard Flowgram Format (SFF) at NCBI in the Short Read Archive database under the accession number SRA030696 (experiment accession nos. SRX048831–SRX048836). Various quality controls, including filtering of high-quality reads based on the score values given in .qual files, trimming of reads containing primer/adaptor sequences, trimming of reads containing homopolymers of more than seven bases, and removal of reads with length of less than 100 bp, were done using the in-house NGS QC tool kit (R.K. Patel and M. Jain, unpublished data). This was followed by filtering of rRNA sequences with an optimized highly stringent criterion. The sequences showing E-value cutoffs of 1e-10 or less and 100-bp alignment length with at least 80% identity in BLASTN search against a set of 147 rRNA sequences from various plant species downloaded from NCBI were considered as rRNA sequences and discarded.

De Novo Assembly

We used various de novo assembly programs to obtain the best assembly results with the chickpea 454 sequence data set and generated a nonredundant set of transcripts. Among the various programs available, we used publicly available programs Velvet (version 1.0.14; http://www.ebi.ac.uk/∼zerbino/velvet/), ABySS (version 1.1.2; http://www.bcgsc.ca/platform/bioinfo/software/abyss), and MIRA (version 3.2.0; http://sourceforge.net/projects/mira-assembler/). In addition, we used gsdenovo assembler (Newbler v2.3 and v2.5p1; http://www.454.com/products-solutions/analysis-tools/gs-de-novo-assembler.asp) supplied with the GS FLX Titanium sequencer and the commercially available CLC Genomics Workbench (version 3.7.1; http://www.clcbio.com/index.php?id=1240). These programs have been developed for the de novo assembly of long and/or short reads. We also used CAP3 (http://seq.cs.iastate.edu/cap3.html) and TGICL (version 2.0; http://sourceforge.net/projects/tgicl/) programs for the assembly of 454 sequence data.

All the assemblies were performed either on the server with 48 cores and 128 GB of random access memory or on the server with eight cores and 48 GB of random access memory. The high-quality reads and their quality values in fasta format were used as input in all the assemblers for 454 data. Velvet and ABySS assemblies were performed at various k-mer lengths. In MIRA, we used −job = denovo,est,accurate,454 quick switch using 12 cores (−GE:not = 12) with four passes (−AS:nop = 4) for the assembly of 454 data. The two latest versions of Roche 454’s Newbler, v2.3 and v2.5p1, were used with the cdna option using multiple CPUs. The assembly on the CLC Genomics Workbench was performed using default settings. CAP3 assembly of the 454 data was performed using default parameters. The TGICL assembly was performed on 16 CPUs (−c 16) with minimum overlap length of 40 (−l 40), minimum percentage identity of 90 for overlaps (−p 90), and maximum length of unmatched overhangs of 20 (−v 20).

GC Content Analysis and SSR Identification

The GC content analysis was done using an in-house Perl script. We used MISA (MIcroSAtellite; http://pgrc.ipk-gatersleben.de/misa/) for the identification of SSRs. Dinucleotides repeats of more than six times and trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide repeats of more than five times were considered as search criteria for SSRs in MISA script.

Sequence Conservation and Functional Annotation

The available proteome data sets for all the completely sequenced plant genomes were downloaded from their respective genome project Web sites. The criterion of an expect (E) value of 1e-5 or less was used for the identification of significant hits in a BLASTX search. To deduce the putative function of chickpea transcripts, they were subjected to BLASTX search against annotated protein sequences of Arabidopsis (Arabidopsis thaliana; available at The Arabidopsis Information Resource). The results of only the best hit were extracted, and the hits with E ≤ 1e-5 were considered to be significant. Each of the chickpea transcripts showing significant hits was assigned the same putative function as that of the corresponding Arabidopsis protein. Furthermore, the chickpea transcripts that did not show any significant hits were searched against UniRef90, UniRef100, NR, PFAM, SMART, KEGG, and COG databases to assign putative functions to them using the AutoFACT pipeline (http://megasun.bch.umontreal.ca/Software/AutoFACT.htm; Koski et al., 2005). The GOSlim terms for molecular function, biological process, and cellular component categories associated with the best BLASTX hit of the Arabidopsis protein were assigned to the corresponding chickpea transcript. GO enrichment analysis was done using the BiNGO plugin of Cytoscape (Maere et al., 2005).

Identification of TF-Encoding Genes

The chickpea transcripts encoding for TFs were identified using the hidden Markov model profiles already available in the PFAM database or those generated from the domain alignments available at the Plant Transcription Factor Database (http://plntfdb.bio.uni-potsdam.de/v3.0/; Pérez-Rodríguez et al., 2010) for 84 families using HMMER search. We followed similar criteria as the Plant Transcription Factor Database to identify the TFs belonging to various families.

Identification of Legume-Specific and Species-Specific Genes

The plant transcript assemblies and EST/unigene data for non-Fabaceae/Fabaceae plant species were downloaded from the TIGR Plant Transcript Assemblies database (http://plantta.jcvi.org/; Childs et al., 2007) and NCBI (ftp://ftp.ncbi.nih.gov/repository/UniGene/), respectively. The data set of transcript assemblies was composed of 3,618,017 sequences from 236 non-Fabaceae species and 275,760 sequences from 13 Fabaceae species. The NCBI EST/unigene data set was composed of 11,974,373 sequences from 48 non-Fabaceae species and 2,004,299 sequences from five Fabaceae species. The genome sequences for all the completely sequenced plants were downloaded from their respective genome project Web sites. The final set of chickpea transcripts were subjected to various BLAST searches as per the strategy described in “Results and Discussion.” The criteria of E ≤ 1e-5 for BLASTX and TBLASTX searches and E ≤ 1e-10 for BLASTN search were used for filtering significant hits. In-house Perl scripts were used for filtering the BLAST results of significant and nonsignificant hits and their sequences.

Mapping of Sequence Reads onto Chickpea Transcripts

To quantify the expression of each transcript in individual tissue samples and identify differentially expressed genes, all the reads from six samples were mapped onto the nonredundant set of transcripts using CLC Genomics Workbench software. For mapping the 454 reads, the criterion of a minimum of 90% coverage of the total length was used, and for short reads, a maximum of two mismatches were allowed for mapping. The total number of reads, number of unique reads, and RPM corresponding to each transcript were determined. In addition, the coverage of each transcript was determined in terms of RPKM. The heat map showing tissue-specific expression was generated based on the RPM for each transcript in all the tissue samples using TIGR MultiExperiment Viewer.

Real-Time PCR Validation

The gene-specific primers for real-time PCR analysis were designed using Primer Express (version 3.0) software (Applied Biosystems), and the specificity of primer pairs was confirmed by BLASTN with all the nucleotide sequences of the nonredundant set of chickpea transcripts generated. The primer sequences of all the genes used in this study are listed in Supplemental Table S6. The real-time PCR analysis using gene-specific primers was performed as described previously (Garg et al., 2010). In brief, cDNA was synthesized from 6 μg of total RNA in a final reaction volume of 100 μL for each sample. The real-time PCRs were performed using diluted cDNA, 200 nm of gene-specific primers, and SYBR Green PCR mix on 96-well optical PCR plates using the 7500 Sequence Detection System (Applied Biosystems). All the reactions were performed under default parameters, and the specificity of reactions was verified by dissociation curve analysis. Two independent biological replicates for each sample and three technical replicates of each biological replicate were analyzed for real-time PCR analysis. For a biological replicate of a tissue sample, the same cDNA pool was used for real-time PCR analysis of all the genes analyzed. The transcript level of each gene in different tissue samples was normalized with the transcript level of the most suitable internal control gene, EF1α (Garg et al., 2010).

Construction of CTDB

CTDB is a public resource for chickpea transcriptome data. Web pages have been prepared using Perl-CGI on the Apache Tomcat (version 5.5.29) Web server application. The data regarding expression and annotation for each transcript are stored in the MySql server (version 5.0.77). The database is currently hosted on a Sun Workstation running the CentOs (version 5.4) Linux operating system with two Intel Xeon quad core processors and 12 GB of random access memory. The sequence data are stored in flat files.

Sequence data from this article can be found in the Short Read Archive database at NCBI under accession number SRA030696 (experiment accession nos. SRX048831–SRX048836).

Supplemental Data

The following materials are available in the online version of this article.

  • Supplemental Figure S1. Length (A) and average quality score (B) distribution of the total number of high-quality reads generated.
  • Supplemental Figure S2. Length distribution of chickpea transcripts generated from optimized hybrid assembly.
  • Supplemental Figure S3. Sequence conservation of chickpea transcripts with proteomes of completely sequenced plants.
  • Supplemental Figure S4. Top 20 PFAM domains represented in the chickpea transcripts.
  • Supplemental Figure S5. Relative transcript levels of representative tissue-specific chickpea transcripts validated by real-time PCR analysis (A), and agarose gel electrophoresis showing amplification of specific PCR products of desired size (B).
  • Supplemental Figure S6. GOSlim term assignment in different categories of biological process (A), molecular function (B), and cellular component (C) to the chickpea transcripts showing tissue-specific expression in the five tissue samples analyzed in this study.
  • Supplemental Table S1. Statistics of de novo assemblies of 454 sequence data using Velvet and ABySS at various k-mer lengths.
  • Supplemental Table S2. Frequency of SSRs identified in chickpea transcripts.
  • Supplemental Table S3. KEGG pathways represented in chickpea transcripts.
  • Supplemental Table S4. Transcript ID and annotations of CS transcripts.
  • Supplemental Table S5. Transcript ID and annotations of CLS transcripts.
  • Supplemental Table S6. Primer sequences used for real-time PCR analysis of tissue-specific expression of chickpea transcripts in the study.


  • Ashraf N, Ghai D, Barman P, Basu S, Gangisetty N, Mandal MK, Chakraborty N, Datta A, Chakraborty S. (2009) Comparative analyses of genotype dependent expressed sequence tags and stress-responsive transcriptome of chickpea wilt illustrate predicted and unexpected genes and novel regulators of plant immunity. BMC Genomics 10: 415. [PMC free article] [PubMed]
  • Buhariwalla HK, Jayashree B, Eshwar K, Crouch JH. (2005) Development of ESTs from chickpea roots and their use in diversity analysis of the Cicer genus. BMC Plant Biol 5: 16. [PMC free article] [PubMed]
  • Cai JJ, Woo PC, Lau SK, Smith DK, Yuen KY. (2006) Accelerated evolutionary rate may be responsible for the emergence of lineage-specific genes in ascomycota. J Mol Evol 63: 1–11 [PubMed]
  • Campbell MA, Zhu W, Jiang N, Lin H, Ouyang S, Childs KL, Haas BJ, Hamilton JP, Buell CR. (2007) Identification and characterization of lineage-specific genes within the Poaceae. Plant Physiol 145: 1311–1322 [PMC free article] [PubMed]
  • Cannon SB, May GD, Jackson SA. (2009) Three sequenced legume genomes and many crop species: rich opportunities for translational genomics. Plant Physiol 151: 970–977 [PMC free article] [PubMed]
  • Carels N, Bernardi G. (2000) Two classes of genes in plants. Genetics 154: 1819–1825 [PMC free article] [PubMed]
  • Childs KL, Hamilton JP, Zhu W, Ly E, Cheung F, Wu H, Rabinowicz PD, Town CD, Buell CR, Chan AP. (2007) The TIGR Plant Transcript Assemblies database. Nucleic Acids Res (Database issue) 35: D846–D851 [PMC free article] [PubMed]
  • Choudhary S, Sethy NK, Shokeen B, Bhatia S. (2009) Development of chickpea EST-SSR markers and analysis of allelic variation across related species. Theor Appl Genet 118: 591–608 [PubMed]
  • Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, et al. (2008) Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods 5: 613–619 [PubMed]
  • Cloutier S, Niu Z, Datla R, Duguid S. (2009) Development and analysis of EST-SSRs for flax (Linum usitatissimum L.). Theor Appl Genet 119: 53–63 [PubMed]
  • Daubin V, Lerat E, Perrière G. (2003) The source of laterally transferred genes in bacterial genomes. Genome Biol 4: R57. [PMC free article] [PubMed]
  • Domazet-Loso T, Tautz D. (2003) An evolutionary analysis of orphan genes in Drosophila. Genome Res 13: 2213–2219 [PMC free article] [PubMed]
  • Doyle JJ, Luckow MA. (2003) The rest of the iceberg: legume diversity and evolution in a phylogenetic context. Plant Physiol 131: 900–910 [PMC free article] [PubMed]
  • Dutta S, Kumawat G, Singh BP, Gupta DK, Singh S, Dogra V, Gaikwad K, Sharma TR, Raje RS, Bandhopadhya TK, et al. (2011) Development of genic-SSR markers by deep transcriptome sequencing in pigeonpea [Cajanus cajan (L.) Millspaugh]. BMC Plant Biol 11: 17. [PMC free article] [PubMed]
  • Galtier N, Piganeau G, Mouchiroud D, Duret L. (2001) GC-content evolution in mammalian genomes: the biased gene conversion hypothesis. Genetics 159: 907–911 [PMC free article] [PubMed]
  • Gao WR, Wang XS, Liu QY, Peng H, Chen C, Li JG, Zhang JS, Hu SN, Ma H. (2008) Comparative analysis of ESTs in response to drought stress in chickpea (C. arietinum L.). Biochem Biophys Res Commun 376: 578–583 [PubMed]
  • Garg R, Patel RK, Tyagi AK, Jain M. (2011) De novo assembly of chickpea transcriptome using short reads for gene discovery and marker identification. DNA Res 18: 53–63 [PMC free article] [PubMed]
  • Garg R, Sahoo A, Tyagi AK, Jain M. (2010) Validation of internal control genes for quantitative gene expression studies in chickpea (Cicer arietinum L.). Biochem Biophys Res Commun 396: 283–288 [PubMed]
  • Gaur R, Sethy NK, Choudhary S, Shokeen B, Gupta V, Bhatia S. (2011) Advancing the STMS genomic resources for defining new locations on the intraspecific genetic linkage map of chickpea (Cicer arietinum L.). BMC Genomics 12: 117. [PMC free article] [PubMed]
  • Graham MA, Silverstein KA, Cannon SB, VandenBosch KA. (2004) Computational identification and characterization of novel genes from legumes. Plant Physiol 135: 1179–1197 [PMC free article] [PubMed]
  • Graham PH, Vance CP. (2003) Legumes: importance and constraints to greater use. Plant Physiol 131: 872–877 [PMC free article] [PubMed]
  • Hale MC, McCormick CR, Jackson JR, Dewoody JA. (2009) Next-generation pyrosequencing of gonad transcriptomes in the polyploid lake sturgeon (Acipenser fulvescens): the relative merits of normalization and rarefaction in gene discovery. BMC Genomics 10: 203. [PMC free article] [PubMed]
  • Hisano H, Sato S, Isobe S, Sasamoto S, Wada T, Matsuno A, Fujishiro T, Yamada M, Nakayama S, Nakamura Y, et al. (2007) Characterization of the soybean genome using EST-derived microsatellite markers. DNA Res 14: 271–281 [PMC free article] [PubMed]
  • Hüttel B, Winter P, Weising K, Choumane W, Weigand F, Kahl G. (1999) Sequence-tagged microsatellite site markers for chickpea (Cicer arietinum L.). Genome 42: 210–217 [PubMed]
  • Jain D, Chattopadhyay D. (2010) Analysis of gene expression in response to water deficit of chickpea (Cicer arietinum L.) varieties differing in drought tolerance. BMC Plant Biol 10: 24. [PMC free article] [PubMed]
  • Jain M, Nijhawan A, Arora R, Agarwal P, Ray S, Sharma P, Kapoor S, Tyagi AK, Khurana JP. (2007) F-box proteins in rice: genome-wide analysis, classification, temporal and spatial gene expression during panicle and seed development, and regulation by light and abiotic stress. Plant Physiol 143: 1467–1483 [PMC free article] [PubMed]
  • Jain M, Tyagi AK, Khurana JP. (2008) Genome-wide identification, classification, evolutionary expansion and expression analyses of homeobox genes in rice. FEBS J 275: 2845–2861 [PubMed]
  • Kaur H, Shukla RK, Yadav G, Chattopadhyay D, Majee M. (2008) Two divergent genes encoding L-myo-inositol 1-phosphate synthase1 (CaMIPS1) and 2 (CaMIPS2) are differentially expressed in chickpea. Plant Cell Environ 31: 1701–1716 [PubMed]
  • Koski LB, Gray MW, Lang BF, Burger G. (2005) AutoFACT: an automatic functional annotation and classification tool. BMC Bioinformatics 6: 151. [PMC free article] [PubMed]
  • Kristiansson E, Asker N, Förlin L, Larsson DG. (2009) Characterization of the Zoarces viviparus liver transcriptome using massively parallel pyrosequencing. BMC Genomics 10: 345. [PMC free article] [PubMed]
  • Kumar S, Blaxter ML. (2010) Comparing de novo assemblers for 454 transcriptome data. BMC Genomics 11: 571. [PMC free article] [PubMed]
  • La Rota M, Kantety RV, Yu JK, Sorrells ME. (2005) Nonrandom distribution and frequencies of genomic and EST-derived microsatellite markers in rice, wheat, and barley. BMC Genomics 6: 23. [PMC free article] [PubMed]
  • Levine MT, Jones CD, Kern AD, Lindfors HA, Begun DJ. (2006) Novel genes derived from noncoding DNA in Drosophila melanogaster are frequently X-linked and exhibit testis-biased expression. Proc Natl Acad Sci USA 103: 9935–9939 [PMC free article] [PubMed]
  • Libault M, Farmer A, Joshi T, Takahashi K, Langley RJ, Franklin LD, He J, Xu D, May G, Stacey G. (2010) An integrated transcriptome atlas of the crop model Glycine max, and its use in comparative analyses in plants. Plant J 63: 86–99 [PubMed]
  • Libault M, Joshi T, Benedito VA, Xu D, Udvardi MK, Stacey G. (2009) Legume transcription factor genes: what makes legumes so special? Plant Physiol 151: 991–1001 [PMC free article] [PubMed]
  • Lichtenzveig J, Scheuring C, Dodge J, Abbo S, Zhang HB. (2005) Construction of BAC and BIBAC libraries and their applications for generation of SSR markers for genome analysis of chickpea, Cicer arietinum L. Theor Appl Genet 110: 492–510 [PubMed]
  • Lin H, Moghe G, Ouyang S, Iezzoni A, Shiu SH, Gu X, Buell CR. (2010) Comparative analyses reveal distinct sets of lineage-specific genes within Arabidopsis thaliana. BMC Evol Biol 10: 41. [PMC free article] [PubMed]
  • Maere S, Heymans K, Kuiper M. (2005) BiNGO: a Cytoscape plugin to assess overrepresentation of Gene Ontology categories in biological networks. Bioinformatics 21: 3448–3449 [PubMed]
  • Mantri NL, Ford R, Coram TE, Pang EC. (2007) Transcriptional profiling of chickpea genes differentially regulated in response to high-salinity, cold and drought. BMC Genomics 8: 303. [PMC free article] [PubMed]
  • Metzgar D, Bytof J, Wills C. (2000) Selection against frameshift mutations limits microsatellite expansion in coding DNA. Genome Res 10: 72–80 [PMC free article] [PubMed]
  • Millan T, Clarke HJ, Siddique KHM, Buhariwalla HK, Gaur PM, Kumar J, Gil J, Kahl G, Winter P. (2006) Chickpea molecular breeding: new tools and concepts. Euphytica 147: 81–103
  • Millan T, Winter P, Jüngling R, Gil J, Rubio J, Cho S, Cobos MJ, Iruela M, Rajesh PN, Tekeoglu M, et al. (2010) A consensus genetic map of chickpea (Cicer arietinum L.) based on 10 mapping populations. Euphytica 175: 175–189
  • Molina C, Rotter B, Horres R, Udupa SM, Besser B, Bellarmino L, Baum M, Matsumura H, Terauchi R, Kahl G, et al. (2008) SuperSAGE: the drought stress-responsive transcriptome of chickpea roots. BMC Genomics 9: 553. [PMC free article] [PubMed]
  • Molina C, Zaman-Allah M, Khan F, Fatnassi N, Horres R, Rotter B, Steinhauer D, Amenc L, Drevon JJ, Winter P, et al. (2011) The salt-responsive transcriptome of chickpea roots and nodules via deepSuperSAGE. BMC Plant Biol 11: 31. [PMC free article] [PubMed]
  • Morgante M, Hanafey M, Powell W. (2002) Microsatellites are preferentially associated with nonrepetitive DNA in plant genomes. Nat Genet 30: 194–200 [PubMed]
  • Morozova O, Hirst M, Marra MA. (2009) Applications of new sequencing technologies for transcriptome analysis. Annu Rev Genomics Hum Genet 10: 135–151 [PubMed]
  • Nowick K, Stubbs L. (2010) Lineage-specific transcription factors and the evolution of gene regulatory networks. Brief Funct Genomics 9: 65–78 [PMC free article] [PubMed]
  • Pandey A, Chakraborty S, Datta A, Chakraborty N. (2008) Proteomics approach to identify dehydration responsive nuclear proteins from chickpea (Cicer arietinum L.). Mol Cell Proteomics 7: 88–107 [PubMed]
  • Pandey A, Choudhary MK, Bhushan D, Chattopadhyay A, Chakraborty S, Datta A, Chakraborty N. (2006) The nuclear proteome of chickpea (Cicer arietinum L.) reveals predicted and unexpected proteins. J Proteome Res 5: 3301–3311 [PubMed]
  • Papanicolaou A, Stierli R, Ffrench-Constant RH, Heckel DG. (2009) Next generation transcriptomes for next generation genomes using est2assembly. BMC Bioinformatics 10: 447. [PMC free article] [PubMed]
  • Peng H, Yu X, Cheng H, Shi Q, Zhang H, Li J, Ma H. (2010) Cloning and characterization of a novel NAC family gene CarNAC1 from chickpea (Cicer arietinum L.). Mol Biotechnol 44: 30–40 [PubMed]
  • Pérez-Rodríguez P, Riaño-Pachón DM, Corrêa LG, Rensing SA, Kersten B, Mueller-Roeber B. (2010) PlnTFDB: updated content and new features of the plant transcription factor database. Nucleic Acids Res (Database issue) 38: D822–D827 [PMC free article] [PubMed]
  • Radhika P, Gowda SJ, Kadoo NY, Mhase LB, Jamadagni BM, Sainani MN, Chandra S, Gupta VS. (2007) Development of an integrated intraspecific map of chickpea (Cicer arietinum L.) using two recombinant inbred line populations. Theor Appl Genet 115: 209–216 [PubMed]
  • Rajesh PN, Coyne C, Meksem K, Sharma KD, Gupta V, Muehlbauer FJ. (2004) Construction of a HindIII bacterial artificial chromosome library and its use in identification of clones associated with disease resistance in chickpea. Theor Appl Genet 108: 663–669 [PubMed]
  • Rajesh PN, O’Bleness M, Roe BA, Muehlbauer FJ. (2008) Analysis of genome organization, composition and microsynteny using 500 kb BAC sequences in chickpea. Theor Appl Genet 117: 449–458 [PubMed]
  • Schatz MC, Delcher AL, Salzberg SL. (2010) Assembly of large genomes using second-generation sequencing. Genome Res 20: 1165–1173 [PMC free article] [PubMed]
  • Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J, et al. (2010) Genome sequence of the palaeopolyploid soybean. Nature 463: 178–183 [PubMed]
  • Severin AJ, Woody JL, Bolon YT, Joseph B, Diers BW, Farmer AD, Muehlbauer GJ, Nelson RT, Grant D, Specht JE, et al. (2010) RNA-Seq atlas of Glycine max: a guide to the soybean transcriptome. BMC Plant Biol 10: 160. [PMC free article] [PubMed]
  • Shukla RK, Tripathi V, Jain D, Yadav RK, Chattopadhyay D. (2009) CAP2 enhances germination of transgenic tobacco seeds at high temperature and promotes heat stress tolerance in yeast. FEBS J 276: 5252–5262 [PubMed]
  • Tripathi V, Parasuraman B, Laxmi A, Chattopadhyay D. (2009) CIPK6, a CBL-interacting protein kinase is required for development and salt tolerance in plants. Plant J 58: 778–790 [PubMed]
  • Upadhyaya HD, Dwivedi SL, Baum M, Varshney RK, Udupa SM, Gowda CL, Hoisington D, Singh S. (2008) Genetic structure, diversity, and allelic richness in composite collection and reference set in chickpea (Cicer arietinum L.). BMC Plant Biol 8: 106. [PMC free article] [PubMed]
  • Varshney RK, Graner A, Sorrells ME. (2005) Genic microsatellite markers in plants: features and applications. Trends Biotechnol 23: 48–55 [PubMed]
  • Varshney RK, Hiremath PJ, Lekha P, Kashiwagi J, Balaji J, Deokar AA, Vadez V, Xiao Y, Srinivasan R, Gaur PM, et al. (2009) A comprehensive resource of drought- and salinity-responsive ESTs for gene discovery and marker development in chickpea (Cicer arietinum L.). BMC Genomics 10: 523. [PMC free article] [PubMed]
  • Vinogradov AE. (2003) DNA helix: the importance of being GC-rich. Nucleic Acids Res 31: 1838–1844 [PMC free article] [PubMed]
  • Weber AP, Weber KL, Carr K, Wilkerson C, Ohlrogge JB. (2007) Sampling the Arabidopsis transcriptome with massively parallel pyrosequencing. Plant Physiol 144: 32–42 [PMC free article] [PubMed]
  • Wilhelm BT, Landry JR. (2009) RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing. Methods 48: 249–257 [PubMed]
  • Xu G, Ma H, Nei M, Kong H. (2009) Evolution of F-box genes in plants: different modes of sequence divergence and their relationships with functional diversification. Proc Natl Acad Sci USA 106: 835–840 [PMC free article] [PubMed]
  • Zenoni S, Ferrarini A, Giacomelli E, Xumerle L, Fasoli M, Malerba G, Bellin D, Pezzotti M, Delledonne M. (2010) Characterization of transcriptional complexity during berry development in Vitis vinifera using RNA-Seq. Plant Physiol 152: 1787–1795 [PMC free article] [PubMed]
  • Zhang X, Scheuring CF, Zhang M, Dong JJ, Zhang Y, Huang JJ, Lee MK, Abbo S, Sherman A, Shtienberg D, et al. (2010) A BAC/BIBAC-based physical map of chickpea, Cicer arietinum L. BMC Genomics 11: 501. [PMC free article] [PubMed]

Articles from Plant Physiology are provided here courtesy of American Society of Plant Biologists
PubReader format: click here to try


Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...