Logo of bioinformLink to Publisher's site
Bioinformation. 2012; 8(1): 43–47.
Published online 2012 Jan 6.
PMCID: PMC3282275

The capsicum transcriptome DB: a “hot” tool for genomic research


Chili pepper (Capsicum annuum) is an economically important crop with no available public genome sequence. We describe a genomic resource to facilitate Capsicum annuum research. A collection of Expressed Sequence Tags (ESTs) derived from five C. annuum organs (root, stem, leaf, flower and fruit) were sequenced using the Sanger method and multiple leaf transcriptomes were deeply sampled using with GS-pyrosequencing. A hybrid assembly of 1,324,516 raw reads yielded 32,314 high quality contigs as validated by coverage and identity analysis with existing pepper sequences. Overall, 75.5% of the contigs had significant sequence similarity to entries in nucleic acid and protein databases; 23% of the sequences have not been previously reported for C. annuum and expand sequence resources for this species. A MySQL database and a user-friendly Web interface were constructed with search-tools that permit queries of the ESTs including sequence, functional annotation, Gene Ontology classification, metabolic pathways, and assembly information. The Capsicum Transcriptome DB is free available from http://www.bioingenios.ira.cinvestav.mx:81/Joomla/


Chili pepper (Capsicum annuum, L.) constitutes one of the most important crops in Mexico. Mexico is the second largest chili pepper producer in the word and it has been suggested as a center of domestication of this species [1], which is reflected in the large number of pepper types found in the country. Sequence and analysis of Expressed Sequence Tags (ESTs) are primary tools for the discovery of novel genes in plants and other organisms. As the chili pepper genome is not currently available, transcriptome data can provide major insights into the genes and gene families involved in important biological processes. In this report, we present the Capsicum Transcriptome DB (Database), a web-based EST database. We constructed cDNA libraries derived from different organs of chili pepper plants including root, stem, leaf, flower and fruit. In some tissues, samples were collected after exposure to a variety of stress agents or during several developmental stages. The leaf transcriptome was deeply sequenced with both pyrosequencing and Sanger technologies, while cDNAs from the remaining tissues were sequenced solely with the Sanger platform. Sequences were assembled using a hybrid approach resulting in a reference transcriptome (Figure 1) of 32,314 contigs and 59,991 singletons. The Capsicum Transcriptome DB integrates comprehensive information including functional annotation, Gene Ontology (GO) [2], and metabolic pathway assignments [3]. To provide public access to the sequence and annotation data, we developed and implemented a userfriendly, SQL query-builder tool into the Capsicum Transcriptome DB web site publicly available at http://www.bioingenios.ira.cinvestav.mx:81/Joomla/.

Figure 1
Schematic of the hybrid assembly process and database construction. The diagram represents the process followed to obtain the hybrid assembly and to construct the Capsicum Transcriptome database. The process included a sequential comparison with several ...

Methodology and Results

Data Source:

The Capsicum Transcriptome DB was established using sequence data from two C. annuum varieties, Serrano Tampiqueño 74 and Sonora Anaheim. For the Sanger-based data, we generated 83,116 ESTs from 30 cDNA libraries derived from root, stem, leaf, flower and fruit tissues Table 1 (see supplementary material). mRNA was isolated using Poly(A)+ capture and cloned using Gateway technology. After sequencing, Phred and custom Perl scripts were used to extract high-quality regions from the raw sequences, trim vector sequences, adaptors, polyA/T regions and eliminate short ESTs (< 90 bp) Table 1 (see supplementary material). A total of 70,743 high-quality Sanger sequences with an average length of 678.35 nucleotides (nt) were obtained. Three cDNA libraries constructed from DNA virus-infected and non-infected pepper leaf tissue (C. annuum cv. Sonora Anaheim) were sequenced using 454 pyrosequencing [4] Table 2 (see supplementary material). A total of 1,838,567 pyrosequencing reads were obtained with an average length of 99.89 nt with an average quality of 27.90 Table 3 (see supplementary material).

Capsicum reference transcriptome

Three rounds of assembly were performed using Newbler (v1.3) with default parameters. First, each library was assembled independently to identify sequences classified as “assembled reads” and “singletons” by Newbler Figure 1, Table 2 (see supplementary material) . Sequences classified as “repeats” were discarded due to over-representation, which are problematic in the assembly process. A total of 1,253,773 sequences from all 9 runs derived from the “assembled reads” and “singletons” were used for a second assembly (Figure 1) in which, 887,718 reads were assembled into 33,652 contigs with an average length of 251.7 nt Table 3 (see supplementary material). A third assembly was performed using a hybrid approach in which a total of 1,324,516 sequences (1,253,773 pyrosequencing-reads plus 70,743 ESTs) were assembled (Figure 1, Table 3). A total of 1,144,574 sequences were assembled into 32,538 contigs with 92,211 remaining as singletons Table 3 (see supplementary material). Custom Perl scripts were used to filter polyA/T regions, low quality, and short contigs (<90 nt) obtaining 32,314 high-quality contigs and 51,118 singletons Table 3 (see supplementary material). The hybrid assembly compared to the 454-only strategy increased the contig length from 251.71 nt (454-only) to 388.5 nt (hybrid) (Table 3) and the number of large contigs (≥ 500 nt) increased from 3,438 to 8,792 with an average size increasing from 777 to 871 nt (Figure 2).

Figure 2
Comparison of contig sizes obtained with 454-only and hybrid assembly process. When the reads from 454 pyrosequencing are assembled separately, the majority of the contigs are in the 100-499 nt size range. When Sanger-derived sequences are included in ...

Assembly quality measurement:

Using BLASTN [5] an analysis of alignment coverage was carried out against pepper ESTs database [6]. The results revealed than 60% (19,388) of the contigs had coverage greater than 90% with average identities of 99%; 21% (6,575) of the contigs have 60-89% coverage and 99% identity; the remaining contigs (6,351) had coverage less than 60% with 98% identity. The assembled contigs were examined for similarity to pepper and other plant sequence databases. We found that 52.5% (17,090) of our contigs (E ≤1e-06, % identity ≥ 90) had high identity with sequences in the Pepper ESTs dataset (Figure 3). The remaining contigs (47.5%) were then compared against the Tomato Unique Genes database [7], and 14.3% (4,654) of contigs had high similarity (E ≤1e-06) in this dataset (Figure 3). For the remaining contigs, a sequential BLASTX-based search [5] was extended to three protein sequence datasets, Arabidopsis thaliana peptides (TAIR) [8], Oryza sativa peptides (RefSeq) [9] and NCBI nr. Of the remaining contigs, ~7.5% (2,452) had high identity to Arabidopsis and Rice proteins (E ≤1e-05) and 1% (375) had matches to NCBI nr sequences (E ≤1e-04). In summary, 75.5% of our contigs share sequence similarity with transcripts, genes or proteins in public databases (Figure 3) and 7,481 of these sequences (23%) have not been previously reported for the species Capsicum annuum. A total of 7,743 (23.8%) contigs are novel transcripts that are specific to the C. annuum.

Figure 3
Sequential comparison of hybrid contigs with different plant databases. BLASTN searches of the contigs were performed against two Solanaceae databases, Pepper ESTs hosted at KRIBB [6], and Tomato Unique genes from the Sol Genomic Network [7]. We use an ...

Features of the Web database

A website and database were constructed using open source technologies with the Linux operating system (Ubuntu v9.1). MySQL Database Management System (v5.1) was used to store and manage the data. An Apache HTTP server (v2.2.4), PHP Hypertext Pre-processor (v5.3.1), JavaScript (v3.1.0) and HTML (v4) were used to create the query-builder module for connecting and querying the database. Custom Perl (v5.10) scripts were used to automatically parse the database and Joomla (v1.5) was used as content management system (CMS) for building the web site. Information stored in the database is divided into two main sections: assembled (contigs) and singleton sequences. Each section is then sub-divided into two categories, Functional Annotation and Sequences. (Figure 4A) shows the tables that store the data derived from assembled sequences (contigs) and their relationships. A total of eight tables store information related to functional annotation. The genomics databases used for functional annotation were: S. lycopersicum unigenes [7], A. thaliana peptides [8], O. sativa peptides [9], P. patens [10], NCBI nr and UniRef100 (UniProt).

Figure 4
Database tables. Several tables were created for the Capsicum database. A) Tables for the assembled sequences. The functional annotation category has eight tables in total. Six tables have the same structure; the “blast” which stores tabular outputs ...

The GO terms [2] and KEGG metabolic pathway [3] annotations derived from the highest scoring BLASTX results from TAIR [8]. The assembly and sequence data are stored in tables named “assembly information” and “contigs”, respectively. The “assembly information” table contains the number of sequences (either Sanger or 454 reads) that were assembled into each contig, and the sequence origin (root, stem, leave, flower or fruit). The “Contigs” table stores the assembled sequences. Two additional tables were created to store information for singletons. The majority of the singletons (97.4%) are 454 pyrosequencing reads with an average length of ~100 nt. However, 1,349 are Sanger-derived ESTs with an average length of ~650 nt Table 3 (see supplementary material). We annotated all singletons using BLASTX against the NCBI nr database. Sequence and annotation are stored in separate tables ( Figure 4A) using “seq_id” as a primary key to relate the two tables. A query-builder module, adapted as user-friendly Web interface, was developed. The module allows the user to explore the database in three different ways:

  1. Simple search,
  2. Advanced query, and
  3. Query builder (Figure 5.
    Figure 5
    Examples of resources available in Capsicum Transcriptome DB. Access to different tools available in the module is demonstrated. A) The Simple Query module where a user can perform a table-specific searches is shown. Each button represents one table and ...

Using the “Simple search” option, the user is able to access and download the full content of a specific table (Figure 5A). The searches can be refined using “search options”. The advanced query section was designed for users with SQL knowledge in which searches can be performed through SQL (Figure 5C). The Query-builder permits the user to collect functional annotation and sequence information from different tables (Figure 5D). This module was designed to use checkboxes to make multiple attribute selections from a number of tables and columns and also provides search options to define a query (Figure 5D). In every result generated by the module, the user is able to download a file in CSV (comma separated value) format (Figure 5B).


Our user-friendly web interface is a straightforward tool, providing access to molecular data in a simple and dynamic manner without requisite training in bioinformatics. The user is allowed to perform queries to analyze a group of sequences or an individual sequence. The database contains 32,314 highquality assembled contigs and 51,118 high-quality singletons. Functional annotation was assigned to 75% of the contigs including 21,744 sequences common to Solanaceae (Pepper ESTs and Tomato unique genes) and 7,481 novel sequences not previously reported for Capsicum annuum.

Future Developments

We will continue sequencing several chill-pepper tissues and updating the database with new sequences and functional annotation. Comments and requests regarding the database should be sent to Dr. Rafael Rivera-Bustamante at capsicum@ira.cinvestav.mx


The data presented in this study shows the advantages of using multiple sequencing technologies for de novo assembly of a transcriptome in the absence of a reference genome. With the hybrid assembly approach, we were able to improve multiple contig quality measures. A detailed coverage analysis showed the high quality of the assembly suggesting that the rate of contig artifacts is low. Using an in-depth annotation pipeline, we identified 75% of the contigs including 7,481 novel sequences not previously reported for Capsicum annum. These data expand our knowledge of gene expression across diverse pepper tissues and complement the data in existing databases [6, 7]. In summary, the bioinformatics methods applied to the reported data demonstrate that our Capsicum Reference transcriptome is a reliable resource and an important “hot” tool for downstream functional studies.

Supplementary material

Data 1:


This work was supported by Secretaría de Agricultura, Ganadería, Desarrollo Rural, Pesca y Alimentación (SAGARPA) and Consejo Nacional de Ciencia y Tecnologia (Conacyt) [11806] We acknowledge support from Conacyt-Mexico to Elsa Gongóra-Castillo (Ph. D. fellowship).

We acknowledge and thank Lisset Herrera-Isidrón, Rosa M. Adame-Alvarez, Cesar Aza-González, Martha G. Betancourt- Jiménez, Fernando Hernández-Godinez, Enrique Ibarra- Laclette, Hector G. Nuñez-Palenius and Harumi Shimada- Beltran for participating in the construction of cDNA libraries. We would like to specially thank to Eduardo Jaramillo-Olvera for his support during the database and website development, and Drs. Robin Buell and Rebecca Davidson for critical reading of the manuscript and helpful comments.


Citation:Góngora-Castillo et al, Bioinformation 8(1): 043-047 (2012)


1. A Aguilar-Melendez, et al. American Journal of Botany. 2010;96:1190. [PubMed]
2. M Ashburner, et al. Nat Genet. 2000;25:25. [PMC free article] [PubMed]
3. M Kanehisa, S Goto. Nucleic Acids Res. 2000;28:27. [PMC free article] [PubMed]
4. M Margulies, et al. Nature. 2005;437:376. [PMC free article] [PubMed]
5. SF Altschul, et al. J Mol Biol. 1990;215:403. [PubMed]
6. HJ Kim, et al. BMC Plant Biol. 2008;8:101. [PMC free article] [PubMed]
7. LA Mueller, et al. Plant Physiol. 2005;138:1310. [PMC free article] [PubMed]
8. SY Rhee, et al. Nucleic Acids Res. 2003;31:224. [PMC free article] [PubMed]
9. T Tanaka, et al. Nucleic Acids Res. 2008;36:D1028. [PMC free article] [PubMed]
10. SA Rensing, et al. Science. 2008;319:64. [PubMed]

Articles from Bioinformation are provided here courtesy of Biomedical Informatics Publishing Group
PubReader format: click here to try


Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • BioProject
    BioProject links
  • MedGen
    Related information in MedGen
  • PubMed
    PubMed citations for these articles
  • SRA
    Massively-parallel sequencing project data in the Short Read Archive (SRA) that are reported in the current articles.

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...