Logo of bioinfoLink to Publisher's site
Bioinformatics. May 1, 2009; 25(9): 1195–1196.
Published online Mar 24, 2009. doi:  10.1093/bioinformatics/btp129
PMCID: PMC2732306

dCAS: a desktop application for cDNA sequence annotation

Abstract

Motivation: Understanding gene regulation and expression is the key to the advancement of biology. EST sequence assembly and analysis provide unique benefits in this regard. We have developed a standalone application, dCAS (Desktop cDNA Annotation System), which performs automated EST cleaning, clustering, assembly and annotation on a desktop computer. Compared with other available tools, dCAS provides a more convenient and user-friendly solution to biologists for extracting biological meaning from sequence data.

Availability: The dCAS package is distributed freely. A cross-platform installer and associated sequence databases can be downloaded at: http://exon.niaid.nih.gov/applications.html

Contact: vog.hin.liam@oyoug

1 INTRODUCTION

Two major strategies are commonly used in evaluating gene expression under different biological conditions: hybridization-based and sequencing-based. While hybridization-based methods, such as microarray, use a pre-identified probe set, sequencing-based technologies generate more exhaustive samplings, which can lead to identifying new transcripts (Liu et al., 2007). In the latter method, ESTs, derived from biological samples, are sequenced and assembled into contigs, which are then used to identify the expressed genes. With the assumption that the number of ESTs of a particular gene is correlated with the copy number of the transcribed gene, comparing the pool of ESTs obtained under different experimental conditions reflects differential gene expression level (Ribeiro et al., 2006).

Multiple analytic steps are required to translate ESTs into biological meaning. There have been many tools developed for this purpose: ESTExplorer (Nagaraj et al., 2007), ESTpass (Lee et al., 2007) and EST2uni (Forment et al., 2008). These tools are either command-line applications or require sophisticated web server and database setup. Although web-based application provide distinct advantage, we do not think it currently fits the needs of general biological laboratories, which are short of computer-savvy technicians but still require specialized and comprehensive bioinformatics tools to analyze large amount of EST sequences on site.

To satisfy those needs, we have developed a software package: Desktop cDNA Annotation System (dCAS), which is used to perform large-scale EST sequence cleaning, clustering, assembly and annotation on a desktop computer. dCAS originated from a set of independent EST analysis and annotation software modules, which were developed by Dr Jose Ribeiro (unpublished data) at the National Institutes of Health using Visual Basic and have been widely used by many laboratories around the world (Andersen et al., 2007; Arca et al., 2007; Assumpcao et al., 2008).

dCAS uses a workflow concept to integrate multiple steps of sequence processing and analysis. There are two workflows currently available in dCAS: the Core workflow, designed for processing a single cDNA library, and the Compare Library workflow, for analysis of multiple cDNA libraries derived from the same biological sample under different experimental conditions.

For a single cDNA library, the Core workflow first strips each EST sequence of any cloning vector sequence flanking the EST. This is achieved by searching exact hits in an extensible UniVec database (http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html). Stripped sequences are then clustered using Blast (Altschul et al., 1990) and assembled into contigs using Cap3 (Huang et al., 1999). Contigs are then annotated by searching against multiple sequence databases specified by the users and predicting signal peptides using SignalP server (Emanuelsson et al., 2007). Finally, the analyzed data are congregated and compiled into an Excel report with hyperlinks referring to data files generated in individual analysis step.

For multiple cDNA libraries, the Compare Library workflow builds sequence clusters on all stripped EST sequences from multiple EST libraries, assembles them into contigs and then annotates each contig using Blast and SignalP. The results are then compiled into an Excel report, which has not only the sequence annotation, but also statistics on gene expression difference across the cDNA libraries selected.

In addition to the workflows, several utility functions are also provided in dCAS to help users perform independent tasks. For example, formatting Blast databases and searching database using a standalone Blast package can be done through the dedicated user interface. Users can use another utility function to perform sequence curation, such as sequence editing, translation frame determination and sequence name assignment.

Because of the limitation of the typical desktop computer, the processing time for analyzing large cDNA libraries can be very long. To mitigate the problem, a parallel processing feature has been implemented to allow multiple computers to work together during the sequence annotating step. Specifically, dCAS can send the sequences to a companion program called dCAS Mate run on other computers. After data analysis is done, the results will be sent back to dCAS for further processing.

2 IMPLEMENTATION

dCAS is implemented using Java 5.0. The current version of dCAS has been tested on Windows XP, Mac OS X and Linux. The screenshot of dCAS running the core workflow is shown in Figure 1. Each working node is configurable by clicking the node itself and filling in data in the corresponding panel shown in lower left region. Internally, dCAS wraps several command-line applications: Blast, Cap3 and Phred in its data processing workflow.

Fig. 1.
dCAS interface. Animated rainbow border is used to signify processing stage. Network information in the lower section is used for establishing data communication with dCAS Mate.

An important step in sequence annotation is to search available databases to identify similar sequences with biological meanings. To simplify system installation, configuration and processing, dCAS only uses the Blast package in this step to search for protein and nucleotide sequences that have already been annotated with functional information, conserved domain and Gene Ontology. dCAS also provides a simple user interface allowing user to format new Blast databases, add and use it in the dCAS pipeline.

An embedded database implemented using HSQLDB 1.8.0 (http://hsqldb.org/) is incorporated in dCAS. The database is used to store system configurations, processed cDNA library information and running status of the workflow, such as the parameters and progress of each analysis step. The workflow including its running status and all of the related data files can be exported into an archive file, which can then be imported into another dCAS system. The data-sharing feature allows users to rerun dCAS workflow processed by others.

The parallel processing feature of dCAS is implemented using xfire (http://xfire.codehaus.org/), a Java Simple Object Access Protocol (SOAP) framework, and Jetty (http://www.mortbay.org/), an open source embeddable web server package. Specifically, communication between dCAS running on master machine and dCAS Mate running on slave machines is done through web service. One dCAS can control multiple dCAS Mates. Through partitioning input sequences into data blocks and sending them to slave machines, dCAS can work more efficiently on large cDNA libraries.

Microsoft Excel file is a convenient file format for biologists to analyze and manipulate large datasets. dCAS uses Apache POI (http://poi.apache.org/) package to generate the analysis report in Excel format. The report generation and viewing do not require the availability of Microsoft Excel. The generated Excel file can be viewed by using a freely available Excel Viewer (search ‘Excel viewer’ at http://www.microsoft.com/downloads/) on Windows or OpenOffice (http://www.openoffice.org/) on other platforms. The structure of result report file is demonstrated in the user manual.

3 RESULTS

We have performed profiling analysis using JProfiler 5.1.2 (http://www.ej-technologies.com/) to check the memory usage of dCAS. The results show that dCAS uses at most 150 M memory when analyzing a cDNA library with over 10 000 EST sequences.

ACKNOWLEDGEMENTS

We thank John Lumpkin, Sheetal Shah and Jason Barnett (Bioinformatics and Computational Biosciences Branch, NIH/NIAID/OCICB) for their valuable help and suggestions.

Conflict of Interest: none declared.

REFERENCES

  • Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. [PubMed]
  • Andersen JF, et al. An insight into the sialome of the oriental rat flea, Xenopsylla cheopis (Rots) BMC Genomics. 2007;8:102. [PMC free article] [PubMed]
  • Arca B, et al. An insight into the sialome of the adult female mosquito Aedes albopictus. Insect Biochem. Mol. Biol. 2007;37:107–127. [PubMed]
  • Assumpcao TC, et al. An insight into the sialome of the blood-sucking bug Triatoma infestans, a vector of Chagas' disease. Insect Biochem. Mol. Biol. 2008;38:213–232. [PMC free article] [PubMed]
  • Emanuelsson O, et al. Locating proteins in the cell using TargetP, SignalP and related tools. Nat. Protoc. 2007;2:953–971. [PubMed]
  • Forment J, et al. EST2uni: an open, parallel tool for automated EST analysis and database creation, with a data mining web interface and microarray expression data integration. BMC Bioinformatics. 2008;9:5. [PMC free article] [PubMed]
  • Huang X, Madan A. CAP3: a DNA sequence assembly program. Genome Res. 1999;9:868–877. [PMC free article] [PubMed]
  • Lee B, et al. ESTpass: a web-based server for processing and annotating expressed sequence tag (EST) sequences. Nucleic Acids Res. 2007;35:W159–W162. [PMC free article] [PubMed]
  • Liu F, et al. Comparison of hybridization-based and sequencing-based gene expression technologies on biological replicates. BMC Genomics. 2007;8:153. [PMC free article] [PubMed]
  • Nagaraj SH, et al. ESTExplorer: an expressed sequence tag (EST) assembly and annotation platform. Nucleic Acids Res. 2007;35:W143–W147. [PMC free article] [PubMed]
  • Ribeiro JM, et al. An annotated catalog of salivary gland transcripts from Ixodes scapularis ticks. Insect Biochem. Mol. Biol. 2006;36:111–129. [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...