Logo of narLink to Publisher's site
Nucleic Acids Res. Jul 2013; 41(Web Server issue): W123–W128.
Published online May 21, 2013. doi:  10.1093/nar/gkt418
PMCID: PMC3692069

WebAUGUSTUS—a web service for training AUGUSTUS and predicting genes in eukaryotes

Abstract

The prediction of protein coding genes is an important step in the annotation of newly sequenced and assembled genomes. AUGUSTUS is one of the most accurate tools for eukaryotic gene prediction. Here, we present WebAUGUSTUS, a web interface for training AUGUSTUS and predicting genes with AUGUSTUS. Depending on the needs of the user, WebAUGUSTUS generates training gene structures automatically. Besides a genome file, either a file with expressed sequence tags or a file with protein sequences is required for this step. Alternatively, it is possible to submit an externally generated training gene structure file and a genome file. The web service optimizes AUGUSTUS parameters and predicts genes with those parameters. WebAUGUSTUS is available at http://bioinf.uni-greifswald.de/webaugustus.

INTRODUCTION

The structural annotation of protein coding genes serves as a basis for many further steps in the analysis of sequenced and assembled eukaryotic genomes. Many of the currently available gene prediction tools rely on stochastic models incorporating usually several thousands of parameters. These parameters need to be adapted to species-specific traits to achieve most accurate gene prediction results. With the exception of self-training methods [e.g. the command line tool GeneMark-ES (1)], parameters are usually adapted using a set of already annotated genes in the target genome; here, we refer to those gene structures as training genes.

Training genes can for instance be built from alignments of expressed sequence tags (ESTs) or protein sequences against the target genome. For example, the initial training genes for annotating the genome of Amphimedon queenslandica were generated from EST alignments (2), and protein sequences were used for generating training genes for Coprinus cinerea (3).

Scripts or tutorials for training gene finders are often available for non-commercial gene finders. A skilled bioinformatician should currently for example be able to train and execute SNAP (4), AUGUSTUS (5,6), mGene (7) and GeneID (8). However, installing required software and training a gene prediction tool can be difficult for inexperienced users, as basic programming skills are often required (9).

This problem was already recognized by Schweikert et al. (10), who developed mGene.web. Their web service was supposed to enable software users without programming skills to train and execute the gene prediction tool mGene. However, at the time of submission of this manuscript, mGene.web was temporarily not fully functional because it was being migrated to a new sever.

Independent accuracy assessments have shown that AUGUSTUS belongs to the most accurate gene finding tools (11,12). Training AUGUSTUS for a novel species has up to now been challenging for many users. Therefore, we here describe WebAUGUSTUS, a web service for automated training gene structure generation, training AUGUSTUS and predicting genes with AUGUSTUS.

WEB SERVICE

WebAUGUSTUS provides two web interfaces:

  1. AUGUSTUS Training generates training gene structures, trains AUGUSTUS and predicts genes with AUGUSTUS in a fully automated way.
  2. AUGUSTUS Prediction predicts genes with AUGUSTUS in genomic sequences using already trained parameters.

AUGUSTUS Training

To date, AUGUSTUS has been trained by experts for >50 species. The parameter files are publicly available and can be applied across program versions of AUGUSTUS for predicting genes in genomic sequences of those 50 species and for species that are not distantly related to all of those 50 species. The AUGUSTUS Training web server application can be used to optimize AUGUSTUS parameters for novel species.

Input

The AUGUSTUS Training web interface offers three different data set options for training AUGUSTUS (further referred to as optA, optB and optC), which require different input file combinations:

  • optA
    Fully automated training gene structure generation on the basis of a genome and a cDNA file requires both files in fasta format.
  • optB
    Fully automated training gene structure generation on the basis of a genome and protein sequence file requires both files in fasta format.
  • optC
    Already existing training gene structures can be submitted in gene transfer format (gtf) or genbank format in addition to a genome file in fasta format.

Genome files should contain high-quality genomic sequences (long scaffolds or contigs that can encode for complete genes are required). cDNA files may contain ESTs or assembled full-length cDNA sequences. Protein files should contain full-length protein sequences. We recommend that cDNA files and protein files originate from the same species as the target genome file. In some cases, data from close relatives (An external file that holds a picture, illustration, etc.
Object name is gkt418i1.jpg95% protein sequence identity) might also lead to good results, but more frequently, it is not possible to generate a sufficient amount of training genes from using another’s species proteins. Externally prepared training gene structures should contain complete gene structures, only (i.e. no parts of the coding sequence should be missing).

Software

The AUGUSTUS Training web service provides an interface to a Perl pipeline called AutoAug.pl (available at http://bioinf.uni-greifswald.de/augustus/binaries/scripts).

If invoked with data from optA, autoAug.pl assembles cDNA sequences into gene structures using PASA (13). AUGUSTUS parameters are optimized using those gene structures. After successful training, ab initio gene prediction in the genome file is performed. Subsequently, the cDNA sequences are used to create hints for AUGUSTUS using BLAT (14), and genes are predicted using the extrinsic cDNA information as described previously (6). If possible, training examples for untranslated regions (UTRs) are assembled from cDNA information and predicted protein coding regions, and UTR-parameters for AUGUSTUS are trained. After this second training step, genes are predicted with the beforehand created hints and UTRs.

In case of data from optB, Scipio (15) is used to generate training gene structures from alignments of protein sequences to the genome. AUGUSTUS parameters are optimized using those gene structures. After successful training, ab initio gene prediction in the genome file is performed.

If data from optC is submitted, AUGUSTUS parameters are trained, and meta parameters, such as splice site window sizes, are optimized using the provided training gene structures. Afterwards, genes are predicted ab initio in the genome sequences.

It is also possible to submit a genome file, a cDNA file and a protein file. In that case, the same steps as in case optB are performed, but in addition, hints are created from the cDNA sequences as described in optA, and genes are predicted with this extrinsic evidence.

Reasonable training of AUGUSTUS parameters will require at least several hundreds of training gene structures. WebAUGUSTUS will not start training with ≤100 training gene structures.

Output

One major goal of the Training web interface is to return parameters that are optimized for predicting genes in a genomic sequence of a species of interest with AUGUSTUS. These can be used without retraining to predict genes when new assemblies or new transcript sequences are available. In addition, gene predictions according to the different workflows described for optA, optB and optC will be returned if possible.

All jobs will return a log-file AutoAug.log and an error-file AutoAug.err. It is generally recommended that users inspect these files before they continue to work with other results files.

If AUGUSTUS training was possible, an archive parameters.tar.gz with AUGUSTUS parameters is returned. After download and extraction, this archive can be used within a local AUGUSTUS installation. In addition, parameter sets that were trained via the AUGUSTUS Training web interface are immediately available to the AUGUSTUS Prediction web service for future predictions. The web service also returns compressed training gene structures in genbank format that were used for optimizing the parameter set (training.gb.gz).

In case of successful gene prediction, compressed gene prediction archives are returned. Possible gene prediction archives are listed in Table 1. Ab initio gene prediction will always be performed after successful training. Predictions with hints will only be performed if a cDNA file was provided, and if it was possible to generate hints from aligning the contained sequences against the genome. Predictions with UTRs will only be provided if it was previously possible to train UTR parameters for AUGUSTUS.

Table 1.
Compressed gene prediction archives of AUGUSTUS Training

All gene prediction archives contain at least one file in general feature format (gff). If no genes were predicted, this will be the only file. If it was possible to predict genes, gene structures are also contained in gtf- and gbrowse-format. Furthermore, predicted amino acid sequences, coding sequences and exons of coding sequences in fasta format are then contained in a prediction archive. If UTR parameter optimization was possible, the predicted mRNA sequences are additionally contained in fasta format. Table 2 summarizes files that can be contained in gene prediction archives.

Table 2.
Compressed gene prediction output archives generated by WebAUGUSTUS may contain files with the following file endings

AUGUSTUS Prediction

The new AUGUSTUS Prediction web service is directly connected to a database that stores species-specific parameters that were trained by using the Training web service, i.e. if a user has trained AUGUSTUS parameters via WebAUGUSTUS, those parameters are instantly available for predicting genes in more genomic sequences by using the AUGUSTUS Prediction webinterface.

Input

Gene predictions are performed in a provided genome file in fasta format. Additionally, a parameter set must be specified (either via training job ID or by selection from a drop-down menu with expert-trained parameters, or by uploading an externally trained parameter archive). Parameter sets that were trained using WebAUGUSTUS are identified via an ID (trainxxxxxxxx) that is only available to the user who performed the training (the chances of guessing the parameter ID of a training job are roughly 1:1014).

Optionally, users may upload a file with cDNA sequences that will be used to automatically generate hints or an externally created hints file in AUGUSTUS-specific gff-format.

Checkboxes offer the functionality of enabling UTR prediction (only possible if UTR parameters for the species in question exist), reporting genes on certain strands, enabling alternative transcripts and allowed predicting gene structures (e.g. only complete genes or complete and partial genes).

Software

If a cDNA file was provided, WebAUGUSTUS will use BLAT to align the cDNA sequences to the genome. The alignments are converted to hints. In any case, AUGUSTUS will be executed with arguments that match the user-specified requirements. In contrast to the Training web service, which automatically tries to run many subsequent prediction steps, the AUGUSTUS Prediction web service will run only exactly one gene prediction job at a time.

Output

After WebAUGUSTUS has finished a gene prediction job, prediction results will be available for download in a compressed archive that contains at least a gff-file, but may optionally also contain the other files listed in Table 2.

Implementation

AUGUSTUS and parts of the training routine are implemented in C++, the wrapping pipeline is implemented in Perl. The web service is implemented in Grails. Submitted jobs are scheduled via a Sun Grid Engine. Currently, eight jobs can be executed in parallel.

According to its license, BLAT is freely available for academic, non-profit and personal use. Commercial users are, therefore, not allowed to use WebAUGUSTUS for processes that involve the usage of BLAT. That means, commercial users are only allowed to run AUGUSTUS Training with externally generated training gene structures, and the submission of cDNA files is not allowed for commercial users in AUGUSTUS Training and Prediction.

MATERIALS AND METHODS

Prediction accuracy with parameters trained by WebAUGUSTUS and by human experts was measured using three different data sets. For optA, the genome of the insect Drosophila melanogaster (assembly BDGP R5/dm3) and 818 005 ESTs from the same species that were obtained from the National Center for Biotechnology Information (NCBI) were used. OptB was evaluated using the genome of the plant Arabidopsis thaliana (assembly TAIR 10) and 35 375 protein sequences of the same species that were obtained from NCBI. OptC was evaluated using the genome of the worm Caenorhabditis elegans and 18 555 training gene structures retrieved from Wormbase (16).

To avoid an overly optimistic performance estimate for the new genes, the chromosomes of all genomes [for fly and plant downloaded from the UCSC Genome Browser database (17)] were split into two parts in such a way that ~50% of the genes were located on the first half, and the remaining genes were located on the second half. The second part of all chromosomes was used as a genomic input sequence for training AUGUSTUS, whereas the first part served for accuracy assessment opf gene predictions.

For D.melanogaster, protein coding genes from FlyBase (18), for A. thaliana, protein coding genes from TAIR 10 (19) and for C. elegans, protein coding genes from Wormbase were used as a reference annotation for measuring accuracy.

The exact source of all data sets and the files used for the actual experiments are described in detail in Supplementary Materials, section Supplementary Methods: Data Sets.

Commonly used measures of accuracy (measured in percent) in gene prediction are

equation image

where TP stand for true positives, i.e. the number of predicted features that agree with the gold-standard reference, FN stands for false negatives, i.e. the number of features that were overseen by the predictor and FP stands for false positives, i.e. the number of features that were predicted but not in agreement with the reference annotation.

Sensitivity and specificity were measured for the features gene (i.e. only a gene structure that was predicted correctly including the exact positions of all CDS exons was counted as TP), exon (i.e. only exons that were predicted correctly were counted as TP) and nucleotide (i.e. every correctly predicted nucleotide was counted as TP).

RESULTS

Ab initio gene prediction accuracy results from training and gene prediction via WebAUGUSTUS are shown in Table 3. Additionally, we show gene prediction accuracy obtained with parameter sets that were trained by experts. Here, performance depends on the amount and quality of input data. The here reported differences between expert and automated training are small. For optA, it should be noted that the expert-trained parameters contained customized modifications for the particular case of D. melanogaster that are not possible via WebAUGUSTUS (e.g. the length of donor and acceptor splice sites was altered). In case of optC, accuracy obtained by using WebAUGUSTUS seems to be slightly higher than accuracy obtained with the expert-trained parameters. This may be explained by the fact that the web parameters were trained and tested on genes from the current Wormbase release, whereas the expert parameters were trained using an earlier annotation. In general, higher accuracy values can be expected when using the same parameter sets in combination with extrinsic evidence.

Table 3.
Ab initio gene prediction accuracy results of WebAUGUSTUS with web-trained and expert-trained parameters

The runtimes of training and prediction jobs that were executed for preparing the results are shown in Table 4 (jobs are executed sequentially on the server, i.e. only one CPU is allocated to each job). Although the training jobs of optB and optC required a couple of hours, only, the training job of optA required several days. Runtime depends on the size of data sets, on the executed pipeline, on the resulting number of training gene structures and on the obtained parameter set. Given the same number of resulting training gene structures, optC will always be faster than optA and optB because the training gene structure file does not need to be generated by the web service. In turn, given the same number of resulting training gene structures, optA will always be slower than optB because optB does not attempt to perform the assembly of UTR training examples and UTR parameter training. Certain properties of the sequences, in particular the number of unknown nucleotides, and parameter sets influence the execution time for gene prediction jobs, e.g. the maximal length of UTR exons can lead to different execution times because the longer a UTR exon can potentially be, the more candidates must be scored.

Table 4.
Computational time of WebAUGUSTUS for optA, optB and optC

DISCUSSION AND CONCLUSION

WebAUGUSTUS is currently the only functional web service for generating training gene structures and training a eukaryotic gene prediction tool. In comparison, mGene.web does not offer automated training gene generation, and it requires a lot of interaction with the user (building customized workflows and so forth). In contrast to this, WebAUGUSTUS is fairly easy to use: after filling in a web form, the entire job is executed automatically.

Although not available as a web service, the usage of GeneMark-ES for one-step training and prediction is as easy to operate as a web service. However, GeneMark-ES was designed for small and not complex genomes. We would, therefore, like to mention that the gene prediction accuracy of WebAUGUSTUS in relatively complex eukaryotic genomes (e.g. D. melanogaster) is higher in comparison with the accuracy of GeneMark-ES (see Supplementary Materials, section Supplementary Results: Accuracy of GeneMark-ES).

Concerning the here reported accuracy results, users should be aware of the fact that accuracy of gene prediction with parameters that were optimized using WebAUGUSTUS strongly depends on the input data quality. A low number of training gene structures, or low-quality training gene structures, may lead to poor accuracy.

In comparison with the freely available binaries and scripts for execution of AUGUSTUS on a local computer, the functionality of WebAUGUSTUS is limited. For example, Conditional Random Field (CRF) training is not accessible via WebAUGUSTUS, as this method is less robust to errors in the training gene set than Generalized Hidden Markov Model training. Also the integration of RNA-Seq data and several other frequently used sources of extrinsic evidence is currently not supported by WebAUGUSTUS unless the user first prepares the hints locally. In some cases, the CRF will yield better accuracy results, and the inclusion of as much high-quality extrinsic evidence as possible will most definitely improve gene prediction accuracy.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Methods and Supplementary Results.

FUNDING

Funding for open access charge: Deutsche Forschungsgemeinschaft (DFG) [HO4545/1-1, STA1009/6-1], institutional funding (Institute for Mathematics and Computer Science, Ernst Moritz Arndt University of Greifswald).

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

The authors thank all users of WebAUGUSTUS who reported numerous bugs and thereby actively helped to improve the web service. They are grateful to J. Behr and G. Rtsch from the Memorial-Sloan-Kettering Cancer Center for spending time on teaching us how to use mGene.web. They thank L. Gerischer for proofreading the manuscript.

REFERENCES

1. Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res. 2005;33:6494–6506. [PMC free article] [PubMed]
2. Srivastava M, Simakov O, Chapman J, Fahey B, Gauthier MEA, Mitros T, Richards GS, Conaco C, Dacre M, Hellsten U, et al. The Amphimedon queenslandica genome and the evolution of animal complexity. Nature. 2010;466:720–726. [PMC free article] [PubMed]
3. Stajich JE, Wilke SK, Ahrén D, Au CH, Birren BW, Borodovsky M, Burns C, Canbäck B, Casselton LA, Cheng CK, et al. Insights into evolution of multicellular fungi from the assembled chromosomes of the mushroom Coprinopsis cinerea (coprinus cinereus) Proc. Natl Acad. Sci. USA. 2010;107:11889–11894. [PMC free article] [PubMed]
4. Korf I. Gene finding in novel genomes. BMC Bioinformatics. 2004;5:59. [PMC free article] [PubMed]
5. Stanke M, Waack S. Gene prediction with a hidden markov model and new intron submodel. Bioinformatics. 2003;19(Suppl. 2):ii215–ii225. [PubMed]
6. Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008;24:637–644. [PubMed]
7. Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong CS, Philips P, De Bona F, Hartmann L, Bohlen A, et al. mGene: accurate SVM-based gene finding with an application to nematode genomes. Genome Res. 2009;19:2133–2143. [PMC free article] [PubMed]
8. Parra G, Blanco E, Guigó R. (2000) GeneID in. Drosophila. Genome Res. 10:511–515. [PMC free article] [PubMed]
9. Yandell M, Ence D. A beginner’s guide to eukaryotic genome annotation. Nat. Rev. Genet. 2012;13:329–342. [PubMed]
10. Schweikert G, Behr J, Zien A, Zeller G, Ong CS, Sonnenburg S, Rätsch G. mGene.web: a web service for accurate computational gene finding. Nucleic Acids Res. 2009;37:W312–W316. [PMC free article] [PubMed]
11. Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, et al. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biology. 2006;7(Suppl. 1):S2. [PMC free article] [PubMed]
12. Coghlan A, Fiedler TJ, McKay SJ, Flicek P, Harris TW, Blasiar D, nGASP Consortium. Stein LD. ngasp—the nematode genome annotation assessment project. BMC Bioinformatics. 2008;9:549. [PMC free article] [PubMed]
13. Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK, Hannick LI, Maiti R, Ronning CM, Rusch DB, et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 2003;31:5654–5666. [PMC free article] [PubMed]
14. Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res. 2002;12:656–664. [PMC free article] [PubMed]
15. Keller O, Odronitz F, Stanke M, Kollmar M, Waack S. Scipio: using protein sequences to determine the precise exon/intron structures of genes and their orthologs in closely related species. BMC Bioinformatics. 2008;9:278. [PMC free article] [PubMed]
16. Yook K, Harris TW, Bieri T, Cabunoc A, Chan J, Chen WJ, Davis P, de la Cruz N, Duong A, Fang R, et al. Wormbase 2012: more genomes, more data, new website. Nucleic Acids Res. 2012;40:D735–D741. [PMC free article] [PubMed]
17. Fujita PA, Rhead B, Zweig AS, Hinrichs AS, Karolchik D, Cline MS, Goldman M, Barber GP, Clawson H, Coelho A, et al. The UCSC Genome Browser database: update 2011. Nucleic Acids Res. 2010;39:D876–D882. [PMC free article] [PubMed]
18. Drysdale R. FlyBase Consortium. FlyBase: a database for the Drosophila research community. Methods Mol. Biol. 2008;420:45–59. [PubMed]
19. Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, Dreher K, Alexander DL, Garcia-Hernandez M, et al. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 2012;40:D1202–D1210. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...