• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. Jan 15, 1996; 24(2): 316–320.
PMCID: PMC145627

Cleaning the GenBank Arabidopsis thaliana data set.


Data driven computational biology relies on the large quantities of genomic data stored in international sequence data banks. However, the possibilities are drastically impaired if the stored data is unreliable. During a project aiming to predict splice sites in the dicot Arabidopsis thaliana, we extracted a data set from the A.thaliana entries in GenBank. A number of simple 'sanity' checks, based on the nature of the data, revealed an alarmingly high error rate. More than 15% of the most important entries extracted did contain erroneous information. In addition, a number of entries had directly conflicting assignments of exons and introns, not stemming from alternative splicing. In a few cases the errors are due to mere typographical misprints, which may be corrected by comparison to the original papers, but errors caused by wrong assignments of splice sites from experimental data are the most common. It is proposed that the level of error correction should be increased and that gene structure sanity checks should be incorporated--also at the submitter level--to avoid or reduce the problem in the future. A non-redundant and error corrected subset of the data for A.thaliana is made available through anonymous FTP.

Full Text

The Full Text of this article is available as a PDF (88K).

Selected References

These references are in PubMed. This may not be the complete list of references from this article.
  • Brunak S, Engelbrecht J, Knudsen S. Cleaning up gene databases. Nature. 1990 Jan 11;343(6254):123–123. [PubMed]
  • Brunak S, Engelbrecht J, Knudsen S. Neural network detects errors in the assignment of mRNA splice sites. Nucleic Acids Res. 1990 Aug 25;18(16):4797–4801. [PMC free article] [PubMed]
  • Brunak S, Engelbrecht J, Knudsen S. Prediction of human mRNA donor and acceptor sites from the DNA sequence. J Mol Biol. 1991 Jul 5;220(1):49–65. [PubMed]
  • Goodall GJ, Filipowicz W. The minimum functional length of pre-mRNA introns in monocots and dicots. Plant Mol Biol. 1990 May;14(5):727–733. [PubMed]
  • Xue J, Rask L. The unusual 5' splicing border GC is used in myrosinase genes of the Brassicaceae. Plant Mol Biol. 1995 Oct;29(1):167–171. [PubMed]
  • Sander C, Schneider R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins. 1991;9(1):56–68. [PubMed]
  • Hobohm U, Scharf M, Schneider R, Sander C. Selection of representative protein data sets. Protein Sci. 1992 Mar;1(3):409–417. [PMC free article] [PubMed]
  • Bonham-Smith PC, Moloney MM. Nucleotide and protein sequences of a cytoplasmic ribosomal protein S15a gene from Arabidopsis thaliana. Plant Physiol. 1994 Sep;106(1):401–402. [PMC free article] [PubMed]
  • Li J, Zhao J, Rose AB, Schmidt R, Last RL. Arabidopsis phosphoribosylanthranilate isomerase: molecular genetic analysis of triplicate tryptophan pathway genes. Plant Cell. 1995 Apr;7(4):447–461. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...