Introduction
The Discrepancy Report is an evaluation of a single or multiple ASN.1 files, looking for suspicious annotation or annotation discrepancies that NCBI staff has noticed commonly occur in genome submissions, both complete and incomplete (WGS). A few of the problems that this function was written to find include inconsistent locus_tag prefixes, missing protein_id's, missing gene features, and suspect product names. The function is available in specially configured Sequin, for evaluating a single file at a time, or multiple files can be evaluated at once with the command-line program asndisc.
If you have questions about the Discrepancy Report, please contact us by email at genomes@ncbi.nlm.nih.gov prior to creating your submission.
Table of Contents
Using Sequin
Sequin can be configured to have the Discrepancy Report function available by following the "Configuring Sequin for HTG Submissions" directions in the HTG section. "Discrepancy Report" will then be present as an option in the Special menu.
When you run the Discrepancy Report within Sequin, a box will pop up with the results. In the lefthand frame are the problems, and in the righthand frame are the features with the selected problem. Selecting a feature in the righthand frame will cause Sequin to jump to the feature in the standard display. Double-clicking on a feature in the righthand frame will open that feature's editor, so that changes can be made to it. After making all the changes, click the "Recheck" button to have the Discrepancy Report run again. And be sure to save your changes with File-Save or File-Save as.
Using asndisc
The commandline program asndisc is available by anonymous FTP. Copy the right version for your platform, then uncompress the file, rename it to "asndisc", and set the permissions, as necessary for the platform.
asndisc examines all the files with a common suffix in a directory and collates all the discrepancies into an output file. Each problem in the output file is prefaced with DiscRep, so that the types of problems can be easily found. The standard usage runs all of the tests, but specific tests can be enabled or disabled. In addition, expanded reports of particular tests can be generated. Running "asndisc -" provides the list of arguments.
This is the recommended usage:
- Run asndisc on the files created by tbl2asn.
The recommended commandline for files whose names end with .sqn in the directory DIR is:
- asndisc -p DIR/ -x .sqn -o discrep
- See the types of problems in the output file, by retrieving lines containing "DiscRep". In the
output file, each type of error is sorted by category and begins with the word "DiscRep". You
can create a summary of the types of problems by retrieving the lines containing DiscRep. For
example, if the output file is named discrep, you can grep all of the error types listed in
the report with
- grep DiscRep discrep
- Look at the problematic features in the output file and examine those features in the .sqn files to determine whether the problems are real and need to be corrected, or can be ignored because the situation reflects the biology.
- If there are SUSPECT_PRODUCT_NAMES, run asndisc again specifically to check the product names (with -e) and to generate an expanded list (with -X)
- asndisc -p DIR/ -x .sqn -o disc.prod -X -e SUSPECT_PRODUCT_NAMES
Examples
The Discrepancy Report is something of a blunt instrument that reports everything that fails its tests; it does not consider whether those failures are real problems or just a reflection of the biology. For example, here is a summary of the analysis of a submission, performed with the default settings of asndisc:
- DiscRep:58507 features have joined locations.
- DiscRep:8491 coding regions or mRNAs have inconsistent gene locations.
- DiscRep:2 coding regions have the same gene name as another coding region but a different product.
- DiscRep:1 features have EC numbers in notes or products.
- DiscRep:10 coding regions are completely contained in another coding region.
- DiscRep:76 product_names contain suspect phrase or characters
- DiscRep:8 cds comments or protein descriptions contain fragment or frameshift
- DiscRep:3 RNA features have no product
Since this was a eukaryotic organism with introns, the "features have joined locations" is expected. Similarly, since the submitters have UTR information for some mRNAs, those mRNAs will extend beyond their CDS, generating "coding regions or mRNAs have inconsistent gene locations" reports. However, the other reports need to be investigated to determine whether they indicate a real problem with the annotation. For example, EC numbers need to be fielded in the EC_number qualifier. Similarly, RNA features (mRNA, tRNA, etc) need to have products.
Here is the summary of the expanded report that examined only SUSPECT_PRODUCT_NAMES:
- DiscRep:3 product names contain Fragment
- DiscRep:9 product names contain Chloroplast
- DiscRep:1 product names contain Mitochondrial
- DiscRep:6 product names contain may contain a plural
- DiscRep:56 product names contain Brackets or parenthesis [] ()
- DiscRep:1 product names contain partial
Again, review the names and fix those that are incorrect. Since this is a eukaryote, it is possible that some of these are nuclear genes encoding organellar proteins, so perhaps those reports should be ignored. In contrast, no product name should contain the word 'partial'. See the product name guidelines in the Prokaryotic and Eukaryotic annotation guidelines for recommended and inappropriate product name formats.
After you've run the Discrepancy Report and fixed the problem annotation, let us know when you submit your genome about reports that you think can be ignored and why. If you are not certain whether a particular test is important for your genome, please ask us.
Discrepancy Report Tests
The available tests are:
- MISSING_GENES
- EXTRA_GENES
- MISSING_LOCUS_TAGS
- DUPLICATE_LOCUS_TAGS
- BAD_LOCUS_TAG_FORMAT
- INCONSISTENT_LOCUS_TAG_PREFIX
- NON_GENE_LOCUS_TAG
- MISSING_PROTEIN_ID
- INCONSISTENT_PROTEIN_ID
- FEATURE_LOCATION_CONFLICT
- GENE_PRODUCT_CONFLICT
- DUPLICATE_GENE_LOCUS
- EC_NUMBER_NOTE
- PSEUDO_MISMATCH
- JOINED_FEATURES
- OVERLAPPING_GENES
- OVERLAPPING_CDS
- CONTAINED_CDS
- RNA_CDS_OVERLAP
- SHORT_CONTIG
- INCONSISTENT_BIOSOURCE
- SUSPECT_PRODUCT_NAMES
- INCONSISTENT_SOURCE_DEFLINE
- PARTIAL_CDS_COMPLETE_SEQUENCE
- EC_NUMBER_ON_UNKNOWN_PROTEIN
- TAX_LOOKUP_MISSING
- TAX_LOOKUP_MISMATCH
- SHORT_SEQUENCES
- SUSPECT_PHRASES
- COUNT_TRNAS
- FIND_DUP_TRNAS
- FIND_BADLEN_TRNAS
- FIND_STRAND_TRNAS
- COUNT_RRNAS
- FIND_DUP_RRNAS
- RNA_NO_PRODUCT
- TRANSL_NO_NOTE
- NOTE_NO_TRANSL
- TRANSL_TOO_LONG
- CDS_TRNA_OVERLAP
- COUNT_PROTEINS
- DISC_FEAT_OVERLAP_SRCFEAT
The standard configuration of the DiscrepancyReport within the genome-center-configured Sequin turns off these mitochondrial-related tests:
- COUNT_TRNAS
- FIND_DUP_TRNAS
- FIND_BADLEN_TRNAS
- COUNT_RRNAS
- FIND_DUP_RRNAS
- TRANSL_NO_NOTE
- NOTE_NO_TRANSL
- TRANSL_TOO_LONG
- CDS_TRNA_OVERLAP
- COUNT_PROTEINS