Format of Sequence Record: User Question and Answer
Course Home Modules Schedule Exercises Comments Credits
Problem Summary:

Compare the
Primary (Archival) vs. Reference (Curated) Sequence Records for human MLH1

  Sample User Question
Analysis/Comments
Highlights
Additional Notes
 

Sample User Question back to top

 
I found two mRNA records in Entrez Nucleotide for the human MLH1 gene. They have similar sequence data and have at least one reference in common. The corresponding protein translations are also similar to each other. Why is there apparent duplication in the database, and what are the similarities and differences between the records?
 
Primary (Archival)
Sequence Record
Reference (Curated)
Sequence Record
Nucleotide
U07343
NM_000249
Protein
AAC50285
NP_000240

  • Each accession number in the table above links to a slide in the module on sequence record format and opens in a separate window.
  • The module slides contain static images of the sequence records. Those images weres captured deliberately to save a snapshot of the records in time and the relationship among them. The comments below pertain to the static images, and current versions of the records are available live in Entrez. Where relevant, comments about the current versions are also noted below.
 

Analysis/Comments back to top

The Entrez Nucleotide data domain includes sequence records from a number of different source databases. Therefore, some of the sequence records you retrieve for a gene/protein might be from an archival database (e.g., GenBank) and others might be from a curated database (e.g., RefSeq). Because a curated record is generally based on a representative archival sequence, there will be some apparent redundancy between the two versions.

An archival record reflects the work of a single laboratory and can only be edited/updated by members of that lab. It is primary research data. A curated record contains information that is drawn together by a third party (a curator) from a variety of sources and encapsulates the knowledge available for a single gene. It is similar to a review article. (more...)

There will be a number of close similarities between a RefSeq record and the corresponding source GenBank record that served as the seed for the curated version. The curated record, however, will contain value-added information not present in the archival record. (If a RefSeq record is still in "provisional" status, it will very closely resemble the source GenBank record. More differences will be apparent when the RefSeq record changes to "reviewed" status.)

This exercise compares an archival and curated mRNA record for MLH1, and the corresponding archival vs. curated protein records.

The links in the table above lead to static images of the records in the "Format of Sequence Record" module of the course. Each slide also contains a link to the live record in Entrez, which might be different (if it was updated in any way) from the static image.

For the purposes of this exercise, we will compare the static images. Begin by following the link for U07343, which will open in a separate window. You can simultaneously view the corresponding comments in the highlights section of this window. Then progress through the other sequence records one at a time, simultaneously viewing the corresponding comments on this page.

Highlights back to top

Archival (GenBank) mRNA: U07343
  • mRNA, 2484 bp
  • gene symbol:
    • shows an older gene symbol, hMLH1, in the Definition field, instead of the current official gene symbol of MLH1.
    • Gene symbols shown in archival records are the ones placed in records by the submitters, who may or may or may not check with a nomenclature committee for the organism in question to find out what the current official gene symbol is. Also, official gene symbols might change over time, and records submitted in earlier years but not subsequently updated might contain older versions of gene symbols.
  • References:
    • includes a single published reference, reflecting the work of the lab that submitted this particular sequence record
  • Comment field
    • does not include a Comment field
    • some archival records do include a Comment field, but do not tend to contain the types of value added information described in the Comment field of the RefSeq record, below
  • Features:
    • shows a limited number/variety of biological features -- only those which the submitters annotated on their sequence at time of submission or last update
    • CDS feature includes a cross-reference/link to the Protein record (AAC50285) that contains the amino acid translation
  • live version of record in Entrez has been updated since the static image was captured for the module slides.
    • A more recent modification date shows the record was touched in some way on the date indicated. In this case, the gene symbol in the Definition line has changed from the older gene symbol (hMLH1) to the current official gene symbol (MLH1). The gene symbol in the title of the cited article, however, did not change (since the article title is static). Also, if the GI number (463988) in the record has not changed, that means the sequence data have not changed in any way.

Archival (GenPept) Protein: AAC50285
  • 756 aa
  • DBSOURCE field
    • shows the accession of the GenBank record (U07343) from which the protein translation was obtained
  • References:
    • includes a single published reference, reflecting the work of the lab that submitted this particular sequence record
  • Comment field
    • only shows how the protein sequence data was obtained (conceptual translation)
  • Features:
    • shows a limited number/variety of biological features -- only those which the submitters annotated on their sequence at time of submission or last update
  • live version of record in Entrez has been updated since the static image was captured for the module slides.

Curated (RefSeq) mRNA: NM_000249
  • The 05-NOV-2002 version of the mRNA sequence record NM_000249, shown in the module slides, has 2484 bp (same as the source GenBank record, U07343, from which the sequence data was drawn)
  • gene symbol:
    • shows official gene symbol, MLH1, in the Definition field and Features/gene field
    • shows alternate gene symbols, or synonyms, in the Features/gene/note field
  • References:
    • many references, reflecting the work of many different labs, so RefSeq record is similar to a review article
  • Comment field is always present:
    • shows "REVIEWED" status
    • contains cross-reference to the source GenBank record that served as the seed (i.e., from which the sequence data was drawn) for the RefSeq record
    • includes a summary of gene function (value-added information added by RefSeq curators)
    • note: records that are in a "PROVISIONAL" status more closely resemble their source GenBank records until they go through the review process
  • Features:
    • shows a larger number/variety of biological features than those annotated by the submitters of the source GenBank record; these annotations have been added by the RefSeq curators
    • CDS feature includes a cross-reference/link to the Protein record (NP_000240) that contains the amino acid translation
  • live version of record in Entrez has been updated since the static image was captured for the module slides
    • now has 2524 bp, and the Comment field of the updated record shows the accessions of the GenBank records from which the additional sequence data was taken
    • location (base span) of CDS feature reflects the position of the CDS on the updated, longer sequence
    • other features have been added/deleted/changed
    • new references have been added to the record since the 05-NOV-2002 version that is shown in the module slides, and a more complete bibliography is available in the corresponding Entrez Gene record (accessible from the "Links" menu of the live Entrez Nucleotide record)

Curated (RefSeq) Protein: NP_000240
  • 756 aa (same as the archival protein, above)
  • DBSOURCE field
    • shows the accession of the RefSeq record (NM_000249) from which the protein translation was obtained
  • References:
    • many references, reflecting the work of many different labs (same seet of references as the corresponding RefSeq mRNA)
  • Comment field
    • shows same type of value-added information as the corresponding RefSeq mRNA
  • Features:
    • shows a larger number/variety of biological features than those annotated in the archival GenPept record
  • live version of record in Entrez has been updated since the static image was captured for the module slides.

A Curated Swiss-Prot record, P40692, also exists in addition to the records above for MLH1:
  • 756 aa (same as the archival protein, above)
  • DBSOURCE field
    • cross-references (xrefs) include gi: 463988 (the sequence identifier for U07343) and gi: 463989, (the sequence identifier for AAC5028). So the Swiss-Prot curators also used U07343 as part of the foundation for their version of a curated protein record.
    • Also includes xrefs to many other gi numbers
  • References:
    • many references, reflecting the work of many different labs; not necessarily the same set of references that you find in the RefSeq record, because Swiss-Prot and RefSeq are separate databases curated by different groups
  • Comment field
    • usually written in upper-case letters, characteristic of Swiss-Prot records
    • shows a structured set of value-added information, including function, subunit, subcellular location, tissue specificity, disease, and more
  • Features:
    • extensive set of biological annoations that might differ from those in the RefSeq protein record, so the two databases can be used in a complementary way
  • live version of record in Entrez has been updated since the static image was captured for the module slides.

Additional Notes back to top

The Types of Databases module of the course provides additional information about the differences between various types of databases, including archival and curated.


Sequence Record Format:
User Question
Return to Slides Revised 11/01/2007
Return to Colon Cancer Umbrella Page