Entrez: User Question and Answer
Course Home Modules Schedule Exercises Comments Credits
Problem Summary:

Find concise summary of genes pertaining to human colon cancer

  Sample User Question
Analysis/Comments
Step By Step Guide
Additional Notes
 

Sample User Question back to top

 
I would like to retrieve a concise, non-redundant list of human mRNA sequences associated with colon cancer.
 

NOTE: In the Entrez module, this demonstration is spread throughout a number of slides, showing how to search for nucleotide sequences related to human colon cancer via basic, advanced#1, advanced#2, and complex Boolean methods, and how to use Entrez Features such as Limits. Below, the demonstration is consolidated into a single page.

Analysis/Comments back to top

Challenge: separate the wheat from the chaff: retrieve a representative, non-redundant, well-annotated set of mRNA sequence records.

An Entrez data domain usually encompasses data from several different source databases. Our goal is to identify a representative, well-annotated mRNA sequence record among the many available in the Entrez Nucleotide data domain.

The Entrez Nucleotide domain includes sequence records from the archival GenBank database, the curated RefSeq database, nucleotide sequences extracted from Protein Data Bank (PDB) records, and a new Third-Party Annotation (TPA) database. As a result, an unrefined search can retrieve records of varying quality (in both sequence and annotation), and there can be a high degree of redundancy in search results, depending upon how many labs have submitted sequence data for a gene or its fragments.

For example, an unqualified search of Entrez Nucleotide for human colon cancer currently retrieves >10,000 hits. The results include archival and curated records, characterized sequences and lower quality sequences such as expressed sequence tags (ESTs), contigs from the genome project, sequences from patents, and more.

Step By Step Guide back to top

RESOURCE USED: Entrez Nucleotides, RefSeq
FEATURES HIGHLIGHTED: Limits - source database, molecule type

As noted above, an unqualified search of Entrez Nucleotide for human colon cancer currently retrieves >10,000 hits. The results include archival and curated records, characterized sequences and lower quality sequences such as expressed sequence tags (ESTs), contigs from the genome project, sequences from patents, and more.

A "Limits" option allows you to restrict your search, if desired, to a specific data subset, such as the curated, non-redundant RefSeq database. It also allows you to limit searches to specific data fields, retrieve records with certain attributes, such as molecule type, and exclude sometimes unwanted records such as ESTs, which are typically numerous and of lower sequence and annotation quality than characterized genes.

In this case, if we use the Limits page to restrict our colon cancer search to the Title field and then only to records from RefSeq, our retrieval narrows to 31 hits. If we then do a new search for human in the Organism field and use the History option to combine the two searches with a Boolean AND, we retrieve 13 hits -- far fewer and far more specific results than our original >10,000.

In addition, because each RefSeq record presents an encapsulation of the knowledge about a single gene or splice variant, rather than the work of an individual laboratory, each hit is similar to a review article.

For this example, we will more closely examine NM_000249: Homo sapiens mutL homolog 1 (MLH1) and the additional information we can retrieve for that gene in Entrez.

Additional Notes back to top

Of course, a search for MLH1, rather than colon cancer, would have worked as well, and the same techniques could have been used to narrow the search results. Gene symbol searching, however, can sometimes be less reliable if a gene has been known by numerous aliases. Although curated RefSeq records include the official gene symbol as well as the aliases, archival records, such as those in GenBank, include only the gene symbol that the authors used at the time of submission or last update.

This exercise is also narrated as part of Entrez tutorial:
  • Geer RC, Sayers EW. 2003. Entrez: making use of its power. Brief Bioinform., 4(2):179-84 (June). PMID: 12846398
    The Entrez Tutorial page provides a brief summary of the article and a link to the full text *.pdf file.
Please note that the search results (number of hits) noted in the article reflect the data that were available as of March 2003. The number of search hits will change as the databases grow, but the general search concepts will continue to apply.


Entrez User Question Return to Slides Revised 10/31/2007
Return to Colon Cancer Umbrella Page