Problem Summary:
Find concise summary of genes
pertaining to human colon cancer |
| Sample User Question |
 |
|
| |
|
I would like to retrieve a concise, non-redundant list of human mRNA
sequences associated with colon cancer.
|
|
|
NOTE: In the Entrez module, this demonstration is spread throughout
a number of slides, showing how to search for nucleotide sequences related to
human colon cancer via basic, advanced#1, advanced#2, and complex Boolean methods, and how to use Entrez Features such as
Limits. Below, the demonstration is
consolidated into a single page.
|
| Analysis/Comments |
 |
Challenge: separate the wheat from the chaff:
retrieve a representative, non-redundant, well-annotated set of
mRNA sequence records.
An Entrez data domain usually encompasses data from several different source
databases. Our goal is to identify a representative, well-annotated mRNA sequence
record among the many available in the Entrez Nucleotide data domain.
The Entrez Nucleotide domain includes sequence records from the archival GenBank
database, the curated RefSeq database, nucleotide sequences extracted from
Protein Data Bank (PDB) records, and a new Third-Party Annotation (TPA)
database. As a result, an unrefined search can retrieve records of varying quality
(in both sequence and annotation), and there can be a high degree of redundancy in
search results, depending upon how many labs have submitted sequence data for a
gene or its fragments.
For example, an unqualified search of Entrez Nucleotide for human colon cancer
currently retrieves >10,000 hits. The results include archival and curated
records, characterized sequences and lower quality sequences such as expressed
sequence tags (ESTs), contigs from the genome project, sequences from patents, and
more.
|
| Step By Step Guide |
 |
RESOURCE USED: Entrez Nucleotides, RefSeq
FEATURES HIGHLIGHTED: Limits - source database, molecule type
As noted above, an unqualified search of Entrez Nucleotide for human colon cancer
currently retrieves >10,000 hits. The results include archival and curated
records, characterized sequences and lower quality sequences such as expressed
sequence tags (ESTs), contigs from the genome project, sequences from patents, and
more.
A "Limits" option allows you to restrict your search, if desired, to a specific
data subset, such as the curated, non-redundant RefSeq database. It also allows
you to limit searches to specific data fields, retrieve records with certain
attributes, such as molecule type, and exclude sometimes unwanted records such as
ESTs, which are typically numerous and of lower sequence and annotation quality
than characterized genes.
In this case, if we use the Limits page to restrict our colon cancer search to
the Title field and then only to records from RefSeq, our retrieval narrows to
31 hits. If we then do a new search for human in the Organism field and use the
History option to combine the two searches with a Boolean AND, we retrieve 13
hits -- far fewer and far more specific results than our original >10,000.
In addition, because each RefSeq record presents an encapsulation of the knowledge
about a single gene or splice variant, rather than the work of an individual
laboratory, each hit is similar to a review article.
For this example, we will more closely examine NM_000249: Homo sapiens mutL
homolog 1 (MLH1) and the additional information we can retrieve for that gene in
Entrez.
|
| Additional Notes |
 |
Of course, a search for MLH1, rather than colon cancer, would have worked as well,
and the same techniques could have been used to narrow the search results. Gene
symbol searching, however, can sometimes be less reliable if a gene has been known
by numerous aliases. Although curated RefSeq records include the official gene
symbol as well as the aliases, archival records, such as those in GenBank, include
only the gene symbol that the authors used at the time of submission or last
update.
This exercise is also narrated as part of Entrez tutorial:
- Geer RC, Sayers EW. 2003. Entrez: making use of its power.
Brief Bioinform., 4(2):179-84 (June). PMID: 12846398
The Entrez
Tutorial page provides a brief summary of the article and a link to the full
text *.pdf file.
Please note that the search results (number of hits) noted
in the article reflect the data that were available as of March 2003. The number
of search hits will change as the databases grow, but the general search concepts
will continue to apply.
|
|