Information Hubs
Course Home Modules Schedule Exercises Comments Credits

Retrieve only sequences from patents

  Sample User Question Comments/Analysis Step By Step Guide Additional Tips  

Sample User Question back to
top

 
I'd like to retrieve nucleotide sequences from patents associated with cystic fibrosis. I am interested in sequences from all molecule types, not just mRNAs.
 

Comments / Analysis back to
top

The Limits page of the Entrez CoreNucleotide database allows you to easily exclude certain categories of sequence data, such as STSs, working draft high throughput genomic sequences, TPAs, and patents. However, the Limits page does not allow you to include, or retrieve only, those types of sequences. This exercise shows how to do that.

While some users prefer to exclude patent sequences because they are often short with little or no biological annotation, other users might specifically be interested in them. To enhance your skill in using Entrez by demonstrating a few tips&tricks, several search methods are shown below. If you are short on time, just do the first. The first three examples achieve the same search results using different techniques. In the fourth example, we added a synonym to the query, so that retrieves more records.

This exercise also deliberately focuses on patent sequences to bring up a discussion, under additional tips, about the limited scope of patent data in GenBank/EMBL/DDBJ and the recommendation to also consult external databases specializing in patent sequences as needed. There is also a tip on how to search by patent number.

Step By Step Guide back to top

There are four different ways you can search an individual Entrez database, as decribed in the module slides. The examples below demonstrate both advanced methods as well as a complex Boolean query.

Sample method 1:
  1. open Entrez CoreNucleotide - use the Limits page checkbox to exclude patents, then use the "Details" page to change the Boolean operator from NOT to AND, thereby retrieving only the patents


    • enter cystic fibrosis in the text box
    • select the Limits option beneath the search box
    • check the box to exclude patents
    • press Go
    • select the Details option beneath the search box to see exactly how Entrez parsed the query
    • in the Query Translation text box, change the Boolean NOT to AND (just by typing over the "NOT", so the last part of the query becomes AND gbdiv_pat[PROP] )
    • press the Search button under the Query Translation box

Sample method 2:
  1. open Entrez CoreNucleotide - use the Preview/Index page and Properties field to add the criterion to your query that all retrieved sequences must belong to the Patent division of GenBank


    • enter cystic fibrosis in the text box
    • select the Preview/Index option beneath the search box
    • select the Properties field from the search field pop-up menu near the bottom of the page
    • press the Index button to browse the index of that field
    • scroll down the index until you see terms that begin with "gbdiv", for GenBank division
    • select gbdiv pat, which is the Patent division
    • press the AND button to add that search term to the active query at the top of the page
    • press Go

Sample method 3:
  1. open Entrez CoreNucleotide - do your search in a single step with a complex Boolean query


    • enter the following query into the text box: cystic fibrosis[all] AND gbdiv_pat[prop]
    • press Go

Sample method 4:
  1. Entrez CoreNucleotide - do a complex Boolean query, as above, but also include a synonym to broaden retrieval


    • enter the following query into the text box: cystic fibrosis[all] OR CFTR[all] AND gbdiv_pat[prop]
    • press Go
      Tips: Boolean operators are processed from left to right unless parentheses are used for nesting. There is no need to add parentheses in this case because the left to right processing follows the logic necessary for our query. Other synonyms and search fields can be used in the query as well, if desired.

Additional Tips back to
top

Scope of Patents in GenBank/EMBL/DDBJ

While some users prefer to exclude patent sequences because they are often short with little or no biological annotation, other users might specifically be interested in them because they, too, might be interested in possibly obtaining a patent and want to see what sequences are already associated with patents. If that is the case, please note that the International Nucleotide Sequence Database Collaboration (GenBank/EMBL/DDBJ) contains only the sequences from patents that were provided by the U.S.P.T.O., the European patent office, and the Japanese patent offices. Therefore, the user should also consult other databases that obtain data from a larger number of patent authorities and therefore provide more exhaustive coverage of sequences from patents. An example of such a database is Derwent's GENESEQ (disclaimer). Another resource that might be of interest is the U.S.P.T.O. "Publication Site for Issued and Published Sequences (PSIPS)."

Searching for a specific patent number

If a user wants to retrieve all the sequences associated with a specific patent number, they can search the Accession field with a query formatted as:

two letter prefix (US, DE, WO, or JP), SPACE, [digits]
For exmaple:
US 5472872[accn]
Note that some of the sequences from the patent might be in the Entrez Nucleotide database while others might be in the Entrez Protein database (as is the case with the sample patent number given above), so it is sometimes helpful to search both.

Just for your reference, in BLAST search results, the SeqID string for patent records is shown in the following format:

pat|US|5472872|1
where the last digit indicates the serial number of the sequence within the patent. For example, there are 13 sequences associated with US 5472872, so the last digits will vary from 1 to 13 (with numbers 4 and 13 in the Protein database while the rest are in the Nucleotide database).


Information Hubs Return to Slides (*.html or *.mht format)
Return to Exercises List
Revised 08/03/2007