NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Entrez Programming Utilities Help [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2010-.

Cover of Entrez Programming Utilities Help

Entrez Programming Utilities Help [Internet].

Show details

Entrez Direct: E-utilities on the Unix Command Line

, PhD.

Author Information

Created: ; Last Update: February 8, 2021.

Estimated reading time: 50 minutes

Getting Started

Introduction

Entrez Direct (EDirect) provides access to the NCBI's suite of interconnected databases (publication, sequence, structure, gene, variation, expression, etc.) from a Unix terminal window. Search terms are entered as command-line arguments. Individual operations are connected with Unix pipes to allow construction of multi-step queries. Selected records can then be retrieved in a variety of formats.

Installation

EDirect will run on Unix and Macintosh computers, and under the Cygwin Unix-emulation environment on Windows PCs. To install the EDirect software, click on the download EDirect installer link to obtain the install-edirect.sh script, then execute it by running:

  source ./install-edirect.sh

Alternatively, open a terminal window and execute one of the following two commands:

  sh -c "$(curl -fsSL ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"

sh -c "$(wget -q ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh -O -)"

or copy the following commands and paste them into a terminal window:

  
cd ~
/bin/bash
perl -MNet::FTP -e \
'$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1);
$ftp->login; $ftp->binary;
$ftp->get("/entrez/entrezdirect/edirect.tar.gz");'
gunzip -c edirect.tar.gz | tar xf -
rm edirect.tar.gz
builtin exit
export PATH=${PATH}:$HOME/edirect >& /dev/null || setenv PATH "${PATH}:$HOME/edirect"
./edirect/setup.sh

Any of these methods will download several programs into an "edirect" folder in the user's home directory. The setup.sh script then downloads any missing legacy modules, and may print an additional command for updating the PATH environment variable in the user's configuration file. The editing instructions will look something like:

  
echo "export PATH=\$PATH:\$HOME/edirect" >> $HOME/.bash_profile

As a convenience, the installation process now ends by offering to run the PATH update command for you. Answer "y" and press the Return key if you want it run. If the PATH is already set correctly, or if you prefer to make any editing changes manually, just press Return.

Programmatic Access

EDirect connects to Entrez through the Entrez Programming Utilities interface. It supports searching by indexed terms, looking up precomputed neighbors or links, filtering results by date or category, and downloading record summaries or reports.

Navigation programs (esearch, elink, efilter, and efetch) communicate by means of a small structured message, which can be passed invisibly between operations with a Unix pipe. The message includes the current database, so it does not need to be given as an argument after the first step.

Accessory programs (nquire, transmute, and xtract) can help eliminate the need for writing custom software to answer ad hoc questions. Queries can move seamlessly between EDirect programs and Unix utilities or scripts to perform actions that cannot be accomplished entirely within Entrez.

All EDirect programs are designed to work on large sets of data. They handle many technical details behind the scenes (avoiding the learning curve normally required for E-utilities programming), and save intermediate results on the Entrez history server. For best performance, obtain an API Key from NCBI, and place the following line in your .bash_profile configuration file:

  export NCBI_API_KEY=unique_api_key_goes_here

Each program also has a ‑help command that prints detailed information about available arguments.

Unix programs are run by typing the name of the program and then supplying any required or optional arguments on the command line. Argument names are letters or words that start with a dash ("") character.

Navigation Functions

Esearch performs a new Entrez search using terms in indexed fields. It requires a ‑db argument for the database name and uses ‑query for the search terms. For PubMed, without field qualifiers, the server uses automatic term mapping to compose a search strategy by translating the supplied query:

  esearch -db pubmed -query "selective serotonin reuptake inhibitor"

Search terms can also be qualified with a bracketed field name to match within the specified index:

  esearch -db nuccore -query "insulin [PROT] AND rodents [ORGN]"

Elink looks up precomputed neighbors within a database, or finds associated records in other databases:

  elink -related

elink -target gene

Elink also connects to the NIH Open Citation Collection dataset to follow the reference lists of PubMed records, or to find publications that cite the selected PubMed articles:

  elink -cites

elink -cited

Efilter limits the results of a previous query, with shortcuts that can also be used in esearch:

  efilter -molecule genomic -location chloroplast -country sweden -days 365

Efetch downloads selected records or reports in a style designated by ‑format:

  efetch -format abstract

Constructing Multi-Step Queries

EDirect allows individual operations to be described separately, combining them into a multi-step query by using the vertical bar ("|") Unix pipe symbol:

  esearch -db pubmed -query "tn3 transposition immunity" | efetch -format medline

Writing Commands on Multiple Lines

A query can be continued on the next line by typing the backslash ("\") Unix escape character immediately before pressing the Return key.

  esearch -db pubmed -query "opsin gene conversion" | \

Continuing the query looks up precomputed neighbors of the original papers, next links to all protein sequences published in the related articles, then limits those to the rodent division of GenBank, and finally retrieves the records in FASTA format:

  elink -related | \
elink -target protein | \
efilter -division rod | \
efetch -format fasta

In most modern versions of Unix the vertical bar pipe symbol also allows the query to continue on the next line, without the need for an additional backslash.

Accessory Programs

Nquire retrieves data from remote servers with URLs constructed from command line arguments:

  nquire -get http://www.wikidata.org/entity Q22679758 |

Transmute converts a concatenated stream of JSON objects or other structured formats into XML:

  transmute -j2x |

Xtract can use waypoints to navigate a complex XML hierarchy and obtain data values by field name:

  xtract -pattern entities -group P527 -block datavalue -element id |

The resulting output can be post-processed by Unix utilities or scripts:

  fmt -w 1 | sort -V | uniq

Discovery by Navigation

PubMed related articles are calculated by a statistical text retrieval algorithm using the title, abstract, and medical subject headings (MeSH terms). The connections between papers can be used for making discoveries. An example of this is finding the last enzymatic step in the vitamin A biosynthetic pathway.

Lycopene cyclase in plants converts lycopene into β-carotene, the immediate biochemical precursor of vitamin A. An initial search on the enzyme:

  esearch -db pubmed -query "lycopene cyclase" |

finds 258 articles. Looking up precomputed neighbors:

  elink -related |

returns 15,827 PubMed papers, some of which might be expected to discuss other enzymes in the pathway. Since β-carotene (or vitamin A itself) is an essential nutrient, enzymes catalyzing earlier steps are not present in animals (with a few exceptions caused by horizontal gene transfer from fungi).

This knowledge can be used to help locate the desired enzyme. Linking from publications to proteins:

  elink -target protein |

finds 644,215 protein sequences, all annotated and indexed with organism information from the NCBI taxonomy. Limiting to mice excludes plants, fungi, and bacteria, which eliminates the earlier enzymes:

  efilter -organism mouse -source refseq |

This matches only 34 sequences, which is small enough to examine by retrieving the individual records:

  efetch -format fasta

As anticipated, the results include the enzyme that splits β-carotene into two molecules of retinal:

  ...
>NP_067461.2 beta,beta-carotene 15,15'-dioxygenase isoform 1 [Mus musculus]
MEIIFGQNKKEQLEPVQAKVTGSIPAWLQGTLLRNGPGMHTVGESKYNHWFDGLALLHSFSIRDGEVFYR
SKYLQSDTYIANIEANRIVVSEFGTMAYPDPCKNIFSKAFSYLSHTIPDFTDNCLINIMKCGEDFYATTE
TNYIRKIDPQTLETLEKVDYRKYVAVNLATSHPHYDEAGNVLNMGTSVVDKGRTKYVIFKIPATVPDSKK
...

The entire set of commands runs in 50 seconds. There is no need to use a script to loop over records one at a time, or write code to retry after a transient network failure, or add a time delay between requests. All of these features are already built into the EDirect commands.

Retrieving PubMed Reports

Piping PubMed query results to efetch and specifying the "abstract" format:

  esearch -db pubmed -query "lycopene cyclase" |
efetch -format abstract

returns a set of reports that can be read by a person:

  ...
85. PLoS One. 2013;8(3):e58144. doi: 10.1371/journal.pone.0058144. Epub ...

Levels of lycopene β-cyclase 1 modulate carotenoid gene expression and
accumulation in Daucus carota.

Moreno JC(1), Pizarro L, Fuentes P, Handford M, Cifuentes V, Stange C.

Author information:
(1)Departamento de Biología, Facultad de Ciencias, Universidad de Chile,
Santiago, Chile.

Plant carotenoids are synthesized and accumulated in plastids through a
highly regulated pathway. Lycopene β-cyclase (LCYB) is a key enzyme
involved directly in the synthesis of α-carotene and β-carotene through
...

If "medline" format is used instead:

  esearch -db pubmed -query "lycopene cyclase" |
efetch -format medline

the output can be entered into common bibliographic management software packages:

  ...
PMID- 23555569
OWN - NLM
STAT- MEDLINE
DA - 20130404
DCOM- 20130930
LR - 20131121
IS - 1932-6203 (Electronic)
IS - 1932-6203 (Linking)
VI - 8
IP - 3
DP - 2013
TI - Levels of lycopene beta-cyclase 1 modulate carotenoid gene expression
and accumulation in Daucus carota.
PG - e58144
LID - 10.1371/journal.pone.0058144 [doi]
AB - Plant carotenoids are synthesized and accumulated in plastids
through a highly regulated pathway. Lycopene beta-cyclase (LCYB) is a
key enzyme involved directly in the synthesis of alpha-carotene and
...

Retrieving Sequence Reports

Nucleotide and protein records can be downloaded in FASTA format:

  esearch -db protein -query "lycopene cyclase" |
efetch -format fasta

which consists of a definition line followed by the sequence:

  ...
>gi|735882|gb|AAA81880.1| lycopene cyclase [Arabidopsis thaliana]
MDTLLKTPNKLDFFIPQFHGFERLCSNNPYPSRVRLGVKKRAIKIVSSVVSGSAALLDLVPETKKENLDF
ELPLYDTSKSQVVDLAIVGGGPAGLAVAQQVSEAGLSVCSIDPSPKLIWPNNYGVWVDEFEAMDLLDCLD
TTWSGAVVYVDEGVKKDLSRPYGRVNRKQLKSKMLQKCITNGVKFHQSKVTNVVHEEANSTVVCSDGVKI
QASVVLDATGFSRCLVQYDKPYNPGYQVAYGIIAEVDGHPFDVDKMVFMDWRDKHLDSYPELKERNSKIP
TFLYAMPFSSNRIFLEETSLVARPGLRMEDIQERMAARLKHLGINVKRIEEDERCVIPMGGPLPVLPQRV
VGIGGTAGMVHPSTGYMVARTLAAAPIVANAIVRYLGSPSSNSLRGDQLSAEVWRDLWPIERRRQREFFC
FGMDILLKLDLDATRRFFDAFFDLQPHYWHGFLSSRLFLPELLVFGLSLFSHASNTSRLEIMTKGTVPLA
KMINNLVQDRD
...

Sequence records can also be obtained as GenBank or GenPept flatfiles:

  esearch -db protein -query "lycopene cyclase" |
efetch -format gp

which have features annotating particular regions of the sequence:

  ...
LOCUS AAA81880 501 aa linear PLN ...
DEFINITION lycopene cyclase [Arabidopsis thaliana].
ACCESSION AAA81880
VERSION AAA81880.1 GI:735882
DBSOURCE locus ATHLYC accession L40176.1
KEYWORDS .
SOURCE Arabidopsis thaliana (thale cress)
ORGANISM Arabidopsis thaliana
Eukaryota; Viridiplantae; Streptophyta; Embryophyta;
Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons;
Brassicales; Brassicaceae; Camelineae; Arabidopsis.
REFERENCE 1 (residues 1 to 501)
AUTHORS Scolnik,P.A. and Bartley,G.E.
TITLE Nucleotide sequence of lycopene cyclase (GenBank L40176) from
Arabidopsis (PGR95-019)
JOURNAL Plant Physiol. 108 (3), 1343 (1995)
...
FEATURES Location/Qualifiers
source 1..501
/organism="Arabidopsis thaliana"
/db_xref="taxon:3702"
Protein 1..501
/product="lycopene cyclase"
transit_peptide 1..80
mat_peptide 81..501
/product="lycopene cyclase"
CDS 1..501
/gene="LYC"
/coded_by="L40176.1:2..1507"
ORIGIN
1 mdtllktpnk ldffipqfhg ferlcsnnpy psrvrlgvkk raikivssvv sgsaalldlv
61 petkkenldf elplydtsks qvvdlaivgg gpaglavaqq vseaglsvcs idpspkliwp
121 nnygvwvdef eamdlldcld ttwsgavvyv degvkkdlsr pygrvnrkql kskmlqkcit
181 ngvkfhqskv tnvvheeans tvvcsdgvki qasvvldatg fsrclvqydk pynpgyqvay
241 giiaevdghp fdvdkmvfmd wrdkhldsyp elkernskip tflyampfss nrifleetsl
301 varpglrmed iqermaarlk hlginvkrie edercvipmg gplpvlpqrv vgiggtagmv
361 hpstgymvar tlaaapivan aivrylgsps snslrgdqls aevwrdlwpi errrqreffc
421 fgmdillkld ldatrrffda ffdlqphywh gflssrlflp ellvfglslf shasntsrle
481 imtkgtvpla kminnlvqdr d
//
...

Searching and Filtering

Restricting Query Results

The current results can be refined by further term searching in Entrez (useful in the protein database for limiting BLAST neighbors to a taxonomic subset):

  esearch -db pubmed -query "opsin gene conversion" |
elink -related |
efilter -query "tetrachromacy"

Limiting by Date

Results can also be filtered by date. For example, the following statements:

  efilter -days 60 -datetype PDAT

efilter -mindate 2000

efilter -maxdate 1985

efilter -mindate 1990 -maxdate 1999

restrict results to articles published in the previous two months, since the beginning of 2000, through the end of 1985, or in the 1990s, respectively. YYYY/MM and YYYY/MM/DD date formats are also accepted.

Fetch by Identifier

Efetch and elink can take a list of numeric identifiers or accessions in an ‑id argument:

  efetch -db pubmed -id 7252148,1937004 -format xml

efetch -db nuccore -id 1121073309 -format acc

efetch -db protein -id 3OQZ_a -format fasta

efetch -db bioproject -id PRJNA257197 -format docsum

elink -db pubmed -id 2539356 -cites

without the need for a preceding esearch command.

Non-integer accessions will be looked up with an internal search, using the appropriate field for the database:

  esearch -db bioproject -query "PRJNA257197 [PRJA]" |
efetch -format uid |
...

For backward compatibility, esummary is a shortcut for esearch ‑format docsum:

  esummary -db bioproject -id PRJNA257197

esummary -db sra -id SRR5437876

Indexed Fields

The einfo command can report the fields and links that are indexed for each database:

  einfo -db protein -fields

This will return a table of field abbreviations and names indexed for proteins:

  ACCN    Accession
ALL All Fields
ASSM Assembly
AUTH Author
BRD Breed
CULT Cultivar
DIV Division
ECNO EC/RN Number
FILT Filter
FKEY Feature key
...

Qualifying Queries by Indexed Field

Query terms in esearch or efilter can be qualified by entering an indexed field abbreviation in brackets. Boolean operators and parentheses can also be used in the query expression for more complex searches.

Commonly-used fields for PubMed queries include:

  [AFFL]    Affiliation           [MAJR]    MeSH Major Topic
[ALL] All Fields [SUBH] MeSH Subheading
[AUTH] Author [MESH] MeSH Terms
[FAUT] Author - First [PTYP] Publication Type
[LAUT] Author - Last [WORD] Text Word
[PDAT] Date - Publication [TITL] Title
[FILT] Filter [TIAB] Title/Abstract
[JOUR] Journal [UID] UID
[LANG] Language

and a qualified query looks like:

  "Tager HS [AUTH] AND glucagon [TIAB]"

Filters that limit search results to subsets of PubMed include:

  humans [MESH]
pharmacokinetics [MESH]
chemically induced [SUBH]
all child [FILT]
english [FILT]
freetext [FILT]
has abstract [FILT]
historical article [FILT]
randomized controlled trial [FILT]
clinical trial, phase ii [PTYP]
review [PTYP]

Sequence databases are indexed with a different set of search fields, including:

  [ACCN]    Accession       [MLWT]    Molecular Weight
[ALL] All Fields [ORGN] Organism
[AUTH] Author [PACC] Primary Accession
[GPRJ] BioProject [PROP] Properties
[BIOS] BioSample [PROT] Protein Name
[ECNO] EC/RN Number [SQID] SeqID String
[FKEY] Feature key [SLEN] Sequence Length
[FILT] Filter [SUBS] Substance Name
[GENE] Gene Name [WORD] Text Word
[JOUR] Journal [TITL] Title
[KYWD] Keyword [UID] UID

and a sample query in the protein database is:

  "alcohol dehydrogenase [PROT] NOT (bacteria [ORGN] OR fungi [ORGN])"

Additional examples of subset filters in sequence databases are:

  mammalia [ORGN]
mammalia [ORGN:noexp]
txid40674 [ORGN]
cds [FKEY]
lacz [GENE]
beta galactosidase [PROT]
protein snp [FILT]
reviewed [FILT]
country united kingdom glasgow [TEXT]
biomol genomic [PROP]
dbxref flybase [PROP]
gbdiv phg [PROP]
phylogenetic study [PROP]
sequence from mitochondrion [PROP]
src cultivar [PROP]
srcdb refseq validated [PROP]
150:200 [SLEN]

(The calculated molecular weight (MLWT) field is only indexed for proteins (and structures), not nucleotides.)

See efilter ‑help for a list of filter shortcuts available for several Entrez databases.

Examining Intermediate Results

EDirect stores intermediate results on the Entrez history server. EDirect navigation functions produce a custom XML message with the relevant fields (database, web environment, query key, and record count) that can be read by the next command in the pipeline.

The results of each step in a query can be examined to confirm expected behavior before adding the next step. The Count field in the ENTREZ_DIRECT object contains the number of records returned by the previous step. A good measure of query success is a reasonable (non-zero) count value. For example:

  esearch -db protein -query "tryptophan synthase alpha chain [PROT]" |
efilter -query "28000:30000 [MLWT]" |
elink -target structure |
efilter -query "0:2 [RESO]"

produces:

  <ENTREZ_DIRECT>
<Db>structure</Db>
<WebEnv> MCID_5fac27e119f45d4eca20b0e6</WebEnv>
<QueryKey>32</QueryKey>
<Count>58</Count>
<Step>4</Step>
</ENTREZ_DIRECT>

with 58 protein structures being within the specified molecular weight range and having the desired (X-ray crystallographic) atomic position resolution.

(The QueryKey value differs from Step because the elink command splits its query into smaller chunks to avoid server truncation limits and timeout errors.)

Combining Independent Queries

Independent esearch, elink, and efilter operations can be performed and then combined at the end by using the history server's "#" convention to indicate query key numbers. (The steps to be combined must be in the same database.) Subsequent esearch commands can take a ‑db argument to override the database piped in from the previous step. (Piping the queries together is necessary for sharing the same history thread.)

Because elink splits a large query into multiple smaller link requests, the new QueryKey value cannot be predicted in advance. The ‑label argument is used to get around this artifact. The label value is prefixed by a "#" symbol and placed in parentheses in the final search. For example, the query:

  esearch -db protein -query "amyloid* [PROT]" |
elink -target pubmed -label prot_cit |
esearch -db gene -query "apo* [GENE]" |
elink -target pubmed -label gene_cit |
esearch -query "(#prot_cit) AND (#gene_cit)" |
efetch -format docsum |
xtract -pattern DocumentSummary -element Id Title

uses truncation searching (entering the beginning of a word followed by an asterisk) to return titles of papers with links to amyloid protein sequence and apolipoprotein gene records:

  23962925    Genome analysis reveals insights into physiology and ...
23959870 Low levels of copper disrupt brain amyloid-β homeostasis ...
23371554 Genomic diversity and evolution of the head crest in the ...
23251661 Novel genetic loci identified for the pathophysiology of ...
...

Structured Data

Advantages of XML Format

The ability to obtain Entrez records in structured eXtensible Markup Language (XML) format, and to easily extract the underlying data, allows the user to ask novel questions that are not addressed by existing analysis software.

The advantage of XML is that information is in specific locations in a well-defined data hierarchy. Accessing individual units of data that are fielded by name, such as:

  <PubDate>2013</PubDate>
<Source>PLoS One</Source>
<Volume>8</Volume>
<Issue>3</Issue>
<Pages>e58144</Pages>

requires matching the same general pattern, differing only by the element name. This is much simpler than parsing the units from a long, complex string:

  1. PLoS One. 2013;8(3):e58144 ...

The disadvantage of XML is that data extraction usually requires programming. But EDirect relies on the common pattern of XML value representation to provide a simplified approach to interpreting XML data.

Conversion of XML into Tables

The xtract program uses command-line arguments to direct the selective conversion of data in XML format. It allows path exploration, element selection, conditional processing, and report formatting to be controlled independently.

The ‑pattern command partitions an XML stream by object name into individual records that are processed separately. Within each record, the ‑element command does an exhaustive, depth-first search to find data content by field name. Explicit paths to objects are not needed.

By default, the ‑pattern argument divides the results into rows, while placement of data into columns is controlled by ‑element, to create a tab-delimited table.

Format Customization

Formatting commands allow extensive customization of the output. The line break between ‑pattern rows is changed with ‑ret, while the tab character between ‑element columns is modified by ‑tab. Multiple instances of the same element are distinguished using ‑sep, which controls their separation independently of the ‑tab command. The following query:

  efetch -db pubmed -id 6271474,6092233,16589597 -format docsum |
xtract -pattern DocumentSummary -sep "|" -element Id PubDate Name

returns a table with individual author names separated by vertical bars:

  6271474     1981            Casadaban MJ|Chou J|Lemaux P|Tu CP|Cohen SN
6092233 1984 Jul-Aug Calderon IL|Contopoulou CR|Mortimer RK
16589597 1954 Dec Garber ED

The ‑sep value also applies to distinct -element arguments that are grouped with commas. This can be used to keep data from multiple related fields in the same column:

  -sep " " -element Initials,LastName

Groups of fields are preceded by the ‑pfx value and followed by the ‑sfx value, both of which are initially empty.

The ‑def command sets a default placeholder to be printed when none of the comma-separated fields in an ‑element clause are present:

  -def "-" -sep " " -element Year,Month,MedlineDate

Repackaging commands (‑wrp, ‑enc, and ‑pkg) wrap extracted data values with bracketed XML tags given only the object name. For example, "-wrp Word" issues the following formatting instructions:

  -pfx "<Word>" -sep "</Word><Word>" -sfx "</Word>"

and also ensures that data values containing encoded angle brackets, ampersands, quotation marks, or apostrophes remain properly encoded inside the new XML.

Element Variants

Derivatives of ‑element were created to eliminate the inconvenience of having to write post-processing scripts to perform otherwise trivial modifications or analyses on extracted data. They are subdivided into several categories. Substitute for ‑element as needed. A representative selection is shown below:

  Positional:       -first, -last

Numeric: -num, -len, -inc, -dec, -sum, -min, -max, -avg, -dev, -med

Text: -encode, -plain, -upper, -lower, -title, -words, -reverse

Expression: -replace

Sequence: -revcomp, -fasta, -ncbi2na, -molwt

Coordinate: -0-based, -1-based, -ucsc-based

Variation: -hgvs

Miscellaneous: -year, -doi, -histogram

The original -element prefix shortcuts, "#" and "%", are redirected to ‑num and ‑len, respectively.

Exploration Control

Exploration commands provide fine control over the order in which XML record contents are examined, by separately presenting each instance of the chosen subregion. This limits what subsequent commands "see" at any one time, and allows related fields in an object to be kept together.

In contrast to the simpler DocumentSummary format, records retrieved as PubmedArticle XML:

  efetch -db pubmed -id 1413997 -format xml |

have authors with separate fields for last name and initials:

  <Author>
<LastName>Mortimer</LastName>
<Initials>RK</Initials>
</Author>

Without being given any guidance about context, an ‑element command on initials and last names:

  efetch -db pubmed -id 1413997 -format xml |
xtract -pattern PubmedArticle -element Initials LastName

will explore the current record for each argument in turn, and thus print all author initials followed by all author last names:

  RK    CR    JS    Mortimer    Contopoulou    King

Inserting a ‑block command redirects data exploration to present the authors one at a time:

  efetch -db pubmed -id 1413997 -format xml |
xtract -pattern PubmedArticle -block Author -element Initials LastName

Each time through the loop, the ‑element command only sees the current author's values. This restores the correct association of initials and last names in the output:

  RK    Mortimer    CR    Contopoulou    JS    King

Grouping the two author subfields with a comma, and adjusting the ‑sep and ‑tab values:

  efetch -db pubmed -id 1413997 -format xml |
xtract -pattern PubmedArticle -block Author \
-sep " " -tab ", " -element Initials,LastName

produces a more desirable formatting of author names:

  RK Mortimer, CR Contopoulou, JS King

Sequential Exploration

Multiple ‑block statements can be used in a single xtract to explore different areas of the XML. This limits element extraction to the desired subregions, and allows disambiguation of fields with identical names. For example:

  efetch -db pubmed -id 6092233,4640931,4296474 -format xml |
xtract -pattern PubmedArticle \
-element MedlineCitation/PMID \
-block PubDate -sep " " -element Year,Month,MedlineDate \
-block AuthorList -num Author -sep "/" -element LastName |
sort -t $'\t' -k 3,3n -k 4,4f

generates a table that allows easy parsing of author last names, and sorts the results by author count:

  4296474    1968 Apr        1    Friedmann
4640931 1972 Dec 2 Tager/Steiner
6092233 1984 Jul-Aug 3 Calderon/Contopoulou/Mortimer

The individual ‑block statements are executed sequentially, in order of appearance.

Note that the PubDate object can exist either in a structured form:

  <PubDate>
<Year>1968</Year>
<Month>Apr</Month>
<Day>25</Day>
</PubDate>

(with the Day field frequently absent), or in a string form:

  <PubDate>
<MedlineDate>1984 Jul-Aug</MedlineDate>
</PubDate>

but would not contain a mixture of both types, so the directive:

  -element Year,Month,MedlineDate

will only contribute a single column to the output.

Nested Exploration

Exploration command names (‑group, ‑block, and ‑subset) are assigned to a precedence hierarchy:

  -pattern > -group > -block > -subset > -element

and are combined in ranked order to control object iteration at progressively deeper levels in the XML data structure. Each command argument acts as a "nested for-loop" control variable, retaining information about the context, or state of exploration, at its level.

(Hypothetical) census data would need several nested loops to visit each unique address in context:

  -pattern State -group City -block Street -subset Number -element Resident

MeSH terms can have their own unique set of qualifiers, with a major topic attribute on each object:

  ...
<MeshHeading>
<DescriptorName MajorTopicYN="N">beta-Galactosidase</DescriptorName>
<QualifierName MajorTopicYN="Y">genetics</QualifierName>
<QualifierName MajorTopicYN="N">metabolism</QualifierName>
</MeshHeading>
...

Since ‑element does its own exploration for objects within its current scope, a ‑block command:

  -block MeshHeading -sep " / " -element DescriptorName,QualifierName

is sufficient for grouping each MeSH name with its qualifiers:

  beta-Galactosidase / genetics / metabolism

Adding ‑subset commands within the ‑block separately visits the descriptor and each qualifier in the current MeSH term, and is needed to keep major topic attributes associated with their parent objects:

  efetch -db pubmed -id 6162838 -format xml |
xtract -transform <( echo -e "Y\t*\n" ) \
-pattern PubmedArticle -element MedlineCitation/PMID \
-block MeshHeading -clr \
-subset DescriptorName -plg "\n" -tab "" \
-translate "@MajorTopicYN" -element DescriptorName \
-subset QualifierName -plg " / " -tab "" \
-translate "@MajorTopicYN" -element QualifierName

The ‑transform and ‑translate commands convert the "Y" attribute value to an asterisk for indicating major topics. This avoids the need for post-processing in Unix to get the desired output:

  6162838
Base Sequence
*DNA, Recombinant
Escherichia coli / genetics
...
RNA, Messenger / *genetics
Transcription, Genetic
beta-Galactosidase / *genetics / metabolism

Note that "‑element MedlineCitation/PMID" uses the parent / child construct to prevent the display of additional PMID items that may be present later in CommentsCorrections objects.

Conditional Execution

Conditional processing commands (‑if, ‑unless, ‑and, ‑or, and ‑else) restrict object exploration by data content. They check to see if the named field is within the scope, and may be used in conjunction with string, numeric, or object constraints to require an additional match by value. For example:

  esearch -db pubmed -query "Havran W [AUTH]" |
efetch -format xml |
xtract -pattern PubmedArticle -if "#Author" -lt 13 \
-block Author -if LastName -is-not Havran \
-sep ", " -tab "\n" -element LastName,Initials |
sort-uniq-count-rank

selects papers with fewer than 13 authors and prints a table of the most frequent collaborators:

  27    Witherden, DA
15 Boismenu, R
10 Allison, JP
9 Fitch, FW
8 Jameson, JM
...

Numeric constraints can also compare the integer values of two fields. This can be used to find genes that are encoded on the minus strand of a nucleotide sequence:

  -if ChrStart -gt ChrStop

Object constraints will compare the string values of two named fields, and can look for internal inconsistencies between fields whose contents should (in most cases) be identical:

  -if Chromosome -differs-from ChrLoc

The ‑position command restricts presentation of objects by relative location or index number:

  -block Author -position last -sep ", " -element LastName,Initials

Multiple conditions are specified with ‑and and ‑or commands:

  -if @score -equals 1 -or @score -starts-with 0.9

The ‑else command can supply alternative ‑element or ‑lbl instructions to be run if the condition is not satisfied:

  -if MapLocation -element MapLocation -else -lbl "\-"

but setting a default value with ‑def may be more convenient in simple cases.

Parallel ‑if and ‑unless statements can be used to provide a more complex response to alternative conditions that include nested explorations.

Saving Data in Variables

A value can be recorded in a variable and used wherever needed. Variables are created by a hyphen followed by a name consisting of a string of capital letters or digits (e.g., ‑PMID):

  efetch -db pubmed -id 3201829,6301692,781293 -format xml |
xtract -pattern PubmedArticle -PMID MedlineCitation/PMID \

Variable values are retrieved by placing an ampersand before the variable name (e.g., "&PMID") in an ‑element statement:

    -block Author -element "&PMID" \
-sep " " -tab "\n" -element Initials,LastName

This produces a list of authors, with the PubMed identifier in the first column of each row:

  3201829    JR Johnston
3201829 CR Contopoulou
3201829 RK Mortimer
6301692 MA Krasnow
6301692 NR Cozzarelli
781293 MJ Casadaban

The variable can be used even though the original object is no longer visible inside the ‑block section.

Variables can be (re)initialized with an explicit literal value inside parentheses:

  -block Author -sep " " -tab "" -element "&COM" Initials,LastName -COM "(, )"

They can also be used as the first argument in a conditional statement:

  -CHR Chromosome -block GenomicInfoType -if "&CHR" -differs-from ChrLoc

Using a double-hyphen (e.g., ‑‑STATS) appends a value to the variable.

All variables are reset when the next record is processed.

Post-processing Functions

Elink ‑cited can perform a reverse citation lookup, thanks to a data service provided by the NIH Open Citation Collection. The extracted author names can be processed by piping to a chain of Unix utilities:

  esearch -db pubmed -query "Beadle GW [AUTH]" |
elink -cited |
efetch -format docsum |
xtract -pattern Author -element Name |
sort -f | uniq -i -c

which produces an alphabetized count of authors who cited the original papers:

  1 Abellan-Schneyder I
1 Abramowitz M
1 ABREU LA
1 ABREU RR
1 Abril JF
1 Abächerli E
1 Achetib N
1 Adams CM
2 ADELBERG EA
1 Adrian AB
...

Rather than always having to retype a series of common post-processing instructions, frequently-used combinations of Unix commands can be placed in a function, stored in an alias file (e.g., the user's .bash_profile), and executed by name. For example:

  SortUniqCountRank() {
sort -f |
uniq -i -c |
awk '{ n=$1; sub(/[ \t]*[0-9]+[ \t]/, ""); print n "\t" $0 }' |
sort -t "$(printf '\t')" -k 1,1nr -k 2f
}
alias sort-uniq-count-rank='SortUniqCountRank'

(An enhanced version of sort-uniq-count-rank that accepts customization arguments is now included with EDirect as a stand-alone script.)

The raw author names can be passed directly to the sort-uniq-count-rank script:

  esearch -db pubmed -query "Beadle GW [AUTH]" |
elink -cited |
efetch -format docsum |
xtract -pattern Author -element Name |
sort-uniq-count-rank

to produce a tab-delimited ranked list of authors who most often cited the original papers:

  17    Hawley RS
13 Beadle GW
13 PERKINS DD
11 Glass NL
11 Vécsei L
10 Toldi J
9 TATUM EL
8 Ephrussi B
8 LEDERBERG J
8 Schaeffer SW
...

Similarly, elink ‑cites uses NIH OCC data to return an article's reference list.

Note that EDirect commands can also be used inside Unix functions or scripts.

Viewing an XML Hierarchy

Piping a PubmedArticle XML object to xtract ‑outline will give an indented overview of the XML hierarchy:

  PubmedArticle
MedlineCitation
PMID
DateCompleted
Year
Month
Day
...
Article
Journal
...
Title
ISOAbbreviation
ArticleTitle
...
Abstract
AbstractText
AuthorList
Author
LastName
ForeName
Initials
AffiliationInfo
Affiliation
Author
...

Using xtract ‑synopsis or ‑contour will show the full paths to all nodes or just the terminal (leaf) nodes, respectively. Piping those results to "sort-uniq-count" will produce a table of unique paths.

Code Nesting Comparison

Sketching with indented pseudo code can clarify relative nesting levels. The extraction command:

  xtract -pattern PubmedArticle \
-block Author -element Initials,LastName \
-block MeshHeading \
-if QualifierName \
-element DescriptorName \
-subset QualifierName -element QualifierName

where the rank of the argument name controls the nesting depth, could be represented as a computer program in pseudo code by:

  for pat = each PubmedArticle {
for blk = each pat.Author {
print blk.Initials blk.LastName
}
for blk = each pat.MeSHTerm {
if blk.Qual is present {
print blk.MeshName
for sbs = each blk.Qual {
print sbs.QualName
}
}
}
}

where the brace indentation count controls the nesting depth.

Extra arguments are held in reserve to provide additional levels of organization, should the need arise in the future for processing complex, deeply-nested XML data. The exploration commands below ‑pattern, in order of rank, are:

  -path
-division
-group
-branch
-block
-section
-subset
-unit

Starting xtract exploration with ‑block, and expanding with ‑group and ‑subset, leaves additional level names that can be used wherever needed without having to redesign the entire command.

Complex Objects

Author Exploration

What's in a name? That which we call an author by any other name may be a consortium, investigator, or editor:

  <PubmedArticle>
<MedlineCitation>
<PMID>99999999</PMID>
<Article>
<AuthorList>
<Author>
<LastName>Tinker</LastName>
</Author>
<Author>
<LastName>Evers</LastName>
</Author>
<Author>
<LastName>Chance</LastName>
</Author>
<Author>
<CollectiveName>FlyBase Consortium</CollectiveName>
</Author>
</AuthorList>
</Article>
<InvestigatorList>
<Investigator>
<LastName>Alpher</LastName>
</Investigator>
<Investigator>
<LastName>Bethe</LastName>
</Investigator>
<Investigator>
<LastName>Gamow</LastName>
</Investigator>
</InvestigatorList>
</MedlineCitation>
</PubmedArticle>

Within the record, ‑element exploration on last name:

  xtract -pattern PubmedArticle -element LastName

prints each last name, but does not match the consortium:

  Tinker    Evers    Chance    Alpher    Bethe    Gamow

Limiting to the author list:

  xtract -pattern PubmedArticle -block AuthorList -element LastName

excludes the investigators:

  Tinker    Evers    Chance

Using ‑num on each type of object:

  xtract -pattern PubmedArticle -num Author Investigator LastName CollectiveName

displays the various object counts:

  4    3    6    1

Date Selection

Dates come in all shapes and sizes:

  <PubmedArticle>
<MedlineCitation>
<PMID>99999999</PMID>
<DateCompleted>
<Year>2011</Year>
</DateCompleted>
<DateRevised>
<Year>2012</Year>
</DateRevised>
<Article>
<Journal>
<JournalIssue>
<PubDate>
<Year>2013</Year>
</PubDate>
</JournalIssue>
</Journal>
<ArticleDate>
<Year>2014</Year>
</ArticleDate>
</Article>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="received">
<Year>2015</Year>
</PubMedPubDate>
<PubMedPubDate PubStatus="accepted">
<Year>2016</Year>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>2017</Year>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2018</Year>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2019</Year>
</PubMedPubDate>
</History>
</PubmedData>
</PubmedArticle>

Within the record, -element exploration on the year:

  xtract -pattern PubmedArticle -element Year

finds and prints all nine instances:

  2011    2012    2013    2014    2015    2016    2017    2018    2019

Using ‑block to limit the scope:

  xtract -pattern PubmedArticle -block History -element Year

prints only the five years within the History object:

  2015    2016    2017    2018    2019

Inserting a conditional statement to limit element selection to a date with a specific attribute:

  xtract -pattern PubmedArticle -block History \
-if @PubStatus -equals "pubmed" -element Year

surprisingly still prints all five years within History:

  2015    2016    2017    2018    2019

This is because the ‑if command uses the same exploration logic as ‑element, but is designed to declare success if it finds a match anywhere within the current scope. There is indeed a "pubmed" attribute within History, in one of the five PubMedPubDate child objects, so the test succeeds. Thus, ‑element is given free rein to do its own exploration in History, and prints all five years.

The solution is to explore the individual PubMedPubDate objects:

  xtract -pattern PubmedArticle -block PubMedPubDate \
-if @PubStatus -equals "pubmed" -element Year

This visits each PubMedPubDate separately, with the ‑if test matching only the indicated date type, thus returning only the desired year:

  2018

PMID Extraction

Because of the presence of a CommentsCorrections object:

  <PubmedArticle>
<MedlineCitation>
<PMID>99999999</PMID>
<CommentsCorrectionsList>
<CommentsCorrections RefType="ErratumFor">
<PMID>88888888</PMID>
</CommentsCorrections>
</CommentsCorrectionsList>
</MedlineCitation>
</PubmedArticle>

attempting to print the record's PubMed Identifier:

  xtract -pattern PubmedArticle -element PMID

also returns the PMID of the comment:

  99999999    88888888

Using an exploration command cannot exclude the second instance, because it would need a parent node unique to the first element, and the chain of parents to the first PMID:

  PubmedArticle/MedlineCitation

is a subset of the chain of parents to the second PMID:

  PubmedArticle/MedlineCitation/CommentsCorrectionList/CommentsCorrections

Although ‑first PMID will work in this particular case, the more general solution is to limit by subpath with the parent / child construct:

  xtract -pattern PubmedArticle -element MedlineCitation/PMID

That would work even if the order of objects were reversed.

Heterogeneous Data

XML objects can contain a heterogeneous mix of components. For example:

  efetch -db pubmed -id 21433338,17247418 -format xml

returns a mixture of book and journal records:

  <PubmedArticleSet>
<PubmedBookArticle>
<BookDocument>
...
</PubmedBookData>
</PubmedBookArticle>
<PubmedArticle>
<MedlineCitation>
...
</PubmedData>
</PubmedArticle>
</PubmedArticleSet>

The parent / star construct is used to visit the individual components, even though they may have different names. Piping the output to:

  xtract -pattern "PubmedArticleSet/*" -element "*"

separately prints the entirety of each XML component:

  <PubmedBookArticle><BookDocument> ... </PubmedBookData></PubmedBookArticle>
<PubmedArticle><MedlineCitation> ... </PubmedData></PubmedArticle>

Use of the parent / child construct can isolate objects of the same name that differ by their location in the XML hierarchy. For example:

  efetch -db pubmed -id 21433338,17247418 -format xml |
xtract -pattern "PubmedArticleSet/*" \
-group "BookDocument/AuthorList" -tab "\n" -element LastName \
-group "Book/AuthorList" -tab "\n" -element LastName \
-group "Article/AuthorList" -tab "\n" -element LastName

writes separate lines for book/chapter authors, book editors, and article authors:

  Fauci        Desrosiers
Coffin Hughes Varmus
Lederberg Cavalli Lederberg

Simply exploring with individual arguments:

  -group BookDocument -block AuthorList -element LastName

would visit the editors (at BookDocument/Book/AuthorList) as well as the authors (at BookDocument/AuthorList), and print names in order of appearance in the XML:

  Coffin    Hughes    Varmus    Fauci    Desrosiers

(In this particular example the book author lists could be distinguished by using ‑if "@Type" ‑equals authors or ‑if "@Type" ‑equals editors, but exploring by parent / child is a general position-based approach.)

Recursive Definitions

Certain XML objects returned by efetch are recursively defined, including Taxon in ‑db taxonomy and Gene-commentary in ‑db gene. Thus, they can contain nested objects with the same XML tag.

Retrieving a set of taxonomy records:

  efetch -db taxonomy -id 9606,7227 -format xml

produces XML with nested Taxon objects (marked below with line references) for each rank in the taxonomic lineage:

    <TaxaSet>
1 <Taxon>
<TaxId>9606</TaxId>
<ScientificName>Homo sapiens</ScientificName>
...
<LineageEx>
2 <Taxon>
<TaxId>131567</TaxId>
<ScientificName>cellular organisms</ScientificName>
<Rank>no rank</Rank>
3 </Taxon>
4 <Taxon>
<TaxId>2759</TaxId>
<ScientificName>Eukaryota</ScientificName>
<Rank>superkingdom</Rank>
5 </Taxon>
...
</LineageEx>
...
6 </Taxon>
7 <Taxon>
<TaxId>7227</TaxId>
<ScientificName>Drosophila melanogaster</ScientificName>
...
8 </Taxon>
</TaxaSet>

Xtract tracks XML object nesting to determine that the <Taxon> start tag on line 1 is actually closed by the </Taxon> stop tag on line 6, and not by the first </Taxon> encountered on line 3.

When a recursive object is given to an exploration command, selection of data using the ‑element command:

  efetch -db taxonomy -id 9606,7227,10090 -format xml |
xtract -pattern Taxon \
-element TaxId ScientificName GenbankCommonName Division

does not examine fields in the internal objects, and returns information only for the main entries:

  9606     Homo sapiens               human          Primates
7227 Drosophila melanogaster fruit fly Invertebrates
10090 Mus musculus house mouse Rodents

The star / child construct will skip past the outer start tag:

  efetch -db taxonomy -id 9606,7227,10090 -format xml |
xtract -pattern Taxon -block "*/Taxon" \
-tab "\n" -element TaxId,ScientificName

to visit the next level of nested objects individually:

  131567    cellular organisms
2759 Eukaryota
33154 Opisthokonta
...

Recursive objects can be fully explored with a double star / child construct:

  esearch -db gene -query "DMD [GENE] AND human [ORGN]" |
efetch -format xml |
xtract -pattern Entrezgene -block "**/Gene-commentary" \
-tab "\n" -element Gene-commentary_type@value,Gene-commentary_accession

which visits every child object regardless of nesting depth:

  genomic    NC_000023
mRNA XM_006724469
peptide XP_006724532
mRNA XM_011545467
peptide XP_011543769
...

Repackaging XML Results

Splitting abstract paragraphs into individual words, while using XML reformatting commands:

  efetch -db pubmed -id 2539356 -format xml |
xtract -stops -rec Rec -pattern PubmedArticle \
-enc Paragraph -wrp Word -words AbstractText

generates:

  ...
<Paragraph>
<Word>the</Word>
<Word>tn3</Word>
<Word>transposon</Word>
<Word>inserts</Word>
...
<Word>was</Word>
<Word>necessary</Word>
<Word>for</Word>
<Word>immunity</Word>
</Paragraph>
...

with the words from each abstract instance encased in a separate parent object. Word counts for each paragraph could then be calculated by piping to:

  xtract -pattern Rec -block Paragraph -num Word

Divide and Conquer

Although xtract provides ‑element variants to do simple data manipulation, more complex tasks are sometimes best handled by being broken up into a series of simpler transformations.

Document summaries for two bacterial chromosomes:

  efetch -db nuccore -id U00096,CP002956 -format docsum |

contain several individual fields and a complex series of self-closing Stat objects:

  <DocumentSummary>
<Id>545778205</Id>
<Caption>U00096</Caption>
<Title>Escherichia coli str. K-12 substr. MG1655, complete genome</Title>
<CreateDate>1998/10/13</CreateDate>
<UpdateDate>2020/09/23</UpdateDate>
<TaxId>511145</TaxId>
<Slen>4641652</Slen>
<Biomol>genomic</Biomol>
<MolType>dna</MolType>
<Topology>circular</Topology>
<Genome>chromosome</Genome>
<Completeness>complete</Completeness>
<GeneticCode>11</GeneticCode>
<Organism>Escherichia coli str. K-12 substr. MG1655</Organism>
<Strain>K-12</Strain>
<BioSample>SAMN02604091</BioSample>
<Statistics>
<Stat type="Length" count="4641652"/>
<Stat type="all" count="9198"/>
<Stat type="cdregion" count="4302"/>
<Stat type="cdregion" subtype="CDS" count="4285"/>
<Stat type="cdregion" subtype="CDS/pseudo" count="17"/>
<Stat type="gene" count="4609"/>
<Stat type="gene" subtype="Gene" count="4464"/>
<Stat type="gene" subtype="Gene/pseudo" count="145"/>
<Stat type="rna" count="187"/>
<Stat type="rna" subtype="ncRNA" count="79"/>
<Stat type="rna" subtype="rRNA" count="22"/>
<Stat type="rna" subtype="tRNA" count="86"/>
<Stat source="all" type="Length" count="4641652"/>
<Stat source="all" type="all" count="13500"/>
<Stat source="all" type="cdregion" count="4302"/>
<Stat source="all" type="gene" count="4609"/>
<Stat source="all" type="prot" count="4302"/>
<Stat source="all" type="rna" count="187"/>
</Statistics>
<AccessionVersion>U00096.3</AccessionVersion>
</DocumentSummary>
<DocumentSummary>
<Id>342852136</Id>
<Caption>CP002956</Caption>
<Title>Yersinia pestis A1122, complete genome</Title>
...

which make extracting the single "best" value for gene count a non-trivial exercise.

In addition to repackaging commands that surround extracted values with XML tags, the ‑element "*" construct prints the entirety of the current scope, including its XML wrapper. Piping the document summaries to:

  xtract -set Set -rec Rec -pattern DocumentSummary \
-block DocumentSummary -pkg Common \
-wrp Accession -element AccessionVersion \
-wrp Organism -element Organism \
-wrp Length -element Slen \
-wrp Title -element Title \
-wrp Date -element CreateDate \
-wrp Biomol -element Biomol \
-wrp MolType -element MolType \
-block Stat -if @type -equals gene -pkg Gene -element "*" \
-block Stat -if @type -equals rna -pkg RNA -element "*" \
-block Stat -if @type -equals cdregion -pkg CDS -element "*" |

encloses several fields in a Common block, and packages statistics on gene, RNA, and coding region features into separate sections of a new XML object:

  ...
<Rec>
<Common>
<Accession>U00096.3</Accession>
<Organism>Escherichia coli str. K-12 substr. MG1655</Organism>
<Length>4641652</Length>
<Title>Escherichia coli str. K-12 substr. MG1655, complete genome</Title>
<Date>1998/10/13</Date>
<Biomol>genomic</Biomol>
<MolType>dna</MolType>
</Common>
<Gene>
<Stat type="gene" count="4609"/>
<Stat type="gene" subtype="Gene" count="4464"/>
<Stat type="gene" subtype="Gene/pseudo" count="145"/>
<Stat source="all" type="gene" count="4609"/>
</Gene>
<RNA>
<Stat type="rna" count="187"/>
<Stat type="rna" subtype="ncRNA" count="79"/>
<Stat type="rna" subtype="rRNA" count="22"/>
<Stat type="rna" subtype="tRNA" count="86"/>
<Stat source="all" type="rna" count="187"/>
</RNA>
<CDS>
<Stat type="cdregion" count="4302"/>
<Stat type="cdregion" subtype="CDS" count="4285"/>
<Stat type="cdregion" subtype="CDS/pseudo" count="17"/>
<Stat source="all" type="cdregion" count="4302"/>
</CDS>
</Rec>
...

With statistics from different types of feature now segregated in their own substructures, total counts for each can be extracted with the ‑first command:

  xtract -set Set -rec Rec -pattern Rec \
-block Common -element "*" \
-block Gene -wrp GeneCount -first Stat@count \
-block RNA -wrp RnaCount -first Stat@count \
-block CDS -wrp CDSCount -first Stat@count |

This rewraps the data into a third XML form containing specific feature counts:

  ...
<Rec>
<Common>
<Accession>U00096.3</Accession>
<Organism>Escherichia coli str. K-12 substr. MG1655</Organism>
<Length>4641652</Length>
<Title>Escherichia coli str. K-12 substr. MG1655, complete genome</Title>
<Date>1998/10/13</Date>
<Biomol>genomic</Biomol>
<MolType>dna</MolType>
</Common>
<GeneCount>4609</GeneCount>
<RnaCount>187</RnaCount>
<CDSCount>4302</CDSCount>
</Rec>
...

without requiring extraction commands for the individual elements in the Common block to be repeated at each step.

Assuming the contents are satisfactory, passing the last structured form to:

  xtract \
-head accession organism length gene_count rna_count \
-pattern Rec -def "-" \
-element Accession Organism Length GeneCount RnaCount

produces a tab-delimited table with the desired values:

  accession     organism                 length     gene_count    rna_count
U00096.3 Escherichia coli ... 4641652 4609 187
CP002956.1 Yersinia pestis A1122 4553770 4217 86

If a different order of fields is desired after the final xtract has been run, piping to:

  reorder-columns 1 3 5 4

will rearrange the output, including the column headings:

  accession     length     rna_count    gene_count
U00096.3 4641652 187 4609
CP002956.1 4553770 86 4217

Sequence Records

NCBI Data Model for Sequence Records

The NCBI represents sequence records in a data model that is based on the central dogma of molecular biology. Sequences, including genomic DNA, messenger RNAs, and protein products, are "instantiated" with the actual sequence letters, and are assigned identifiers (e.g., accession numbers) for reference.

Each sequence can have multiple features, which contain information about the biology of a given region, including the transformations involved in gene expression. Each feature can have multiple qualifiers, which store specific details about that feature (e.g., name of the gene, genetic code used for translation, accession of the product sequence, cross-references to external databases).

Image chapter6-Image001.jpg

A gene feature indicates the location of a heritable region of nucleic acid that confers a measurable phenotype. An mRNA feature on genomic DNA represents the exonic and untranslated regions of the message that remain after transcription and splicing. A coding region (CDS) feature has a product reference to the translated protein.

Since messenger RNA sequences are not always submitted with a genomic region, CDS features (which model the travel of ribosomes on transcript molecules) are traditionally annotated on the genomic sequence, with locations that encode the exonic intervals.

A qualifier can be dynamically generated from underlying data for the convenience of the user. Thus, the sequence of a mature peptide may be extracted from the mat_peptide feature's location on the precursor protein and displayed in a /peptide qualifier, even if a mature peptide is not instantiated.

Sequence Records in INSDSeq XML

Sequence records can be retrieved in an XML version of the GenBank or GenPept flatfile. The query:

  efetch -db protein -id 26418308,26418074 -format gpc

returns a set of INSDSeq objects:

  <INSDSet>
<INSDSeq>
<INSDSeq_locus>AAN78128</INSDSeq_locus>
<INSDSeq_length>17</INSDSeq_length>
<INSDSeq_moltype>AA</INSDSeq_moltype>
<INSDSeq_topology>linear</INSDSeq_topology>
<INSDSeq_division>INV</INSDSeq_division>
<INSDSeq_update-date>03-JAN-2003</INSDSeq_update-date>
<INSDSeq_create-date>10-DEC-2002</INSDSeq_create-date>
<INSDSeq_definition>alpha-conotoxin ImI precursor, partial [Conus
imperialis]</INSDSeq_definition>
<INSDSeq_primary-accession>AAN78128</INSDSeq_primary-accession>
<INSDSeq_accession-version>AAN78128.1</INSDSeq_accession-version>
<INSDSeq_other-seqids>
<INSDSeqid>gb|AAN78128.1|</INSDSeqid>
<INSDSeqid>gi|26418308</INSDSeqid>
</INSDSeq_other-seqids>
<INSDSeq_source>Conus imperialis</INSDSeq_source>
<INSDSeq_organism>Conus imperialis</INSDSeq_organism>
<INSDSeq_taxonomy>Eukaryota; Metazoa; Lophotrochozoa; Mollusca;
Gastropoda; Caenogastropoda; Hypsogastropoda; Neogastropoda;
Conoidea; Conidae; Conus</INSDSeq_taxonomy>
<INSDSeq_references>
<INSDReference>
...

Biological features and qualifiers (shown here in GenPept format):

  FEATURES             Location/Qualifiers
source 1..17
/organism="Conus imperialis"
/db_xref="taxon:35631"
/country="Philippines"
Protein <1..17
/product="alpha-conotoxin ImI precursor"
mat_peptide 5..16
/product="alpha-conotoxin ImI"
/note="the C-terminal glycine of the precursor is post
translationally removed"
/calculated_mol_wt=1357
/peptide="GCCSDPRCAWRC"
CDS 1..17
/coded_by="AY159318.1:<1..54"
/note="nAChR antagonist"

are presented in INSDSeq XML as structured objects:

  ...
<INSDFeature>
<INSDFeature_key>mat_peptide</INSDFeature_key>
<INSDFeature_location>5..16</INSDFeature_location>
<INSDFeature_intervals>
<INSDInterval>
<INSDInterval_from>5</INSDInterval_from>
<INSDInterval_to>16</INSDInterval_to>
<INSDInterval_accession>AAN78128.1</INSDInterval_accession>
</INSDInterval>
</INSDFeature_intervals>
<INSDFeature_quals>
<INSDQualifier>
<INSDQualifier_name>product</INSDQualifier_name>
<INSDQualifier_value>alpha-conotoxin ImI</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>note</INSDQualifier_name>
<INSDQualifier_value>the C-terminal glycine of the precursor is
post translationally removed</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>calculated_mol_wt</INSDQualifier_name>
<INSDQualifier_value>1357</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>peptide</INSDQualifier_name>
<INSDQualifier_value>GCCSDPRCAWRC</INSDQualifier_value>
</INSDQualifier>
</INSDFeature_quals>
</INSDFeature>
...

The data hierarchy is explored using a ‑pattern {sequence} ‑group {feature} ‑block {qualifier} construct. However, feature and qualifier names are indicated in data values, not XML element tags, and require ‑if and ‑equals to select the desired object and content.

Generating Qualifier Extraction Commands

As a convenience, the xtract ‑insd helper function generates the appropriate nested extraction commands from feature and qualifier names on the command line.

Running xtract ‑insd in an isolated command prints a new xtract statement that can then be copied, edited if necessary, and pasted into other queries. Running the ‑insd command within a multi-step pipe dynamically executes the automatically-constructed query.

Providing an optional (complete/partial) location indication, a feature key, and then one or more qualifier names:

  xtract -insd complete mat_peptide product peptide

creates a new xtract statement that will produce a table of qualifier values from mature peptide features with complete locations. The statement starts with instructions to record the accession and find features of the indicated type:

  xtract -pattern INSDSeq -ACCN INSDSeq_accession-version -SEQ INSDSeq_sequence \
-group INSDFeature -if INSDFeature_key -equals mat_peptide \
-unless INSDFeature_partial5 -or INSDFeature_partial3 \
-clr -pfx "\n" -element "&ACCN" \

Each qualifier then generates custom extraction code that is appended to the growing query. For example:

  -block INSDQualifier \
-if INSDQualifier_name -equals product \
-element INSDQualifier_value

Incorporating the xtract ‑insd command in a search on cone snail venom:

  esearch -db pubmed -query "conotoxin" |
elink -target protein |
efilter -query "mat_peptide [FKEY]" |
efetch -format gpc |
xtract -insd complete mat_peptide "%peptide" product mol_wt peptide

prints the accession number, peptide length, product name, calculated molecular weight, and sequence for a sample of neurotoxic peptides:

  AAN78128.1    12    alpha-conotoxin ImI    1357    GCCSDPRCAWRC
ADB65789.1 20 conotoxin Cal 16 2134 LEMQGCVCNANAKFCCGEGR
ADB65788.1 20 conotoxin Cal 16 2134 LEMQGCVCNANAKFCCGEGR
AGO59814.1 32 del13b conotoxin 3462 DCPTSCPTTCANGWECCKGYPCVRQHCSGCNH
AAO33169.1 16 alpha-conotoxin GIC 1615 GCCSHPACAGNNQHIC
AAN78279.1 21 conotoxin Vx-II 2252 WIDPSHYCCCGGGCTDDCVNC
AAF23167.1 31 BeTX toxin 3433 CRAEGTYCENDSQCCLNECCWGGCGHPCRHP
ABW16858.1 15 marmophin 1915 DWEYHAHPKPNSFWT
...

Piping the results to a series of Unix commands:

  grep -i conotoxin |
awk -F '\t' -v 'OFS=\t' '{if ( 10 <= $2 && $2 <= 30 ) print}' |
sort -t $'\t' -u -k 3,5 |
sort -t $'\t' -k 2,2n -k 3,3f |
cut -f 1,3- |
column -s $'\t' -t

filters by product name, limits the results to a specified range of peptide lengths, removes redundant accessions, sorts the table by peptide length, deletes the length column, and aligns the columns for cleaner printing:

  AAN78128.1  alpha-conotoxin ImI            1357  GCCSDPRCAWRC
AAN78127.1 alpha-conotoxin ImII 1515 ACCSDRRCRWRC
ADB43130.1 conotoxin Cal 1a 1750 KCCKRHHGCHPCGRK
ADB43131.1 conotoxin Cal 1b 1708 LCCKRHHGCHPCGRT
AAO33169.1 alpha-conotoxin GIC 1615 GCCSHPACAGNNQHIC
ADB43128.1 conotoxin Cal 5.1 1829 DPAPCCQHPIETCCRR
AAD31913.1 alpha A conotoxin Tx2 2010 PECCSHPACNVDHPEICR
ADB43129.1 conotoxin Cal 5.2 2008 MIQRSQCCAVKKNCCHVG
ADD97803.1 conotoxin Cal 1.2 2206 AGCCPTIMYKTGACRTNRCR
ADB65789.1 conotoxin Cal 16 2134 LEMQGCVCNANAKFCCGEGR
AAD31912.1 alpha A conotoxin Tx1 2304 PECCSDPRCNSSHPELCGGRR
AAN78279.1 conotoxin Vx-II 2252 WIDPSHYCCCGGGCTDDCVNC
ADB43125.1 conotoxin Cal 14.2 2157 GCPADCPNTCDSSNKCSPGFPG
ADD97802.1 conotoxin Cal 6.4 2514 GCWLCLGPNACCRGSVCHDYCPR
CAH64846.1 four-loop conotoxin 2419 CRPSGSPCGVTSICCGRCSRGKCT
AAD31915.1 O-superfamily conotoxin TxO2 2565 CYDSGTSCNTGNQCCSGWCIFVCL
AAD31916.1 O-superfamily conotoxin TxO3 2555 CYDGGTSCDSGIQCCSGWCIFVCF
AAD31920.1 omega conotoxin SVIA mutant 1 2495 CRPSGSPCGVTSICCGRCYRGKCT
AAD31921.1 omega conotoxin SVIA mutant 2 2419 CRPSGSPCGVTSICCGRCSRGKCT
ABE27010.1 conotoxin fe14.1 2732 SPGSTICKMACRTGNGHKYPFCNCR
ABE27011.1 conotoxin fe14.2 2697 SSGSTVCKMMCRLGYGHLYPSCGCR
ABE27007.1 conotoxin p114.1 2645 GPGSAICNMACRLGQGHMYPFCNCN
ABE27008.1 conotoxin p114.2 2773 GPGSAICNMACRLEHGHLYPFCHCR
ABE27009.1 conotoxin p114.3 2709 GPGSAICNMACRLEHGHLYPFCNCD
...

For records where a particular qualifier is missing:

  esearch -db protein -query "RAG1 [GENE] AND Mus musculus [ORGN]" |
efetch -format gpc |
xtract -insd source organism strain |
sort -t $'\t' -u -k 2,3

a dash is inserted as a placeholder:

  P15919.2          Mus musculus               -
AAO61776.1 Mus musculus 129/Sv
NP_033045.2 Mus musculus C57BL/6
XP_006499075.1 Mus musculus C57BL/6J
EDL27655.1 Mus musculus mixed
BAD69530.1 Mus musculus castaneus -
BAD69531.1 Mus musculus domesticus BALB/c
BAD69532.1 Mus musculus molossinus MOA

Sequence Coordinates

Gene Positions

An understanding of sequence coordinate conventions is necessary in order to use gene positions to retrieve the corresponding chromosome subregion with efetch or with the UCSC browser.

Sequence records displayed in GenBank or GenPept formats use a "one-based" coordinate system, with sequence position numbers starting at "1":

    1 catgccattc gttgagttgg aaacaaactt gccggctagc cgcatacccg cggggctgga
61 gaaccggctg tgtgcggcca cagccaccat cctggacaaa cccgaagacg tgagtgaggg
121 tcggcgagaa cttgtgggct agggtcggac ctcccaatga cccgttccca tccccaggga
181 ccccactccc ctggtaacct ctgaccttcc gtgtcctatc ctcccttcct agatcccttc
...

Under this convention, positions refer to the sequence letters themselves:

  C   A   T   G   C   C   A   T   T   C
1 2 3 4 5 6 7 8 9 10

and the position of the last base or residue is equal to the length of the sequence. The ATG initiation codon above is at positions 2 through 4, inclusive.

For computer programs, however, using "zero-based" coordinates can simplify the arithmetic used for calculations on sequence positions. The ATG codon in the 0-based representation is at positions 1 through 3. (The UCSC browser uses a hybrid, half-open representation, where the start position is 0-based and the stop position is 1-based.)

Software at NCBI will typically convert positions to 0-based coordinates upon input, perform whatever calculations are desired, and then convert the results to a 1-based representation for display. These transformations are done by simply subtracting 1 from the 1-based value or adding 1 to the 0-based value.

Coordinate Conversions

Retrieving the docsum for a particular gene:

  esearch -db gene -query "BRCA2 [GENE] AND human [ORGN]" |
efetch -format docsum

returns the chromosomal position of that gene in "zero-based" coordinates:

  ...
<GenomicInfoType>
<ChrLoc>13</ChrLoc>
<ChrAccVer>NC_000013.11</ChrAccVer>
<ChrStart>32315479</ChrStart>
<ChrStop>32399671</ChrStop>
<ExonCount>27</ExonCount>
</GenomicInfoType>
...

Piping the document summary to an xtract command using -element:

  xtract -pattern GenomicInfoType -element ChrAccVer ChrStart ChrStop

obtains the accession and 0-based coordinate values:

  NC_000013.11    32315479    32399671

Efetch has ‑seq_start and ‑seq_stop arguments to retrieve a gene segment, but these expect the sequence subrange to be in 1-based coordinates.

To address this problem, two additional efetch arguments, ‑chr_start and ‑chr_stop, were created to allow direct use of the 0-based coordinates:

  efetch -db nuccore -format gb -id NC_000013.11 \
-chr_start 32315479 -chr_stop 32399671

Xtract now has numeric extraction commands to assist with coordinate conversion. Selecting fields with an ‑inc argument:

  xtract -pattern GenomicInfoType -element ChrAccVer -inc ChrStart ChrStop

obtains the accession and 0-based coordinates, then increments the positions to produce 1-based values:

  NC_000013.11    32315480    32399672

EDirect knows the policies for sequence positions in all relevant Entrez databases (e.g., gene, snp, dbvar), and provides additional shortcuts for converting these to other conventions. For example:

  xtract -pattern GenomicInfoType -element ChrAccVer -1-based ChrStart ChrStop

understands that gene docsum ChrStart and ChrStop fields are 0-based, sees that the desired output is 1-based, and translates the command to convert coordinates internally using the ‑inc logic. Similarly:

  -element ChrAccVer -ucsc-based ChrStart ChrStop

leaves the 0-based start value unchanged but increments the original stop value to produce the half-open form that can be passed to the UCSC browser:

  NC_000013.11    32315479    32399672

Gene Records

Genes in a Region

To list all genes between two markers flanking the human X chromosome centromere, first retrieve the protein-coding gene records on that chromosome:

  esearch -db gene -query "Homo sapiens [ORGN] AND X [CHR]" |
efilter -status alive -type coding | efetch -format docsum |

Gene names and chromosomal positions are extracted by piping the records to:

  xtract -pattern DocumentSummary -NAME Name -DESC Description \
-block GenomicInfoType -if ChrLoc -equals X \
-min ChrStart,ChrStop -element "&NAME" "&DESC" |

Exploring each GenomicInfoType is needed because of pseudoautosomal regions at the ends of the X and Y chromosomes:

  ...
<GenomicInfo>
<GenomicInfoType>
<ChrLoc>X</ChrLoc>
<ChrAccVer>NC_000023.11</ChrAccVer>
<ChrStart>155997630</ChrStart>
<ChrStop>156013016</ChrStop>
<ExonCount>14</ExonCount>
</GenomicInfoType>
<GenomicInfoType>
<ChrLoc>Y</ChrLoc>
<ChrAccVer>NC_000024.10</ChrAccVer>
<ChrStart>57184150</ChrStart>
<ChrStop>57199536</ChrStop>
<ExonCount>14</ExonCount>
</GenomicInfoType>
</GenomicInfo>
...

Without limiting to chromosome X, the copy of IL9R near the "q" telomere of chromosome Y would be erroneously placed with genes that are near the X chromosome centromere, shown here in between SPIN2A and ZXDB:

  ...
57121860 FAAH2 fatty acid amide hydrolase 2
57133042 SPIN2A spindlin family member 2A
57184150 IL9R interleukin 9 receptor
57592010 ZXDB zinc finger X-linked duplicated B
...

With genes restricted to the X chromosome, results can be sorted by position, and then filtered and partitioned:

  sort -k 1,1n | cut -f 2- |
grep -v pseudogene | grep -v uncharacterized |
between-two-genes AMER1 FAAH2

to produce a table of known genes located between the two markers:

  FAAH2      fatty acid amide hydrolase 2
SPIN2A spindlin family member 2A
ZXDB zinc finger X-linked duplicated B
NLRP2B NLR family pyrin domain containing 2B
ZXDA zinc finger X-linked duplicated A
SPIN4 spindlin family member 4
ARHGEF9 Cdc42 guanine nucleotide exchange factor 9
AMER1 APC membrane recruitment protein 1

Genes in a Pathway

A gene can be linked to the biochemical pathways in which it participates:

  esearch -db gene -query "PAH [GENE]" -organism human |
elink -target biosystems |
efilter -pathway wikipathways |

Linking from a pathway record back to the gene database:

  elink -target gene |
efetch -format docsum |
xtract -pattern DocumentSummary -element Name Description |
grep -v pseudogene | grep -v uncharacterized | sort -f

returns the set of all genes known to be involved in the pathway:

  AANAT    aralkylamine N-acetyltransferase
ACADM acyl-CoA dehydrogenase medium chain
ACHE acetylcholinesterase (Cartwright blood group)
...

Gene Sequence

Genes encoded on the minus strand of a sequence:

  esearch -db gene -query "DDT [GENE] AND mouse [ORGN]" |
efetch -format docsum |
xtract -pattern GenomicInfoType -element ChrAccVer ChrStart ChrStop |

have coordinates ("zero-based" in docsums) where the start position is greater than the stop:

  NC_000076.6    75773373    75771232

These values can be read into Unix variables by a "while" loop:

  while IFS=$'\t' read acn str stp
do
efetch -db nuccore -format gb \
-id "$acn" -chr_start "$str" -chr_stop "$stp"
done

The variables can then be used to obtain the reverse-complemented subregion in GenBank format:

  LOCUS       NC_000076               2142 bp    DNA     linear   CON 08-AUG-2019
DEFINITION Mus musculus strain C57BL/6J chromosome 10, GRCm38.p6 C57BL/6J.
ACCESSION NC_000076 REGION: complement(75771233..75773374)
...
gene 1..2142
/gene="Ddt"
mRNA join(1..159,462..637,1869..2142)
/gene="Ddt"
/product="D-dopachrome tautomerase"
/transcript_id="NM_010027.1"
CDS join(52..159,462..637,1869..1941)
/gene="Ddt"
/codon_start=1
/product="D-dopachrome decarboxylase"
/protein_id="NP_034157.1"
/translation="MPFVELETNLPASRIPAGLENRLCAATATILDKPEDRVSVTIRP
GMTLLMNKSTEPCAHLLVSSIGVVGTAEQNRTHSASFFKFLTEELSLDQDRIVIRFFP
...

The reverse complement of a plus-strand sequence range can be selected with efetch ‑revcomp

External Data

Querying External Services

The nquire program uses command-line arguments to obtain data from CGI or FTP servers. Queries are built up from command-line arguments. Paths can be separated into components, which are combined with slashes. Remaining arguments are tag/value pairs, with multiple values between tags combined with commas.

For example, a POST request:

  nquire -url http://w1.weather.gov/xml/current_obs/KSFO.xml |
xtract -pattern current_observation -tab "\n" \
-element weather temp_f wind_dir wind_mph

returns the current weather report at the San Francisco airport:

  A Few Clouds
54.0
Southeast
5.8

and a GET query:

  nquire -get http://collections.mnh.si.edu/services/resolver/resolver.php \
-voucher "Birds:321082" |
xtract -pattern Result -tab "\n" -element ScientificName StateProvince Country

returns information on a ruby-throated hummingbird specimen:

  Archilochus colubris
Maryland
United States

while an FTP request:

  nquire -ftp ftp.ncbi.nlm.nih.gov pub/gdp ideogram_9606_GCF_000001305.14_850_V1 |
grep acen | cut -f 1,2,6,7 | awk '/^X\t/'

returns data with the (estimated) sequence coordinates of the human X chromosome centromere (here showing where the p and q arms meet):

  X    p    58100001    61000000
X q 61000001 63800000

Nquire can also produce a list of files in an ftp server directory:

  nquire -lst ftp://nlmpubs.nlm.nih.gov online/mesh/MESH_FILES/xmlmesh

and download ftp files to the local disk:

  nquire -dwn ftp.nlm.nih.gov online/mesh/MESH_FILES/xmlmesh desc2021.zip

XML Namespaces

Namespace prefixes are followed by a colon, while a leading colon matches any prefix:

  nquire -url http://webservice.wikipathways.org getPathway -pwId WP455 |
xtract -pattern "ns1:getPathwayResponse" -element ":gpml" |
transmute -decode64 |

The embedded Graphical Pathway Markup Language object can then be processed:

  xtract -pattern Pathway -block Xref \
-if @Database -equals "Entrez Gene" \
-tab "\n" -element @ID

JSON Arrays

Consolidated gene information for human β-globin retrieved from a curated biological database service developed at the Scripps Research Institute:

  nquire -get http://mygene.info/v3 gene 3043 |

contains a multi-dimensional array of exon coordinates in JavaScript Object Notation (JSON) format:

  "position": [
[
5225463,
5225726
],
[
5226576,
5226799
],
[
5226929,
5227071
]
],

This can be converted to XML with transmute ‑j2x:

  transmute -j2x -set - -rec GeneRec -nest plural |

using "‑nest plural" to derive a parent name that keeps the original structure intact in the XML:

  <positions>
<position>5225463</position>
<position>5225726</position>
</positions>
...

Individual exons can then be visited by piping the record through:

  xtract -pattern GeneRec -group exons \
-block positions -pfc "\n" -element position

to print a tab-delimited table of start and stop positions:

  5225463    5225726
5226576 5226799
5226929 5227071

JSON Mixtures

A query for the human green-sensitive opsin gene:

  nquire -get http://mygene.info/v3/gene/2652 |
transmute -j2x -set - -rec GeneRec |

returns data containing a heterogeneous mixture of objects in the pathway section:

  <pathway>
<reactome>
<id>R-HSA-162582</id>
<name>Signal Transduction</name>
</reactome>
...
<wikipathways>
<id>WP455</id>
<name>GPCRs, Class A Rhodopsin-like</name>
</wikipathways>
</pathway>

The parent / star construct is used to visit the individual components of a parent object without needing to explicitly specify their names. For printing, the name of a child object is indicated by a question mark:

  xtract -pattern GeneRec -group "pathway/*" \
-pfc "\n" -element "?,name,id"

This displays a table of pathway database references:

  reactome        Signal Transduction                R-HSA-162582
reactome Disease R-HSA-1643685
...
reactome Diseases of the neuronal system R-HSA-9675143
wikipathways GPCRs, Class A Rhodopsin-like WP455

Xtract ‑path can explore using multi-level object addresses, delimited by periods or slashes:

  xtract -pattern GeneRec -path pathway.wikipathways.id -tab "\n" -element id

Conversion of ASN.1

Similarly to ‑j2x, transmute ‑a2x will convert Abstract Syntax Notation 1 (ASN.1) text files to XML.

Tables to XML

Tab-delimited files are easily converted to XML with transmute ‑t2x:

  nquire -ftp ftp.ncbi.nlm.nih.gov gene/DATA gene_info.gz |
gunzip -c | grep -v NEWENTRY | cut -f 2,3 |
transmute -t2x -set Set -rec Rec -skip 1 Code Name

This takes a series of command-line arguments with tag names for wrapping the individual columns, and skips the first line of input, which contains header information, to generate a new XML file:

  ...
<Rec>
<Code>1246500</Code>
<Name>repA1</Name>
</Rec>
<Rec>
<Code>1246501</Code>
<Name>repA2</Name>
</Rec>
...

The transmute ‑t2x ‑header argument will obtain tag names from the first line of the file:

  nquire -ftp ftp.ncbi.nlm.nih.gov gene/DATA gene_info.gz |
gunzip -c | grep -v NEWENTRY | cut -f 2,3 |
transmute -t2x -set Set -rec Rec -header

CSV to XML

Similarly to ‑t2x, transmute ‑c2x will convert comma-separated values (CSV) files to XML.

GenBank Download

The entire set of GenBank format release files be downloaded with:

  fls=$( nquire -lst ftp.ncbi.nlm.nih.gov genbank )
for div in \
bct con env est gss htc htg inv mam pat \
phg pln pri rod sts syn tsa una vrl vrt
do
echo "$fls" |
grep ".seq.gz" | grep "gb${div}" |
sort -V | skip-if-file-exists |
nquire -asp ftp.ncbi.nlm.nih.gov genbank
done

Unwanted divisions can be removed from the "for" loop to limit retrieval to specific sequencing classes or taxonomic regions.

For systems with Aspera Connect installed, the nquire ‑asp command will provide faster retrieval from NCBI. Otherwise it defaults to the ‑dwn logic.

GenPept to XML

The latest GenPept incremental update file can be parsed into INSDSeq XML with transmute ‑g2x:

  nquire -ftp ftp.ncbi.nlm.nih.gov genbank daily-nc Last.File |
sed "s/flat/gnp/g" |
nquire -ftp ftp.ncbi.nlm.nih.gov genbank daily-nc |
gunzip -c | transmute -g2x |

Records can then be filtered by organism name or taxon identifier with xtract ‑select:

  xtract -pattern INSDSeq -select INSDQualifier_value -equals "taxon:2697049" |

and used to generate the sequences of individual mature peptides derived from polyproteins:

  xtract -insd mat_peptide product sub_sequence

Local PubMed Cache

Fetching data from Entrez works well when a few thousand records are needed, but it does not scale for much larger sets of data, where the time it takes to download becomes a limiting factor.

Random Access Archive

EDirect can now preload over 30 million live PubMed records onto an inexpensive external 500 GB solid state drive for rapid retrieval.

For example, PMID 12345678 would be stored (as a compressed XML file) at:

  /Archive/12/34/56/12345678.xml.gz

using a hierarchy of folders to organize the data for random access to any record.

Set an environment variable in your configuration file to reference your external drive:

  export EDIRECT_PUBMED_MASTER=/Volumes/external_disk_name_goes_here

and run archive-pubmed to download the PubMed release files and distribute each record on the drive. This process will take several hours to complete, but subsequent updates are incremental, and should finish in minutes.

The local archive is a completely self-contained turnkey system, with no need for the user to download and configure complicated third-party database software.

Retrieving over 120,000 compressed PubMed records from the local archive:

  esearch -db pubmed -query "PNAS [JOUR]" -pub abstract |
efetch -format uid | stream-pubmed | gunzip -c |

takes about 15 seconds. Retrieving those records from NCBI's network service, with efetch -format xml, would take around 40 minutes.

Even modest sets of PubMed query results can benefit from using the local cache. A reverse citation lookup on 191 papers:

  esearch -db pubmed -query "Cozzarelli NR [AUTH]" | elink -cited |

requires 7 seconds to match 7485 subsequent articles. Fetching them from the local archive:

  efetch -format uid | fetch-pubmed |

is practically instantaneous. Printing the names of all authors in those records:

  xtract -pattern PubmedArticle -block Author \
-sep " " -tab "\n" -element LastName,Initials |

allows creation of a frequency table:

  sort-uniq-count-rank

that lists the authors who most often cited the original papers:

  113    Cozzarelli NR
76 Maxwell A
58 Wang JC
52 Osheroff N
52 Stasiak A
...

Fetching from the network service would extend the 7 second running time to over 2 minutes.

Local Search Index

A similar divide-and-conquer strategy is used to create an experimental local information retrieval system suitable for large data mining queries. Run index-pubmed to populate retrieval index files from records stored in the local archive. This will also take a few hours. Since PubMed updates are released once per day, it may be most convenient to schedule reindexing to start in the late evening and run during the night.

For PubMed titles and primary abstracts, the indexing process deletes hyphens after specific prefixes, removes accents and diacritical marks, splits words at punctuation characters, corrects encoding artifacts, and spells out Greek letters for easier searching on scientific terms. It then prepares inverted indices with term positions, and uses them to build distributed term lists and postings files.

For example, the term list that includes "cancer" would be located at:

  /Postings/NORM/c/a/n/c/canc.trm

A query on cancer thus only needs to load a very small subset of the total index. The underlying software supports efficient expression evaluation, unrestricted wildcard truncation, phrase queries, and proximity searches.

The phrase-search script provides access to the local search system.

Currently indexed field names, and the full set of indexed terms for a given field, are shown by:

  phrase-search -terms

phrase-search -terms NORM

Terms are truncated with trailing asterisks, and can be expanded to show individual postings counts:

  phrase-search -count "catabolite repress*"

phrase-search -counts "catabolite repress*"

Query evaluation includes Boolean operations and parenthetical expressions:

  phrase-search -query "(literacy AND numeracy) NOT (adolescent OR child)"

Adjacent words in the query are treated as a contiguous phrase:

  phrase-search -query "selective serotonin reuptake inhibit*"

More inclusive searches can use the Porter2 stemming algorithm:

  phrase-search -query "monoamine oxidase inhibitor [STEM]"

Each plus sign will replace a single word inside a phrase, and runs of tildes indicate the maximum distance between sequential phrases:

  phrase-search -query "vitamin c + + common cold"

phrase-search -query "vitamin c ~ ~ common cold"

MeSH identifier code, MeSH hierarchy key, and year of publication are also indexed:

  phrase-search -query "C14.907.617.812* [TREE] AND 2015:2019 [YEAR]"

An exact match can search for all or part of a title or abstract:

  phrase-search -exact "Genetic Control of Biochemical Reactions in Neurospora."

All query commands return a list of PMIDs, which can be piped directly to fetch-pubmed to retrieve the uncompressed records. For example:

  phrase-search -query "selective serotonin ~ ~ ~ reuptake inhibitor*" |
fetch-pubmed |
xtract -pattern PubmedArticle -num AuthorList/Author |
sort-uniq-count -n |
reorder-columns 2 1 |
head -n 25 |
tee /dev/tty |
xy-plot auth.png

performs a proximity search with dynamic wildcard expansion (matching phrases like "selective serotonin and norepinephrine reuptake inhibitors") and fetches 12,620 PubMed records from the local archive. It then counts the number of authors for each paper (a consortium is treated as a single author), printing a frequency table of the number of papers per number of authors:

  0    50
1 1372
2 1863
3 1873
4 1721
5 1516
...

and creating a visual graph of the data. The entire set of commands runs in under 4 seconds.

The phrase-search and fetch-pubmed scripts are front-ends to the rchive program, which is used to build and search the inverted retrieval system. Rchive is multi-threaded for speed, retrieving records from the local archive in parallel, and fetching the positional indices for all terms in parallel before evaluating the title words as a contiguous phrase.

Rapidly Scanning PubMed

If the expand-current script is run after archive-pubmed or index-pubmed, an ad hoc scan can be performed on the entire set of live PubMed records:

  cat $EDIRECT_PUBMED_MASTER/Current/*.xml |
xtract -timer -pattern PubmedArticle -PMID MedlineCitation/PMID \
-group AuthorList -if "#LastName" -eq 7 -element "&PMID" LastName

in this case finding 1,613,792 articles with seven authors. (Author count is not indexed by Entrez or EDirect. This query excludes consortia and additional named investigators.)

Xtract uses the Boyer-Moore-Horspool algorithm to partition an XML stream into individual records, distributing them among multiple instances of the data exploration and extraction function for concurrent execution. On a modern six-core computer with a fast solid state drive, it can process the full set of PubMed records in under 4 minutes.

Processing by XML Subset

A query on articles with abstracts published in a chosen journal, retrieved from the local cache, and followed by a multi-step transformation:

  esearch -db pubmed -query "PNAS [JOUR]" -pub abstract |
efetch -format uid | fetch-pubmed |
xtract -stops -rec Rec -pattern PubmedArticle \
-wrp Year -year "PubDate/*" -wrp Abst -words Abstract/AbstractText |
xtract -rec Pub -pattern Rec \
-wrp Year -element Year -wrp Num -num Abst > countsByYear.xml

returns structured data with the year of publication and number of words in the abstract for each record:

  <Pub><Year>2018</Year><Num>198</Num></Pub>
<Pub><Year>2018</Year><Num>167</Num></Pub>
<Pub><Year>2018</Year><Num>242</Num></Pub>

The ">" redirect saves the results to a file.

The following "for" loop limits the processed query results to one year at a time with xtract ‑select, passing the relevant subset to a second xtract command:

  for yr in {1960..2021}
do
cat countsByYear.xml |
xtract -set Raw -pattern Pub -select Year -eq "$yr" |
xtract -pattern Raw -lbl "$yr" -avg Num
done |

that applies ‑avg to the word counts in order to compute the average number of abstract words per article for the current year:

  1969    122
1970 120
1971 127
...
2018 207
2019 207
2020 208

This result can be saved by redirecting to a file, or it can be piped to:

  tee /dev/tty |
xy-plot pnas.png

to print the data to the terminal and then display the results in graphical format. The last step should be:

  rm countsByYear.xml

to remove the intermediate file.

Identifier Conversion

The index-pubmed script also downloads MeSH descriptor information from the NLM ftp server and generates a conversion file:

  ...
<Rec>
<Code>D064007</Code>
<Name>Ataxia Telangiectasia Mutated Proteins</Name>
...
<Tree>D12.776.157.687.125</Tree>
<Tree>D12.776.660.720.125</Tree>
</Rec>
...

that can be used for mapping MeSH codes to and from chemical or disease names. For example:

  cat $EDIRECT_PUBMED_MASTER/Data/meshconv.xml |
xtract -pattern Rec \
-if Name -starts-with "ataxia telangiectasia" \
-element Code

will return:

  C565779
C576887
D001260
D064007

More information on a MeSH term could be obtained by running:

  efetch -db mesh -id D064007 -format docsum

Natural Language Processing

Additional NLP annotation on PubMed can be downloaded and indexed by running index-extras.

NCBI's Biomedical Text Mining Group performs computational analysis of PubMed and PMC papers, and extracts chemical, disease, and gene references from article contents. Along with NLM Gene Reference Into Function mappings, these terms are indexed in CHEM, DISZ, and GENE fields.

Recent research at Stanford University defined biological themes, supported by dependency paths, which are indexed in THME, PATH, and CONV fields. Theme keys in the Global Network of Biomedical Relationships are taken from a table in the paper:

  A+    Agonism, activation                      N     Inhibits
A- Antagonism, blocking O Transport, channels
B Binding, ligand Pa Alleviates, reduces
C Inhibits cell growth Pr Prevents, suppresses
D Drug targets Q Production by cell population
E Affects expression/production Rg Regulation
E+ Increases expression/production Sa Side effect/adverse event
E- Decreases expression/production T Treatment/therapy
G Promotes progression Te Possible therapeutic effect
H Same protein or complex U Causal mutations
I Signaling pathway Ud Mutations affecting disease course
J Role in disease pathogenesis V+ Activates, stimulates
K Metabolism, pharmacokinetics W Enhances response
L Improper regulation linked to disease X Overexpression in disease
Md Biomarkers (diagnostic) Y Polymorphisms alter risk
Mp Biomarkers (progression) Z Enzyme activity

Themes common to multiple chemical-disease-gene relationships are disambiguated so they can be queried individually. The expanded list, along with MeSH category codes, can be seen with:

  phrase-search -help

Integration with Entrez

The phrase-search ‑filter command allows PMIDs to be generated by an EDirect search and then incorporated as a component in a local query:

  esearch -db pubmed -query "complement system proteins [MESH]" |
efetch -format uid |
phrase-search -filter "L [THME] AND D10* [TREE]"

This finds PubMed papers about complement proteins and limits them by the "improper regulation linked to disease" theme and the lipids MeSH chemical category:

  448084
1292783
1379443
...

Intermediate lists of PMIDs can be saved to a file and piped (with "cat") into a subsequent phrase-search ‑filter query. They can also be uploaded to the Entrez history server by piping to epost:

  epost -db pubmed

Automation

Entrez Direct Commands Within Scripts

Taking an adventurous plunge into the world of programming, a shell script can repeat the same sequence of operations on a number of individual input lines or arguments.

Variables can be set to the results of a command by enclosing the statements between "$(" and ")" symbols. (Placing statements between backtick ("`") characters is an earlier, obsolete convention, but it still works.) The variable name is prefixed by a dollar sign ("$") to use its value as an argument in another command. Comments start with a pound sign ("#") and are ignored. Quotation marks within quoted strings are entered by "escaping" with a backslash ("\"). Subroutines can be used to collect common code or simplify the organization of the script.

Saving the following text:

  #!/bin/bash

printf "Years"
for disease in "$@"
do
frst=$( echo -e "${disease:0:1}" | tr [a-z] [A-Z] )
printf "\t${frst}${disease:1:3}"
done
printf "\n"

for (( yr = 2020; yr >= 1900; yr -= 10 ))
do
printf "${yr}s"
for disease in "$@"
do
val=$(
esearch -db pubmed -query "$disease [TITL]" |
efilter -mindate "${yr}" -maxdate "$((yr+9))" |
xtract -pattern ENTREZ_DIRECT -element Count
)
printf "\t${val}"
done
printf "\n"
done

to a file named "scan_for_diseases.sh" and executing:

  chmod +x scan_for_diseases.sh

allows the script to be called by name. Passing several disease names in command-line arguments:

  scan_for_diseases.sh diphtheria pertussis tetanus |

returns the counts of papers on each disease, by decade, for over a century:

  Years    Diph    Pert    Teta
2020s 104 281 154
2010s 860 2558 1296
2000s 892 1968 1345
1990s 1150 2662 1617
1980s 780 1747 1488
...

A graph of papers per decade for each disease is generated by piping the table to:

  xy-plot diseases.png

Passing the data instead to:

  align-columns -h 2 -g 4 -a ln

right-justifies numeric data columns for easier reading or for publication:

  Years    Diph    Pert    Teta
2020s 104 281 154
2010s 860 2558 1296
2000s 892 1968 1345
1990s 1150 2662 1617
1980s 780 1747 1488
...

while piping to:

  transmute -t2x -set Set -rec Rec -header

produces a custom XML structure for further comparative analysis by xtract.

Time Delay

The shell script command:

  sleep 1

adds a one second delay between steps, and can be used to help prevent overuse of servers by advanced scripts.

Xargs/Sh Loop

Writing a script to loop through data can sometimes be avoided by creative use of the Unix xargs and sh commands. Within the "sh ‑c" command string, the last name and initials arguments (passed in pairs by "xargs ‑n 2") are substituted at the "$0" and "$1" variables. All of the commands in the sh string are run separately on each name:

  echo "Garber ED Casadaban MJ Mortimer RK" |
xargs -n 2 sh -c 'esearch -db pubmed -query "$0 $1 [AUTH]" |
xtract -pattern ENTREZ_DIRECT -lbl "$1 $0" -element Count'

This produces PubMed article counts for each author:

  ED Garber       35
MJ Casadaban 46
RK Mortimer 85

While Loop

A "while" loop can also be used to independently process lines of data. Given a file "organisms.txt" containing genus-species names, the Unix "cat" command:

  cat organisms.txt |

writes the contents of the file:

  Arabidopsis thaliana
Caenorhabditis elegans
Danio rerio
Drosophila melanogaster
Escherichia coli
Homo sapiens
Mus musculus
Saccharomyces cerevisiae

This can be piped to a loop that reads one line at a time:

  while read org
do
esearch -db taxonomy -query "$org [LNGE] AND family [RANK]" < /dev/null |
efetch -format docsum |
xtract -pattern DocumentSummary -lbl "$org" \
-element ScientificName Division
done

looking up the taxonomic family name and BLAST division for each organism:

  Arabidopsis thaliana        Brassicaceae          eudicots
Caenorhabditis elegans Rhabditidae nematodes
Danio rerio Cyprinidae bony fishes
Drosophila melanogaster Drosophilidae flies
Escherichia coli Enterobacteriaceae enterobacteria
Homo sapiens Hominidae primates
Mus musculus Muridae rodents
Saccharomyces cerevisiae Saccharomycetaceae ascomycetes

(The "< /dev/null" input redirection construct prevents esearch from "draining" the remaining lines from stdin.)

For Loop

The same results can be obtained with organism names embedded in a "for" loop:

  for org in \
"Arabidopsis thaliana" \
"Caenorhabditis elegans" \
"Danio rerio" \
"Drosophila melanogaster" \
"Escherichia coli" \
"Homo sapiens" \
"Mus musculus" \
"Saccharomyces cerevisiae"
do
esearch -db taxonomy -query "$org [LNGE] AND family [RANK]" |
efetch -format docsum |
xtract -pattern DocumentSummary -lbl "$org" \
-element ScientificName Division
done

File Exploration

A for loop can also be used to explore the computer's file system:

  for i in *
do
if [ -f "$i" ]
then
echo $(basename "$i")
fi
done

visiting each file within the current directory. The asterisk ("*") character indicates all files, and can be replaced by any pattern (e.g., "*.txt") to limit the file search. The if statement "‑f" operator can be changed to "‑d" to find directories instead of files, and "‑s" selects files with size greater than zero.

Processing in Groups

EDirect supplies a function that combines lines of unique identifiers or sequence accession numbers into comma-separated groups:

  JoinIntoGroupsOf() {
xargs -n "$@" echo |
sed 's/ /,/g'
}
alias join-into-group-of='JoinIntoGroupsOf'

The following example demonstrates processing sequence records in groups of 200 accessions at a time:

  ...
efetch -format acc |
join-into-groups-of 200 |
xargs -n 1 sh -c 'epost -db nuccore -format acc -id "$0" |
elink -target pubmed |
efetch -format abstract'

Additional Examples

EDirect examples demonstrate how to answer ad hoc questions in several Entrez databases. The detailed examples have been moved to a separate document, which can be viewed by clicking on the ADDITIONAL EXAMPLES link.

Appendices

Command-Line Arguments

Each EDirect program has a ‑help command that prints detailed information about available arguments.

EFetch Formats

EFetch ‑format and ‑mode values for each database are shown below:

  -db            -format            -mode    Report Type
___ _______ _____ ___________

(all)
docsum DocumentSummarySet XML
docsum json DocumentSummarySet JSON
full Same as native except for mesh
uid Unique Identifier List
url Entrez URL
xml Same as -format full -mode xml

bioproject
native BioProject Report
native xml RecordSet XML

biosample
native BioSample Report
native xml BioSampleSet XML

biosystems
native xml Sys-set XML

clinvar
variation Older Format
variationid Transition Format
vcv VCV Report
clinvarset RCV Report

gds
native xml RecordSet XML
summary Summary

gene
full_report Detailed Report
gene_table Gene Table
native Gene Report
native asn.1 Entrezgene ASN.1
native xml Entrezgene-Set XML
tabular Tabular Report

homologene
alignmentscores Alignment Scores
fasta FASTA
homologene Homologene Report
native Homologene List
native asn.1 HG-Entry ASN.1
native xml Entrez-Homologene-Set XML

mesh
full Full Record
native MeSH Report
native xml RecordSet XML

nlmcatalog
native Full Record
native xml NLMCatalogRecordSet XML

pmc
bioc PubTator Central BioC XML
medline MEDLINE
native xml pmc-articleset XML

pubmed
abstract Abstract
bioc PubTator Central BioC XML
medline MEDLINE
native asn.1 Pubmed-entry ASN.1
native xml PubmedArticleSet XML

(sequences)
acc Accession Number
est EST Report
fasta FASTA
fasta xml TinySeq XML
fasta_cds_aa FASTA of CDS Products
fasta_cds_na FASTA of Coding Regions
ft Feature Table
gb GenBank Flatfile
gb xml GBSet XML
gbc xml INSDSet XML
gene_fasta FASTA of Gene
gp GenPept Flatfile
gp xml GBSet XML
gpc xml INSDSet XML
gss GSS Report
ipg Identical Protein Report
ipg xml IPGReportSet XML
native text Seq-entry ASN.1
native xml Bioseq-set XML
seqid Seq-id ASN.1

snp
json Reference SNP Report

sra
native xml EXPERIMENT_PACKAGE_SET XML
runinfo xml SraRunInfo XML

structure
mmdb Ncbi-mime-asn1 strucseq ASN.1
native MMDB Report
native xml RecordSet XML

taxonomy
native Taxonomy List
native xml TaxaSet XML

ESearch Sort Order

ESearch ‑sort values for several databases are listed below:

  -db            -sort
___ _____

gene
Chromosome
Gene Weight
Name
Relevance

geoprofiles
Default Order
Deviation
Mean Value
Outliers
Subgroup Effect

pubmed
First Author
Journal
Last Author
Pub Date
Recently Added
Relevance
Title

(sequences)
Accession
Date Modified
Date Released
Default Order
Organism Name
Taxonomy ID

snp
Chromosome Base Position
Default Order
Heterozygosity
Organism
SNP_ID
Success Rate

EInfo Data

EInfo field data contains status flags for several term list index properties:

  <Field>
<Name>ALL</Name>
<FullName>All Fields</FullName>
<Description>All terms from all searchable fields</Description>
<TermCount>138982028</TermCount>
<IsDate>N</IsDate>
<IsNumerical>N</IsNumerical>
<SingleToken>N</SingleToken>
<Hierarchy>N</Hierarchy>
<IsHidden>N</IsHidden>
<IsTruncatable>Y</IsTruncatable>
<IsRangable>N</IsRangable>
</Field>

Unix Utilities

Several useful classes of Unix text processing filters, with selected arguments, are presented below:

Process by Contents:

  sort    Sorts lines of text

-f Ignore case
-n Numeric comparison
-r Reverse result order

-k Field key (start,stop or first)
-u Unique lines with identical keys

-b Ignore leading blanks
-s Stable sort
-t Specify field separator

uniq Removes repeated lines

-c Count occurrences
-i Ignore case

-f Ignore first n fields
-s Ignore first n characters

-d Only output repeated lines
-u Only output non-repeated lines

grep Matches patterns using regular expressions

-i Ignore case
-v Invert search
-w Search expression as a word
-x Search expression as whole line

-e Specify individual pattern

-c Only count number of matches
-n Print line numbers

Regular Expressions:

  Characters

. Any single character (except newline)
\w Alphabetic [A-Za-z], numeric [0-9], or underscore (_)
\s Whitespace (space or tab)
\ Escapes special characters
[] Matches any enclosed characters

Positions

^ Beginning of line
$ End of line
\b Word boundary

Repeat Matches

? 0 or 1
* 0 or more
+ 1 or more
{n} Exactly n

Modify Contents:

  sed     Replaces text strings

-e Specify individual expression

tr Translates characters

-d Delete character
-s Squeeze runs of characters

rev Reverses characters on line

Format Contents:

  column  Aligns columns by content width

-s Specify field separator
-t Create table

expand Aligns columns to specified positions

-t Tab positions

fold Wraps lines at a specific width

-w Line width
-s Fold at spaces

Filter by Position:

  cut     Removes parts of lines

-c Characters to keep
-f Fields to keep
-d Specify field separator

-s Suppress lines with no delimiters

head Prints first lines

-n Number of lines

tail Prints last lines

-n Number of lines

Miscellaneous:

  wc      Counts words, lines, or characters

-c Characters
-l Lines
-w Words

xargs Constructs arguments

-n Number of words per batch

File Compression:

  tar     Archive files

-c Create archive
-f Name of output file
-z Compress archive with gzip

gzip Compress file

-k Keep original file
-9 Best compression

unzip Decompress .zip archive

-p Pipe to stdout

gzcat Decompress .gz archive and pipe to stdout

Directory and File Navigation:

  cd      Changes directory

/ Root
~ Home
. Current
.. Parent
- Previous

ls Lists file names

-1 One entry per line
-a Show files beginning with dot (.)
-l List in long format
-R Recursively explore subdirectories
-S Sort files by size
-t Sort by most recently modified

pwd Prints working directory path

Additional documentation with detailed explanations and examples can be obtained by typing "man" followed by a command name.

Release Notes

EDirect release notes describe the history of incremental development and refactoring, from the original implementation in Perl to the redesign in Go and shell script. The detailed notes have been moved to a separate document, which can be viewed by clicking on the RELEASE NOTES link.

For More Information

Announcement Mailing List

NCBI posts general announcements regarding the E-utilities to the utilities-announce announcement mailing list. This mailing list is an announcement list only; individual subscribers may not send mail to the list. Also, the list of subscribers is private and is not shared or used in any other way except for providing announcements to list members. The list receives about one posting per month. Please subscribe at the above link.

Documentation

EDirect navigation functions call the URL-based Entrez Programming Utilities:

  https://www.ncbi.nlm.nih.gov/books/NBK25501

NCBI database resources are described by:

  https://www.ncbi.nlm.nih.gov/pubmed/31602479

Information on how to obtain an API Key is described in this NCBI blogpost:

  https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities

References

The Smithsonian Online Collections Databases are provided by the National Museum of Natural History, Smithsonian Institution, 10th and Constitution Ave. N.W., Washington, DC 20560-0193. https://collections.nmnh.si.edu/.

den Dunnen JT, Dalgleish R, Maglott DR, Hart RK, Greenblatt MS, McGowan-Jordan J, Roux AF, Smith T, Antonarakis SE, Taschner PE. HGVS Recommendations for the Description of Sequence Variants: 2016 Update. Hum Mutat. 2016. https://doi.org/10.1002/humu.22981. (PMID 26931183.)

Hutchins BI, Baker KL, Davis MT, Diwersy MA, Haque E, Harriman RM, Hoppe TA, Leicht SA, Meyer P, Santangelo GM. The NIH Open Citation Collection: A public access, broad coverage resource. PLoS Biol. 2019. https://doi.org/10.1371/journal.pbio.3000385. (PMID 31600197.)

Mitchell JA, Aronson AR, Mork JG, Folk LC, Humphrey SM, Ward JM. Gene indexing: characterization and analysis of NLM's GeneRIFs. AMIA Annu Symp Proc. 2003:460-4. (PMID 14728215.)

Percha B, Altman RB. A global network of biomedical relationships derived from text. Bioinformatics. 2018. https://doi.org/10.1093/bioinformatics/bty114. (PMID 29490008.)

Wei C-H, Allot A, Leaman R, Lu Z. PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res. 2019. https://doi.org/10.1093/nar/gkz389. (PMID 31114887.)

Wu C, Macleod I, Su AI. BioGPS and MyGene.info: organizing online, gene-centric information. Nucleic Acids Res. 2013. https://doi.org/10.1093/nar/gks1114. (PMID 23175613.)

Getting Help

Please refer to the PubMed and Entrez help documents for more information about search queries, database indexing, field limitations and database content.

Suggestions, comments, and questions specifically relating to the EUtility programs may be sent to vog.hin.mln.ibcn@seitilitue.

Views

Other titles in this collection

Recent Activity

    Your browsing activity is empty.

    Activity recording is turned off.

    Turn recording back on

    See more...