NCBI Logo
NCBI News




In this issue


GENSAT Project Data Now in Entrez

My NCBI

Influenza Virus Resource

NCBI ToolKit Utility Programs

New Microbial Genomes in GenBank

Iceman Preserved in GenBank

RefSeq Updates

RefSeq Release 11

New Organisms in UniGene

GenBank Release 147

New Genome Build

CCDS Database

NCBI Courses

PubMed Corrects Spelling

BLAST Lab

LocusLink Retired

Masthead





Eight ASN.1 Utility Programs for Five Computer Platforms

The NCBI ToolKit provides source code and configuration scripts that make it easy to compile NCBI software to run on a variety of computing platforms. Some of the most familiar applications that can be built using the ToolKit are blastall, blastcl3, blastpgp, the BLAST web server, Sequin, Entrez2, and Spidey. These programs are well-known to NCBI users because they are provided in executable format for many common computing platforms. However, there are many useful but less well-known ASN.1 utility programs in the ToolKit that have previously been offered only as source code. NCBI now distributes a number of these command-line utilities for converting, validating, indexing, and creating NCBI ASN.1 records in executable form for 5 major computing platforms: Alpha, Linux, Macintosh, Solaris and MS Windows. They are run in Terminal or Command Prompt windows.

Each utility program accepts a number of command line arguments, specified using a dash and a single letter option code followed by an option value. Some values are boolean and are given as either ‘T’, true, or ‘F’, false. Others are specified using one-letter codes, such as format specifiers, or strings, such as file names or GenBank accession numbers. To see a complete list of command line parameters for any of the programs, run the program with a trailing dash and no parameter. A list of the eight programs with brief descriptions is given in Box 1, while a detailed description of one of the most versatile programs, “asn2all”, follows. In many situations, the multifunctional program asn2all can be run instead of asn2fsa, asn2gb or asn2xml.

The program “asn2all” is primarily intended to generate reports from the binary ASN.1 Bioseq-set GenBank release files that are available at:

Depending on the “f” argument, the program can produce GenBank and GenPept flatfiles, FASTA sequence files, INSDSet structured XML, TinySeq XML, and 5-column feature table formats. Prior to running asn2all, the GenBank release files, which have an “.aso.gz” suffix, should be uncompressed using a program such as “gunzip”, resulting in files with suffix “.aso”. For example, gbpri1.aso is the first file in the primate division, and the command:

gunzip gbpri1.aso.gz

will produce “gbpri1.aso”

Using asn2all, the name of the file to process is specified with the “-i” command line argument. Use “-a t” to indicate batch processing of a GenBank release file and “-b T” to indicate that it is binary ASN.1. A text ASN.1 record, such as one obtained on the web from Entrez, can be processed by using “-a a -b F” instead of “-a t -b T”.

Nucleotide and protein records within ASN.1 records can be processed simultaneously. Use the “-o” argument to indicate the nucleotide output file and the “-v” argument for the protein output file.

The “-f” argument determines the format to be generated. Legal values of “-f” and the resulting formats are:

g GenBank (nucleotide) or GenPept (protein)
f FASTA
t 5-column feature table
y TinySet XML
s INSDSet XML
a ASN.1 of entire record
x XML version of entire record

The command:

asn2all -i gbpri1.aso -a t -b T -f g -o gbpri1.nuc -v gbpri1.prt
will generate both GenBank reports for nucleotide sequences and GenPept reports for protein sequences from gbpri1.aso in the files “gbpri1.nuc” and “gbpri1.prt”, respectively.

A remote fetching option, “-r T”, allows the download of an ASN.1 record from NCBI over a network connection using an accession number or NCBI gi number as an identifier. For instance, to download the feature table within the Reference Sequence record, or RefSeq, for the Escherichia coli genome via remote fetch, use:

asn2all -r T -A NC_000913 -f t

The output of this command for the first NC_000913 feature is given below. The 5-column feature table format used is identical to that required as input to generate an ASN.1 sequence file using tbl2asn, described in Box 1.

>Feature ref|NC_000913.2|
190 255 gene
gene thrL
gene_syn EG11277
locus_tag b0001
db_xref GeneID:944742

Box 1. Eight NCBI ToolKit utility programs now available in executable for five computer platforms

asn2all: converts GenBank release files in ASN.1 format to a variety of other formats
asn2fsa: converts binary or text ASN.1 sequence files to FASTA format
asn2gb: converts binary or text ASN.1 sequence files to GenBank or GenPept flatfile formats
asn2idx: Generates accession/file offset indices for Bioseq-set release files
asn2xml: converts binary or text ASN.1 sequence files to XML format
asnval: validates ASN.1 release files
tbl2asn: automates the creation of sequence records for submission to GenBank
gene2xml: converts text or binary ASN.1 files of Entrez Gene records into XML

The eight ASN.1 utility programs may be downloaded at:

 

Box 2. Converting the Entrez Gene FTP files with gene2XML

The gene2xml program is the most recent to join the group of NCBI conversion tools. It reads the binary ASN.1 Entrezgene-Set files offered on the Entrez Gene ftp site and converts them into an easily parsable XML format. The program can accept the name of a single file as input or the path to a group of files to be converted. A option to filter the output by NCBI Taxon Id allows organism-specific XML files to be created from a single multi-species ASN.1 file. The Entrez Gene FTP ASN.1 files are found at:

ftp.ncbi.nih.gov/gene/DATA/

back to previous articleContinue to next article

NCBI News | Fall/Winter 2002 NCBI News: Spring 2003