AsnLib: ASN.1 Processing


Introduction to ASN.1
AsnLib: Overview
Principles of Operation
Specification for AsnLib
AsnTool
AsnTool Tutorial
Using AsnLib
AsnLib: A Tutorial
Data-links
AsnLib Generated Header Files
Returns From AsnLib Parsing
Finding AsnTypePtrs at Run-time
Custom Read and Write Functions
Customizing an AsnIo Stream
ASN.1 Object Loaders
AsnLib and Object Loaders As a Generalized Iterator
AsnLib and Object Loaders Provide a Generalized Copy and Compare
AsnLib Interface: asn.h


 Introduction to ASN.1

Why ASN.1

Abstract Syntax Notation 1 (ASN.1) is used to describe the structure of data to be transferred between the Application Layer and the Presentation Layer of the Open Systems Interconnection (OSI).  It is meant to provide a mechanism whereby the Presentation Layer can use a single standard encoding to reliably exchange any arbitrary data structure with other computer systems, while the Application layer can map the standard encoding into any type of representation or language that is appropriate for the end user.  ASN.1 does not describe the content, meaning, or structure of the data, only the way in which it is specified and encoded.

These properties make it an excellent choice for a standard way of encoding scientific data.  Since ASN.1 does not specify content, specifications can be created as new concepts need to be represented.  Yet since it is an International Standards Organization (ISO) standard, the new specification can take advantage of various tools built to work with ASN.1 in general.  It removes from scientists the role of specifying ad hoc file formats, and focuses them instead on specifying the content and structure of data necessary to convey scientific meaning.

There are two aspects to ASN.1, the specification of the data and the encoded data itself.  The specification describes the abstract structure of the data and the allowed values various fields may take.  Frequently today scientific data is presented with no formal specification.  There may be some documentation describing the data file, but very often it is incomplete or not entirely accurate, since it is usually written about the file, rather than as an integral step toward building the file.  The ASN.1 specification is formal language, which means it can be automatically and thoroughly checked for errors and inconsistencies in form by machine before any data are collected at all.  Further, it can be used by a computer to validate that any data presented correctly reflect that specification.  This is essential in eliminating the random errors and oversights in generating data files that plague scientific data now.  A utility program, asntool, was built with the AsnTool libraries to do this sort of checking and validation while developing ASN.1 specifications.

The requirement for a separate specification also means that interested parties can examine and evaluate the structure of the data independent of any particular database or data file.  One can understand the limits and strengths of a specification separately from the quality or amount of the data itself. Data structures that prove to be useful can be re-used in a variety of ways; by large public databases, by small private databases, in various software tools, and in assorted data files.

Finally, a separate specification means software to construct, decode, and validate any ASN.1 specified object can be built semi- or fully automatically from the specification.  Data encoded according to that specification can then be processed with relatively little manual programming for those aspects of the application dealing directly with ASN.1.  This is what the AsnTool routines are for.

Structure of ASN.1

ASN.1 has Type References, identifiers, and values.  A Type Reference is the name of an object defined in an ASN.1 specification.  An identifier is a field within an object.  A value is generally not included in the specification, but rather is the value of a Type Reference or an identifier in data encoded in ASN.1.  Values can be encoded in either a text or a binary form.  The examples here will obviously be in the text form.

Type References ALWAYS start with an upper case letter.  Identifiers ALWAYS start with a lower case letter.  Values depend on what type of value it is (integer, string, etc.) and examples are given below.  "-" (hyphen) is the ONLY separator character allowed in References and identifiers.

ASN.1 allows elements of SET, SEQUENCE, and CHOICE to not have identifiers if they can be distinguished from each other by their type (e.g. one is an integer and one is a string).  However, this can make the text value notation ambiguous and it may also lead to errors in the hands of the novice.  So we REQUIRE that every element of a SET, SEQUENCE, and CHOICE have an identifier.

ASN.1 also allows the specification of numerical tags (used for the binary encoding) in [] in addition to or in lieu of identifiers.  Again, this can be a problem for the novice.  Since we require identifiers, our software generates the numerical tags itself and we can ignore this.  It still supports explicitly defined APPLICATION, and PRIVATE tags, but that is beyond the scope of this document.  Comments begin with   --   and end with   --    or end of line.

A simple ASN.1 specification module example is shown below:

Demo-module DEFINITIONS ::=       -- Module-name DEFINITIONS ::= BEGIN

BEGIN

 

EXPORTS My-type;                         -- My-type can be used by other modules

 

IMPORTS Foreign-type FROM Other-module; -- can import types

 

                                         -- we define an object called My-type

My-type ::= SEQUENCE {                   -- My-type is a Type Reference

   first     INTEGER ,                  -- first is an identifier

   second    INTEGER DEFAULT 2 ,        -- second defaults to 2

   third     VisibleString OPTIONAL     -- third is an optional string

   }                                     -- end of object definition

 

Another ::= Foreign-type                 -- can reference other defined types

 

END                                      -- end of module, END required

Value notation (or data encoded in the text form of ASN.1) looks like this:

My-type ::= {

   first 42

   }

This means this My-type will have first = 42, second = 2, and third not present.  To present more than one My-type you must have defined:

 

My-type-set ::= SET OF My-type           -- in Demo-module

 

  Then you could have:

My-type-set ::= {                        -- start SET OF

   {                                     -- a My-type

       first 42

    } ,

   {                                     -- another My-type

       first 27 ,

       second 22 ,

       third "Everything set here"

   }

}                                        -- end of SET OF

ASN.1 Primitive Types Supported by AsnLib

Type,

Description

Specification

Value Notation

BOOLEAN

Any TRUE or FALSE value

May have a DEFAULT

Truth ::= BOOLEAN

Truth ::= FALSE

INTEGER

Any integer value.

May be given named values but range not limited to names.

May have a DEFAULT.

Number ::= INTEGER

or

Number ::= INTEGER {

     red (1) ,

     blue (2) }

Number ::= 42

or

Number ::= red

OCTET STRING

Any string of bytes.

Returned as or read from ByteStorePtr.

May not have DEFAULT.

Hstring ::= OCTET STRING

Hstring ::= '0A01F'H

NULL

null is only allowed value

Nothing ::= NULL

Nothing ::= null

REAL

Floating point number in base 2 or 10.

REAL value notation is 3 integers for { matissa, base, exponent }

May have a DEFAULT.

Pi ::= REAL

Pi ::= { 314159, 10, -5 }

ENUMERATED

A named set of integer values.

Only named values allowed.

May have a DEFAULT

Sex ::= ENUMERATED {

     male (1) ,

     female (2) }

Sex ::= male

SEQUENCE

A series of other named types, in order.

Not related to a biological sequence.

All elements must be present unless OPTIONAL or DEFAULT

Yuppie ::= SEQUENCE {

     income   INTEGER ,

     name     VisibleString }

Yuppie ::= {

     income 100000 ,

     name "John Doe" }

SEQUENCE OF

A repeating series of a single type in order.

Stooges ::=

SEQUENCE OF VisibleString

Stooges ::= {

     "Larry" ,

     "Curly",

     "Moe" }

SET

A series of named other types.

Order does not matter.

All elements must be present unless OPTIONAL or DEFAULT

Yuppie ::= SET {

     income   INTEGER ,

     name     VisibleString }

Yuppie ::= {

     income 100000 ,

     name "John Doe" }

SET OF

A repeating series of a single type. Order does not matter.

Stooges ::=

SET OF VisibleString

Stooges ::= {

     "Larry" ,

     "Curly",

     "Moe" }

CHOICE

A way to select one from a set of alternate types.

NOTE:  in the value notation you are indicating one choice, so {} are not necessary (or allowed) but the identifier for the selected CHOICE must be given before the value.

Person ::= CHOICE {

     social-security INTEGER ,

     name VisibleString ,

     badge-id INTEGER }

Person ::= name "Joe"

VisibleString

A string of printable ASCII characters

NOTE: The double quite character (") may be included in a VisibleString by doubling it.
"He said ""Hi Mom!"" to her"
NOTE: AsnLib can accept wrapped long VisibleStrings.  That is, a string may contain internal newlines which are stripped on input from the value notation.
 Text ::= "He said ""Hi Mom!"" to her"
would be read as:
"He said ""Hi Mom!"" to her"

Text ::= VisibleString

Text ::= "Hi Mom!"

StringStore

ONLY in AsnLib. Defines a VisibleString which is read into a ByteStore instead of a CharPtr. Used for long strings like DNA sequences.

Dna ::= StringStore

Dna ::= "AGGAGG"

Further information about ASN.1

 

The Open Book
A Practical Perspective on OSI
by Marshall T. Rose
Prentice Hall, Englewood Cliffs, New Jersey  07632
(c) 1990

 

ISO Development Environment  (public software)
University of Pennsylvania
Dept. of Computer Science and Information Science
Moore School
Attn: David J. Farber (ISODE Distribution)
200 South 33rd Street
Philadelphia, PA  19104-6314
1-215-898-8560

 

OSIkit Tools from NIST  (1989) (public software)
US Dept. of Commerce
National Institute of Standards and Technology
Gaithersburg, MD

 

Information Processing - Open Systems Interconnection - Specification of Abstract Syntax Notation One (ASN.1).  International Organization for Standardization and International Electrotechnical Committee, 1987. International Standard 8824.

 

Information Processing - Open Systems Interconnection - Specification of Basic Encoding Rules for Abstract Syntax Notation One (ASN.1).  International Organization for Standardization and International Electrotechnical Committee, 1987.  International Standard 8825.

 

Information Processing - Open Systems Interconnection - Abstract Syntax Notation One (ASN.1) - Draft Addendum 1:  Extensions to ASN.1.  International Organization for Standardization and International Electrotechnical Committee, 1987.  Draft Addendum 8824/DAD 1.

 

Information Processing - Open Systems Interconnection - Abstract Syntax Notation One (ASN.1) - Draft Addendum 1:  Extensions to ASN.1 Basic Encoding Rules.  International Organization for Standardization and International Electrotechnical Committee, 1987.  Draft Addendum 8825/DAD 1.

AsnLib: Overview

AsnLib is a library of functions developed by NCBI for manipulating and exchanging ASN.1 specifications and encoded data for scientific purposes.

A number of commercial and public domain tools are available for working with ASN.1 and for automatically building data handlers of various sorts. They are focused on the use for which ASN.1 was originally intended, the exchange of data between layers of the OSI.  As such they tend to automate the process more than AsnLib does, because the domain of use is much more limited.  The fact that they determine the internal data structures to use and write all the code to handle them themselves is not a big problem in this case.

When ASN.1 is used for scientific data description though, other uses will be made of the encoded data than may have originally been envisaged by the designers of these products.  For example, a scientist will often want an application which scans through a large complicated data structure, and just extracts certain fields for use, or even just counts occurrences of certain values.  A tool which automatically generates large elaborate data structures and lots of code to parse the stream, generate the structures, and store them in memory is inappropriate for such an application.  Further, a scientific application may well wish to manipulate that data in a different language than the tool is written in, such as FORTRAN, PROLOG, or LISP.  These applications may well wish to store the whole data structure from the stream, but they will not wish to use the data structures provided by the tool.

ASN.1 can be used to encode data in two ways, an ASCII human readable form called "value notation" or "print form", and a binary encoding.  ASN.1 has separate standards documents for the syntax (specification rules) and the binary encoding rules (BER, or "Basic Encoding Rules").  This was done on purpose to allow various encoding rules for the same abstract syntax.  The BER is, at this writing, the only official ISO encoding for ASN.1, but several other encodings which are faster or take less space, are under consideration by ISO.  Currently the only binary encoding AsnLib supports is BER.

The value notation or ASCII form of the data is not really an official ISO standard.  It was meant to provide a human readable form of ASN.1 data for development or explication, but not as a standard for data exchange. Nonetheless, value notation rules are given in the ISO documents for all the data types they describe.  With only a few additional rules, value notation is quite robust for data exchange.  These rules are listed in Appendix 1. While we do not recommend the ASCII form of ASN.1 encoded data for large amounts of data, it is very useful for developing and testing data representations or for generating ASN.1 values easily from other data files or local databases without specialized tools.  Since the value notation and binary encoded forms of data are completely and reliably interconvertable using AsnLib, there is no problem doing this.

Principles of Operation

AsnLib operates on atomic elements of ASN.1 specified data.  It is built using the NCBI core software tools and this document assumes you have some acquaintance with them.  AsnLib reads or writes strings, integers, etc. with single function calls.  Composite objects such as a SEQUENCE or a SET are read or written with a series of calls to read or write its component parts.  The process is designed to be relatively intuitive even in this case.  One calls a function to start encoding a SEQUENCE, then calls the routines to encode its parts, then calls a function to end encoding the SEQUENCE. NCBI has built functions to read and write such higher level objects in single function calls (described in the chapters on data), which use the low level AsnLib functions described here.

One can read and write any type using only three functions.  They take as arguments the identifier of an ASN.1 encoded stream (binary or ASCII), a pointer to a node in a parse tree (generated from the ASN.1 specification), and a pointer to a union which can hold a value of any type.  All aspects of how to encode a value properly, error checking to be sure that all appropriate nodes in the tree are visited in the proper order and that values are valid for a particular type are all taken care of within AsnLib and are not the concern of application programmer.  The application programmers must read and understand the ASN.1 specification to make proper use of it, but all the other details of using ASN.1 correctly are not their concern.

The parse tree contains information about the type of every node, its name, its binary tag, allowed values, default values, and the next valid element. The header file also contains a series of #defines which associate names derived from the ASN.1 specification with pointers to nodes in the parse tree. Thus one's code would refer to JOURNAL_title, not a pointer to a specific node.  Using these defines means that if an ASN.1 specification is changed, but the names and types of nodes an application cares about have not changed, the application can be updated by just compiling with the new header file.

There are also functions which allow more interpreter-like code to be written.  One function will load an ASN.1 specification from a file, validate it, and build the appropriate parse tree on the fly, rather than at compile time by including a header file.  One can still identify nodes in the tree by name with a function that searches the tree for nodes with names matching a string.  As with all interpreter/compiler trade offs, such an application is slower, but more flexible.

AsnLib assumes that specifications will be written as a collection of smaller modules.  Data types may be declared as IMPORTS or EXPORTS by any module.  Multiple modules which reference each other may be loaded at once into AsnLib or through the interpreter function described above.  It will then link the modules before outputing the header file, thus effectively building a single parse tree containing all the modules.

In another approach, one might build a series of functions which handle the datatypes in a particular module.  Then when one writes code which uses a module which IMPORTS another module type, it is left unlinked in that parse tree and one just calls the appropriate function to read it.  AsnLib contains two functions for temporarily linking, then unlinking local parse subtrees to a parent object parse tree for this purpose.  We have begun to build a library of such modular object functions, so one need not link the whole world of possible datatypes into a single routine or module, or write the basic routines to create, destroy, and exchange such sub-objects.

Specification for AsnLib

AsnLib supports the following types from ISO 8824 and the ASN.1 enhancements.  The internal representation used by AsnLib (from the NCBI core tools) for routines dealing with these types is also shown.

Supported ASN.1 primitive types

type                                

internal representation                

BOOLEAN

Boolean          

INTEGER

Int4

OCTET STRING

ByteStorePtr

NULL

no value

REAL

FloatHi

ENUMERATED

Int4

SEQUENCE

no value

SEQUENCE OF

no value

SET

no value

SET OF

no value

CHOICE

no value

VisibleString

CharPtr

StringStore

ByteStorePtr

Other ASN.1 string types are supported as VisibleString.  No checks are made to ensure restrictions of character usage by the various string types. Types not supported by AsnLib at this point (although they will be accepted in a module specification as valid ASN.1) are:

Unsupported ASN.1 primitive types

BIT STRING

OBJECT IDENTIFIER

ObjectDescriptor

EXTERNAL

ANY

GeneralizedTime

UTCTime

The following keywords are currently supported by AsnLib:

Supported ASN.1 keywords

DEFINITIONS

BEGIN

END

EXPORTS

IMPORTS

FROM

APPLICATION

PRIVATE

UNIVERSAL

DEFAULT

OPTIONAL

FALSE

TRUE

The following ASN.1 keywords are not supported by AsnLib (although they are passed in a module specification as valid ASN.1):

Unsupported ASN.1 keywords

IMPLICIT

ABSENT

BY

COMPONENT

DEFINED

INCLUDES

MIN

MINUS-INFINITY

MAX

PRESENT

PLUS-INFINITY

SIZE

TAGS

WITH

AsnLib uses indefinite encoding for output of all binary encoded non‑ primitive types.  It can decode either definite or indefinite binary encoded data for all types.  This conforms to the BER.

DEFAULT values may be given in an ASN.1 specification.  AsnLib accepts and records them in the parse tree.  However, it does not supply the value if it is missing from the input stream on the assumption that the application would want to distinguish a value actually supplied from a value defaulted locally. DEFAULT is only supported for simple types like INTEGER or VisibleString, but not for structure types like SEQUENCE because it is too difficult to code.

Values may not be assigned in a specification module to types defined in a different module.  Each module is self contained and does not "know" anything about types defined in other modules except their names if they were IMPORTS. So suppose one module defines:

 

Dna-strand ::= ENUMERATED { plus(1), minus(2) }

 

A different module may not use the DEFAULT in the following case:

 Dna-sequence ::= SEQUENCE {

   length INTEGER ,

   strand Dna-strand DEFAULT plus }

 

because it does not know Dna-strand is ENUMERATED or what its allowed values are.  Such a construct is acceptable if the definition of Dna-strand and Dna‑ sequence are in the same module and the Dna-strand definition comes first.

Elements of a SEQUENCE are checked that they are all received or sent in the correct order and that no non-OPTIONAL or non-DEFAULT elements are missing.  However, because AsnLib does not store whole structures, it can only check that the types of elements in a SET are correct, but cannot check if more than one element of a type is used or if a required element is missing.  For this reason it is safer to use SEQUENCE rather than SET as a rule when using AsnLib.  While there is a semantic difference, there is no representational limitation in doing this.

AsnTool

An application program called "asntool", is built by the NCBI Software Toolkit using the AsnLib function libraries, which in turn are based on the NCBI portable core software tools. This application is a utility program which can:

1.             Read, write, and error check an ASN.1 specification.

2.             Read, write, and check ASCII values conforming to the specification in 1.

3.             Read, write, and check binary values conforming to the specification in 1.

4.             Combinations of 2 and 3 to translate or convert between binary and ASCII

5.             Output a C language header file which contains a parse tree for specification 1 which can be used in an application program.

AsnTool Tutorial

It may be quickest to demonstrate the use of AsnLib through example.  In the distribution directory of the NCBI Software Toolkit, \ncbi, there are two subdirectories. \demo contains demonstration source code to be used in the section below and 2 samples of MEDLINE entries as ASN.1 value notation (ASCII).  medline.ent is a single Medline-entry and medline.prt is a Pub-set containing many MEDLINE entries. \asn contains the ASN.1 specifications for the modules used to describe the MEDLINE entries.  They are:

File           

Module               

Description                           

general.asn

NCBI-General

general purpose data types

pub.asn

NCBI-Pub

branch point for various publication types.

biblio.asn

NCBI-Biblio

standard bibliographic citations for journals, books, manuscripts, patents based on ANSI standard

medline.asn

NCBI-Medline

MEDLINE entry (based on NCBI-Biblio)

asnpub.all

all

all above modules in one file

asntool should have been built as part of installing the system.  It is in \ncbi\bin.  Set your path, or move asntool to a place it can be executed.

From within the \demo directory, run asntool with no arguments.  It presents its argument usage to you.  Note that you must always give a module file name. asntool takes only one module file, so if you wish to use more than one you must concatenate them into a single file, such as asnpub.all.

Try the following exercises -- type:

 

asntool -m ..\asn\asnpub.all

 

This will read the publication modules and validate that they are correctly built.  asntool will notify you of various syntax errors and typos, usually giving the line number where the error occurred.  It makes sure that everything EXPORTS from a module is defined in that module and that everything IMPORTS is used by that module.  Everything not IMPORTS must be defined within the module.  In the case of multiple modules, it will try to link EXPORTS from one module with IMPORTS from others.  It is not an error to be unable to link an IMPORTS, but it does imply you expect it to be handled by an outside function.  There are no errors in asnpub.all, so asntool is silent.  The path may have a different form on various machines.

 

asntool -m ..\asn\asnpub.all -v medline.ent

 

This does everything above, and then reads the file medline.ent which it expects to be of a type defined in asnpub.all.  It checks for errors, reporting any it finds.  There are none, so asntool is silent.

 

asntool -m ..\asn\asnpub.all -v medline.ent -p stdout

 

On command line systems, everything above will happen, except that medline.ent will be encoded from asntool's internal structures to ASN.1 value notation on stdout, your terminal.  On Macintosh or Microsoft Windows, the output will go to a disk file named "stdout".

 

asntool -m ..\asn\asnpub.all -v medline.prt -e medline.val

 

This reads the set of MEDLINE records from medline.prt and encodes them in binary ASN.1 in the file medline.val

 

asntool -m ..\asn\asnpub.all -d medline.val -t Pub-set -p stdout

 

This reads (decodes) the set of MEDLINE records from the binary ASN.1 file we just made and outputs them as value notation on stdout.  Note that we MUST specify the type (Pub-set) of the binary file or message.  That is because the binary form does not have that information.  The value notation form does, so asntool can figure it out, but the binary, which is the real ISO standard, does not.

 

asntool -m ..\asn\asnpub.all -o allpub.h

 

This outputs a header file for an application which will use the asntool routines to encode and decode objects defined in asnpub.all.

Using AsnLib

If you take a look at the allpub.h you generated above, you will see that it includes <asn.h> which defines the interface to the AsnLib library and which includes <ncbi.h> which defines the interface to the NCBI core software tools.

Then the arrays of structures defining the parse tree come.  You should never program directly for these structures as they may change without notice. You should always use the functions described below.

Last come the #defines for pointers to specific nodes in the parse tree. They are built from the names of objects specified in the ASN.1 modules.  The name of the type itself is upper case, and component parts are in lower case. An example of the mapping between the ASN.1 specification medline.asn and the #defines in allpub.h is shown in Appendix 2. 

One less intuitive aspect of this system applies only to SET OF or SEQUENCE OF which are repeating series of the same type.  Since any one element of such a repeating series does not have a name, one must be invented.  This is done by appending a _E (for Element of) to the parent name (e.g.. if Name-list ::= SEQUENCE OF VisibleString, then one name (VisibleString) of that SEQUENCE OF would have a #defined node name of NAME_LIST_E).  Names defined this way are limited to a maximum of 31 characters.  If they grow longer than that, the leftmost characters are truncated.  The suggestion is: keep names as short as you can and still be meaningful.  Also, since "-" is the only valid separator character in ASN.1 but "_" is the only valid separator character in C, the Name-list (mentioned above) node in the parse tree would be defined as NAME_LIST.

ASN.1 encoded values are represented basically as identifier/value pairs. AsnLib has two parsing functions that correspond to the members of the pair:

atp = AsnReadId(aip, amp, atp);

    Reads an identifier from an input stream (aip) and returns a pointer to the appropriate node in the parse tree for it (atp as the return value).  atp will be one of the nodes #defined in the header generated by AsnLib.

 

success = AsnReadVal(aip, atp, avp);

      Reads the value of atp from the stream (aip) into an AsnValue (a union of Pointer, Int4, Boolean, FloatHi).  If AsnReadVal() is called with avp = NULL, it skips over that value.  This is useful for scanning through a file extracting only a few fields.

To parse then, one basically just alternates AsnReadId() and AsnReadVal(). The most common error to make in writing a parser that uses these functions is to get out of synchronization alternating between these two routines.

There is only one function to write an identifier/value pair at once:

success = AsnWrite(aip, atp, avp);

      Writes the identifier pointed to by atp, and the value in avp, to the stream aip.

AsnLib: A Tutorial

In \ncbi\demo are three small demo applications that process medline entries and require the allpub.h header and the binary form of medline.prt we built in the sections above.  The make files for Microsoft C (makedemo.msc) and for all UNIX systems (makedemo.unx) are in \make.  Copy the makedemo file appropriate for your system into \ncbi\build and make it.

getmesh.c

Function:  Reads a Medline-entry, extracts the MeSH terms, and prints them.

Type "getmesh -" to see its arguments.

Type "getmesh -i medline.ent -o terms.out".  getmesh reads medline.ent, which contains a single Medline-entry in value notation (ASCII).  This file is presented at the end of this chapter, somewhat abbreviated, with the #defined names for the nodes in the allpub.h parse tree that will be encountered in the course of reading this file. getmesh parses it, extracts the MeSH terms and prints them in "terms.out".

Look at the source code in getmesh.c.

/*****************************************************************************

*

*   getmesh.c

*      gets mesh terms from a Medline-entry

*

*****************************************************************************/

#include <allpub.h>

 

#define NUMARGS 3

Args myargs[NUMARGS] = {

   { "Input data", NULL, "Medline-entry", NULL, FALSE, 'i', ARG_DATA_IN, 0.0,0,NULL},

   { "Input data is binary", "F", NULL, NULL, TRUE , 'b', ARG_BOOLEAN, 0.0,0,NULL},

   { "Output list", NULL, NULL, NULL, FALSE, 'o', ARG_FILE_OUT, 0.0,0,NULL}};

 

Int2 Main()

{

   AsnIoPtr aip;

   AsnTypePtr atp;

   DataVal value;

   static CharPtr intypes[2] = { "r", "rb" };

   Int2 intype;

   FILE *fp;

 

    if (! AsnLoad())

        Message(MSG_FATAL, "Unable to load allpub parse tree.");

 

    if (! GetArgs("GetMesh 1.0", NUMARGS, myargs))

       return 1;

 

   if (myargs[1].intvalue)        /* binary input is TRUE */

       intype = 1;

   else

       intype = 0;

 

   if ((aip = AsnIoOpen(myargs[0].strvalue, intypes[intype])) == NULL)

   {

       Message(MSG_ERROR, "Couldn't open %s", myargs[0].strvalue);

       return 1;

   }

 

   if ((fp = FileOpen(myargs[2].strvalue, "w")) == NULL)

   {

       Message(MSG_ERROR, "Couldn't open %s", myargs[2].strvalue);

       return 1;

   }

 

   atp = MEDLINE_ENTRY;

 

   fprintf(fp, "MeSH terms =\n\n");

   while ((atp = AsnReadId(aip, amp, atp)) != NULL)

   {

       if (atp == MEDLINE_MESH_term)

       {

          AsnReadVal(aip, atp, &value);

          FilePuts(value.ptrvalue, fp);

          FilePuts("\n", fp);

          AsnKillValue(atp, &value);

       }

       else

          AsnReadVal(aip, atp, NULL);

   }

 

   aip = AsnIoClose(aip);

 

   FileClose(fp);

 

   return 0;

}

Pretty short for doing all this, isn't it?  Walking through the code:

0.             AsnLoad() is called to load the ASN.1 parse tree for "allpub" into memory.

1.             GetArgs() is called to display or get the command line arguments.

2.             The appropriate string is selected for opening a value notation ("r") or a binary ("rb") input stream.

3.             The input stream is opened with AsnIoOpen().

4.             The file for printed output is opened.

5.             atp is initialized to MEDLINE_ENTRY, the defined node we expect the input stream to start with.  If the input stream were ALWAYS value notation, atp could be set to NULL, and Medline-entry ::= would be read from the input file and atp set correctly.  Since getmesh takes binary and value notation, atp must be properly initialized.

6.             The main while loop just reads identifiers with AsnReadId() until it returns NULL, which is EOF.  The argument, amp, is the AsnModulePtr declared in allpub.h.  It is used to locate the appropriate AsnTypePtr (atp) if it was set to NULL on the first call.  After that, atp provides the link to the parse tree.

7.             In the while loop, a check is made to see if atp == MEDLINE_MESH_term, or the VisibleString containing a single MeSH term.  If so, we read the value with AsnReadVal(), print it, then call AsnKillValue() which will deallocate any storage used when any data type is read.  Since a VisibleString requires storage this is necessary.  There is no harm in calling AsnKillValue() even on types that do not allocate storage (e.g.. INTEGER).

8.             If it's not a MeSH term, we call AsnReadVal() with a NULL argument for the AsnValuePtr, which just skips over the value to the next identifier.

9.             We close the streams.

10.           c'est tout.

indexpub.c

Function: Builds an index to medline.ent base on Medline Unique Identifier.

Type "indexpub -" to see the arguments.

Type "indexpub -imedline.val".  indexpub will read the binary value file, medline.val, note the seek offset of the start of each Medline-entry it contains, identifies the Medline uid for it, and builds an index file, "medline.idx".

Take a look at the source code, indexpub.c.

/*****************************************************************************

*

*   indexpub.c

*      indexes a Pub-set by Medline UID

*

*****************************************************************************/

#include <allpub.h>

 

#define NUMARGS 3

Args myargs[NUMARGS] = {

   { "Input data", "medline.val", "Pub-set", NULL, FALSE, 'i', ARG_DATA_IN, 0.0,0,NULL},

   { "Input data is binary", "T", NULL, NULL, TRUE , 'b', ARG_BOOLEAN, 0.0,0,NULL},

   { "Output index table", "medline.idx", NULL, NULL, FALSE, 't', ARG_FILE_OUT, 0.0,0,NULL}};

 

 

Int2 Main()

{

   AsnIoPtr aip;

   AsnTypePtr atp;

   DataVal value;

   Int4 seekptr, tempseek, uid;

   static CharPtr intypes[2] = { "r", "rb" };

   Int2 intype;

   FILE *fp;

 

    if (! AsnLoad())

        Message(MSG_FATAL, "Unable to load allpub parse tree.");

 

   if (! GetArgs("IndexPub 1.0", NUMARGS, myargs))

       return 1;

 

   if (myargs[1].intvalue)        /* binary input is TRUE */

       intype = 1;

   else

       intype = 0;

 

   if ((aip = AsnIoOpen(myargs[0].strvalue, intypes[intype])) == NULL)

   {

       Message(MSG_ERROR, "Couldn't open %s", myargs[0].strvalue);

       return 1;

   }

 

   if ((fp = FileOpen(myargs[2].strvalue, "w")) == NULL)

   {

       Message(MSG_ERROR, "Couldn't open %s", myargs[2].strvalue);

       return 1;

   }

 

   atp = PUB_SET;

   tempseek = 0L;

 

   while ((atp = AsnReadId(aip, amp, atp)) != NULL)

   {

       if (atp == PUB_SET_medline_E)

          seekptr = tempseek;

       if (atp == MEDLINE_ENTRY_uid)

       {

          AsnReadVal(aip, atp, &value);

          uid = value.intvalue;

          fprintf(fp, "%ld %ld\n", uid, seekptr);

       }

       else

          AsnReadVal(aip, atp, NULL);

       tempseek = AsnIoTell(aip);

   }

 

   aip = AsnIoClose(aip);

   FileClose(fp);

   return 0;

 

}

It is the same basic structure as getmesh.c.  However, the use of the while loop is a little different.  Since we are building an index, we want to record the offset in the file of the identifier which starts each medline entry in the Pub-set (PUB_SET_medline_E ‑- a PUB_SET of type medline is a SET OF Medline-entry).  So tempseek is set (to 0 to begin with, then with AsnIoTell()) BEFORE each read of an identifier with AsnReadId().  When the return value is PUB_SET_medline_E we know that tempseek contains the seek offset just before the first identifier for the Medline-entry.  Then we read through the entry looking for the MEDLINE_ENTRY_uid since we want to index on the MEDLINE Unique Identifier. When we find it, we store the seek offset and the uid in the index file.  All other values are skipped.

getpub.c

Function: Uses the index created by indexpub.c to retrieve a Medline-entry from medline.val by Medline uid.

/*****************************************************************************

*

*   getpub.c

*      does an indexed lookup for medline entries by medline uid

*

*****************************************************************************/

#include "allpub.h"

 

#define NUMARGS 5

Args myargs[NUMARGS] = {

   { "Input binary data", "medline.val", "Pub-set", NULL, FALSE, 'i', ARG_DATA_IN, 0.0,0,NULL},

   { "Medline UID to find", "88055872", NULL,NULL,FALSE,'u', ARG_INT, 0.0, 0, NULL },

   { "Input index table", "medline.idx", NULL,NULL,FALSE,'t', ARG_FILE_IN, 0.0,0,NULL },

   { "Output data", "stdout", "Medline-entry",NULL,FALSE,'o',ARG_DATA_OUT, 0.0,0,NULL},

   { "Output data is binary", "F", NULL, NULL, FALSE , 'b', ARG_BOOLEAN, 0.0,0,NULL}};

 

 

Int2 Main()

{

   AsnIoPtr aip, aipout;

   AsnTypePtr atp;

   DataVal value;

   Int4 seekptr, uid, uid_to_find;

   static CharPtr outtypes[2] = { "w", "wb" };

   Int2 outtype;

   FILE *fp;

   Boolean done, first;

   int retval;

 

    if (! AsnLoad())

        Message(MSG_FATAL, "Unable to load allpub parse tree.");

 

   if (! GetArgs("GetPub 1.0", NUMARGS, myargs))

       return 1;

 

   if (myargs[4].intvalue)        /* binary output is TRUE */

       outtype = 1;

   else

       outtype = 0;

 

   if ((aip = AsnIoOpen(myargs[0].strvalue, "rb")) == NULL)

   {

       Message(MSG_ERROR, "Couldn't open %s", myargs[0].strvalue);

       return 1;

   }

 

   if ((aipout = AsnIoOpen(myargs[3].strvalue, outtypes[outtype])) == NULL)

   {

       Message(MSG_ERROR, "Couldn't open %s", myargs[3].strvalue);

       return 1;

   }

 

   if ((fp = FileOpen(myargs[2].strvalue, "r")) == NULL)

   {

       Message(MSG_ERROR, "Couldn't open %s", myargs[2].strvalue);

       return 1;

   }

 

   uid_to_find = myargs[1].intvalue;

   done = FALSE;

   first = TRUE;

   while (! done)

   {

       retval = fscanf(fp, "%ld %ld", &uid, &seekptr);

       if (retval == EOF)

       {

          Message(MSG_ERROR, "UID %ld not found", uid_to_find);

          return 1;

       }

       if (uid == uid_to_find)

          done = TRUE;

   }

   FileClose(fp);

 

   atp = MEDLINE_ENTRY;

   AsnIoSeek(aip, seekptr);

   done = FALSE;

   while (! done)

   {

       atp = AsnReadId(aip, amp, atp);

       AsnReadVal(aip, atp, &value);

       AsnWrite(aipout, atp, &value);

       AsnKillValue(atp, &value);

 

       if (! first)

       {

          if (atp == MEDLINE_ENTRY)

              done = TRUE;

       }

       else

          first = FALSE;

   }

 

   AsnIoClose(aip);

   AsnIoClose(aipout);

 

   return 0;

}

This is a very simple program.  It looks up the seek offset into the file by uid, and seeks to that point with AsnIoSeek().  It then just cycles through the process of reading an identifier then reading a value from medline.val using AsnReadId() and AsnReadVal().  It then writes them both to the output file with AsnWrite().  Any storage used is freed with AsnKillValue(). Depending on the way the output AsnIo stream is opened, ASCII or binary, the program can deliver a binary Medline-entry or an ASCII conversion of it.

One important point to note is that the way the while loop knows when it has finished reading a MEDLINE_ENTRY.  Since it is a SEQUENCE which is basically a structure with component parts, AsnReadId() returns atp == MEDLINE_ENTRY twice.  Once when it reads the start of the structure, and once when it reads the end.  If you imagine the MEDLINE_ENTRY being bounded by braces {} as in the value notation the process is this:

MEDLINE_ENTRY ::= { AsnReadId() gets MEDLINE_ENTRY, AsnReadVal() gets {

    one ,                         { read the internal components )

    two

   }                AsnReadId() gets MEDLINE_ENTRY, AsnReadVal() gets }

To produce the same effect on output, there are two extra output functions for AsnLib, in addition to AsnWrite().

AsnOpenStruct(aip, atp, ptr)

                Writes the first instance of atp on the output stream aip at the beginning of a structure (SEQUENCE, SET, SEQUENCE OF, SET OF).

 

AsnCloseStruct(aip, atp, ptr)

                Writes the second, closing instance.

The "ptr" argument is a pointer to the internal C structure representing the ASN.1 structure. It is used by functions that piggyback on the AsnWrite functions to explore the internal objects (discussed below).

For this reason a similar function is provided to write a CHOICE.

AsnWriteChoice(aip, atp, choice, value)

                Writes a choice of types. The choice argument is an integer to indicate which type will be written at the next AsnWrite(), and value is a DataVal in which can be passed the internal C structure used to represent the choice.

 In the case of getpub.c, it is not necessary to call these functions because getpub is simply reading the data from an ASN.1 stream then writing it again in order, which includes the two instances of MEDLINE_ENTRY.

Another point about this program is that we recognized the Medline entries in the Pub-set in indexpub.c by looking for PUB_SET_medline_E, but we are reading and writing the same entry in getpub.c using MEDLINE_ENTRY.  That is because a Pub-set of CHOICE medline is defined as a SET OF Medline-entry.  So when reading the whole Pub-set, each Medline-entry is a PUB_SET_medline_E. But when reading one entry it is a MEDLINE_ENTRY.

Data-links

Data-links are described in the NCBI Core Tools document.  They are meant to be "ports" in and out of software applications which perform exchange of structured data (in ASN.1).  The inputs and outputs for getpub.c and getmesh.c are actually Data-links.  If you simply type the command:

 

getpub -u 88055872 -b -o stdout | getmesh -i stdin -b -o terms.out

 

you have executed a pair of programs which communicate over a Data-link with structured, binary encoded ASN.1.  getpub extracts a Medline-entry with uid = 88055872 from a binary encoded Pub-set by indexed look-up, transfers it out stdout as a Medline-entry in binary, to getmesh which parses the "message" and locates MeSH terms, and prints them to test.out.

This example is just a pipe between two programs, with the enhancement that the stream is binary coded ASN.1, which permits a very much richer "vocabulary" for the exchange than is usual for traditional pipes.  Further, since binary coded ASN.1 is a machine independent coding, the exchange could just as easily been between two completely different machines over a network. Finally, this pipe is a single channel of exchange.  The principles hold if one expands the system to many channels, by a variety of means.

AsnLib Generated Header Files

Correspondence between ASN.1 and header #defines

Medline-entry ::= SEQUENCE {                    MEDLINE_ENTRY

   uid INTEGER ,                                      MEDLINE_ENTRY_uid

   em Date ,                                          MEDLINE_ENTRY_em

   cit Cit-art ,                                      MEDLINE_ENTRY_cit

   abstract VisibleString OPTIONAL ,                  MEDLINE_ENTRY_abstract

   mesh SET OF Medline-mesh OPTIONAL ,                MEDLINE_ENTRY_mesh

   substance SET OF Medline-rn OPTIONAL ,             MEDLINE_ENTRY_substance

   xref SET OF Medline-si OPTIONAL ,                  MEDLINE_ENTRY_xref

   idnum SET OF VisibleString OPTIONAL }        MEDLINE_ENTRY_idnum

 

Medline-mesh ::= SEQUENCE {              MEDLINE_MESH

   mp BOOLEAN DEFAULT FALSE ,                         MEDLINE_MESH_mp

   term VisibleString ,                               MEDLINE_MESH_term

   qual SET OF Medline-qual OPTIONAL }                MEDLINE_MESH_qual

Returns From AsnLib Parsing

Medline-entry with header #defines as returned when parsing with AsnLib

Medline-entry ::= {                    /MEDLINE_ENTRY

  uid 88055872 ,                      |   MEDLINE_ENTRY_uid

  em                                  |   MEDLINE_ENTRY_em

    std {                             |    /DATE_std

      year 1988 ,                     |   |   DATE_STD_year

      month 3                         |   |   DATE_STD_month

    } ,                               |    \DATE_std

  cit {                               |  /MEDLINE_ENTRY_cit

    title {                           | |  /CIT_ART_title

      name "Developmental .. protein."| | |   TITLE_name

    } ,                                | |  \CIT_ART_title

    authors {                         | |  /CIT_ART_authors

      names                           | | |  AUTH_LIST_names

        ml {                          | | |   /AUTH_LIST_names_ml

          "Giebel LB" ,               | | |  |   AUTH_LIST_names_ml_E

          "Dworniczak BP" ,           | | |  |   AUTH_LIST_names_ml_E

          "Bautz EK"                  | | |  |   AUTH_LIST_names_ml_E

        } ,                           | | |   \AUTH_LIST_names_ml

      affil                           | | |    AUTH_LIST_affil

        str "Zentrum ... Germany"     | | |      AFFIL_str

    } ,                               | |  \CIT_ART_authors

    from                              | |   CIT_ART_from

      journal {                       | |    /CIT_ART_from_journal

        title {                       | |   |  /CIT_JOUR_title

          ml-jta "Dev Biol"           | |   | |   TITLE_ml_jta

        } ,                           | |   |  \CIT_JOUR_title

        imp {                         | |   |  /CIT_JOUR_imp

          date                        | |   | |   IMPRINT_date

            std {                     | |   | |    /DATE_std

              year 1988 ,             | |   | |   |   DATE_STD_year

              month 1                 | |   | |   |   DATE_STD_month