NCBI C Toolkit Cross Reference

C/readme


  1                  NCBI SOFTWARE DEVELOPMENT TOOLKIT
  2              National Center for Biotechnology Information
  3                          Bldg 38A, NIH
  4                        8600 Rockville Pike
  5                         Bethesda, MD 20894
  6 
  7 The NCBI Software Development Toolkit was developed for the production and
  8 distribution of GenBank, Entrez, BLAST, and related services by NCBI. We make
  9 it freely available to the public without restriction to facilitate the
 10 use of NCBI by the scientific community. However, please understand that
 11 while we feel we have done a high quality job, this is not commercial software.
 12 The documentation lags considerably behind the software and we must make any
 13 changes required by our data production needs. Nontheless, many people have
 14 found it a useful and stable basis for a number of tools and applications.
 15 
 16 The toolkit is available by anonymous ftp from ftp.ncbi.nih.gov
 17 
 18 cd toolbox
 19 cd ncbi_tools
 20 bin
 21 get ncbi.tar.Z  (compressed UNIX tar file)
 22 quit
 23 
 24 In this same directory are also ncbiz.exe (DOS self extracting archive) and
 25 ncbi.hqx (Mac self extracting archive). All three files contain the same
 26 source code and will make the toolkit for all platforms.
 27 
 28 
 29 Please feel free to email questions/suggestions to:
 30   toolbox@ncbi.nlm.nih.gov
 31 
 32 If you would like hardcopy of the current documentation, send your mailing
 33 address with your request to the email address above.
 34 
 35 If you are considering a serious development project using this toolkit, please
 36 contact us. We are happy to discuss compatible strategies and inform you of
 37 our longer term plans. There is no limitation of the use of this code or in
 38 contacting us about its use for commercial, academic, or government groups.
 39 
 40 ===========================================================================
 41 
 42                            Version 6.1
 43       the date of release may be obtained from the file ncbi/VERSION
 44 
 45 ===========================================================================
 46 
 47                              Summary
 48 
 49 The procedure of building the toolkit on Unix was slightly changed.
 50 Now there is no need to download any binary NCBI product for your
 51 platform to obtain the platform-specific ncbi.mk file.
 52 
 53 To build the NCBI toolkit you need to look for platform-dependent instructions:
 54 For UNIX (including Linux and Mac OS X):
 55     look at the file make/readme.unx
 56 For alternative Mac instructions (using CodeWarrior):
 57     look at the file make/readme.mac
 58 For Microsoft Windows95/98/NT:
 59     look at the file make/readme.dos
 60 There is some information which may be useful for NCBI tookit building
 61 in the file doc/FAQ.txt
 62 
 63 This release includes source code for the new (2.0.9) version of BLAST.
 64 Look at the file doc/README.bls for more detailed documentation on
 65 stand-alone BLAST.
 66 
 67 The file doc/README.pbl has the information about PowerBLAST.
 68 
 69 And the description on Integrating Matrix Profiles And Local Alignments
 70 (IMPALA) is located in the file doc/README.imp
 71 
 72 The file doc/sequin.htm describes the SEQUIN and its configuration.
 73 
 74 If you have problems configuring Entrez with a firewall, look at the
 75 file doc/firewall.txt
 76 
 77 This file has a section called CONFIGURATION OR SETTINGS FILES,
 78 which explains in detail how our configuration system works.  The ncbi
 79 config file (.ncbirc on UNIX, ncbi.ini on PC/Windows, and ncbi.cnf on
 80 Macintosh) is needed in order to find data files, such as
 81 gc.val (the genetic code table), provided in the toolkit or with programs
 82 like Sequin.  (The asnload files containing dynamic versions of the ASN.1
 83 parse tables are no longer needed, since all platforms can now have large
 84 static data.)
 85 
 86 It has recently become possible to eliminate the need for the ncbi config
 87 file by calling UseLocalAsnloadDataAndErrMsg () at the beginning of your
 88 program.  This looks for the data directory in the same directory as the
 89 running program.  If it doesn't find it, it looks up one level, in case you
 90 are compiling programs in the build directory of the toolkit.  If it finds
 91 the data directory in either of these places, it transiently sets the
 92 location, so code that loads these files is given the correct path.
 93 
 94 An even more recent change is that copies of several of our data files (gc,
 95 seqcode, and featdef) are now built into the source code, so if the data
 96 directory is not found, programs that require only these can still run.
 97 
 98 One final improvement is that access to our network services is now much
 99 simpler than before, so if you are not behind a firewall and have domain
100 name server (DNS) available you can connect to our network without needing
101 any configuration information in the ncbi config file.  Operation behind a
102 firewall, or with a proxy, requires very little in the ncbi config file, and
103 this is easily created by asking Sequin to configure for network access.
104 
105 =============================================================================
106                  Notes from Previous Releases
107 =============================================================================
108 
109 =============================================================================
110                            Version 6.0
111       the date of release may be obtained from the file ncbi/VERSION
112 =============================================================================
113 
114 This release includes source code for the new (2.0) version of BLAST.
115 Also included are a small number of incremental changes in the ASN.1
116 specification. 
117 
118 BLAST 2.0 - BLAST 2.0 can produce gapped alignments and is capable of 
119 position-specific-iterated BLASTp (PSI-BLAST).  Compared to the 1.4 release of
120 BLAST, there are also signficant performance enhancements as well as extensive 
121 changes to the text report and the format of the databases.  BLAST 2.0
122 uses threads for multi-processing, using the NCBI threads library.
123 Three BLAST programs may be compiled in the demo directory.   They are:
124 
125 formatdb: formats FASTA files as BLAST databases for BLAST 2.0.
126 
127 blastall: perform all five flavors of blast comparison.
128 blastn and blastp offer fully gapped alignments.
129 blastx and tblastn have 'in-frame' gapped alignments and use sum
130     statistics to link alignments from different frames.
131 tblastx provides only ungapped alignments.
132 
133 blastpgp: performs gapped blastp searches and can be used to perform
134 iterative searches in psi-blast mode.
135 
136 Additional information may be obtained from the README in the BLAST
137 directory of the FTP site and from the NCBI BLAST pages.
138 
139 ASN.1 Spec Changes for 1997
140 
141 biblio.asn
142   Cit-pat - some fields made optional to allow patent applications to be legal
143             Cit-pat.number OPTIONAL
144             Cit-pat.date-issue OPTIONAL
145     -- Patent number and date-issue were made optional in 1997 to
146     --   support patent applications being issued from the USPTO
147     --   Semantically a Cit-pat must have either a patent number or
148     --   an application number (or both) to be valid
149 
150 medline.asn
151   added ML-field to support other MEDLINE line types
152 
153 Medline-entry ::= SEQUENCE {
154     uid INTEGER OPTIONAL ,      -- MEDLINE UID, sometimes not yet available if from PubMed
155     em Date ,                   -- Entry Month
156     ... (not shown)
157     pmid PubMedId OPTIONAL ,               -- MEDLINE records may include the PubMedId
158     pub-type SET OF VisibleString OPTIONAL, -- may show publication types (review, etc)
159     mlfield SET OF Medline-field OPTIONAL }  -- additional Medline field types
160 
161 Medline-field ::= SEQUENCE {
162     type INTEGER {              -- Keyed type
163         other (0) ,             -- look in line code
164         comment (1) ,           -- comment line
165         erratum (2) } ,         -- retracted, corrected, etc
166     str VisibleString ,         -- the text
167     ids SEQUENCE OF DocRef OPTIONAL }  -- pointers relevant to this text
168 
169 DocRef ::= SEQUENCE {           -- reference to a document
170     type INTEGER {
171         medline (1) ,
172         pubmed (2) ,
173         ncbigi (3) } ,
174     uid INTEGER }
175 
176 
177 seq.asn
178   MolInfo.tech - added names for HTG classes already implemented
179   Annotdesc.region - added seqloc. If present, all annots in this SeqAnnot
180                       are within this region. Optimization on big seqs.
181 
182 seqfeat.asn
183   added OrgMod.specimen-voucher - new organism qualifier
184   added OrgMod.old-name - used internally at NCBI
185   added BioSource.is-focus - for distinguishing biological focus of
186       multiple source features.
187   added Seq-feat.pseudo so any feature can be flagged explicitly as
188       belonging to a pseudogene
189   added Seq-feat.except-text for an explanation of the exception when
190       Seq-feat.except is TRUE. Currently this text is in Seq-feat.comment
191       in backbone records and GBQuals in some other genbank records.
192 
193 
194 
195 =============================================================================
196                  Notes from Previous Releases
197 =============================================================================
198 
199                            Version 5.0
200 
201                              Summary
202 
203 This release includes a small number of incremental changes in the ASN.1
204 specification. Most significant is the addition of the PubMedID, a
205 bibliographic citation identifier similar to a MEDLINE UID. PubMed is a new
206 citation database being developed at NCBI which is a superset of MEDLINE. It
207 will be an avenue by which publishers can deposit electronic versions of their
208 citations and abstracts to allow them timely linking to network entrez from
209 the publishers on-line services. PubMed will route these citations to MEDLINE
210 and they will appear in MEDLINE (and Entrez) after the usual MEDLINE indexing.
211 However, for some period of time, such articles will have only a PubMedID.
212 We would like to switch Entrez over to supporting PubMedIDs as early as
213 possible. WE STRONGLY ENCOURAGE DEVELOPERS TO RECOMPILE AND RELINK WITH THIS
214 VERSION OF THE TOOLKIT AS SOON AS POSSIBLE. The changes in this specification
215 should not cause problems with existing software, so a simple compile and
216 link should be enough to make you compatible. Details of ASN.1 specification
217 changes are listed below.
218 
219 There has been considerable development of the toolkit in other aspects as
220 well, many of which are embodied in sequin, the new NCBI direct submission
221 tool, which is included in the toolkit as well. In the interest of getting the
222 PubMed changes into the specification and developers hands promptly, we have
223 not included much on that aspect of this toolkit at this time.
224 
225 
226        Changes in the 1996 NCBI ASN.1 (version 5.0) specification
227 
228 Once again, there are very few changes to the NCBI ASN.1 specification this
229 year.  The biggest change is the addition of the PubMed ID to support the new
230 NCBI PubMed database.  There are also small additions to the medline and
231 organism specifications, detailed below.  As usual, these changes are also
232 backward compatible with old data.  However, you should recompile and relink
233 your applications as soon as possible, since the old applications will not be
234 compatible with the new datatypes.
235 
236 1) PubMed - NCBI is building a new citation database that is a superset of
237 MEDLINE and which will be linked to online journals from publishers.  The
238 bibliographic components of the specification have had support for PubMed IDs
239 added.  These include biblio.asn (objbibli.[ch]), pub.asn (objpub.[ch]),
240 medline.asn (objmedli.[ch]).
241 
242 2) pub-type - MEDLINE includes strings indicating the type of a publication.
243 The medline definition has had the attribute pub-type added to support these
244 strings.
245 
246 From the 1996 MeSH, here's the list.  
247 
248 Abstract
249 Bibliography 
250 Classical Article 
251 Clinical Conference 
252 Clinical Trial 
253 Clinical Trial, Phase I 
254 Clinical Trial, Phase II 
255 Clinical Trial, Phase III 
256 Clinical Trial, Phase IV 
257 Comment 
258 Consensus Development Conference 
259 Consensus Development Conference, NIH 
260 Controlled Clinical Trial 
261 Corrected and Republished Article
262 Current Biog-Obit
263 Dictionary 
264 Directory 
265 Duplicate Publication
266 Editorial
267 Festschrift 
268 Guideline 
269 Historical Article
270 Historical Biography
271 Interview
272 Journal Article
273 Legal Brief
274 Letter
275 Meeting Report
276 Meta-Analysis
277 Monograph
278 Multicenter Study
279 News 
280 Newspaper Article
281 Overall
282 Periodical Index 
283 Practice Guideline 
284 Published Erratum 
285 Randomized Controlled Trial
286 Retracted Publication
287 Retraction of Publication
288 Review
289 Review Literature
290 Review of Reported Cases
291 Review, Academic
292 Review, Multicase
293 Review, Tutorial
294 Scientific Integrity Review
295 Technical Report
296 Twin Study
297 
298 3) virion - the attribute virion has been added to BioSource.genome.  It just
299 complements proviral which was already there.  This will map to a /virion
300 qualifier in the new GenBank feature table definition.
301 
302 4) division - OrgName.div now (optionally) can contain the GenBank division code
303 (eg.  PRI).
304 
305 5) signal-peptide, transit-peptide - were added to Prot-ref, to support
306 annotation of protein features on the protein sequence in a way that could be
307 mapped to a GenBank feature table.
308 
309 That's all. Relevant sections of the asn.1 specification are shown below.
310 
311 ================================================================================
312 
313 biblio.asn
314 
315 
316 PubMedId ::= INTEGER                    -- Id from the PubMed database at NCBI
317 
318 and..
319 
320     
321 Cit-gen ::= SEQUENCE {      -- NOT from ANSI, this is a catchall
322     cit VisibleString OPTIONAL ,     -- anything, not parsable
323     authors Auth-list OPTIONAL ,
324     muid INTEGER OPTIONAL ,      -- medline uid
325     journal Title OPTIONAL ,
326     volume VisibleString OPTIONAL ,
327     issue VisibleString OPTIONAL ,
328     pages VisibleString OPTIONAL ,
329     date Date OPTIONAL ,
330     serial-number INTEGER OPTIONAL ,   -- for GenBank style references
331     title VisibleString OPTIONAL ,     -- eg. cit="unpublished",title="title"
332     pmid PubMedId OPTIONAL }           -- PubMed Id
333 
334 pub.asn
335 
336 
337 Pub ::= CHOICE {
338     gen Cit-gen ,        -- general or generic unparsed
339     sub Cit-sub ,        -- submission
340     medline Medline-entry ,
341     muid INTEGER ,       -- medline uid
342     article Cit-art ,
343     journal Cit-jour ,
344     book Cit-book ,
345     proc Cit-proc ,      -- proceedings of a meeting
346     patent Cit-pat ,
347     pat-id Id-pat ,      -- identify a patent
348     man Cit-let ,        -- manuscript, thesis, or letter
349     equiv Pub-equiv,     -- to cite a variety of ways
350     pmid PubMedId }      -- PubMedId
351 
352 medline.asn
353 
354                                 -- a MEDLINE or PubMed entry
355 Medline-entry ::= SEQUENCE {
356     uid INTEGER OPTIONAL ,      -- MEDLINE UID, sometimes not yet available if 
357 from PubMed
358     em Date ,                   -- Entry Month
359     cit Cit-art ,               -- article citation
360     abstract VisibleString OPTIONAL ,
361     mesh SET OF Medline-mesh OPTIONAL ,
362     substance SET OF Medline-rn OPTIONAL ,
363     xref SET OF Medline-si OPTIONAL ,
364     idnum SET OF VisibleString OPTIONAL ,  -- ID Number (grants, contracts)
365     gene SET OF VisibleString OPTIONAL ,
366     pmid PubMedId OPTIONAL ,               -- MEDLINE records may include 
367 the PubMedId
368     pub-type SET OF VisibleString OPTIONAL } -- may show publication types 
369 (review, etc)
370 
371 seqfeat.asn
372 
373 
374 OrgName ::= SEQUENCE {
375     name CHOICE {
376         binomial BinomialOrgName ,         -- genus/species type name
377         virus VisibleString ,              -- virus names are different
378         hybrid MultiOrgName ,              -- hybrid between organisms
379         namedhybrid BinomialOrgName ,      -- some hybrids have genus x species 
380 name
381         partial PartialOrgName } OPTIONAL , -- when genus not known
382     attrib VisibleString OPTIONAL ,        -- attribution of name
383     mod SEQUENCE OF OrgMod OPTIONAL ,
384     lineage VisibleString OPTIONAL ,       -- lineage with semicolon separators
385     gcode INTEGER OPTIONAL ,               -- genetic code (see CdRegion)
386     mgcode INTEGER OPTIONAL ,              -- mitochondrial genetic code
387     div VisibleString OPTIONAL }           -- GenBank division code
388 
389 BioSource ::= SEQUENCE {
390     genome INTEGER {         -- biological context
391         unknown (0) ,
392         genomic (1) ,
393         chloroplast (2) ,
394         chromoplast (3) ,
395         kinetoplast (4) ,
396         mitochondrion (5) ,
397         plastid (6) ,
398         macronuclear (7) ,
399         extrachrom (8) ,
400         plasmid (9) ,
401         transposon (10) ,
402         insertion-seq (11) ,
403         cyanelle (12) ,
404         proviral (13) ,
405         virion (14) } DEFAULT unknown ,
406     origin INTEGER {
407       unknown (0) ,
408       natural (1) ,                    -- normal biological entity
409       natmut (2) ,                     -- naturally occurring mutant
410       mut (3) ,                        -- artificially mutagenized
411       artificial (4) ,                 -- artificially engineered
412       synthetic (5) ,                  -- purely synthetic
413       other (255) } DEFAULT unknown , 
414     org Org-ref ,
415     subtype SEQUENCE OF SubSource OPTIONAL }
416 
417 Prot-ref ::= SEQUENCE {
418     name SET OF VisibleString OPTIONAL ,      -- protein name
419     desc VisibleString OPTIONAL ,      -- description (instead of name)
420     ec SET OF VisibleString OPTIONAL , -- E.C. number(s)
421     activity SET OF VisibleString OPTIONAL ,  -- activities
422     db SET OF Dbtag OPTIONAL ,         -- ids in other dbases
423     processed ENUMERATED {             -- processing status
424        not-set (0) ,
425        preprotein (1) ,
426        mature (2) ,
427        signal-peptide (3) ,
428        transit-peptide (4) } DEFAULT not-set }
429 
430 
431 =============================================================================
432                  Notes from Previous Releases
433 =============================================================================
434 
435         New Functions in Version 4.0
436 
437 There are a host of new functions in this release, but as usual we have not
438 managed to make time to document them all. Large parts of Sequin are present
439 which will be announced and described more fully in the fall. However,
440 specific tools of immediate interest are:
441 
442 blast2 - this is the long awaited BLAST client/server which permits structured
443    interaction with BLAST over the internet. We have provided a basic client
444    that produces the traditional blast output. In addition, the function call
445    interface can be used in more elaborate clients. For more information
446    contact Tom Madden, madden@ncbi.nlm.nih.gov
447 
448    WARNING!!! blast2 is the client we plan to support on the longer term.
449    The blast1 client we included for those of you who wanted a head start
450    will NOT be supported in future. Please shift any blast1 clients to the
451    (very similar) blast2 interface as soon as possible.
452 
453 sim, sim2 - protein and DNA sequence alignments in linear space. This is
454    the function call interface to these valuable tools. Applications have
455    been written which are available by ftp as are published papers. For more
456    information contact Jinghui Zhang, zjing@ncbi.nlm.nih.gov
457 
458  
459 
460 
461                 Changes in ASN.1 spec 4.0 from 3.0
462 
463 
464 Affil - biblio.asn
465   added the field "postal-code" for Zip code finally.
466 
467 Contact-info - submit.asn
468   added the field "contact" which is type "Author". The contact info has
469   evolved into a fully structured form, so I just took Author which has
470   structured names and structured address (Affil). We will eventually
471   phase out all the less structured ones in Contact-info.
472 
473 OrgName - sefeat.asn
474   added "lineage", "gcode", "mgcode" for the lineage, genetic code, and
475     mitochondrial genetic code. This is part of Org-ref, and consolidates
476     all the organism info (except original SOURCE line) out of the
477     GenBank block... and enables us to deliver it nicely from Taxon.
478 
479 Seq-descr - seq.asn
480   removed the Seq-descr "neighbors" and replaced it with "dbxref", since
481     neighbors has never been used. This is used to add cross-references to
482     the whole entry.
483 
484 Pubdesc - seq.asn
485   has an added slot, "reftype" which is an integer and is used to
486    indicate the GenBank usage of a reference.
487 
488    0 - seq - applies to the sequence. This is default and they way it is
489              used now.
490    1 - sites - applies to (unspecified) features. Equivalent to a GenBank
491              SITES feature. We could switch to this from using the
492              Imp-feat we do now.
493    2 - feats - applies to specific features. The idea here is provide a
494              place for the full citation, so features nead only reference
495              it. If now features reference it should be removed. This 
496              would work for checking content when only a part of a sequence
497              is copied or pasted. A "sites" ref could not have this check
498              since we do not know which features it goes to.
499 
500 Seq-feat - seqfeat.asn
501   added a slot called "dbxref" to Seq-feat. This is a SET OF Dbtag. It will
502   be for adding the new db_xref qualifiers to features. We already have some
503   of these in the xref slots of Gene-ref, Prot-ref, Org-ref. It means we ahve
504   to check two places in these cases. I do not want to retire the slots
505   since these were meant to be used in other contexts besides features.. and
506   Org-ref already is.
507 
508 
509   added a slot called "anticodon" to the tRNA extension of the RNA feature.
510   This is a Seq-loc that points to the location of the anticodon in a tRNA.
511   We have been populating this data in a User-object, and will have to do
512   a retro to convert it.
513 
514   EXPORTED Genetic-code
515 
516 
517 Seq-align - seqalign.asn
518 
519    added "bounds" to Seq-align so you can record the regions over which
520    an alignment was computed.. not always included in the resulting alignment
521    itself.
522 
523    added two new types:
524    A) Packed-seg -- a denser representation from Colombe and Jinghui
525    B) disc - discontinuous alignments as a SEQUENCE OF Seq-align
526 
527 
528 Seq-annot - seq.asn
529 
530    added a field to Seq-annot, Align-def, to discriminate types of
531    alignment sets. This has the advantage of minimal changes as well as
532    separating sets of alignments from conceptually single alignments. I am
533    not sure it is necessary to distinguish "alt" from "blocks" though. Also
534    it means you can attach more info, with other Seq-annot fields and/or by
535    expanding the Align-def. I put in "ids" in Align-def specifically to put
536    the one Seq-id that is the "master" for type "ref". I made it a SET OF
537    so we could use it for other collections where we might want to list
538    more than one.
539 
540    added "ids" and "locs" as allowed types within Seq-annot. This would
541    enable us to pass lists like this around between tools with all the
542    addtional descriptive information in Annotdesc. I know this will be
543    useful.
544 
545    added "general" to Annot-id for tracking 3rd party annotations.
546 
547 
548 
549 
550 
551 
552 
553                          Introduction
554 
555     This distribution is release 5.0 of the NCBI core library for building
556 portable software, and AsnLib, a collection of routines for handling ASN.1
557 data and developing ASN.1 software applications.  AsnLib and the asntool
558 application are built using the CoreLib routines. In the \doc directory is an
559 MS Word file which details the information given below. It is also available
560 as hardcopy. See the README in \doc.
561 
562 The lowest layer of code is the CoreLib.  These are multi-
563 platform functions for memory allocation (including byte stores), string 
564 manipulation, file input and output, error and general messages, and 
565 time and date notification.  These functions have been written only 
566 where we found that the existing ANSI functions were not sufficiently 
567 multi-platform or well- behaved among all of the platforms that we 
568 support.  For each platform (a combination of processor, operating 
569 system, compiler, and windowing system), we supply a specific ncbilcl.h 
570 file, which contains typedefs and defines for multi-platform symbols, 
571 and includes a number of standard header files.  (For example, 
572 ncbilcl.msw is used for the Microsoft C compiler under Microsoft Windows 
573 on the PC.)  Use of these symbols, and of the functions in the CoreLib, 
574 allow us to write multi-platform source code for a variety of disparate 
575 platforms.
576 
577 The next layer of code is the AsnLib stream reader.  This is 
578 used in conjunction with a header file and a parse table loader file, 
579 both of which are produced by processing the formal ASN.1 specification 
580 with the AsnTool application. The symbolic defines in the 
581 header file are pointers into the parse table, in which the ASN.1 
582 specification is represented.  To read at the stream reader level, a 
583 program alternates between calls to AsnReadId and AsnReadVal.  AsnReadId 
584 returns a pointer into the parse table, which can be compared against 
585 the defines in the AsnTool-generated header.  For example, in the 
586 specification for MEDLINE records, the Medline-entry section has an item 
587 called "uid", for the unique ID of the record.  This is symbolized in 
588 the header file as MEDLINE_ENTRY_uid.  When AsnReadId returns this 
589 symbol, the program calls AsnReadVal to obtain the uid for that record. 
590 AsnKillValue is also needed to free any memory allocated by AsnReadVal, 
591 which occurs when the value is a string and not an integer.  The entire 
592 set of records on the Entrez CD-ROM can be read as a single stream with 
593 the AsnLib functions.
594 
595 The ASN.1 records may be accessed at a higher level through the object 
596 loaders, which utilize the stream processing functions to 
597 load C memory structures with the contents of the ASN.1 objects. For 
598 each ASN.1 object we specify, we also define an equivalent C memory 
599 structure.  The object loader level of code contains functions to read 
600 and write each ASN.1 object.  These are hierarchical, as are the ASN.1 
601 specifications.  Calling the top level loader, SeqEntryAsnRead, will 
602 load an entire SeqEntry from an open AsnIo channel, and will return a 
603 pointer to the loaded memory structure.  The read function for an AsnIo 
604 channel can be swapped to refer to a normal disk file, a network socket, 
605 or to compressed data, which it automatically decompresses.  The object 
606 loader code can interconvert between the highly-branched memory object 
607 and a linear ASN.1 message with complete fidelity.  The object loaders 
608 have additional functions, including the ability to explore the 
609 structure and notify the program when particular data elements are 
610 encountered.  The entire contents of the Entrez CD-ROM can also be 
611 streamed through the object loaders.  However, most calls to the object 
612 loaders for simply reading a particular record are done via the data 
613 access functions (see below).
614 
615 The data access functions allow a program to call the object loaders on 
616 a sequence or MEDLINE record given the uid of the record.  
617 This will get the data into memory regardless of whether the data are 
618 compressed on the Entrez CD-ROM or are obtained through a service over 
619 the Internet. This means that a detailed understanding of the files and 
620 formats on the Entrez disc is not needed by application programmers. The 
621 function to load a sequence record, SeqEntryGet, needs the uid to 
622 retrieve and a complexity code parameter. A sequence record is in the 
623 form of a NucProt set.  This contains a nucleotide (which may itself be 
624 composed of segments) and all of the proteins it is known to encode.  
625 The set of segments is called a SegSet, and the individual sequences are 
626 called BioSeqs.  We have taken the liberty of producing this integrated 
627 view, but the complexity code parameter allows the record to be easily 
628 loaded in a simpler, more traditional form, if desired.  The accession 
629 number term list is built to supply the proper uids to support this 
630 facility.  This access library is compatible with Entrez release 1.0 or
631 later only.
632 
633 The sequence utilities and application programmer interface layer 
634 allows exploration of the loaded memory structures and 
635 generation of standard literature or sequence reports from those 
636 objects.  For example, a BioSeq can be converted to FASTA or GenBank 
637 flat file formats and saved to a file, and a MEDLINE record can be saved 
638 in MEDLARS format, which is suitable for entry into personal 
639 bibliographic database programs.  A sequence port can be opened that 
640 gives a simple, linear view of a segmented sequence, converting 
641 alphabets, merging exon segments, and dealing with information on both 
642 strands of the DNA.  This layer also includes some functions to explore 
643 the NucProt set.  The explore functions visit each individual BioSeq in 
644 the set, calling a callback function for each sequence node so that a 
645 program can examine feature tables and other information that are 
646 associated with the NucProt or SegSets or with the individual sequences.
647 
648 Vibrant is a multi-platform user interface development library that runs 
649 on the Macintosh, Microsoft Windows on the PC, or X11 and OSF/Motif on 
650 UNIX and VAX computers [separate documentation].  It is used to build 
651 the graphical interface for the Entrez application (whose source code is 
652 in the browser directory). The philosophy behind Vibrant is that 
653 everything in the published user interface guidelines (the generic 
654 behavior of windows, menus, buttons, etc.), as well as positioning and 
655 sizing of graphical control objects, is taken care of automatically.  
656 The program provides callback functions that are notified when the user 
657 has manipulated an object. Vibrant and Entrez code are not supported, 
658 but are provided on an as-is basis.
659 
660 The advantage of using AsnLib and the object loaders, as they are 
661 implemented, is that application program developers merely need to 
662 recompile their programs with the new (AsnTool-generated) header files 
663 and load the new parse tables (included with the Entrez software) in 
664 order to be able to read the new data.  This process is straightforward, 
665 and will not break existing program code.  The application is free to 
666 ignore new fields if it does not choose to take advantage of the new 
667 kinds of information.
668 
669 When developing new ASN.1 specifications, as of June 1994 it is possible to
670 automatically generate the object loaders and header files for those
671 specifications, using the AsnCode utility.  For some complex ASN.1
672 specifications, however, AsnCode may fail to generate the correct source code.
673 
674 The documentation is currently being brought up to date.  The programs 
675 in the demo directory are designed to teach the proper use of many of 
676 the functions discussed above.  Many of these programs are not yet 
677 documented.  The simplest is testcore.c, which tests various functions 
678 in the CoreLib.  The most complex is getfeat.c, which takes an accession 
679 number of locus name, determines the unique seq ID, retrieves the entry 
680 from the Entrez CD-ROM using the data access library, locates all coding 
681 region features using the explore functions, and prints the DNA 
682 sequences of all exons using sequence port functions.  If you cannot 
683 extract and print the doc.tar.Z file, please send an email message with 
684 your land mailing address and phone number to toolbox@ncbi.nlm.nih.gov, 
685 and we will mail a copy to you.
686 
687 The contents of the ncbi directory (the highest level, containing the 
688 NCBI Software Development Kit source code in several subdirectories) is 
689 shown below.  The readme file contains instructions on copying the 
690 appropriate make files to be built in the build directory.  The makeall 
691 file copies headers to the include directory builds four libraries 
692 (ncbi, ncbiobj, ncbicdr and vibrant), copying them to the lib directory.  
693 The makedemo file builds the demo programs and the Entrez application:
694 
695   api           Application Programmer Interface, Sequence Utilities
696   asn           ASN.1 specifications for publications and sequences
697   asnlib        Source code for AsnLib and asntool
698   asnload       AsnLib headers and dynamic parse tables (Mac and PC)
699   asnstat       AsnLib headers that use static memory (UNIX and VMS)
700   bin           Asntool executable copied here
701   biostruc      Source code for Molecular Modelling DataBase functions
702   browser       Source code for Entrez application
703   build         Empty directory for building tools and libraries
704   cdromlib      Access routines for data on the Entrez CD-ROM
705   cn3d          Source code for Vibrant-based 3D structure viewer
706   config        Configuration files for NCBI software:
707     mac
708     unix
709     vms
710     win
711   corelib       Source code for NCBI Core Software Library
712   data          Data files used for sequence conversion
713   demo          AsnLib and sequence utility demonstration programs
714   desktop       Source code for Vibrant-based viewers and editors
715   doc           Documentation in Microsoft Word file
716   include       Include files required by applications are copied here
717   lib           Libraries copied here
718   link          Contains several subdirectories with build accessory files:
719     macmet        Macintosh Metrowerks/CodeWarrior
720     macmpw        Macintosh MPW C
721     mswin         Microsoft C and Borland C for Windows
722   make          Make files for various systems
723   network       Network version of data access
724     apple
725     blast2
726     encrypt
727     entrez
728     netmanag
729     nsclilib
730   object        Functions for reading and writing complex objects
731   sequin        Source code for Sequin application
732   tools         Source code for alignment and other contributed utilities
733   readme        File that contains important building instructions
734   vibrant       Source code for Vibrant portable interface package
735 
736 The platforms that are supported (as indicated by the suffix on the 
737 relevant ncbilcl.h file) are shown below.  Those marked with an asterisk 
738 (*) are available as-is:
739 
740   370*          IBM 370
741   acc           SUN acc compiler
742   alf           DEC Alpha under OSF/1
743   aov           DEC Alpha under AXP/OpenVMS
744   aux*          Macintosh A/UX
745   bor           Borland for DOS
746   bwn           Borland for Microsoft Windows
747   ccr           CenterLine CodeCenter
748   cpp           SUN C++
749   cra*          Cray
750   cvx*          Convex
751   gcc           Gnu gcc (under SunOS, not Solaris)
752   hp *          Hewlett Packard
753   lna*          Linux on DEC Alpha
754   lnx           Linux (RedHat Linux release 5.2 with kernel 2.0.36)
755   met           Macintosh Metrowerks compiler
756   mpw           Macintosh Programmer's Workshop
757   msc           Microsoft C for DOS
758   msw           Microsoft for Windows
759   nxt*          NeXT
760   r6k*          IBM RS 6000
761   scr           CodeCenter under Sun Solaris
762   sgi           Silicon Graphics
763   sin           Sun Solaris on Intel processors
764   sol           Sun Solaris (for cc and gcc)
765   thc           THINK C on Macintosh
766   ult           DEC ULTRIX
767   vms           DEC VAX/VMS
768 
769 Questions or comments can be directed to toolbox@ncbi.nlm.nih.gov.
770 
771 ANSI C:
772 
773     This software requires an ANSI C compiler.  This will be no problem at
774 all except to people on Sun machines, where the bundled C compiler, cc, is
775 non-ansi.  However, you can use the Sun unbundled compiler, acc, or the Gnu
776 compiler, gcc (which is free) and that works just fine.  If you have written
777 applications on the Sun with non-ANSI functions, the ANSI compilers will
778 complain.  See the notes below if this is a problem.
779 
780 
781                          Installation
782 
783 To build the NCBI toolkit you need to look for platform-dependent instructions:
784 For UNIX:
785     look at the file make/readme.unx
786 For Mac:
787     look at the file make/readme.mac
788 For Microsoft Windows95/98/NT:
789     look at the file make/readme.dos
790 
791 There is some information which may be useful for NCBI tookit building
792 in the file doc/FAQ.txt
793 
794 ALL -
795      change to the directory above ncbi subdirectory
796 
797 Unix
798     tested on Sun Sparc (Solaris 2.6, Sunos 4.1.3),
799     Silicon Graphics IRIX 5.* and 6.*, DEC Alpha with OSF/1 V5.1,
800     Linux (Red Hat Linux release 6.2 with kernel 2.2.16) on Intel,
801     Sun Solaris for Intel (Solaris 2.7).
802 
803     Run the script ncbi/make/makedis.csh keeping it's output in the
804     separate file:
805     for sh or bash:
806         ncbi/make/makedis.csh 2>&1 | tee out.makedis.csh
807     for csh or tcsh:
808         ncbi/make/makedis.csh |& tee out.makedis.csh
809     If that script gives you an error like this:
810         Your platform is not supported.
811         To port ncbi toolkit to your platform consult
812         the files platform/*.ncbi.mk
813     then you should check the script ncbi/make/makedis.csh and
814     add proper platform-dependent ncbi.mk file in ncbi/platform
815     directory.
816     
817     Other UNIX: AIX, ULTRIX, NeXt, Sun acc, 
818           Follows models above.  Read header in makeall.unx and makedemo.unx
819     for details.
820 
821     for all UNIX, edit .ncbirc as described in section "CONFIGURATION OR
822     SETTINGS FILES".
823     optional edit .login to "setenv NCBI=[path to .ncbirc file]"
824 
825 MS-DOS
826     look at the file make/readme.dos
827 
828 Mac
829      tested on CodeWarrior IDE 2.1, MacOS 8.0
830      All - copy config:mac:ncbi.cnf to your System Folder, or to the
831                  System Folder:Preferences subfolder
832                  edit the "ASNLOAD" line in "ncbi.cnf" to point to the
833                    ncbi:asnload directory in this release
834                  edit the "DATA" line to point to the ncbi/data directory
835      CodeWarrior - raise Preferred Size of Script Editor from 700 to 3000,
836                      and raise Preferred Size of CodeWarrior IDE 2.1 by
837                      2000 (e.g., from 8206 to 10206), using Get Info from
838                      the Finder.
839                    to compile for MC680x0 platform (default is PowerPC),
840                      change property MASTER from "PPC" to "68K".
841                    run copyhdrs.met
842                    run makeall.met
843                    run makenet.met
844                    run makedemo.met
845      Think C - no longer supported
846      MPW C -   no longer supported
847 
848 Changes to VMS make file naming conventions:
849 
850     The old .dcl prefix (last character is a lower case L) was changed
851 to .dc1 (last character is the numeral 1) to allow for different make files
852 for DecWindows 1.1 and DecWindows 1.2.  Several new .dc2 files were
853 contributed by David Mathog of CalTech.  A synopsis of his additional
854 instructions:
855 
856     VAX C  DecWindows 1.1        Use .dcl1 files.
857     DEC C  DecWindows 1.1        Use .dcl1 files,
858                                    but change cc to cc/standard=vaxc
859     VAX C  DecWindows 1.2        This combination has not been tested.
860     DEC C  DecWindows 1.2        Use .dcl2 files.
861 
862 VMS (without Vibrant) on VAX
863      $set def [ncbi.build]
864      $copy [-.make]*.dc1 *.com
865      $@makeall
866 
867      check ncbi.cfg as described in section "CONFIGURATION OR SETTINGS FILES".
868      edit LOGIN.COM to "define NCBI [path to ncbi.cfg file]"
869 
870     To make demos:
871         $@makedemo
872 
873 VMS (with Vibrant) on VAX
874      $set def [ncbi.build]
875      $copy [-.make]*.dc1 *.com
876      $@viball
877 
878      check ncbi.cfg as described in section "CONFIGURATION OR SETTINGS FILES".
879      edit LOGIN.COM to "define NCBI [path to ncbi.cfg file]"
880 
881     To make demos:
882         $@vibdemo
883 
884                             Testing
885 
886 VMS only:  look in rundemo.dc1 in [make] to see how to give command
887     line arguments.  Not all demo programs are shown. Run at least testcore.
888 
889 All else:
890 
891     In build should be a program called testcore.  Type "testcore -" and
892 it should show you some default arguments.  Type "testcore" and it will
893 run through a variety of functions in CoreLib, prompting you for responses
894 along the way.  It should run without a crash or error report.  If you made
895 Vibrant versions all demos will have startup dialog boxes.  If not, they
896 take command line arguments.
897 
898     If testcore runs, read the documentation for CoreLib and for AsnLib. 
899 In the AsnLib documentation are instructions for running asntool itself.
900 for running a few of the demo programs.  There are a large number of demo
901 programs now (including Entrez itself, if you made the Vibrant versions).
902 
903 
904 
905 CONFIGURATION OR SETTINGS FILES:
906 
907     One of the fundamental problems in writing portable software concerns
908 configuration issues.  Each individual user's computer will have its own
909 particular hardware and software environment, and each machine will have
910 its disk file  hierarchy set up in a unique manner.  A program that needs 
911 accessory information, such as help files, parse tables, or  format
912 converters, must be given a means of finding the data regardless of where
913 the user has placed the files.  The difficulty is compounded by the different
914 conventions for naming files and specifying paths on each class of machine.  
915 For example, the name of a CD-ROM on the Macintosh is fixed, determined by
916 information on the CD itself, whereas on the PC it is addressed by a drive
917 letter, which can be assigned by the user, but which cannot be reconciled
918 with the name the Macintosh sees.
919 
920     An associated problem is that many programs will want to allow the user
921 to make persistent changes to parameters.  These parameters typically involve
922 numbers or font specifications, but may also include paths to data files.  
923 Some platforms supply such configuration information in preferences files,
924 others in environment variables.  Manipulating these settings is platform
925 dependent, as is the format in which the preference is specified.
926 
927     The NCBI Software Toolkit core library addresses these problems by
928 providing configuration or settings files.  These are modeled after the .INI
929 files used by Microsoft Windows.  Settings files are plain ASCII text files
930 that may be edited by the user or modified by the program.  They are divided 
931 into sections, each of which is headed by the section name enclosed in square
932 brackets.  Below each section heading is a series of key=value strings, somewhat
933 analogous to the environment variables that are used on many platforms.
934 
935     The ncbi configuration file supplies general purpose configuration
936 information on paths for commonly used data files.  The typical file set up for
937 the Entrez application running on the PC under Microsoft Windows is shown below:
938 
939 [NCBI]
940 ROOT=D:
941 ASNLOAD=C:\ENTREZ\ASNLOAD\
942 DATA=C:\ENTREZ\DATA
943 
944     The only section is entitled NCBI.  The ROOT entry refers to the path to
945 the Entrez CD-ROM.  In this example, the user has configured the machine to
946 use drive letter D.  (On the Macintosh, the name of the disc is SEQDATA, which
947 cannot be changed by the user.)  The ASNLOAD specifies the path to the ASN.1
948 parse tables.  These files are required by the AsnLib functions, and all
949 higher-level procedures that call them,  including the Object Loader, Sequence
950 Utility, and Data Access functions.  Files pointed to by the DATA entry contain 
951 information necessary to convert biomolecule sequence data into different
952 alphabets (e.g., unpacking the 2-bit nucleotide code stored on the Entrez CD
953 into standard IUPAC letters).
954 
955     Although the contents of a configuration file is similar regardless of
956 platform, the name of the file and its location is platform dependent.  If the
957 base name of the configuration file is xxx, then the actual file name is shown
958 below for each platform:
959 
960 Macintosh                   xxx.cnf
961 Microsoft Windows           xxx.INI
962 MS-DOS (without Windows)    xxx.CFG
963 UNIX                        .xxxrc
964 VMS                         xxx.cfg
965 
966     Samples of such files are in subdirectories of \config.  The UNIX version
967 does not have the leading '.' in filename so you can see it. 
968 
969     The location in which these files must reside is also platform dependent,
970 and the functions that manipulate the contents may look in several places to
971 find these files.
972 
973 On the Macintosh, the function first looks in the System Folder, then in the
974 Preferences folder within the System Folder.  (See the Mac OS X addendum in the
975 next paragraph).  Under Microsoft Windows, the file must be in the Windows
976 directory, along with all of the other .INI files.  Under DOS without Windows,
977 the function first looks in the current working directory, then in the directory
978 whose path is specified in the NCBI environment variable. Under UNIX and VMS,
979 the current working directory is first checked, then the user's home directory,
980 and finally the directory specified by the NCBI environment variable.  (Under
981 UNIX, when it uses the environment variable, it will check for configuration
982 files first without and then with the initial dot.)  On the multi- user
983 platforms (UNIX and VMS), the use of the NCBI environment variable allows a
984 common settings file to be used  as the default by multiple users.  If such a
985 settings file is  changed under program control, it is copied over into the
986 user's home directory, and the new copy is modified.  The  order of searching
987 for settings files ensures that this new copy is used in all subsequent
988 operations.
989 
990     On Mac OS X, it first looks for xxx.cnf in username/Library/Preferences,
991 then in package/Contents/Resources, where username is the user's home directory
992 and package is the application package.  If it does not find the configuration
993 file, it then switches to UNIX style, looking for .xxxrc in the home directory
994 and then in the current directory.  This way Mac OS X applications retain the
995 traditional Mac behavior but can also UNIX style configuration files.
996 
997 
998 contents of ASNLOAD are in ncbi/asnload
999 contents of DATA are in ncbi/data
1000 
1001 Automatic Generation of code to read and write new ASN.1 messages.
1002 (Previously, ASNCODE USAGE)
1003 
1004 'asntool' can now generate code for use as ASN.1 readers and writers.
1005 This functionality used to be in the program called 'asncode'.  There
1006 is thus no longer any need for the *.l* files.  An example of how
1007 to generate this code follows:
1008 
1009 
1010          asntool -m YOURSPEC.asn -G -B genYOURSPEC 
1011 
1012 Both genYOURSPEC.h and genYOURSPEC.c will be generated.
1013 
1014 Within asn ASN.1 definitions, types can be EXPORTed and IMPORTed.
1015 If YOURSPEC.asn imports definitions from otherspec.asn then it has
1016 to be added to the -m parameter as below.  Note that code is only 
1017 generated for the first file.
1018 
1019         asntool -m YOURSPEC.asn,otherspec.asn -G -B  genYOURSPEC 
1020                                 ^
1021 
1022 Notice the lack of a blank at the caret (^), above.  This is important.
1023 
1024 
1025 MAJOR CHANGES FROM DOCUMENTATION:
1026 
1027     AsnNode structures have proved to be generally useful and moved from AsnLib
1028 to ncbimisc.  In addition, some elements of structs used in the object loaders
1029 were called "class" to match the ASN.1 names.  Class is a C++ reserved word,
1030 so all instances of "class" have been changed to "_class".
1031 
1032     To conform to our naming conventions, we have changed the names appropriately:
1033 
1034 AsnValue = DataVal
1035 AsnNode = ValNode
1036 class = _class
1037 
1038     A global search and replace of your code with these strings (not restricted
1039 to words... we want to change AsnNodePtr = ValNodePtr as well) should fix
1040 any problems.  Field names within structures have not been changed.  If your
1041 code uses only the object loaders, you may not find these strings in your
1042 code at all.
1043 
1044 DATA ACCESS LIBRARIES
1045 
1046     cdromlib contains data access routines compatible with release 1.0-6.0
1047 of the Entrez CDROM.  The documentation for these functions are out of
1048 date.  The routines in cdromlib have been split into entrez, sequence, and
1049 medline access functions.  The interface you should normally program to is
1050 defined in accentr.[ch].  The form of this calls has been changed to make
1051 them compatible with the NCBI network server, a client/server version of
1052 data access.  A program written to use these calls can access the the cdrom
1053 data, the network data, a combination, or that plus a local database by just
1054 fiddling with defines.  The form of the api for these functions has also
1055 been changed to hide the details of storage and caching more so that the
1056 different optimizations done to support cdrom and network access are
1057 transparent to the application programmer.  The end user tool called
1058 "Entrez" now uses these libraries as it's only means of data access (i.e.,
1059 you can write an application of your own with any or all of Entrez's
1060 functionality using just these routines).
1061 
1062 NETWORK LIBRARIES
1063 
1064     The toolbox now includes NCBI "Network Services".  This includes
1065 everything which you need to build your own "Network Entrez" client software.
1066 The network libraries include a generic network services library (nsclilib),
1067 which is used to contact the network services dispatcher and connect to a
1068 desired server.  Note that some development platforms require that you obtain
1069 a few source modules from external vendors.  Look at the README files
1070 contained in the network directory (network/*/README) for more details.
1071 
1072 
1073 DOCUMENTATION
1074 
1075     We are rewriting the documentation to conform with all the new features
1076 contained in this software.  We will add it to the package as soon as possible.
1077 
1078 DEMO PROGRAMS
1079 
1080     As in the tools, there are a number of undocumented programs in the demo
1081 directory as well, that use a number of the utility functions in api.  There
1082 is also a demo program called "getseq" in the cdromlib directory which
1083 retrieves a sequence from the cdrom given any valid sequence id.  These will
1084 be described in more detail in the next set of documentation. Briefly:
1085 
1086 asn2ff.c      converts ASN.1 to GenBank flatfile
1087 asn2rpt.c       converts ASN.1 to human readable report
1088 dosimple.c      converts ASN.1 to a "simple sequence"
1089 getseq.c        gets sequence from Entrez Cdrom using data access library,
1090                 writes to disk
1091 getfeat.c       ditto, but writes sequence of any CdRegion features to
1092                  "test.out"
1093 getmesh.c       documented
1094 getpub.c        documented
1095 indexpub.c      documented
1096 seqtest.c       reads ASN.1 sequence, converts to iupac, reports segmented
1097                 sequences, outputs fasta format to seqtest.out
1098 testcore.c      documented
1099 testobj.c       tests Medline object loader, demonstrates error checking using
1100                 NULL asnio stream.
1101 entrez          If Vibrant is installed, the full Entrez program is made.
1102 asndhuff        Demonstrates streaming ASN.1 data from the huffman compressed
1103                 Entrez CDROM (only works on release 1.0 or later).
1104 entrcmd         Standalone non-interactive tool for accessing Entrez data.
1105                 Entrcmd is the search engine used for NCBI's Entrez WWW server.
1106 asncode         Tool for generating object loader source code given a .l
1107                 file which is the output of AsnTool.
1108 cdscan         scans entrez cdrom, makes GenBank, GenPept, or FASTA format
1109                output. Also has a slot for a replaceable CustomRoutine
1110                supplied by you. Has two examples of such routines.
1111 
1112 CALLBACK CONVENTIONS
1113 
1114     The CoreLib, AsnLib, and Object Loader routines have been converted to use
1115 the LIBCALL and LIBCALLBACK symbols (FAR PASCAL) on the PC for Windows.  This will
1116 allow us to build dynamic link libraries (DLLs) so that the code can be accessed
1117 from languages other than C.  Callback functions you write that are of types
1118 AsnOptFreeFunc, AsnExpOptFunc, IoFuncType, AsnReadFunc, AsnWriteFunc, and
1119 SeqEntryFunc, should be declared using the LIBCALLBACK macro.  For example, a
1120 callback used as an AsnOptFreeFunc should be declared as follows:
1121 
1122 static Pointer LIBCALLBACK MyOptFreeFunc (Pointer);
1123 
1124 The SeqEntryFunc callback used by SeqEntryExplore has not yet been modified to
1125 use the LIBCALLBACK type.  This will be added in the near future.
1126 

source navigation ]   [ diff markup ]   [ identifier search ]   [ freetext search ]   [ file search ]  

This page was automatically generated by the LXR engine.
Visit the LXR main site for more information.