|
NCBI Home IEB Home C Toolkit docs C++ Toolkit source browser C Toolkit source browser (2) |
NCBI C Toolkit Cross ReferenceC/readme |
source navigation diff markup identifier search freetext search file search |
1 NCBI SOFTWARE DEVELOPMENT TOOLKIT
2 National Center for Biotechnology Information
3 Bldg 38A, NIH
4 8600 Rockville Pike
5 Bethesda, MD 20894
6
7 The NCBI Software Development Toolkit was developed for the production and
8 distribution of GenBank, Entrez, BLAST, and related services by NCBI. We make
9 it freely available to the public without restriction to facilitate the
10 use of NCBI by the scientific community. However, please understand that
11 while we feel we have done a high quality job, this is not commercial software.
12 The documentation lags considerably behind the software and we must make any
13 changes required by our data production needs. Nontheless, many people have
14 found it a useful and stable basis for a number of tools and applications.
15
16 The toolkit is available by anonymous ftp from ftp.ncbi.nih.gov
17
18 cd toolbox
19 cd ncbi_tools
20 bin
21 get ncbi.tar.Z (compressed UNIX tar file)
22 quit
23
24 In this same directory are also ncbiz.exe (DOS self extracting archive) and
25 ncbi.hqx (Mac self extracting archive). All three files contain the same
26 source code and will make the toolkit for all platforms.
27
28
29 Please feel free to email questions/suggestions to:
30 toolbox@ncbi.nlm.nih.gov
31
32 If you would like hardcopy of the current documentation, send your mailing
33 address with your request to the email address above.
34
35 If you are considering a serious development project using this toolkit, please
36 contact us. We are happy to discuss compatible strategies and inform you of
37 our longer term plans. There is no limitation of the use of this code or in
38 contacting us about its use for commercial, academic, or government groups.
39
40 ===========================================================================
41
42 Version 6.1
43 the date of release may be obtained from the file ncbi/VERSION
44
45 ===========================================================================
46
47 Summary
48
49 The procedure of building the toolkit on Unix was slightly changed.
50 Now there is no need to download any binary NCBI product for your
51 platform to obtain the platform-specific ncbi.mk file.
52
53 To build the NCBI toolkit you need to look for platform-dependent instructions:
54 For UNIX (including Linux and Mac OS X):
55 look at the file make/readme.unx
56 For alternative Mac instructions (using CodeWarrior):
57 look at the file make/readme.mac
58 For Microsoft Windows95/98/NT:
59 look at the file make/readme.dos
60 There is some information which may be useful for NCBI tookit building
61 in the file doc/FAQ.txt
62
63 This release includes source code for the new (2.0.9) version of BLAST.
64 Look at the file doc/README.bls for more detailed documentation on
65 stand-alone BLAST.
66
67 The file doc/README.pbl has the information about PowerBLAST.
68
69 And the description on Integrating Matrix Profiles And Local Alignments
70 (IMPALA) is located in the file doc/README.imp
71
72 The file doc/sequin.htm describes the SEQUIN and its configuration.
73
74 If you have problems configuring Entrez with a firewall, look at the
75 file doc/firewall.txt
76
77 This file has a section called CONFIGURATION OR SETTINGS FILES,
78 which explains in detail how our configuration system works. The ncbi
79 config file (.ncbirc on UNIX, ncbi.ini on PC/Windows, and ncbi.cnf on
80 Macintosh) is needed in order to find data files, such as
81 gc.val (the genetic code table), provided in the toolkit or with programs
82 like Sequin. (The asnload files containing dynamic versions of the ASN.1
83 parse tables are no longer needed, since all platforms can now have large
84 static data.)
85
86 It has recently become possible to eliminate the need for the ncbi config
87 file by calling UseLocalAsnloadDataAndErrMsg () at the beginning of your
88 program. This looks for the data directory in the same directory as the
89 running program. If it doesn't find it, it looks up one level, in case you
90 are compiling programs in the build directory of the toolkit. If it finds
91 the data directory in either of these places, it transiently sets the
92 location, so code that loads these files is given the correct path.
93
94 An even more recent change is that copies of several of our data files (gc,
95 seqcode, and featdef) are now built into the source code, so if the data
96 directory is not found, programs that require only these can still run.
97
98 One final improvement is that access to our network services is now much
99 simpler than before, so if you are not behind a firewall and have domain
100 name server (DNS) available you can connect to our network without needing
101 any configuration information in the ncbi config file. Operation behind a
102 firewall, or with a proxy, requires very little in the ncbi config file, and
103 this is easily created by asking Sequin to configure for network access.
104
105 =============================================================================
106 Notes from Previous Releases
107 =============================================================================
108
109 =============================================================================
110 Version 6.0
111 the date of release may be obtained from the file ncbi/VERSION
112 =============================================================================
113
114 This release includes source code for the new (2.0) version of BLAST.
115 Also included are a small number of incremental changes in the ASN.1
116 specification.
117
118 BLAST 2.0 - BLAST 2.0 can produce gapped alignments and is capable of
119 position-specific-iterated BLASTp (PSI-BLAST). Compared to the 1.4 release of
120 BLAST, there are also signficant performance enhancements as well as extensive
121 changes to the text report and the format of the databases. BLAST 2.0
122 uses threads for multi-processing, using the NCBI threads library.
123 Three BLAST programs may be compiled in the demo directory. They are:
124
125 formatdb: formats FASTA files as BLAST databases for BLAST 2.0.
126
127 blastall: perform all five flavors of blast comparison.
128 blastn and blastp offer fully gapped alignments.
129 blastx and tblastn have 'in-frame' gapped alignments and use sum
130 statistics to link alignments from different frames.
131 tblastx provides only ungapped alignments.
132
133 blastpgp: performs gapped blastp searches and can be used to perform
134 iterative searches in psi-blast mode.
135
136 Additional information may be obtained from the README in the BLAST
137 directory of the FTP site and from the NCBI BLAST pages.
138
139 ASN.1 Spec Changes for 1997
140
141 biblio.asn
142 Cit-pat - some fields made optional to allow patent applications to be legal
143 Cit-pat.number OPTIONAL
144 Cit-pat.date-issue OPTIONAL
145 -- Patent number and date-issue were made optional in 1997 to
146 -- support patent applications being issued from the USPTO
147 -- Semantically a Cit-pat must have either a patent number or
148 -- an application number (or both) to be valid
149
150 medline.asn
151 added ML-field to support other MEDLINE line types
152
153 Medline-entry ::= SEQUENCE {
154 uid INTEGER OPTIONAL , -- MEDLINE UID, sometimes not yet available if from PubMed
155 em Date , -- Entry Month
156 ... (not shown)
157 pmid PubMedId OPTIONAL , -- MEDLINE records may include the PubMedId
158 pub-type SET OF VisibleString OPTIONAL, -- may show publication types (review, etc)
159 mlfield SET OF Medline-field OPTIONAL } -- additional Medline field types
160
161 Medline-field ::= SEQUENCE {
162 type INTEGER { -- Keyed type
163 other (0) , -- look in line code
164 comment (1) , -- comment line
165 erratum (2) } , -- retracted, corrected, etc
166 str VisibleString , -- the text
167 ids SEQUENCE OF DocRef OPTIONAL } -- pointers relevant to this text
168
169 DocRef ::= SEQUENCE { -- reference to a document
170 type INTEGER {
171 medline (1) ,
172 pubmed (2) ,
173 ncbigi (3) } ,
174 uid INTEGER }
175
176
177 seq.asn
178 MolInfo.tech - added names for HTG classes already implemented
179 Annotdesc.region - added seqloc. If present, all annots in this SeqAnnot
180 are within this region. Optimization on big seqs.
181
182 seqfeat.asn
183 added OrgMod.specimen-voucher - new organism qualifier
184 added OrgMod.old-name - used internally at NCBI
185 added BioSource.is-focus - for distinguishing biological focus of
186 multiple source features.
187 added Seq-feat.pseudo so any feature can be flagged explicitly as
188 belonging to a pseudogene
189 added Seq-feat.except-text for an explanation of the exception when
190 Seq-feat.except is TRUE. Currently this text is in Seq-feat.comment
191 in backbone records and GBQuals in some other genbank records.
192
193
194
195 =============================================================================
196 Notes from Previous Releases
197 =============================================================================
198
199 Version 5.0
200
201 Summary
202
203 This release includes a small number of incremental changes in the ASN.1
204 specification. Most significant is the addition of the PubMedID, a
205 bibliographic citation identifier similar to a MEDLINE UID. PubMed is a new
206 citation database being developed at NCBI which is a superset of MEDLINE. It
207 will be an avenue by which publishers can deposit electronic versions of their
208 citations and abstracts to allow them timely linking to network entrez from
209 the publishers on-line services. PubMed will route these citations to MEDLINE
210 and they will appear in MEDLINE (and Entrez) after the usual MEDLINE indexing.
211 However, for some period of time, such articles will have only a PubMedID.
212 We would like to switch Entrez over to supporting PubMedIDs as early as
213 possible. WE STRONGLY ENCOURAGE DEVELOPERS TO RECOMPILE AND RELINK WITH THIS
214 VERSION OF THE TOOLKIT AS SOON AS POSSIBLE. The changes in this specification
215 should not cause problems with existing software, so a simple compile and
216 link should be enough to make you compatible. Details of ASN.1 specification
217 changes are listed below.
218
219 There has been considerable development of the toolkit in other aspects as
220 well, many of which are embodied in sequin, the new NCBI direct submission
221 tool, which is included in the toolkit as well. In the interest of getting the
222 PubMed changes into the specification and developers hands promptly, we have
223 not included much on that aspect of this toolkit at this time.
224
225
226 Changes in the 1996 NCBI ASN.1 (version 5.0) specification
227
228 Once again, there are very few changes to the NCBI ASN.1 specification this
229 year. The biggest change is the addition of the PubMed ID to support the new
230 NCBI PubMed database. There are also small additions to the medline and
231 organism specifications, detailed below. As usual, these changes are also
232 backward compatible with old data. However, you should recompile and relink
233 your applications as soon as possible, since the old applications will not be
234 compatible with the new datatypes.
235
236 1) PubMed - NCBI is building a new citation database that is a superset of
237 MEDLINE and which will be linked to online journals from publishers. The
238 bibliographic components of the specification have had support for PubMed IDs
239 added. These include biblio.asn (objbibli.[ch]), pub.asn (objpub.[ch]),
240 medline.asn (objmedli.[ch]).
241
242 2) pub-type - MEDLINE includes strings indicating the type of a publication.
243 The medline definition has had the attribute pub-type added to support these
244 strings.
245
246 From the 1996 MeSH, here's the list.
247
248 Abstract
249 Bibliography
250 Classical Article
251 Clinical Conference
252 Clinical Trial
253 Clinical Trial, Phase I
254 Clinical Trial, Phase II
255 Clinical Trial, Phase III
256 Clinical Trial, Phase IV
257 Comment
258 Consensus Development Conference
259 Consensus Development Conference, NIH
260 Controlled Clinical Trial
261 Corrected and Republished Article
262 Current Biog-Obit
263 Dictionary
264 Directory
265 Duplicate Publication
266 Editorial
267 Festschrift
268 Guideline
269 Historical Article
270 Historical Biography
271 Interview
272 Journal Article
273 Legal Brief
274 Letter
275 Meeting Report
276 Meta-Analysis
277 Monograph
278 Multicenter Study
279 News
280 Newspaper Article
281 Overall
282 Periodical Index
283 Practice Guideline
284 Published Erratum
285 Randomized Controlled Trial
286 Retracted Publication
287 Retraction of Publication
288 Review
289 Review Literature
290 Review of Reported Cases
291 Review, Academic
292 Review, Multicase
293 Review, Tutorial
294 Scientific Integrity Review
295 Technical Report
296 Twin Study
297
298 3) virion - the attribute virion has been added to BioSource.genome. It just
299 complements proviral which was already there. This will map to a /virion
300 qualifier in the new GenBank feature table definition.
301
302 4) division - OrgName.div now (optionally) can contain the GenBank division code
303 (eg. PRI).
304
305 5) signal-peptide, transit-peptide - were added to Prot-ref, to support
306 annotation of protein features on the protein sequence in a way that could be
307 mapped to a GenBank feature table.
308
309 That's all. Relevant sections of the asn.1 specification are shown below.
310
311 ================================================================================
312
313 biblio.asn
314
315
316 PubMedId ::= INTEGER -- Id from the PubMed database at NCBI
317
318 and..
319
320
321 Cit-gen ::= SEQUENCE { -- NOT from ANSI, this is a catchall
322 cit VisibleString OPTIONAL , -- anything, not parsable
323 authors Auth-list OPTIONAL ,
324 muid INTEGER OPTIONAL , -- medline uid
325 journal Title OPTIONAL ,
326 volume VisibleString OPTIONAL ,
327 issue VisibleString OPTIONAL ,
328 pages VisibleString OPTIONAL ,
329 date Date OPTIONAL ,
330 serial-number INTEGER OPTIONAL , -- for GenBank style references
331 title VisibleString OPTIONAL , -- eg. cit="unpublished",title="title"
332 pmid PubMedId OPTIONAL } -- PubMed Id
333
334 pub.asn
335
336
337 Pub ::= CHOICE {
338 gen Cit-gen , -- general or generic unparsed
339 sub Cit-sub , -- submission
340 medline Medline-entry ,
341 muid INTEGER , -- medline uid
342 article Cit-art ,
343 journal Cit-jour ,
344 book Cit-book ,
345 proc Cit-proc , -- proceedings of a meeting
346 patent Cit-pat ,
347 pat-id Id-pat , -- identify a patent
348 man Cit-let , -- manuscript, thesis, or letter
349 equiv Pub-equiv, -- to cite a variety of ways
350 pmid PubMedId } -- PubMedId
351
352 medline.asn
353
354 -- a MEDLINE or PubMed entry
355 Medline-entry ::= SEQUENCE {
356 uid INTEGER OPTIONAL , -- MEDLINE UID, sometimes not yet available if
357 from PubMed
358 em Date , -- Entry Month
359 cit Cit-art , -- article citation
360 abstract VisibleString OPTIONAL ,
361 mesh SET OF Medline-mesh OPTIONAL ,
362 substance SET OF Medline-rn OPTIONAL ,
363 xref SET OF Medline-si OPTIONAL ,
364 idnum SET OF VisibleString OPTIONAL , -- ID Number (grants, contracts)
365 gene SET OF VisibleString OPTIONAL ,
366 pmid PubMedId OPTIONAL , -- MEDLINE records may include
367 the PubMedId
368 pub-type SET OF VisibleString OPTIONAL } -- may show publication types
369 (review, etc)
370
371 seqfeat.asn
372
373
374 OrgName ::= SEQUENCE {
375 name CHOICE {
376 binomial BinomialOrgName , -- genus/species type name
377 virus VisibleString , -- virus names are different
378 hybrid MultiOrgName , -- hybrid between organisms
379 namedhybrid BinomialOrgName , -- some hybrids have genus x species
380 name
381 partial PartialOrgName } OPTIONAL , -- when genus not known
382 attrib VisibleString OPTIONAL , -- attribution of name
383 mod SEQUENCE OF OrgMod OPTIONAL ,
384 lineage VisibleString OPTIONAL , -- lineage with semicolon separators
385 gcode INTEGER OPTIONAL , -- genetic code (see CdRegion)
386 mgcode INTEGER OPTIONAL , -- mitochondrial genetic code
387 div VisibleString OPTIONAL } -- GenBank division code
388
389 BioSource ::= SEQUENCE {
390 genome INTEGER { -- biological context
391 unknown (0) ,
392 genomic (1) ,
393 chloroplast (2) ,
394 chromoplast (3) ,
395 kinetoplast (4) ,
396 mitochondrion (5) ,
397 plastid (6) ,
398 macronuclear (7) ,
399 extrachrom (8) ,
400 plasmid (9) ,
401 transposon (10) ,
402 insertion-seq (11) ,
403 cyanelle (12) ,
404 proviral (13) ,
405 virion (14) } DEFAULT unknown ,
406 origin INTEGER {
407 unknown (0) ,
408 natural (1) , -- normal biological entity
409 natmut (2) , -- naturally occurring mutant
410 mut (3) , -- artificially mutagenized
411 artificial (4) , -- artificially engineered
412 synthetic (5) , -- purely synthetic
413 other (255) } DEFAULT unknown ,
414 org Org-ref ,
415 subtype SEQUENCE OF SubSource OPTIONAL }
416
417 Prot-ref ::= SEQUENCE {
418 name SET OF VisibleString OPTIONAL , -- protein name
419 desc VisibleString OPTIONAL , -- description (instead of name)
420 ec SET OF VisibleString OPTIONAL , -- E.C. number(s)
421 activity SET OF VisibleString OPTIONAL , -- activities
422 db SET OF Dbtag OPTIONAL , -- ids in other dbases
423 processed ENUMERATED { -- processing status
424 not-set (0) ,
425 preprotein (1) ,
426 mature (2) ,
427 signal-peptide (3) ,
428 transit-peptide (4) } DEFAULT not-set }
429
430
431 =============================================================================
432 Notes from Previous Releases
433 =============================================================================
434
435 New Functions in Version 4.0
436
437 There are a host of new functions in this release, but as usual we have not
438 managed to make time to document them all. Large parts of Sequin are present
439 which will be announced and described more fully in the fall. However,
440 specific tools of immediate interest are:
441
442 blast2 - this is the long awaited BLAST client/server which permits structured
443 interaction with BLAST over the internet. We have provided a basic client
444 that produces the traditional blast output. In addition, the function call
445 interface can be used in more elaborate clients. For more information
446 contact Tom Madden, madden@ncbi.nlm.nih.gov
447
448 WARNING!!! blast2 is the client we plan to support on the longer term.
449 The blast1 client we included for those of you who wanted a head start
450 will NOT be supported in future. Please shift any blast1 clients to the
451 (very similar) blast2 interface as soon as possible.
452
453 sim, sim2 - protein and DNA sequence alignments in linear space. This is
454 the function call interface to these valuable tools. Applications have
455 been written which are available by ftp as are published papers. For more
456 information contact Jinghui Zhang, zjing@ncbi.nlm.nih.gov
457
458
459
460
461 Changes in ASN.1 spec 4.0 from 3.0
462
463
464 Affil - biblio.asn
465 added the field "postal-code" for Zip code finally.
466
467 Contact-info - submit.asn
468 added the field "contact" which is type "Author". The contact info has
469 evolved into a fully structured form, so I just took Author which has
470 structured names and structured address (Affil). We will eventually
471 phase out all the less structured ones in Contact-info.
472
473 OrgName - sefeat.asn
474 added "lineage", "gcode", "mgcode" for the lineage, genetic code, and
475 mitochondrial genetic code. This is part of Org-ref, and consolidates
476 all the organism info (except original SOURCE line) out of the
477 GenBank block... and enables us to deliver it nicely from Taxon.
478
479 Seq-descr - seq.asn
480 removed the Seq-descr "neighbors" and replaced it with "dbxref", since
481 neighbors has never been used. This is used to add cross-references to
482 the whole entry.
483
484 Pubdesc - seq.asn
485 has an added slot, "reftype" which is an integer and is used to
486 indicate the GenBank usage of a reference.
487
488 0 - seq - applies to the sequence. This is default and they way it is
489 used now.
490 1 - sites - applies to (unspecified) features. Equivalent to a GenBank
491 SITES feature. We could switch to this from using the
492 Imp-feat we do now.
493 2 - feats - applies to specific features. The idea here is provide a
494 place for the full citation, so features nead only reference
495 it. If now features reference it should be removed. This
496 would work for checking content when only a part of a sequence
497 is copied or pasted. A "sites" ref could not have this check
498 since we do not know which features it goes to.
499
500 Seq-feat - seqfeat.asn
501 added a slot called "dbxref" to Seq-feat. This is a SET OF Dbtag. It will
502 be for adding the new db_xref qualifiers to features. We already have some
503 of these in the xref slots of Gene-ref, Prot-ref, Org-ref. It means we ahve
504 to check two places in these cases. I do not want to retire the slots
505 since these were meant to be used in other contexts besides features.. and
506 Org-ref already is.
507
508
509 added a slot called "anticodon" to the tRNA extension of the RNA feature.
510 This is a Seq-loc that points to the location of the anticodon in a tRNA.
511 We have been populating this data in a User-object, and will have to do
512 a retro to convert it.
513
514 EXPORTED Genetic-code
515
516
517 Seq-align - seqalign.asn
518
519 added "bounds" to Seq-align so you can record the regions over which
520 an alignment was computed.. not always included in the resulting alignment
521 itself.
522
523 added two new types:
524 A) Packed-seg -- a denser representation from Colombe and Jinghui
525 B) disc - discontinuous alignments as a SEQUENCE OF Seq-align
526
527
528 Seq-annot - seq.asn
529
530 added a field to Seq-annot, Align-def, to discriminate types of
531 alignment sets. This has the advantage of minimal changes as well as
532 separating sets of alignments from conceptually single alignments. I am
533 not sure it is necessary to distinguish "alt" from "blocks" though. Also
534 it means you can attach more info, with other Seq-annot fields and/or by
535 expanding the Align-def. I put in "ids" in Align-def specifically to put
536 the one Seq-id that is the "master" for type "ref". I made it a SET OF
537 so we could use it for other collections where we might want to list
538 more than one.
539
540 added "ids" and "locs" as allowed types within Seq-annot. This would
541 enable us to pass lists like this around between tools with all the
542 addtional descriptive information in Annotdesc. I know this will be
543 useful.
544
545 added "general" to Annot-id for tracking 3rd party annotations.
546
547
548
549
550
551
552
553 Introduction
554
555 This distribution is release 5.0 of the NCBI core library for building
556 portable software, and AsnLib, a collection of routines for handling ASN.1
557 data and developing ASN.1 software applications. AsnLib and the asntool
558 application are built using the CoreLib routines. In the \doc directory is an
559 MS Word file which details the information given below. It is also available
560 as hardcopy. See the README in \doc.
561
562 The lowest layer of code is the CoreLib. These are multi-
563 platform functions for memory allocation (including byte stores), string
564 manipulation, file input and output, error and general messages, and
565 time and date notification. These functions have been written only
566 where we found that the existing ANSI functions were not sufficiently
567 multi-platform or well- behaved among all of the platforms that we
568 support. For each platform (a combination of processor, operating
569 system, compiler, and windowing system), we supply a specific ncbilcl.h
570 file, which contains typedefs and defines for multi-platform symbols,
571 and includes a number of standard header files. (For example,
572 ncbilcl.msw is used for the Microsoft C compiler under Microsoft Windows
573 on the PC.) Use of these symbols, and of the functions in the CoreLib,
574 allow us to write multi-platform source code for a variety of disparate
575 platforms.
576
577 The next layer of code is the AsnLib stream reader. This is
578 used in conjunction with a header file and a parse table loader file,
579 both of which are produced by processing the formal ASN.1 specification
580 with the AsnTool application. The symbolic defines in the
581 header file are pointers into the parse table, in which the ASN.1
582 specification is represented. To read at the stream reader level, a
583 program alternates between calls to AsnReadId and AsnReadVal. AsnReadId
584 returns a pointer into the parse table, which can be compared against
585 the defines in the AsnTool-generated header. For example, in the
586 specification for MEDLINE records, the Medline-entry section has an item
587 called "uid", for the unique ID of the record. This is symbolized in
588 the header file as MEDLINE_ENTRY_uid. When AsnReadId returns this
589 symbol, the program calls AsnReadVal to obtain the uid for that record.
590 AsnKillValue is also needed to free any memory allocated by AsnReadVal,
591 which occurs when the value is a string and not an integer. The entire
592 set of records on the Entrez CD-ROM can be read as a single stream with
593 the AsnLib functions.
594
595 The ASN.1 records may be accessed at a higher level through the object
596 loaders, which utilize the stream processing functions to
597 load C memory structures with the contents of the ASN.1 objects. For
598 each ASN.1 object we specify, we also define an equivalent C memory
599 structure. The object loader level of code contains functions to read
600 and write each ASN.1 object. These are hierarchical, as are the ASN.1
601 specifications. Calling the top level loader, SeqEntryAsnRead, will
602 load an entire SeqEntry from an open AsnIo channel, and will return a
603 pointer to the loaded memory structure. The read function for an AsnIo
604 channel can be swapped to refer to a normal disk file, a network socket,
605 or to compressed data, which it automatically decompresses. The object
606 loader code can interconvert between the highly-branched memory object
607 and a linear ASN.1 message with complete fidelity. The object loaders
608 have additional functions, including the ability to explore the
609 structure and notify the program when particular data elements are
610 encountered. The entire contents of the Entrez CD-ROM can also be
611 streamed through the object loaders. However, most calls to the object
612 loaders for simply reading a particular record are done via the data
613 access functions (see below).
614
615 The data access functions allow a program to call the object loaders on
616 a sequence or MEDLINE record given the uid of the record.
617 This will get the data into memory regardless of whether the data are
618 compressed on the Entrez CD-ROM or are obtained through a service over
619 the Internet. This means that a detailed understanding of the files and
620 formats on the Entrez disc is not needed by application programmers. The
621 function to load a sequence record, SeqEntryGet, needs the uid to
622 retrieve and a complexity code parameter. A sequence record is in the
623 form of a NucProt set. This contains a nucleotide (which may itself be
624 composed of segments) and all of the proteins it is known to encode.
625 The set of segments is called a SegSet, and the individual sequences are
626 called BioSeqs. We have taken the liberty of producing this integrated
627 view, but the complexity code parameter allows the record to be easily
628 loaded in a simpler, more traditional form, if desired. The accession
629 number term list is built to supply the proper uids to support this
630 facility. This access library is compatible with Entrez release 1.0 or
631 later only.
632
633 The sequence utilities and application programmer interface layer
634 allows exploration of the loaded memory structures and
635 generation of standard literature or sequence reports from those
636 objects. For example, a BioSeq can be converted to FASTA or GenBank
637 flat file formats and saved to a file, and a MEDLINE record can be saved
638 in MEDLARS format, which is suitable for entry into personal
639 bibliographic database programs. A sequence port can be opened that
640 gives a simple, linear view of a segmented sequence, converting
641 alphabets, merging exon segments, and dealing with information on both
642 strands of the DNA. This layer also includes some functions to explore
643 the NucProt set. The explore functions visit each individual BioSeq in
644 the set, calling a callback function for each sequence node so that a
645 program can examine feature tables and other information that are
646 associated with the NucProt or SegSets or with the individual sequences.
647
648 Vibrant is a multi-platform user interface development library that runs
649 on the Macintosh, Microsoft Windows on the PC, or X11 and OSF/Motif on
650 UNIX and VAX computers [separate documentation]. It is used to build
651 the graphical interface for the Entrez application (whose source code is
652 in the browser directory). The philosophy behind Vibrant is that
653 everything in the published user interface guidelines (the generic
654 behavior of windows, menus, buttons, etc.), as well as positioning and
655 sizing of graphical control objects, is taken care of automatically.
656 The program provides callback functions that are notified when the user
657 has manipulated an object. Vibrant and Entrez code are not supported,
658 but are provided on an as-is basis.
659
660 The advantage of using AsnLib and the object loaders, as they are
661 implemented, is that application program developers merely need to
662 recompile their programs with the new (AsnTool-generated) header files
663 and load the new parse tables (included with the Entrez software) in
664 order to be able to read the new data. This process is straightforward,
665 and will not break existing program code. The application is free to
666 ignore new fields if it does not choose to take advantage of the new
667 kinds of information.
668
669 When developing new ASN.1 specifications, as of June 1994 it is possible to
670 automatically generate the object loaders and header files for those
671 specifications, using the AsnCode utility. For some complex ASN.1
672 specifications, however, AsnCode may fail to generate the correct source code.
673
674 The documentation is currently being brought up to date. The programs
675 in the demo directory are designed to teach the proper use of many of
676 the functions discussed above. Many of these programs are not yet
677 documented. The simplest is testcore.c, which tests various functions
678 in the CoreLib. The most complex is getfeat.c, which takes an accession
679 number of locus name, determines the unique seq ID, retrieves the entry
680 from the Entrez CD-ROM using the data access library, locates all coding
681 region features using the explore functions, and prints the DNA
682 sequences of all exons using sequence port functions. If you cannot
683 extract and print the doc.tar.Z file, please send an email message with
684 your land mailing address and phone number to toolbox@ncbi.nlm.nih.gov,
685 and we will mail a copy to you.
686
687 The contents of the ncbi directory (the highest level, containing the
688 NCBI Software Development Kit source code in several subdirectories) is
689 shown below. The readme file contains instructions on copying the
690 appropriate make files to be built in the build directory. The makeall
691 file copies headers to the include directory builds four libraries
692 (ncbi, ncbiobj, ncbicdr and vibrant), copying them to the lib directory.
693 The makedemo file builds the demo programs and the Entrez application:
694
695 api Application Programmer Interface, Sequence Utilities
696 asn ASN.1 specifications for publications and sequences
697 asnlib Source code for AsnLib and asntool
698 asnload AsnLib headers and dynamic parse tables (Mac and PC)
699 asnstat AsnLib headers that use static memory (UNIX and VMS)
700 bin Asntool executable copied here
701 biostruc Source code for Molecular Modelling DataBase functions
702 browser Source code for Entrez application
703 build Empty directory for building tools and libraries
704 cdromlib Access routines for data on the Entrez CD-ROM
705 cn3d Source code for Vibrant-based 3D structure viewer
706 config Configuration files for NCBI software:
707 mac
708 unix
709 vms
710 win
711 corelib Source code for NCBI Core Software Library
712 data Data files used for sequence conversion
713 demo AsnLib and sequence utility demonstration programs
714 desktop Source code for Vibrant-based viewers and editors
715 doc Documentation in Microsoft Word file
716 include Include files required by applications are copied here
717 lib Libraries copied here
718 link Contains several subdirectories with build accessory files:
719 macmet Macintosh Metrowerks/CodeWarrior
720 macmpw Macintosh MPW C
721 mswin Microsoft C and Borland C for Windows
722 make Make files for various systems
723 network Network version of data access
724 apple
725 blast2
726 encrypt
727 entrez
728 netmanag
729 nsclilib
730 object Functions for reading and writing complex objects
731 sequin Source code for Sequin application
732 tools Source code for alignment and other contributed utilities
733 readme File that contains important building instructions
734 vibrant Source code for Vibrant portable interface package
735
736 The platforms that are supported (as indicated by the suffix on the
737 relevant ncbilcl.h file) are shown below. Those marked with an asterisk
738 (*) are available as-is:
739
740 370* IBM 370
741 acc SUN acc compiler
742 alf DEC Alpha under OSF/1
743 aov DEC Alpha under AXP/OpenVMS
744 aux* Macintosh A/UX
745 bor Borland for DOS
746 bwn Borland for Microsoft Windows
747 ccr CenterLine CodeCenter
748 cpp SUN C++
749 cra* Cray
750 cvx* Convex
751 gcc Gnu gcc (under SunOS, not Solaris)
752 hp * Hewlett Packard
753 lna* Linux on DEC Alpha
754 lnx Linux (RedHat Linux release 5.2 with kernel 2.0.36)
755 met Macintosh Metrowerks compiler
756 mpw Macintosh Programmer's Workshop
757 msc Microsoft C for DOS
758 msw Microsoft for Windows
759 nxt* NeXT
760 r6k* IBM RS 6000
761 scr CodeCenter under Sun Solaris
762 sgi Silicon Graphics
763 sin Sun Solaris on Intel processors
764 sol Sun Solaris (for cc and gcc)
765 thc THINK C on Macintosh
766 ult DEC ULTRIX
767 vms DEC VAX/VMS
768
769 Questions or comments can be directed to toolbox@ncbi.nlm.nih.gov.
770
771 ANSI C:
772
773 This software requires an ANSI C compiler. This will be no problem at
774 all except to people on Sun machines, where the bundled C compiler, cc, is
775 non-ansi. However, you can use the Sun unbundled compiler, acc, or the Gnu
776 compiler, gcc (which is free) and that works just fine. If you have written
777 applications on the Sun with non-ANSI functions, the ANSI compilers will
778 complain. See the notes below if this is a problem.
779
780
781 Installation
782
783 To build the NCBI toolkit you need to look for platform-dependent instructions:
784 For UNIX:
785 look at the file make/readme.unx
786 For Mac:
787 look at the file make/readme.mac
788 For Microsoft Windows95/98/NT:
789 look at the file make/readme.dos
790
791 There is some information which may be useful for NCBI tookit building
792 in the file doc/FAQ.txt
793
794 ALL -
795 change to the directory above ncbi subdirectory
796
797 Unix
798 tested on Sun Sparc (Solaris 2.6, Sunos 4.1.3),
799 Silicon Graphics IRIX 5.* and 6.*, DEC Alpha with OSF/1 V5.1,
800 Linux (Red Hat Linux release 6.2 with kernel 2.2.16) on Intel,
801 Sun Solaris for Intel (Solaris 2.7).
802
803 Run the script ncbi/make/makedis.csh keeping it's output in the
804 separate file:
805 for sh or bash:
806 ncbi/make/makedis.csh 2>&1 | tee out.makedis.csh
807 for csh or tcsh:
808 ncbi/make/makedis.csh |& tee out.makedis.csh
809 If that script gives you an error like this:
810 Your platform is not supported.
811 To port ncbi toolkit to your platform consult
812 the files platform/*.ncbi.mk
813 then you should check the script ncbi/make/makedis.csh and
814 add proper platform-dependent ncbi.mk file in ncbi/platform
815 directory.
816
817 Other UNIX: AIX, ULTRIX, NeXt, Sun acc,
818 Follows models above. Read header in makeall.unx and makedemo.unx
819 for details.
820
821 for all UNIX, edit .ncbirc as described in section "CONFIGURATION OR
822 SETTINGS FILES".
823 optional edit .login to "setenv NCBI=[path to .ncbirc file]"
824
825 MS-DOS
826 look at the file make/readme.dos
827
828 Mac
829 tested on CodeWarrior IDE 2.1, MacOS 8.0
830 All - copy config:mac:ncbi.cnf to your System Folder, or to the
831 System Folder:Preferences subfolder
832 edit the "ASNLOAD" line in "ncbi.cnf" to point to the
833 ncbi:asnload directory in this release
834 edit the "DATA" line to point to the ncbi/data directory
835 CodeWarrior - raise Preferred Size of Script Editor from 700 to 3000,
836 and raise Preferred Size of CodeWarrior IDE 2.1 by
837 2000 (e.g., from 8206 to 10206), using Get Info from
838 the Finder.
839 to compile for MC680x0 platform (default is PowerPC),
840 change property MASTER from "PPC" to "68K".
841 run copyhdrs.met
842 run makeall.met
843 run makenet.met
844 run makedemo.met
845 Think C - no longer supported
846 MPW C - no longer supported
847
848 Changes to VMS make file naming conventions:
849
850 The old .dcl prefix (last character is a lower case L) was changed
851 to .dc1 (last character is the numeral 1) to allow for different make files
852 for DecWindows 1.1 and DecWindows 1.2. Several new .dc2 files were
853 contributed by David Mathog of CalTech. A synopsis of his additional
854 instructions:
855
856 VAX C DecWindows 1.1 Use .dcl1 files.
857 DEC C DecWindows 1.1 Use .dcl1 files,
858 but change cc to cc/standard=vaxc
859 VAX C DecWindows 1.2 This combination has not been tested.
860 DEC C DecWindows 1.2 Use .dcl2 files.
861
862 VMS (without Vibrant) on VAX
863 $set def [ncbi.build]
864 $copy [-.make]*.dc1 *.com
865 $@makeall
866
867 check ncbi.cfg as described in section "CONFIGURATION OR SETTINGS FILES".
868 edit LOGIN.COM to "define NCBI [path to ncbi.cfg file]"
869
870 To make demos:
871 $@makedemo
872
873 VMS (with Vibrant) on VAX
874 $set def [ncbi.build]
875 $copy [-.make]*.dc1 *.com
876 $@viball
877
878 check ncbi.cfg as described in section "CONFIGURATION OR SETTINGS FILES".
879 edit LOGIN.COM to "define NCBI [path to ncbi.cfg file]"
880
881 To make demos:
882 $@vibdemo
883
884 Testing
885
886 VMS only: look in rundemo.dc1 in [make] to see how to give command
887 line arguments. Not all demo programs are shown. Run at least testcore.
888
889 All else:
890
891 In build should be a program called testcore. Type "testcore -" and
892 it should show you some default arguments. Type "testcore" and it will
893 run through a variety of functions in CoreLib, prompting you for responses
894 along the way. It should run without a crash or error report. If you made
895 Vibrant versions all demos will have startup dialog boxes. If not, they
896 take command line arguments.
897
898 If testcore runs, read the documentation for CoreLib and for AsnLib.
899 In the AsnLib documentation are instructions for running asntool itself.
900 for running a few of the demo programs. There are a large number of demo
901 programs now (including Entrez itself, if you made the Vibrant versions).
902
903
904
905 CONFIGURATION OR SETTINGS FILES:
906
907 One of the fundamental problems in writing portable software concerns
908 configuration issues. Each individual user's computer will have its own
909 particular hardware and software environment, and each machine will have
910 its disk file hierarchy set up in a unique manner. A program that needs
911 accessory information, such as help files, parse tables, or format
912 converters, must be given a means of finding the data regardless of where
913 the user has placed the files. The difficulty is compounded by the different
914 conventions for naming files and specifying paths on each class of machine.
915 For example, the name of a CD-ROM on the Macintosh is fixed, determined by
916 information on the CD itself, whereas on the PC it is addressed by a drive
917 letter, which can be assigned by the user, but which cannot be reconciled
918 with the name the Macintosh sees.
919
920 An associated problem is that many programs will want to allow the user
921 to make persistent changes to parameters. These parameters typically involve
922 numbers or font specifications, but may also include paths to data files.
923 Some platforms supply such configuration information in preferences files,
924 others in environment variables. Manipulating these settings is platform
925 dependent, as is the format in which the preference is specified.
926
927 The NCBI Software Toolkit core library addresses these problems by
928 providing configuration or settings files. These are modeled after the .INI
929 files used by Microsoft Windows. Settings files are plain ASCII text files
930 that may be edited by the user or modified by the program. They are divided
931 into sections, each of which is headed by the section name enclosed in square
932 brackets. Below each section heading is a series of key=value strings, somewhat
933 analogous to the environment variables that are used on many platforms.
934
935 The ncbi configuration file supplies general purpose configuration
936 information on paths for commonly used data files. The typical file set up for
937 the Entrez application running on the PC under Microsoft Windows is shown below:
938
939 [NCBI]
940 ROOT=D:
941 ASNLOAD=C:\ENTREZ\ASNLOAD\
942 DATA=C:\ENTREZ\DATA
943
944 The only section is entitled NCBI. The ROOT entry refers to the path to
945 the Entrez CD-ROM. In this example, the user has configured the machine to
946 use drive letter D. (On the Macintosh, the name of the disc is SEQDATA, which
947 cannot be changed by the user.) The ASNLOAD specifies the path to the ASN.1
948 parse tables. These files are required by the AsnLib functions, and all
949 higher-level procedures that call them, including the Object Loader, Sequence
950 Utility, and Data Access functions. Files pointed to by the DATA entry contain
951 information necessary to convert biomolecule sequence data into different
952 alphabets (e.g., unpacking the 2-bit nucleotide code stored on the Entrez CD
953 into standard IUPAC letters).
954
955 Although the contents of a configuration file is similar regardless of
956 platform, the name of the file and its location is platform dependent. If the
957 base name of the configuration file is xxx, then the actual file name is shown
958 below for each platform:
959
960 Macintosh xxx.cnf
961 Microsoft Windows xxx.INI
962 MS-DOS (without Windows) xxx.CFG
963 UNIX .xxxrc
964 VMS xxx.cfg
965
966 Samples of such files are in subdirectories of \config. The UNIX version
967 does not have the leading '.' in filename so you can see it.
968
969 The location in which these files must reside is also platform dependent,
970 and the functions that manipulate the contents may look in several places to
971 find these files.
972
973 On the Macintosh, the function first looks in the System Folder, then in the
974 Preferences folder within the System Folder. (See the Mac OS X addendum in the
975 next paragraph). Under Microsoft Windows, the file must be in the Windows
976 directory, along with all of the other .INI files. Under DOS without Windows,
977 the function first looks in the current working directory, then in the directory
978 whose path is specified in the NCBI environment variable. Under UNIX and VMS,
979 the current working directory is first checked, then the user's home directory,
980 and finally the directory specified by the NCBI environment variable. (Under
981 UNIX, when it uses the environment variable, it will check for configuration
982 files first without and then with the initial dot.) On the multi- user
983 platforms (UNIX and VMS), the use of the NCBI environment variable allows a
984 common settings file to be used as the default by multiple users. If such a
985 settings file is changed under program control, it is copied over into the
986 user's home directory, and the new copy is modified. The order of searching
987 for settings files ensures that this new copy is used in all subsequent
988 operations.
989
990 On Mac OS X, it first looks for xxx.cnf in username/Library/Preferences,
991 then in package/Contents/Resources, where username is the user's home directory
992 and package is the application package. If it does not find the configuration
993 file, it then switches to UNIX style, looking for .xxxrc in the home directory
994 and then in the current directory. This way Mac OS X applications retain the
995 traditional Mac behavior but can also UNIX style configuration files.
996
997
998 contents of ASNLOAD are in ncbi/asnload
999 contents of DATA are in ncbi/data
1000
1001 Automatic Generation of code to read and write new ASN.1 messages.
1002 (Previously, ASNCODE USAGE)
1003
1004 'asntool' can now generate code for use as ASN.1 readers and writers.
1005 This functionality used to be in the program called 'asncode'. There
1006 is thus no longer any need for the *.l* files. An example of how
1007 to generate this code follows:
1008
1009
1010 asntool -m YOURSPEC.asn -G -B genYOURSPEC
1011
1012 Both genYOURSPEC.h and genYOURSPEC.c will be generated.
1013
1014 Within asn ASN.1 definitions, types can be EXPORTed and IMPORTed.
1015 If YOURSPEC.asn imports definitions from otherspec.asn then it has
1016 to be added to the -m parameter as below. Note that code is only
1017 generated for the first file.
1018
1019 asntool -m YOURSPEC.asn,otherspec.asn -G -B genYOURSPEC
1020 ^
1021
1022 Notice the lack of a blank at the caret (^), above. This is important.
1023
1024
1025 MAJOR CHANGES FROM DOCUMENTATION:
1026
1027 AsnNode structures have proved to be generally useful and moved from AsnLib
1028 to ncbimisc. In addition, some elements of structs used in the object loaders
1029 were called "class" to match the ASN.1 names. Class is a C++ reserved word,
1030 so all instances of "class" have been changed to "_class".
1031
1032 To conform to our naming conventions, we have changed the names appropriately:
1033
1034 AsnValue = DataVal
1035 AsnNode = ValNode
1036 class = _class
1037
1038 A global search and replace of your code with these strings (not restricted
1039 to words... we want to change AsnNodePtr = ValNodePtr as well) should fix
1040 any problems. Field names within structures have not been changed. If your
1041 code uses only the object loaders, you may not find these strings in your
1042 code at all.
1043
1044 DATA ACCESS LIBRARIES
1045
1046 cdromlib contains data access routines compatible with release 1.0-6.0
1047 of the Entrez CDROM. The documentation for these functions are out of
1048 date. The routines in cdromlib have been split into entrez, sequence, and
1049 medline access functions. The interface you should normally program to is
1050 defined in accentr.[ch]. The form of this calls has been changed to make
1051 them compatible with the NCBI network server, a client/server version of
1052 data access. A program written to use these calls can access the the cdrom
1053 data, the network data, a combination, or that plus a local database by just
1054 fiddling with defines. The form of the api for these functions has also
1055 been changed to hide the details of storage and caching more so that the
1056 different optimizations done to support cdrom and network access are
1057 transparent to the application programmer. The end user tool called
1058 "Entrez" now uses these libraries as it's only means of data access (i.e.,
1059 you can write an application of your own with any or all of Entrez's
1060 functionality using just these routines).
1061
1062 NETWORK LIBRARIES
1063
1064 The toolbox now includes NCBI "Network Services". This includes
1065 everything which you need to build your own "Network Entrez" client software.
1066 The network libraries include a generic network services library (nsclilib),
1067 which is used to contact the network services dispatcher and connect to a
1068 desired server. Note that some development platforms require that you obtain
1069 a few source modules from external vendors. Look at the README files
1070 contained in the network directory (network/*/README) for more details.
1071
1072
1073 DOCUMENTATION
1074
1075 We are rewriting the documentation to conform with all the new features
1076 contained in this software. We will add it to the package as soon as possible.
1077
1078 DEMO PROGRAMS
1079
1080 As in the tools, there are a number of undocumented programs in the demo
1081 directory as well, that use a number of the utility functions in api. There
1082 is also a demo program called "getseq" in the cdromlib directory which
1083 retrieves a sequence from the cdrom given any valid sequence id. These will
1084 be described in more detail in the next set of documentation. Briefly:
1085
1086 asn2ff.c converts ASN.1 to GenBank flatfile
1087 asn2rpt.c converts ASN.1 to human readable report
1088 dosimple.c converts ASN.1 to a "simple sequence"
1089 getseq.c gets sequence from Entrez Cdrom using data access library,
1090 writes to disk
1091 getfeat.c ditto, but writes sequence of any CdRegion features to
1092 "test.out"
1093 getmesh.c documented
1094 getpub.c documented
1095 indexpub.c documented
1096 seqtest.c reads ASN.1 sequence, converts to iupac, reports segmented
1097 sequences, outputs fasta format to seqtest.out
1098 testcore.c documented
1099 testobj.c tests Medline object loader, demonstrates error checking using
1100 NULL asnio stream.
1101 entrez If Vibrant is installed, the full Entrez program is made.
1102 asndhuff Demonstrates streaming ASN.1 data from the huffman compressed
1103 Entrez CDROM (only works on release 1.0 or later).
1104 entrcmd Standalone non-interactive tool for accessing Entrez data.
1105 Entrcmd is the search engine used for NCBI's Entrez WWW server.
1106 asncode Tool for generating object loader source code given a .l
1107 file which is the output of AsnTool.
1108 cdscan scans entrez cdrom, makes GenBank, GenPept, or FASTA format
1109 output. Also has a slot for a replaceable CustomRoutine
1110 supplied by you. Has two examples of such routines.
1111
1112 CALLBACK CONVENTIONS
1113
1114 The CoreLib, AsnLib, and Object Loader routines have been converted to use
1115 the LIBCALL and LIBCALLBACK symbols (FAR PASCAL) on the PC for Windows. This will
1116 allow us to build dynamic link libraries (DLLs) so that the code can be accessed
1117 from languages other than C. Callback functions you write that are of types
1118 AsnOptFreeFunc, AsnExpOptFunc, IoFuncType, AsnReadFunc, AsnWriteFunc, and
1119 SeqEntryFunc, should be declared using the LIBCALLBACK macro. For example, a
1120 callback used as an AsnOptFreeFunc should be declared as follows:
1121
1122 static Pointer LIBCALLBACK MyOptFreeFunc (Pointer);
1123
1124 The SeqEntryFunc callback used by SeqEntryExplore has not yet been modified to
1125 use the LIBCALLBACK type. This will be added in the near future.
1126
|
This page was automatically generated by the
LXR engine.
Visit the LXR main site for more information. |