NCBI C Toolkit Cross Reference

C/doc/ncbixml.txt


  1 NCBI Data in XML
  2 
  3 Introduction
  4 
  5 Extensible Markup Language (XML) is a tagged format similar to HTML on which web 
  6 pages are based. The familiar text format and availability of public domain tools for 
  7 parsing this language is making it a popular choice for the exchange of structured data 
  8 over the WWW. Roughly ten years ago, NCBI chose a language called Abstract Syntax 
  9 Notation 1 (ASN.1) for describing and exchanging information in a manner similar to the 
 10 ways XML is now used. ASN.1 came out of the telecommunications industry and is a 
 11 compact binary encoding intended for both human readable text as well as integers, 
 12 floating point numbers, and so on. While this is "software friendly" it is less accessible to 
 13 users familiar with HTML and other text based languages. Tools for ASN.1 have largely 
 14 stayed within the commercial telecommunications industry while a host of public domain 
 15 tools of varying character have arisen for XML and HTML.
 16 
 17 NCBI has recently added support for XML output to its ASN.1 toolkit. An ASN.1 
 18 specification can be automatically rendered into an XML DTD. Data encoded in ASN.1 
 19 can automatically be output in XML which will validate against the DTD using standard 
 20 XML tools. We hope this will make the structured sequence, map, and structure data, as 
 21 well as the output of tools like BLAST, more accessible to those who wish to work in 
 22 XML.
 23 
 24 We are providing XML in two basic modes. Full Data Conversion is the direct mapping 
 25 of every data field used within NCBI to XML. This is not for the faint of heart, but it 
 26 does mean that whatever we have, you have. The other mode is to provide smaller, 
 27 Targeted DTDs for end users. These are still first done as ASN.1, but with an eye to 
 28 providing smaller, standalone data outputs as XML. These two modes are described in 
 29 detail below.
 30 
 31 Full Data Conversion
 32 
 33 Note that the full conversion of existing ASN.1 specified data into XML has some 
 34 specific properties. NCBI is not proposing a new data model, but is simply transliterating 
 35 the data model we have used for the last decade into a different language for the 
 36 convenience of our users. ASN.1 has a number of specific data types such as INTEGER 
 37 or REAL numbers while XML has only strings, so our DTD automatically adds some 
 38 ENTITY definitions at the top which maps these numbers to strings. This mapping only 
 39 allows humans that read the DTD to see where numbers are expected; an XML validator  
 40 will not care what is there. The ASN.1 validators do care, and can also check ranges of 
 41 values and so on, so those continue to be used to read and process the data within NCBI.
 42 
 43 Reuse and Roles
 44 
 45 ASN.1 is also designed to allow the reuse of modules in a specification. Modules may be in multiple
 46 files and mixed and matched as needed, similar to C or C++ header files defining structures and
 47 classes. Most XML specifications in biology have been relatively small thus far, and/or focussed on
 48 the work of a specific group. Thus the DTDs tend to be in a single file. It is possible to write a
 49 large modular DTD in XML, and this is done by commercial publishing houses, but in XML the including
 50 process requires two sets of files. One file is basically a list of DTDs to put together to make the
 51 complete DTD. The other is the DTD modules themselves.  In the NCBI XML specs, the files with a .dtd
 52 extension are the ones referenced by the DOCTYPE line in an XML file. The DTDs for individual
 53 modules have the extension .mod, and these corresspond to the ASN.1 modules.
 54 
 55 XML can be "valid" or "well formed". Valid XML means that the data in a record is compared with a
 56 specific DTD and all the rules and elements defined in the DTD are correctly reflected in the data.
 57 Well formed XML just means that the file does not break any XML syntax rules, but no check is made
 58 that it actually follows the specification of its DTD. ASN.1 was designed on the basis that data
 59 must always be "valid". Not only is this more "type safe", but it also means that the ASN.1 parser
 60 always knows the structure of the data. This makes compact binary encoding possible. It also means
 61 that data elements can be reused in different roles without lots of extra tagging since the context
 62 is always known. So in ASN.1 (or most computer languages) the data structure "Person" can have a field
 63 called "name" and "Gene" can also have a field called "name", and nothing gets confused. XML requires
 64 that every ELEMENT have a unique tag, so if "Person" and "Gene" appear in the same DTD, you cannot
 65 have a single tag, "name" that means two different things depending on context.
 66 
 67 Roles:
 68 For example, the NCBI ASN.1 specification was designed to be used in a modular way. So a single Date
 69 object is defined with the fields year, month, day, etc.  It is then referenced in any object that
 70 needs a date, that is, this object can be reused in a variety of roles. Since ASN.1 assumes a
 71 modular structure, it is straightforward to reuse data in different roles without a lot of overhead.
 72 For this specification:
 73 
 74 Record ::= SEQUENCE {
 75         create-date Date,
 76         update-date Date }
 77 
 78 Date ::= SEQUENCE {
 79         month INTEGER,
 80         year INTEGER }
 81 
 82 and some sample data might be:
 83 
 84 Record ::= SEQUENCE {
 85         create-date {
 86                 month 6,
 87                 year 1999 },
 88         update-date {
 89                 month 8,
 90                 year 2000 } }
 91 
 92 the direct mapping to XML requires that every ELEMENT be explicitly tagged and not 
 93 implied by the context. So the equivalent DTD is more verbose:
 94 
 95 <!ELEMENT Record ( create_date, update_date )>
 96 <!ELEMENT create_date (Date)>
 97 <!ELEMENT update_date (Date)>
 98 
 99 <!ELEMENT Date (month, year)>
100 <!ELEMENT month (#PCDATA)>
101 <!ELEMENT year (#PCDATA)>
102 
103 as is the XML data itself:
104 
105 <Record>
106         <create_date>
107                 <Date>
108                         <month>6</month>
109                         <year>1999</year>
110                 </Date>
111         </create_date>
112         <update_date>
113                 <Date>
114                         <month>8</month>
115                         <year>2000</year>
116                 </Date>
117         </update_date>
118 </Record>
119 
120 There is a tendency in XML DTDs to adjust to this expansion of tag levels due to roles, 
121 by defining each role separately as it occurs:
122 
123 <!ELEMENT Record ( create_month, create_year, update_month, update_year )>
124 
125 Scope:
126 
127 ASN.1 does not require that a name be unique except within a structure, similar to 
128 C or C++. XML however requires that all names be unique across the DTD, unless they 
129 are attributes which must come from a limited repertoire. Many XML parsers rely on this 
130 so that callback functions are associated wth a tag, not a tag within context. As a trivial 
131 illustration, if both people and genes have names, they are distinct in ASN.1:
132 
133 Person ::= SEQUENCE {
134         name VisibleString,
135         room-number INTEGER }
136 
137 Gene ::= SEQUENCE {
138         name VisibleString,
139         map VisibleString }
140 
141 but must be made unique in XML to be distinguished:
142 
143 <!ELEMENT Person ( Person_name, room )>
144 <!ELEMENT Person_name (#PCDATA)>
145 <!ELEMENT room (#PCDATA)>
146 
147 <!ELEMENT Gene (Gene_name, map)>
148 <!ELEMENT Gene_name (#PCDATA)>
149 <!ELEMENT map (#PCDATA)>
150 
151 In the case above, we prefixed the element (name) that was used in two contexts with the 
152 name of the context to make it unique. But this requires an analysis of all the modules of 
153 the specification at once. In addition, it assumes the modules will not be used in other 
154 contexts in future, which might make other elements non-unique. So the automatic 
155 converter guarantees that every element is unique by always prefixing all element names 
156 with the context (and would produce both Person_room, and Gene_map, in the example 
157 above).
158 
159 Alternate Representations:
160 
161 In a number of cases the ASN.1 specification allows alternate forms of the same data object. This is
162 because our goal was to get a workable specification that would incorporate data from all the
163 available sources. While the overall model is designed to a view of how it "should be" there
164 are lots of places where we allow for the reality of available sources. So, for example, while we
165 might prefer that a Date have fields for month and year, for some sources we may only have a string.
166 Rather than drop the Date altogether in those cases, we allow alternate forms in ASN.1:
167 
168 Date ::= CHOICE {
169         str VisibleString,   -- when it is all we have
170         std Date-std }       -- preferred
171 
172 Date-std ::= SEQUENCE {
173         month INTEGER,
174         year INTEGER }
175 
176 which is represented in ASN.1 data as:
177 
178 Date ::= std {
179         month 8,
180         year 1999 }
181 
182 However in XML it requires two more layers of explicit tags:
183 
184 <Date>
185         <Date_std>
186                 <Date-std>
187                         <Date-std_month>8</Date-std_month>
188                         <Date-std_year>1999</Date-std_year>
189                 </Date-std>
190         </Date_std>
191 </Date>
192 
193 Note the use of hyphen in the original names (eg. Date-std) and of underline to delimit a 
194 role in another object (eg. Date_std).
195 
196 Summary:
197 
198 While the effect of Roles, Scope, and Alternate Forms results in extensive 
199 tags in the XML, it does accurately reflect the structure and use of the data. It allows 
200 XML programs to capture as little or as much of the full data structure as they wish. And 
201 once converted back from XML to structures or classes in a variety of programming 
202 languages there is minimal overhead once again. The full NCBI DTD reflects this 
203 structure. What is called the NCBI DTD actually only specifies the basic data structures 
204 for publications, sequences, maps, alignments, and structures. These same elements are 
205 reused in different roles in many services as well, such as BLAST which produces 
206 alignments (defined in NCBI DTD) as well as other elements specific to BLAST. We 
207 have not copied all the referenced modules into a DTD for every service as a practical 
208 matter, although we can produce XML output from any ASN.1 interface.
209 
210 Targeted DTDs
211 
212 Many people do not want, or will not make use of the full data specification used 
213 internally by NCBI. It is possible for us to fairly easily write specialized subsets into 
214 standalone specifications when there is a clear community need that will be served. Just 
215 as FASTA files are a very limited representation of a sequence, they are sufficient for a 
216 large number of users most of the time.
217 
218 In the NCBI toolkit are tools which, given an ASN.1 specification, will automatically 
219 generate the C or C++ code (C++ version is still in development) to read and write data 
220 conforming to that specification in ASN.1, the C structures or classes to store it in, the 
221 XML DTD, and the code to write it in XML. Thus we can specify a simpler, special 
222 purpose structure, automatically generate most of the necessary code, then manually 
223 write a relatively small bit of code to fill in the fields in the new C structure from our 
224 existing C structures of the full version.
225 
226 We have created two small examples of this. The Minimal Sequence (MinSeq) example 
227 keeps some of the modular structure of the full specification, but greatly reduces the 
228 number and depths of elements, and does not reference any other specification. The Tiny 
229 Sequence (TinySeq) removes all modularity (and thus a lot of the flexibility for growth 
230 and modification) of MinSeq but results in an extremely simple structure. All these forms 
231 of any sequence are available in the XML demo application. We welcome comments and 
232 suggestions after you have looked through the demo.
233 
234 asn2xml
235 
236 asn2xml is a utility program designed to read sequence data in ASN.1 and output it as
237 "full XML", for those who would prefer working with that format. The only change to
238 the data itself, in addition to the remapping to XML, is to convert binary sequence
239 alphabets to text. Especially for long DNA sequences NCBI normally stores the data
240 in ASN.1 in 2 bits per base if there are no ambiguity codes, or 4 bits per base if there
241 are. This reduces the data size by a factor of 2 or 4, and is also a more convenient
242 form for many computations. Since XML is a text format, the alphabets are converted.
243 This, and the more verbose tagging in XML, result in considerable expansion of the
244 data from the binary ASN.1 on our ftp site. So, to conserve our heavily used bandwidth
245 and disk space, we provide this utility. You can ftp binary ASN.1 and then expand it
246 on your site to XML.
247 
248 The arguments to asn2xml (or any NCBI application) can be seen by typing the name and a
249 hyphen.. "asn2xml -" which will give you:
250 
251 asn2xml 1.0   arguments:
252 
253   -i  Filename for asn.1 input [File In]
254     default = stdin
255   -e  Input is a Seq-entry [T/F]  Optional
256     default = F
257   -b  Input asnfile in binary mode [T/F]  Optional
258     default = T
259   -o  Filename for XML output [File Out]  Optional
260     default = stdout
261   -l  Log errors to file named: [File Out]  Optional
262 
263 The defaults are set to read a binary update file into stdin and output xml from stdout:
264 
265 gzcat update.aso | asn2xml > update.xml
266 
267 The binary ASN.1 files can be found in the ncbi ftp directory at ftp.ncbi.nih.gov/ncbi-asn1
268 Be sure to transfer them in binary format. Note that these files include GenBank in ASN.1,
269 as well as other sources such as RefSeq, PIR, PDB, etc. SWISSPROT is not included since it
270 is no longer distributable in the public domain.
271 
272 Documentation on the ASN.1 specification, and pointers to the DTDs, and a demo program that shows
273 MinSeq and TinySeq are at http://www.ncbi.nlm.nih.gov/IEB from the upper right hand corner of the
274 page. This page is not really finished, but interest in XML has prompted us to show it to you
275 anyway. The ASN.1 spec documentation is directly relevant to the XML version since they are the same
276 logical structure with pretty much the same names. Note that our DOCTYPE line is set up so that
277 you can validate XML either with local DTD files from us, or using the public repository at
278 http://www.ncbi.nlm.nih.gov/IEB/DTD
279 

source navigation ]   [ diff markup ]   [ identifier search ]   [ freetext search ]   [ file search ]  

This page was automatically generated by the LXR engine.
Visit the LXR main site for more information.