|
NCBI Home IEB Home C Toolkit docs C++ Toolkit source browser C Toolkit source browser (2) |
NCBI C Toolkit Cross ReferenceC/doc/ncbixml.txt |
source navigation diff markup identifier search freetext search file search |
1 NCBI Data in XML
2
3 Introduction
4
5 Extensible Markup Language (XML) is a tagged format similar to HTML on which web
6 pages are based. The familiar text format and availability of public domain tools for
7 parsing this language is making it a popular choice for the exchange of structured data
8 over the WWW. Roughly ten years ago, NCBI chose a language called Abstract Syntax
9 Notation 1 (ASN.1) for describing and exchanging information in a manner similar to the
10 ways XML is now used. ASN.1 came out of the telecommunications industry and is a
11 compact binary encoding intended for both human readable text as well as integers,
12 floating point numbers, and so on. While this is "software friendly" it is less accessible to
13 users familiar with HTML and other text based languages. Tools for ASN.1 have largely
14 stayed within the commercial telecommunications industry while a host of public domain
15 tools of varying character have arisen for XML and HTML.
16
17 NCBI has recently added support for XML output to its ASN.1 toolkit. An ASN.1
18 specification can be automatically rendered into an XML DTD. Data encoded in ASN.1
19 can automatically be output in XML which will validate against the DTD using standard
20 XML tools. We hope this will make the structured sequence, map, and structure data, as
21 well as the output of tools like BLAST, more accessible to those who wish to work in
22 XML.
23
24 We are providing XML in two basic modes. Full Data Conversion is the direct mapping
25 of every data field used within NCBI to XML. This is not for the faint of heart, but it
26 does mean that whatever we have, you have. The other mode is to provide smaller,
27 Targeted DTDs for end users. These are still first done as ASN.1, but with an eye to
28 providing smaller, standalone data outputs as XML. These two modes are described in
29 detail below.
30
31 Full Data Conversion
32
33 Note that the full conversion of existing ASN.1 specified data into XML has some
34 specific properties. NCBI is not proposing a new data model, but is simply transliterating
35 the data model we have used for the last decade into a different language for the
36 convenience of our users. ASN.1 has a number of specific data types such as INTEGER
37 or REAL numbers while XML has only strings, so our DTD automatically adds some
38 ENTITY definitions at the top which maps these numbers to strings. This mapping only
39 allows humans that read the DTD to see where numbers are expected; an XML validator
40 will not care what is there. The ASN.1 validators do care, and can also check ranges of
41 values and so on, so those continue to be used to read and process the data within NCBI.
42
43 Reuse and Roles
44
45 ASN.1 is also designed to allow the reuse of modules in a specification. Modules may be in multiple
46 files and mixed and matched as needed, similar to C or C++ header files defining structures and
47 classes. Most XML specifications in biology have been relatively small thus far, and/or focussed on
48 the work of a specific group. Thus the DTDs tend to be in a single file. It is possible to write a
49 large modular DTD in XML, and this is done by commercial publishing houses, but in XML the including
50 process requires two sets of files. One file is basically a list of DTDs to put together to make the
51 complete DTD. The other is the DTD modules themselves. In the NCBI XML specs, the files with a .dtd
52 extension are the ones referenced by the DOCTYPE line in an XML file. The DTDs for individual
53 modules have the extension .mod, and these corresspond to the ASN.1 modules.
54
55 XML can be "valid" or "well formed". Valid XML means that the data in a record is compared with a
56 specific DTD and all the rules and elements defined in the DTD are correctly reflected in the data.
57 Well formed XML just means that the file does not break any XML syntax rules, but no check is made
58 that it actually follows the specification of its DTD. ASN.1 was designed on the basis that data
59 must always be "valid". Not only is this more "type safe", but it also means that the ASN.1 parser
60 always knows the structure of the data. This makes compact binary encoding possible. It also means
61 that data elements can be reused in different roles without lots of extra tagging since the context
62 is always known. So in ASN.1 (or most computer languages) the data structure "Person" can have a field
63 called "name" and "Gene" can also have a field called "name", and nothing gets confused. XML requires
64 that every ELEMENT have a unique tag, so if "Person" and "Gene" appear in the same DTD, you cannot
65 have a single tag, "name" that means two different things depending on context.
66
67 Roles:
68 For example, the NCBI ASN.1 specification was designed to be used in a modular way. So a single Date
69 object is defined with the fields year, month, day, etc. It is then referenced in any object that
70 needs a date, that is, this object can be reused in a variety of roles. Since ASN.1 assumes a
71 modular structure, it is straightforward to reuse data in different roles without a lot of overhead.
72 For this specification:
73
74 Record ::= SEQUENCE {
75 create-date Date,
76 update-date Date }
77
78 Date ::= SEQUENCE {
79 month INTEGER,
80 year INTEGER }
81
82 and some sample data might be:
83
84 Record ::= SEQUENCE {
85 create-date {
86 month 6,
87 year 1999 },
88 update-date {
89 month 8,
90 year 2000 } }
91
92 the direct mapping to XML requires that every ELEMENT be explicitly tagged and not
93 implied by the context. So the equivalent DTD is more verbose:
94
95 <!ELEMENT Record ( create_date, update_date )>
96 <!ELEMENT create_date (Date)>
97 <!ELEMENT update_date (Date)>
98
99 <!ELEMENT Date (month, year)>
100 <!ELEMENT month (#PCDATA)>
101 <!ELEMENT year (#PCDATA)>
102
103 as is the XML data itself:
104
105 <Record>
106 <create_date>
107 <Date>
108 <month>6</month>
109 <year>1999</year>
110 </Date>
111 </create_date>
112 <update_date>
113 <Date>
114 <month>8</month>
115 <year>2000</year>
116 </Date>
117 </update_date>
118 </Record>
119
120 There is a tendency in XML DTDs to adjust to this expansion of tag levels due to roles,
121 by defining each role separately as it occurs:
122
123 <!ELEMENT Record ( create_month, create_year, update_month, update_year )>
124
125 Scope:
126
127 ASN.1 does not require that a name be unique except within a structure, similar to
128 C or C++. XML however requires that all names be unique across the DTD, unless they
129 are attributes which must come from a limited repertoire. Many XML parsers rely on this
130 so that callback functions are associated wth a tag, not a tag within context. As a trivial
131 illustration, if both people and genes have names, they are distinct in ASN.1:
132
133 Person ::= SEQUENCE {
134 name VisibleString,
135 room-number INTEGER }
136
137 Gene ::= SEQUENCE {
138 name VisibleString,
139 map VisibleString }
140
141 but must be made unique in XML to be distinguished:
142
143 <!ELEMENT Person ( Person_name, room )>
144 <!ELEMENT Person_name (#PCDATA)>
145 <!ELEMENT room (#PCDATA)>
146
147 <!ELEMENT Gene (Gene_name, map)>
148 <!ELEMENT Gene_name (#PCDATA)>
149 <!ELEMENT map (#PCDATA)>
150
151 In the case above, we prefixed the element (name) that was used in two contexts with the
152 name of the context to make it unique. But this requires an analysis of all the modules of
153 the specification at once. In addition, it assumes the modules will not be used in other
154 contexts in future, which might make other elements non-unique. So the automatic
155 converter guarantees that every element is unique by always prefixing all element names
156 with the context (and would produce both Person_room, and Gene_map, in the example
157 above).
158
159 Alternate Representations:
160
161 In a number of cases the ASN.1 specification allows alternate forms of the same data object. This is
162 because our goal was to get a workable specification that would incorporate data from all the
163 available sources. While the overall model is designed to a view of how it "should be" there
164 are lots of places where we allow for the reality of available sources. So, for example, while we
165 might prefer that a Date have fields for month and year, for some sources we may only have a string.
166 Rather than drop the Date altogether in those cases, we allow alternate forms in ASN.1:
167
168 Date ::= CHOICE {
169 str VisibleString, -- when it is all we have
170 std Date-std } -- preferred
171
172 Date-std ::= SEQUENCE {
173 month INTEGER,
174 year INTEGER }
175
176 which is represented in ASN.1 data as:
177
178 Date ::= std {
179 month 8,
180 year 1999 }
181
182 However in XML it requires two more layers of explicit tags:
183
184 <Date>
185 <Date_std>
186 <Date-std>
187 <Date-std_month>8</Date-std_month>
188 <Date-std_year>1999</Date-std_year>
189 </Date-std>
190 </Date_std>
191 </Date>
192
193 Note the use of hyphen in the original names (eg. Date-std) and of underline to delimit a
194 role in another object (eg. Date_std).
195
196 Summary:
197
198 While the effect of Roles, Scope, and Alternate Forms results in extensive
199 tags in the XML, it does accurately reflect the structure and use of the data. It allows
200 XML programs to capture as little or as much of the full data structure as they wish. And
201 once converted back from XML to structures or classes in a variety of programming
202 languages there is minimal overhead once again. The full NCBI DTD reflects this
203 structure. What is called the NCBI DTD actually only specifies the basic data structures
204 for publications, sequences, maps, alignments, and structures. These same elements are
205 reused in different roles in many services as well, such as BLAST which produces
206 alignments (defined in NCBI DTD) as well as other elements specific to BLAST. We
207 have not copied all the referenced modules into a DTD for every service as a practical
208 matter, although we can produce XML output from any ASN.1 interface.
209
210 Targeted DTDs
211
212 Many people do not want, or will not make use of the full data specification used
213 internally by NCBI. It is possible for us to fairly easily write specialized subsets into
214 standalone specifications when there is a clear community need that will be served. Just
215 as FASTA files are a very limited representation of a sequence, they are sufficient for a
216 large number of users most of the time.
217
218 In the NCBI toolkit are tools which, given an ASN.1 specification, will automatically
219 generate the C or C++ code (C++ version is still in development) to read and write data
220 conforming to that specification in ASN.1, the C structures or classes to store it in, the
221 XML DTD, and the code to write it in XML. Thus we can specify a simpler, special
222 purpose structure, automatically generate most of the necessary code, then manually
223 write a relatively small bit of code to fill in the fields in the new C structure from our
224 existing C structures of the full version.
225
226 We have created two small examples of this. The Minimal Sequence (MinSeq) example
227 keeps some of the modular structure of the full specification, but greatly reduces the
228 number and depths of elements, and does not reference any other specification. The Tiny
229 Sequence (TinySeq) removes all modularity (and thus a lot of the flexibility for growth
230 and modification) of MinSeq but results in an extremely simple structure. All these forms
231 of any sequence are available in the XML demo application. We welcome comments and
232 suggestions after you have looked through the demo.
233
234 asn2xml
235
236 asn2xml is a utility program designed to read sequence data in ASN.1 and output it as
237 "full XML", for those who would prefer working with that format. The only change to
238 the data itself, in addition to the remapping to XML, is to convert binary sequence
239 alphabets to text. Especially for long DNA sequences NCBI normally stores the data
240 in ASN.1 in 2 bits per base if there are no ambiguity codes, or 4 bits per base if there
241 are. This reduces the data size by a factor of 2 or 4, and is also a more convenient
242 form for many computations. Since XML is a text format, the alphabets are converted.
243 This, and the more verbose tagging in XML, result in considerable expansion of the
244 data from the binary ASN.1 on our ftp site. So, to conserve our heavily used bandwidth
245 and disk space, we provide this utility. You can ftp binary ASN.1 and then expand it
246 on your site to XML.
247
248 The arguments to asn2xml (or any NCBI application) can be seen by typing the name and a
249 hyphen.. "asn2xml -" which will give you:
250
251 asn2xml 1.0 arguments:
252
253 -i Filename for asn.1 input [File In]
254 default = stdin
255 -e Input is a Seq-entry [T/F] Optional
256 default = F
257 -b Input asnfile in binary mode [T/F] Optional
258 default = T
259 -o Filename for XML output [File Out] Optional
260 default = stdout
261 -l Log errors to file named: [File Out] Optional
262
263 The defaults are set to read a binary update file into stdin and output xml from stdout:
264
265 gzcat update.aso | asn2xml > update.xml
266
267 The binary ASN.1 files can be found in the ncbi ftp directory at ftp.ncbi.nih.gov/ncbi-asn1
268 Be sure to transfer them in binary format. Note that these files include GenBank in ASN.1,
269 as well as other sources such as RefSeq, PIR, PDB, etc. SWISSPROT is not included since it
270 is no longer distributable in the public domain.
271
272 Documentation on the ASN.1 specification, and pointers to the DTDs, and a demo program that shows
273 MinSeq and TinySeq are at http://www.ncbi.nlm.nih.gov/IEB from the upper right hand corner of the
274 page. This page is not really finished, but interest in XML has prompted us to show it to you
275 anyway. The ASN.1 spec documentation is directly relevant to the XML version since they are the same
276 logical structure with pretty much the same names. Note that our DOCTYPE line is set up so that
277 you can validate XML either with local DTD files from us, or using the public repository at
278 http://www.ncbi.nlm.nih.gov/IEB/DTD
279
|
This page was automatically generated by the
LXR engine.
Visit the LXR main site for more information. |