Demands on DELTA
The database uses
the DELTA system (3),
(DEscription Language for Taxonomy) developed at CSIRO Entomology, by
Michael Dallwitz (4), now adopted as a world standard
for data exchange in taxonomy. A distinctive feature of DELTA is its capacity
to store an extra-ordinary diversity of data, and to translate these
data into natural language for traditional reports and web publication. All the flexibility of subprograms in DELTA
is exploited by ICTVdB.
On the input side,
the capacity of DELTA to handle very large datasets one item at a time
is ideally suited to a long list of virus properties (character list),
often accompanied by extensive text comments and images.
Although only partly populated, ICTVdB already lists more than
2000 virus descriptions (items) constructed from 2250 characters, some
with up to 2000 states (Box 2).
By the time all available data on virus isolates and strains
are entered, the number of items will be closer to a million.
Virus taxonomy is
very much in flux because our understanding of relationships between viruses is
increasingly dependent on genomic data that continually challenges earlier
decisions based on morphology.
Strategies to facilitate communication across semantic boundaries are
particularly important in ICTVdB that deals with data from diverse sources such
as bacteriology, agriculture, veterinary and medical sciences, each of which
has evolved a distinctive vocabulary.
Although terms have been standardised within ICTVdB, these standards
can’t be imposed on virologists in all disciplines, and they can’t be imposed
retrospectively on the literature.
Another input side
requirement of ICTVdB is user-friendly, online data entry for peer review of
new information, ranging from molecular properties of a virus to its geographic
distribution and host range. Such
diverse information, with intrinsic dependencies between genomic data, protein
composition, particle structure and infectivity places particular demands on
the flat file system of DELTA. These
have been met by building a dependency network in data specification
files. The spreadsheet display of the
DELTA editor is particularly useful for reviewing these dependencies, a
critical step in developing and working with the ICTVdB dataset.
Although DELTA was
designed for taxonomic research, its output formats transcend these specialist
interests. Its translation facilities
can be used by taxonomists to construct nearest neighbour relationships, but can
also be used to blend data from diverse sources. For example, ICTVdB does not itself contain sequence data, but
conversion of ICTVdB data from DELTA into NEXUS format was deemed essential for
comprehensive phylogenetic analyses.
Such work is also indispensable for monitoring the evolution of viruses
in relation to emerging diseases.
Although most new
information in virology is generated at the molecular level and is deposited in
sequence databases, significant events in virology tend to be associated with
"host jumping", epidemics and environmental disturbances, all of
which information can be retrieved from ICTVdB. From the outset, DELTA was designed not only to generate
identification keys but also to translate its data into natural language hard
copy, for translation onto the web in HTML format, and for translation into
many of the languages of the world.
These output attributes will be fully exploited by ICTVdB.
Structural Features
of ICTVdB
Although ICTVdB began
as a taxonomic database 5, it now has several distinctive features not usually
used in systematics, but introduced of necessity. Chief among these is its decimal code 6. Originally
introduced because the peculiar nomenclature used in virology defies
direct and systematic interrogation in a database, and because virus
taxonomy was changing rapidly, a decimal code (analogous to the code
of enzyme nomenclature) seemed to offer a simple resolution to diverse
problems.
Table 1.
Expansion of the decimal code to accommodate revisions of Poliovirus
taxonomy, and to anticipate the
explosion of lower level data (serotypes, strains and isolates).
|
Level
|
Original
Decimal Code
|
Extended
Decimal Code
|
|
Order
|
|
00. = (not assigned)
|
|
Family
|
52.
= Picornaviridae
|
00.052.
= Picornaviridae
|
|
Subfamily
|
52.0.
= (no subfamilies)
|
00.052.0.
= (not assigned)
|
|
Genus
|
52.0.1.
= Enterovirus
|
00.052.0.01.
= Enterovirus
|
|
Subgenus
(serogroup)
|
52.0.1.0.
= (no subgenus)
|
Superseded
by new species concept
|
|
Species
(type species)
|
52.0.1.0.001
= Poliovirus 1
|
00.052.0.01.001.
= Poliovirus
|
|
Species
|
52.0.1.0.067
= Poliovirus 1
|
00.052.0.01.007.
= Poliovirus
|
|
|
52.0.1.0.068 =
Poliovirus 2
|
|
|
|
52.0.1.0.069 =
Poliovirus 3
|
|
|
Subspecies
|
|
00.052.0.01.007.00.
= (not assigned)
|
|
Serotype
|
|
00.052.0.01.007.00.001.
= Poliovirus 1
|
|
|
|
00.052.0.01.007.00.002.
= Poliovirus 2
|
|
|
|
00.052.0.01.007.00.003.
= Poliovirus 3
|
|
Strain
or Isolate
|
|
00.052.0.01.007.00.001.001.
= PV-1 Brunhilde
|
|
|
|
00.052.0.01.007.00.002.002.
= PV-2 Mahony
00.052.0.01.007.00.002.001. = PV-2 Lansing
|
|
|
|
00.052.0.01.007.00.003.001. = PV-3 Leon
|
Decimal Code
Because virus names
are changed frequently, contain diverse linguistic and geographical elements,
and are usually coupled to a disease or its symptoms, virus nomenclature
presents challenging semantic problems for a database. The decimal code at one and the same time
affords unequivocal identification of a virus to the level of strain or
isolate, and indicates its taxonomic context.
The core infrastructure of ICTVdB is its distinctive "table of
contents", the Index of Viruses (formerly Index Virum), a list of approved
virus names sanctioned by ICTV. The
decimal code is constructed in Index of Viruses, and serves as a filename for
database outputs as well as an accession number for external linkage to ICTVdB. The original DOS-based DELTA system used by
ICTVdB could only accommodate 8 digit filenames. The increasing focus on lower level taxonomic information and
taxonomic revisions require the decimal code to be expanded to 19 digits. The application of the expanded code to the
recently revised taxonomy of Poliovirus is illustrated in Table 1.
With the
introduction of long filenames in Windows 95/NT, the expanded decimal code can
be used by PCs, and is no longer confined to UNIX systems. The expansion to 19 digits should cope with
even the most ambitious "splitters" in the taxonomic community. It will be necessary to differentiate
provisionally assigned taxa in the dynamic database, but this can be
accommodated without further assignments in the decimal code. Although individual virologists are finding
the decimal code useful, this invention of necessity is by no means universally
accepted among virus taxonomists.
If a database is to
accept the latest data from all branches of virology, and place these
diverse data into contemporary taxonomic context, it will most commonly
deal with information at the level of strains and isolates. Ideally, ICTVdB will serve virus taxonomy "from the bottom
up" with primary data from researchers who describe their viruses
using rich and diverse semantics, reflecting geographic and linguistic
factors. At the same time, the database must accept
revisions and consolidations "from the top down" as the consensus
in virus taxonomy reflects this new information.
For example, the relegation of such widely used species names
as Poliovirus 1, 2 and 3 to serotypes (Table 1),
although an emotive issue 8, has been justified by pair-wise comparison
of genomic data.
As the database
developed, it became clear that the decimal code served as more than an unequivocal
identifier for taxonomically correct internal linkages in the database. It is used as a filename for transposing
ICTVdB to the web, and also serves as a surrogate accession number used by
sequence databases such as EMBL and SWISS-PROT to link to ICTVdB. The decimal code unequivocally identifies a
virus, and simultaneously indicates its taxonomic status from order to isolate,
and should be routinely cited in publications.
Dependencies
Unlike many other
databases that deal with relatively uniform data types and small number of
fields, ICTVdB is not a relational database, but is a flat file system. All key components of ICTVdB in DELTA format
(character list, specification and items file) are readable text files, as are
the directive files used for data translation and conversion. The character list of ICTVdB is distinctive
in that it must accommodate data of all sorts, from the geometry of virus
particles through the chemical composition of components to the host range and
geographic distribution. It also
supports these data with explanatory commentary and images. Each character is specified in terms of
ordered or unordered multistate properties, integer or real numeric properties,
text and images, the later being handled as a special category of text. Table 2 unfolds the specification of general
genomic characteristics (excluding sequences) of a virus.
Table 2. Components of a DELTA database, <>
denotes commentary in the character list and in the items file. The natural language translation of this
example will read:
Genome is (usually)
monopartite; contains RNA; is 9128-9738 nucleotides long (depending on isolate)
with a weight ranging between (9.0-)9.2-9.5 or 9.8 (for strain Y). Genome organisation: 5'-gag-pro-pol-env-3'. Genome map (7) (image not displayed).
|
Specification File
|
Character List
|
Items File
|
|
type
|
feature
|
attribute
|
code
|
|
1,OM
ordered multistate
|
#1. genome is <whether segmented>/
|
1. monopartite/
2. bipartite/
3. tripartite/
|
2<usually>,1
|
|
2,UM
unordered multistate
|
#2. genome contains <nucleic acid type>/
|
1. DNA/
2. RNA/
|
1,2
|
|
3,IN
integer numeric
|
#3. genome <length> is/
|
<number of> nucleotides long/
|
3,9128-9738<depending on isolate>
|
|
4,RN
real numeric
|
#4. genome with a weight/
|
kDa/
|
4<ranging between>,(9.0-) 9.2-9.5/9.8<for
strain Y>
|
|
5,TE
text
|
#5. genome organisation: <order of genes or
ORFs>/
|
|
5<5'-gag-pro-pol-env-3'>
|
|
6,TE
image
|
#6. Genome map
<image path to diagram>/
|
|
6<gm_lenti.gif>
|
At critical points
in the character list binary statements, such as virus particle with or without
envelope, are used to establish dependencies so that only the subsequently
valid characters can be used. These
dependencies provide the internal linkages hierarchy in the data, and direct
the search path during interrogation and, among other things, reveal errors
during data entry. The dependencies are
very important for the decision making process during identification and data
comparison, and some multistate characters in key positions (e.g. plant or animal
virus) can control the validity of up to 2000 characters down the line. The dependencies are automatically indicated
in the spreadsheet display of DELTA (Table 3).
Table
3. Spreadsheet view of the DELTA editor. The red bars indicate that this cell is made
inapplicable
through dependencies.
At other points in
the character list "pseudo-characters" are used to overcome
semantic difficulties arising in different sub-fields of virology,
and to establish dependencies among blocks of characters.
Table 4 shows pseudo-character 61 that
handles the semantic equivalence of tegument = inner lipid protein
membrane, and capsid = head of a tailed phage.
It also shows the dependencies established by the states,
so that state 4 for example only opens the character section 512-617,
and handles the semantic equivalence of head and capsid. The dependencies build the internal hierarchy
of the database.
Table 4.
Semantic equivalencies and dependencies among the major morphological
properties of virus particles. Few
viruses contain more than one component.
|
#61. <Virion or phage> consists
of <components of particle>/
|
Dependend Character Blocks
|
|
1. an
envelope <including inner and outer envelope>/
|
Envelope.
|
#406-458
|
|
2. a surface membrane
|
Surface Membrane
|
#459-511
|
|
3. a
tegument/
|
Tegument
|
#791-814
|
|
4. a head
<of phage treated as isometric capsid>/
|
Capsid (Coat Protein).
|
#512-564
|
|
5. a capsid
<including inner and outer capsid>/
|
Inner Capsid.
|
#565-617
|
|
6. a tail <of
phage treated as elongated capsid>/
|
Tail
|
#512-564
|
|
7. a
nucleocapsid/
|
Nucleocapsid.
|
#618-667
|
|
8. a
core/
|
Core.
|
#717-765
|
|
9. a nucleoid/
|
Nucleoid.
|
#668-716
|
|
10.
lateral bodies/
|
Lateral Bodies.
|
#815-838
|
|
11. a
matrix/
|
Matrix.
|
#766-790
|
Images
It is said that a
picture is worth a thousand words.
Images of virus particles are used in several ways in ICTVdB. For example, text descriptions of key
morphological characters become much more precise when they are linked in the
character list to representative vignettes from electron microscope
photographs. Thin section EM images of
infected tissues are used to illustrate virus infection cycles and host
pathology. Descriptions of all viruses
generated from ICTVdB will be enhanced by EM images of the type species,
irrespective of the presentation format selected. Not surprisingly, images of virus particles are among the most
frequently accessed files in ICTVdB on the web. A large image file is more instructive to users, but in the
database it is functionally equivalent to numerous characters, like "virus
75-80 nm in diameter" in the case of Rotavirus. File size considerations and access paths dictate that image
files be stored outside the main dataset, in either local files or files
accessed on the Internet.
Quo
Vadis
The PC based ICTVdB
is presented as a natural language translation on the web using HTML conversion
for the DELTA formatted data. This web
environment is essential for universal access, interactive data entry and
interrogation, as well as interoperability with other databases. Currently, a plethora of accessories is
available, many of which are standard components of DELTA (e.g. Web Intkey, an
interactive identification program).
Others, like the data entry forms, Java applets and scripts used to
display directory trees, have been developed specifically developed for use in
ICTVdB. In future, interoperability
will be vastly improved by XML tagging.
Just as it is certain that the flow of new information about viruses
will not slow, so it is certain that new technology available to ICTVdB to
handle these data will grow.
Thus far, ICTVdB
has been a single investigator project, with a lot of goodwill and support for
software development. The principal
impediments to its usefulness and sustainability are common to most biological
databases. First, filling out of the
database requires commitment from the virological community to data entry and
update. It is pleasing to have some
researchers deposit new virus data in ICTVdB at the same time as they deposit
sequence data elsewhere. Hopefully,
this will become routine, but it remains difficult to extract existing data and
to engage the expertise of busy senior scientists. Second, support beyond the development phase, now largely
completed, requires a shift from public funding to a commercial context. A subscription database seems a plausible
path forward, one that could see a consortium of database professionals working
to maintain ICTVdB. Whatever, it is
hoped that a public domain shop window can be retained, responsive to
significant developments in virology, and accessible to all parties.
References
1. M.H.V. van Regenmortel et al., (eds). Virus Taxonomy. Classification
and Nomenclature of Viruses, Seventh Report of the International Committee on
Taxonomy. Academic Press, New York,
San Diego, (1999).
2. J.G. Atherton, I.R. Holmes and E.H. Jobbins, ICTV Code for the Description of Virus Characters. Monographs in
Virology 14, (1983).
3. M.J. Dallwitz, T.A. Paine and E.J. Zurcher, User's Guide to the DELTA System: a general system for processing
taxonomic descriptions, CSIRO Division of Entomology, Canberra, (1993).
4. M.J. Dallwitz, "A general system for coding taxonomic
descriptions" Taxon. 29, 41-46, (1980).
5. C. Büchen-Osmond, L. Blaine and M.C. Horzinek, "The universal
virus database of ICTV (ICTVdB)". In:
Virus Taxonomy. Classification and Nomenclature
of Viruses, Seventh Report of the International Committee on Taxonomy.
M.H.V. van Regenmortel et al., (eds). Academic Press, New York, San Diego, (1999).
6. C. Büchen-Osmond and M.J. Dallwitz, "Towards a universal virus
database - progress in the ICTVdB", Arch.
Virol., 141, 392-399, (1996).
7. C Büchen-Osmond, "Further progress in ICTVdB, a universal virus
database", Arch. Virol., 142, 1734-1739, (1997).
8. C.R. Pringle, "Virus taxonomy at the XIth International Congress
of Virology, Sydney, Australia 1999", Arch.
Virol., 144, 2065-2070, (1999).
Cornelia
Büchen-Osmond is a virologist who trained in electron microscopical
identification of viruses at the Hygiene Institut, Klinikum JW
Goethe-Universität, Frankfurt a. M., Germany. She was invited to develop the
universal virus database in 1992, and commenced this work in the Bioinformatics
Group at Research School of Biological Sciences, Australian National
University, Canberra, ACT, Australia. The project continues at Columbia
University, USA. The project has been supported by NSF grants through a
consultancy with the American Type Culture Collection, Manassas, VA, USA.
6 September, 2000. Last updated 18 April 2001
|