Ontario Scholars Portal (SP) is a digital repository containing almost 20,000,000
articles from over 8600 full text journals of 16 publishers which covers every
academic discipline. Starting in 2006, SP began adopting NLM Journal Archiving
and Interchange Tag Set v2.3 for its XML based E-journals system using
MarkLogic. The publishers' native data is transformed to NLM Tag Set in SP in
order to normalize data elements to a single standard for archiving, display and
searching. The data transformation is processed in two steps--the crosswalk done
by the librarian and the coding by the programmer. Although not all the elements
from the Tag Set have been used, additional customized tags have been added to
capture the loading information. Local rules and policies are applied to tag the
document in addition to those imposed by NLM Tag Set. Decisions for multiple
options for transforming the publishers' tag, developing the schema to construct
the journal and issue table of content structures on our own, evaluation the
consequences of moving all our content to version 3.0 are the challenges of
using this Tag Set.
Scholars Portal (SP) is an Ontario Council of University Libraries (OCUL) sponsored
digital repository containing almost 20,000,000 articles from over 8600 full text
journals of 16 publishers which covers every academic discipline. The ejournal
service is available to faculty and students of Ontario's 21 universities. The data
provided by the publishers are in XML or SGML format using different DTDs or
schemas. The publishers’ native data is transformed to the NLM Journal Interchange
and Archiving Tag Set in SP in order to normalize data elements to a single standard
for archiving, display and searching. The XML full-text is also transformed when it
is available and rendered to users as XHTML.
Starting in 2006, the SP development team began planning for a migration of the
Scholars Portal e-journal repository from ScienceServer to a new XML-based database
using MarkLogic. All data loaded into ScienceServer had been converted from the
publishers’ native data format to an internal XML format proprietary to
ScienceServer. The ScienceServer DTD supports metadata records, issue
tables-of-content, journal tables-of-content, and full-text articles. It worked
relatively well for the local loading of simple journal articles; however, it had
several shortcomings for more advanced uses -- the lack of some important tags to
describe conference information and reviewed book information, the lack of
capability to handle images, equations, and the lack of a standard process to keep
the DTD current with emerging practices and new needs. With these limitations in
mind, we then began a search for a comprehensive standard suitable for the new
e-journal system.
First, we examed the various data format we received from the publishers. SP ingests
the data from 16 publishers across academic disciplines. The data is submitted in
SGML or XML formate. It seems there is no standard DTD has achieved wide spread
acceptance within journal publishing industry. Each publisher develops their own DTD
according to their publication scope and special needs. About 30% of our journals
were from Elsevier. Elsevier implements journal article DTD version 5.0.1 (released
March, 2003) for its journal article publishing and exchanging (Elsever). The Journal article DTD defines 4 top-level elements:
article, simple-article, book-review and exam. The separate Serial Issue DTD is used
for defining journal issues and book series. We found this DTD structure is not
suitable for the data from our other publishers due to the complexity.
We then did an extensive literature research, compared various metadata standards,
analyzed the best practices of other e-journals archiving system. Back in 2003, The
Harvard university Library e-journal Archiving Project concluded “it would not be
practical for an archive to manage heterogeneous content encoded by multiple DTDs”
(Abrams, Rosenblum, 2003). Harvard
commissioned a study with Inera to perform an in-depth analysis of ten DTDs and come
up with the recommendations regarding any proposed archival DTD. Journal Archiving
and Interchange DTD was created by the National Center for Biotechnology Information
(NCBI) of the National Library of Medicine (NLM) with the intent of providing a common format in which publishers and
archives can exchange journal content (NLM). Among several candidates, we chose the
Journal Archiving and Interchange Tag Set v2.3 as the standard for our new E-journal
system. This DTD was chose based on the following features the DTD can offer.
the DTD was created with an attempt to capture nearly anything a
publisher might have tagged across all the discipline
It is flexible and customizable
It is very well documented and well maintained
The practice of adopting Journal Archiving and Interchange Tag Set v2.3
The materials loaded in Scholars Portal E-journals archive include journal articles
and conference proceedings with ISSNs. According to the agreement with the
publishers, the metadata, cited reference, full-text, and supplementary materials
are transformed to NLM DTD XML when they are available.
1. Work flow
The data transformation from the publishers’ native data to Scholars Portal NLM
XML data is processed in two steps—mapping and coding. Creating XML
transformations in these two separate tasks, not only maximizes the skills of
various team members, but also reduces development time and cost, and increases
correctness of the finished code (Usdin, Piez,
2007). First, the mapping was created by metadata librarian who
posses strong analytical skills, ability to articulate complex relationships and
familiarity with both publisher’s data structure and NLM xml data structure. The
metadata librarian undertakes an intensive analysis of each publisher’s source
data format from the source DTD, schema and sample data and then develops the
crosswalk. The crosswalk includes the mapping of the path from source to target
data and the explanation of decisions and compromises.
Second, the programmer with coding experiences then develops the loader according
to the crosswalk using coding languages such as Perl and Java. A test
environment is set up so the transformations are tested before the data is loaded
into production. The metadata librarian inspects the output and the crosswalk
can go through several iterations to make sure the data are transformed
completely and explicitly. After the loader directed into production system, the
DTD validation is enforced and the transformation of each dataset has been
logged for any errors. The log files are examined by quality assurance person.
Any dataset with errors then is removed from the production and reload after the
problems have been fixed.
2. NLM DTD customization
Not all the elements from NLM tag set are used in SP E-journals archive.
Comparing the Scholars Portal distinct element list with the NLM element list, a
number of elements are not being used such as <sig> and
<front-stub>. Most of the tags are the content or format tags for the
<body>. The NLM tag set is more comprehensive even for a large E-journal
archive repository.
Although not all the elements have been used in SP e-journal system,
customization has been made to meet our special needs.
A set of <custom> elements was added for loading information
A set of <custom> elements has been added to capture the loading
information including dataset-name, publisher, source metadata directory,
loading date, the count of cited, the count of download, and publication
date. Publication date is duplicated in custom tag to rectify the indexing
short falls of Marklogic when more than one element with the same tag name
exists. is an example of
<custom> tag set.
Attribute “display” in <body> tag
Attribute “display” has been added to <body>. <body
display=”NO”> indicates the XML full text in <body> is used only for
indexing, not for rending html display.
pdf-size in <custom-meta> tag
is the <custom-meta> added to
capture the pdf size information.
“other” and “misc” in “citation-type” attribute list
Two more values have been added to the list for attribute “citation-type”.
<citation citation-type=”other”> indicates all other citation types.
<citation citation-type=”misc”> indicates the unparsed citation.
is an example of the
unparsed citation.
“non-latin” attribute was added
Attribute non_latin was added for <name>, <surname>,
<given-names> to hold the CJK and other non Latin author names.
The use of <elocation-id> and <object-id> for citation
linking
For each of the article record, <elocation-id> is generated automatically from author’s last name,
publication year and pagination under <article-meta> to enable cross reference to the citations
within the repository. In the meanwhile, <object-id> element is used within <citation> for each of the citations in the article to cross
reference to the journal articles within the repository. is an example for citation
linking.
3. Journal TOC structure
Although the primary intellectual value of e-journals rests at the item-level,
the issue level structure is the key to navigate around the journal issues. NLM
DTD defined item-level, so the issue level structure was created by our own. All
the issue level textual content is encoded in a single xml. is the example of journal toc XML.
4. Additional tagging rules and policies
In addition to those imposed by NML Tag Set, we have local rules on how certain
tags should be used and how certain documents should be tagged. The
interpretation of the semantics of the NLM DTD, best practices documentation and
examples are maintained by the metadata librarian for Scholars Portal. The XSD
version of the Tag Set is maintained by the lead programmer. All the
documentation is posted in wiki to share within team members.
Challenges of adopting the Tag Set
Decisions need to be made when there are multiple options for transforming the
publishers’ tag and these decisions need to be followed consistently to support
normalized searching. For example, the article copy right can be in the
<copyright-statement> under <article-meta> or
/<article-meta>/<permission>. Once the decision was made, the
tagging rule and the example are documented in the best practice document.
The normalization is not straightforward even the source data is in NLM DTD
family. In theory, normalizing the NLM DTD content should amount to not much
more than an identity (Meyer, et al., 2010). In
practice, each publisher implements NLM DTD according their understanding of
this DTD and their distinctive business rules. For example, one publisher uses
<subj-group subj-group-type="heading"> for the journal heading title,
while another publisher uses <series-title>for the same purpose. One
publisher uses oasis table although the document declared the use of NLM DTD
version 2.2. Most of the time, the varying usage and customization are not well
documented from the publisher. A large amount of examples need to be examined
carefully in order to understand the implicit semantics embedded in the
publisher’s DTD implementation.
The customized attributes from the publisher’s NLM DTD is not as obvious as
customized element so they are not easy to be captured in the transformation.
Typically, the added elements were wrapped in <custom-meta-wrap> while
the customized attributes can be added in any elements.
We are still evaluating the consequences of moving all our content to version
3.0 of the tag set. We analyzed the changes of version 3.0. The improvement of
this version is obvious. This tag set is more logical, internally consistent,
and complete; however, since version 3.0 is not backward compatible, and we
already have almost 20, 000,000 articles in NLM DTD version 2.3., the benefit
and cost must be carefully examined before we choose to integrate Version 3.0.
We just began to receive the data in version 3.0 from the publisher, it doesn’t
make sense to convert version 3.0 back to 2.3. So Version 2.3 and version 3.0
divergence in our repository might be a realistic solution.
After almost 4 years experience of adopting the NLM tag set in production, we are
confident that we made the correct decision when choosing the NLM Tag Set as our
standard. The Tag Set is comprehensive “with an attempt to capture nearly anything a
publisher might have tagged” (NLM). It is well documented, flexible and
customizable. We are glad to see more and more publishers adopting the Tag Set,
including Cambridge, Oxford, and T&F, simplifying our data transformation
processes.
This paper is a reflection of the principles and practices developed by the ejournals
development team at OCUL Scholars Portal.
-
Abrams Stephen L, Rosenblum Bruce . XML for e-journal archiving.
OCLC Systems & Services. 2003. ;
19(4):
155-161.
- JA DTD 5.1.0 and CEP 1.1.5
complete [Internet]: Elsevier;
[cited 2010 Sep 22]. Available from:
http://www.info.sciverse.com/sciencedirect/implementation/implementing/dtds/.
- Morrissey Sheila, Meyer John , Bhattarai Sushil , Kurdikar Sachin , Ling Jie
Stoeffler Matthew
Thanneeru Umadevi . Portico: A Case Study in the Use of XML for the
Long-Term Preservation of Digital Artifacts.
Proceedings of the International Symposium on XML for the Long Haul:
Issues in the Long-term Preservation of XML. 2010.
Aug.
- Archiving and Interchange Tag Set
[Internet]: National Library of
Medicine; [cited 2010 Sep
22]. Available from:
http://dtd.nlm.nih.gov/archiving/index.html.
- Usdin Tommie, Piez Wendell . Separating Mapping from Coding in Transformation
Tasks. Presented at: XML 2007;
2007 Dec 3-5; Boston,
MA.