NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2010.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010 [Internet].

Show details

Aggregating E-Journals: Adopting the Journal Archiving and Interchange Tag Set to Build a Shared E-Journal Archive for Ontario

and .

Author Information

Ontario Scholars Portal (SP) is a digital repository containing almost 20,000,000 articles from over 8600 full text journals of 16 publishers which covers every academic discipline. Starting in 2006, SP began adopting NLM Journal Archiving and Interchange Tag Set v2.3 for its XML based E-journals system using MarkLogic. The publishers' native data is transformed to NLM Tag Set in SP in order to normalize data elements to a single standard for archiving, display and searching. The data transformation is processed in two steps--the crosswalk done by the librarian and the coding by the programmer. Although not all the elements from the Tag Set have been used, additional customized tags have been added to capture the loading information. Local rules and policies are applied to tag the document in addition to those imposed by NLM Tag Set. Decisions for multiple options for transforming the publishers' tag, developing the schema to construct the journal and issue table of content structures on our own, evaluation the consequences of moving all our content to version 3.0 are the challenges of using this Tag Set.

Introduction

Scholars Portal (SP) is an Ontario Council of University Libraries (OCUL) sponsored digital repository containing almost 20,000,000 articles from over 8600 full text journals of 16 publishers which covers every academic discipline. The ejournal service is available to faculty and students of Ontario's 21 universities. The data provided by the publishers are in XML or SGML format using different DTDs or schemas. The publishers’ native data is transformed to the NLM Journal Interchange and Archiving Tag Set in SP in order to normalize data elements to a single standard for archiving, display and searching. The XML full-text is also transformed when it is available and rendered to users as XHTML.

Background

Starting in 2006, the SP development team began planning for a migration of the Scholars Portal e-journal repository from ScienceServer to a new XML-based database using MarkLogic. All data loaded into ScienceServer had been converted from the publishers’ native data format to an internal XML format proprietary to ScienceServer. The ScienceServer DTD supports metadata records, issue tables-of-content, journal tables-of-content, and full-text articles. It worked relatively well for the local loading of simple journal articles; however, it had several shortcomings for more advanced uses -- the lack of some important tags to describe conference information and reviewed book information, the lack of capability to handle images, equations, and the lack of a standard process to keep the DTD current with emerging practices and new needs. With these limitations in mind, we then began a search for a comprehensive standard suitable for the new e-journal system.

First, we examed the various data format we received from the publishers. SP ingests the data from 16 publishers across academic disciplines. The data is submitted in SGML or XML formate. It seems there is no standard DTD has achieved wide spread acceptance within journal publishing industry. Each publisher develops their own DTD according to their publication scope and special needs. About 30% of our journals were from Elsevier. Elsevier implements journal article DTD version 5.0.1 (released March, 2003) for its journal article publishing and exchanging (Elsever). The Journal article DTD defines 4 top-level elements: article, simple-article, book-review and exam. The separate Serial Issue DTD is used for defining journal issues and book series. We found this DTD structure is not suitable for the data from our other publishers due to the complexity.

We then did an extensive literature research, compared various metadata standards, analyzed the best practices of other e-journals archiving system. Back in 2003, The Harvard university Library e-journal Archiving Project concluded “it would not be practical for an archive to manage heterogeneous content encoded by multiple DTDs” (Abrams, Rosenblum, 2003). Harvard commissioned a study with Inera to perform an in-depth analysis of ten DTDs and come up with the recommendations regarding any proposed archival DTD. Journal Archiving and Interchange DTD was created by the National Center for Biotechnology Information (NCBI) of the National Library of Medicine (NLM) with the intent of providing a common format in which publishers and archives can exchange journal content (NLM). Among several candidates, we chose the Journal Archiving and Interchange Tag Set v2.3 as the standard for our new E-journal system. This DTD was chose based on the following features the DTD can offer.

  • the DTD was created with an attempt to capture nearly anything a publisher might have tagged across all the discipline
  • It is flexible and customizable
  • It is very well documented and well maintained

The practice of adopting Journal Archiving and Interchange Tag Set v2.3

The materials loaded in Scholars Portal E-journals archive include journal articles and conference proceedings with ISSNs. According to the agreement with the publishers, the metadata, cited reference, full-text, and supplementary materials are transformed to NLM DTD XML when they are available.

1. Work flow

The data transformation from the publishers’ native data to Scholars Portal NLM XML data is processed in two steps—mapping and coding. Creating XML transformations in these two separate tasks, not only maximizes the skills of various team members, but also reduces development time and cost, and increases correctness of the finished code (Usdin, Piez, 2007). First, the mapping was created by metadata librarian who posses strong analytical skills, ability to articulate complex relationships and familiarity with both publisher’s data structure and NLM xml data structure. The metadata librarian undertakes an intensive analysis of each publisher’s source data format from the source DTD, schema and sample data and then develops the crosswalk. The crosswalk includes the mapping of the path from source to target data and the explanation of decisions and compromises.

Second, the programmer with coding experiences then develops the loader according to the crosswalk using coding languages such as Perl and Java. A test environment is set up so the transformations are tested before the data is loaded into production. The metadata librarian inspects the output and the crosswalk can go through several iterations to make sure the data are transformed completely and explicitly. After the loader directed into production system, the DTD validation is enforced and the transformation of each dataset has been logged for any errors. The log files are examined by quality assurance person. Any dataset with errors then is removed from the production and reload after the problems have been fixed.

2. NLM DTD customization

Not all the elements from NLM tag set are used in SP E-journals archive. Comparing the Scholars Portal distinct element list with the NLM element list, a number of elements are not being used such as <sig> and <front-stub>. Most of the tags are the content or format tags for the <body>. The NLM tag set is more comprehensive even for a large E-journal archive repository.

Although not all the elements have been used in SP e-journal system, customization has been made to meet our special needs.

A set of <custom> elements was added for loading information

A set of <custom> elements has been added to capture the loading information including dataset-name, publisher, source metadata directory, loading date, the count of cited, the count of download, and publication date. Publication date is duplicated in custom tag to rectify the indexing short falls of Marklogic when more than one element with the same tag name exists. Figure 1 is an example of <custom> tag set.

Fig. 1. <custom> tag set.

Fig. 1

<custom> tag set.

Attribute “display” in <body> tag

Attribute “display” has been added to <body>. <body display=”NO”> indicates the XML full text in <body> is used only for indexing, not for rending html display.

pdf-size in <custom-meta> tag

Figure 2 is the <custom-meta> added to capture the pdf size information.

Fig. 2. <custom-meta>.

Fig. 2

<custom-meta>.

“other” and “misc” in “citation-type” attribute list

Two more values have been added to the list for attribute “citation-type”. <citation citation-type=”other”> indicates all other citation types. <citation citation-type=”misc”> indicates the unparsed citation. Figure 3 is an example of the unparsed citation.

Fig. 3. Unparsed citation.

Fig. 3

Unparsed citation.

“non-latin” attribute was added

Attribute non_latin was added for <name>, <surname>, <given-names> to hold the CJK and other non Latin author names.

The use of <elocation-id> and <object-id> for citation linking

For each of the article record, <elocation-id> is generated automatically from author’s last name, publication year and pagination under <article-meta> to enable cross reference to the citations within the repository. In the meanwhile, <object-id> element is used within <citation> for each of the citations in the article to cross reference to the journal articles within the repository. Figure 4 is an example for citation linking.

Fig. 4. Citation linking.

Fig. 4

Citation linking.

3. Journal TOC structure

Although the primary intellectual value of e-journals rests at the item-level, the issue level structure is the key to navigate around the journal issues. NLM DTD defined item-level, so the issue level structure was created by our own. All the issue level textual content is encoded in a single xml. Figure 5 is the example of journal toc XML.

Fig. 5. Journal TOC XML.

Fig. 5

Journal TOC XML.

4. Additional tagging rules and policies

In addition to those imposed by NML Tag Set, we have local rules on how certain tags should be used and how certain documents should be tagged. The interpretation of the semantics of the NLM DTD, best practices documentation and examples are maintained by the metadata librarian for Scholars Portal. The XSD version of the Tag Set is maintained by the lead programmer. All the documentation is posted in wiki to share within team members.

Challenges of adopting the Tag Set

  1. Decisions need to be made when there are multiple options for transforming the publishers’ tag and these decisions need to be followed consistently to support normalized searching. For example, the article copy right can be in the <copyright-statement> under <article-meta> or /<article-meta>/<permission>. Once the decision was made, the tagging rule and the example are documented in the best practice document.
  2. The normalization is not straightforward even the source data is in NLM DTD family. In theory, normalizing the NLM DTD content should amount to not much more than an identity (Meyer, et al., 2010). In practice, each publisher implements NLM DTD according their understanding of this DTD and their distinctive business rules. For example, one publisher uses <subj-group subj-group-type="heading"> for the journal heading title, while another publisher uses <series-title>for the same purpose. One publisher uses oasis table although the document declared the use of NLM DTD version 2.2. Most of the time, the varying usage and customization are not well documented from the publisher. A large amount of examples need to be examined carefully in order to understand the implicit semantics embedded in the publisher’s DTD implementation.
  3. The customized attributes from the publisher’s NLM DTD is not as obvious as customized element so they are not easy to be captured in the transformation. Typically, the added elements were wrapped in <custom-meta-wrap> while the customized attributes can be added in any elements.
  4. We are still evaluating the consequences of moving all our content to version 3.0 of the tag set. We analyzed the changes of version 3.0. The improvement of this version is obvious. This tag set is more logical, internally consistent, and complete; however, since version 3.0 is not backward compatible, and we already have almost 20, 000,000 articles in NLM DTD version 2.3., the benefit and cost must be carefully examined before we choose to integrate Version 3.0. We just began to receive the data in version 3.0 from the publisher, it doesn’t make sense to convert version 3.0 back to 2.3. So Version 2.3 and version 3.0 divergence in our repository might be a realistic solution.

Conclusion

After almost 4 years experience of adopting the NLM tag set in production, we are confident that we made the correct decision when choosing the NLM Tag Set as our standard. The Tag Set is comprehensive “with an attempt to capture nearly anything a publisher might have tagged” (NLM). It is well documented, flexible and customizable. We are glad to see more and more publishers adopting the Tag Set, including Cambridge, Oxford, and T&F, simplifying our data transformation processes.

Acknowledgements

This paper is a reflection of the principles and practices developed by the ejournals development team at OCUL Scholars Portal.

References

  1. Abrams Stephen L, Rosenblum Bruce . XML for e-journal archiving. OCLC Systems & Services. 2003. ; 19(4): 155-161.
  2. JA DTD 5.1.0 and CEP 1.1.5 complete [Internet]: Elsevier; [cited 2010 Sep 22]. Available from: http://www​.info.sciverse​.com/sciencedirect​/implementation/implementing/dtds/.
  3. Morrissey Sheila, Meyer John , Bhattarai Sushil , Kurdikar Sachin , Ling Jie Stoeffler Matthew Thanneeru Umadevi . Portico: A Case Study in the Use of XML for the Long-Term Preservation of Digital Artifacts. Proceedings of the International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML. 2010. Aug.
  4. Archiving and Interchange Tag Set [Internet]: National Library of Medicine; [cited 2010 Sep 22]. Available from: http://dtd​.nlm.nih.gov/archiving/index​.html.
  5. Usdin Tommie, Piez Wendell . Separating Mapping from Coding in Transformation Tasks. Presented at: XML 2007; 2007 Dec 3-5; Boston, MA.
Copyright 2010 by Wei Zhao, Vidhya Arvind.

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License

Bookshelf ID: NBK47085

Views

  • PubReader
  • Print View
  • Cite this Page

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...