U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2013/2014 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2013.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2013/2014

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2013/2014 [Internet].

Show details

mPach: Integrated Publishing and Archiving of Journals in HathiTrust

, , and .

Author Information and Affiliations

mPach is a package of tools being developed to provide a modular platform to enable the publication of born-digital open-access journals in HathiTrust, a digital library created by a partnership of major research institutions. One of the chief technological challenges in creating such a toolkit is enabling the conversion of edited manuscripts to JATS, which was chosen as the preservation-quality format. This paper provides a technical overview of the mPach platform, with special attention paid to the design and functionality of Norm, a tool being developed to convert Microsoft Word documents to JATS.

Motivation and Design Considerations

HathiTrust, a partnership of major research institutions and libraries, aims "to ensure that the cultural record is preserved and accessible long into the future." [1] Its digital library is currently archiving and providing access to digitized library holdings, but since research libraries are increasingly involved not just in making their collections available but also in publishing new scholarly literature [2], HathiTrust would be a natural place to archive and provide access to born-digital content as well to ensure its long-term preservation and discoverability.

Michigan Publishing, the University of Michigan's scholarly publishing operation, is based in the University Library and has long used a system called DLXS as its primary publishing platform. An extensible system with a large amount of code shared between publications, DLXS has been essential in allowing the Library to publish at scale; however, as Michigan Publishing grows and attempts to achieve even greater scalability, additional needs have emerged that cannot be satisfied within the architecture of DLXS. Since there is growing interest in shared infrastructure and operations for scholarly publishing, both among libraries and university presses, the Library, with its close connection to HathiTrust, has seized this opportunity to develop a new platform, called mPach, to meet its own needs and to provide as shared infrastructure for use by other institutions.

A modular platform for publishing born-digital open-access journals in the repository, mPach contains a set of modifications to the existing HathiTrust code base, plus entirely new components, that facilitate ingest, display, and discoverability of journal literature in HathiTrust; all of these components are tightly coupled with the repository. mPach provides all of the tools needed to publish an open-access journal online, and it is designed to allow integration with popular journal publishing tools such as Open Journal Systems.

Of primary importance in the design of a system that both publishes and archives content is the inevitable tension between the needs of publishers, which require the flexibility to innovate, and the needs of the archive, where rapid change must yield to the demands of long-term preservation and access. To guarantee preservation of the final version of all published content, a primary design principle for mPach is that "[t]he version of the content in HathiTrust must always be the single authoritative version," and "any revisions to the content must be made to the authoritative copy in the HathiTrust repository." [3] So while mPach strives to create a workspace that allows publishers to easily manage and update their content, it also enforces a requirement that HathiTrust always contains the final version of the article.

An Overview of mPach

There are three major parts of mPach, each of which includes components in various stages of development at the time of writing:

  • the peer review and editorial system: provides tools for authors, reviewers and editors
  • Prepper: prepares the article for publication and archiving in HathiTrust
  • modified HathiTrust components: provides support for born-digital journal content
Fig. 1. Major parts of mPach.

Fig. 1Major parts of mPach

As a modular system, mPach could be used with any peer review and editorial system that is capable of interacting with Prepper; however, the developers have chosen to provide OJS as the default option. Despite the lack of support for digital preservation, OJS is already widely used for library-based journal publishing, and mPach's integration with this software will allow for a smooth transition of journals already published using OJS into the HathiTrust repository. Integration with mPach requires that manuscripts that reach the "layout" stage in OJS be sent to Prepper, which prepares the HathiTrust Submission Information Package (SIP).

Prepper, a Ruby on Rails application, provides a user interface for the editor of a journal: a dashboard for administering the journal and putting manuscripts through a production process—akin to composition and typesetting—that prepares all content according to the preservation standard developed for mPach content in HathiTrust. Prepper invokes Norm, a Python application developed to convert manuscripts from Office Open XML ("DOCX") format [4] to JATS. DOCX is the default option because, like OJS, it is widely used in the editorial process of journals published by libraries. The Prepper interface also guides the staff member through a review of validation errors detected by Norm's conversion, uploading high-resolution figures, supplying "alt text" for figures, previewing the article as rendered using the default stylesheet (based on the Preview XSLT stylesheets [5]), uploading supplementary material [6], and submitting for ingest into HathiTrust.

Fig. 2. Early version of the mPach Prepper interface.

Fig. 2Early version of the mPach Prepper interface

In this image, the article metadata is displayed just after Norm completes the conversion of the submitted article from DOCX to JATS

Fig. 3. Using Prepper to upload an article and submitting the article to HathiTrust.

Fig. 3Using Prepper to upload an article and submitting the article to HathiTrust

The images are numbered as follows: 1) Select file to upload. 2) Verify that conversion by Norm is correct, and edit certain bibliographic details. 3) Annotate media and provide archival copies. 4) Review HathiTrust Pageturner preview. 5) Add any supplemental material. 6) Review article submission. 7) Receive article submission confirmation. 8) View submitted content from the journal home page. Note that these are early versions of the interface and are subject to change.

mPach requires a number of significant modifications to HathiTrust components and workflows originally designed to support digitized print materials. The reading interface in HathiTrust, which previously supported only rendering of digitized page images, renders JATS XML in HTML and allows a user to download a dynamically generated PDF and EPUB, display metadata specific to articles (Figure 4), and link to a special "collection" for the journal in HathiTrust's Collections application [7] that allows browsing volumes and issues of the journal (Figure 5).

Fig. 4. Prototype of an article in HathiTrust's user interface.

Fig. 4

Prototype of an article in HathiTrust's user interface.

Fig. 5. Prototype of a journal in HathiTrust's user interface.

Fig. 5

Prototype of a journal in HathiTrust's user interface.

In HathiTrust, MARC records provide bibliographic metadata like title and author for every item in the repository, which enables discovery by browsing or searching. For mPach, each article has its own analytic catalog record, tied to a monographic record for the journal as a whole. Finally, the HathiTrust Data API [8] allows for the content of each article to be retrieved for use outside of the native HathiTrust interface.

Note that content within HathiTrust is restricted for legal reasons, not because a rights holder wants to restrict access. Therefore, mPach only supports the publication of open-access journals.

Workflow

In the typical workflow for publishing a journal using mPach, a journal editor uses OJS to manage submissions, peer review, and the editing process. Once an article reaches the "layout" stage (where a combination of composition and typesetting allows the article to be formatted in a consistent way), the journal editor formats the article using a predefined list of styles provided by mPach in Microsoft Word; these styles are used to flag specific content within the article (e.g., the title, authors, institutions, etc.), which provides the necessary semantics for Norm to transform the document into JATS.

After the styles are applied to the Word document, the editor submits the article to Prepper, which guides the editor through conversion to JATS XML (and validation of the result), preparation of the submission information package (SIP), and submission for ingest into HathiTrust. Prepper tracks of version of articles so that a revised content can be resubmitted. Currently, the ingest process overwrites any previous version of an item with the same identifier; however, future support for versioning submissions will be provided by HathiTrust.

JATS and the Publishing Tag Set

University of Michigan Library staff researched various formats for publishing and archiving born-digital article content. JATS was selected because of the increasing coalescence of the publishing industry around this open, non-proprietary standard suitable for representing the structure and semantics of journal articles in order to both preserve them and render the content in various output formats. Although archiving is one of the primary goals for mPach and HathiTrust, the Archiving and Interchange Tag Set ("green") is not an appropriate choice because mPach defines the structure of the content and hence does not need to represent information about the appearance of the source document. Furthermore, the Archiving and Interchange Tag Set provides flexibility in tagging that would complicate the logic in the stylesheets used to render the content in HathiTrust. And while the Authoring ("orange") tag set provides the right tags to represent the content submitted by mPach, it lacks some necessary metadata appearing in the front section of the articles. The Publishing ("blue") tag set was selected because it provides the same benefits of the Authoring tag set as well as the necessary metadata.

Norm: An Application for converting DOCX files to JATS XML

Norm is the component of mPach responsible for transforming an author's DOCX article into XML conforming to the Journal Publishing Tag Set. It is a command-line Python application whose input is a DOCX file and output is JATS XML, plus any embedded content such as images.

Norm parses the XML content of the Word document, mapping various Word styles to the appropriate JATS elements. Norm then generates the JATS document object model (DOM) using rules that specify how elements are nested as well as providing cardinality constraints, similar to the validation rules provided by technologies such as Document Type Definitions (DTD) and XML Schema. (These style-element mappings, as well as the rules for generating the JATS DOM, are represented in Norm's configuration files and are hence customizable.)

A key requirement of Norm is that the Word document must use specific paragraph styles specified in the configuration file. Norm comes with a default configuration file containing correspondences between JATS element names and Word styles for components of the article such as the title, author first name, author last name, abstract, etc. (Users can edit this file to define their own custom styles.) During conversion to JATS, Norm determines the appropriate JATS element for any particular content by its associated styles; hence, accurate styling is essential, and in some cases incorrectly styled documents will be rejected by Norm.

Conceptually, the idea of using Word's built in styles to differentiate elements for transformation is not new. Inera’s eXtyles product suite is an example of software that uses a similar strategy for documents, and also provides support for exporting content in JATS. At this point, Norm has fewer features than eXtyles; however, it will be made available as open-source software along with the rest of mPach, with plans to elicit and include feedback from and features suggested by the community of users.

Fig. 6. A Microsoft Word document using paragraph styles designed for use with Norm.

Fig. 6A Microsoft Word document using paragraph styles designed for use with Norm

A Technical Overview of Norm

The following algorithm demonstrates how Norm performs the transformation:

Box Icon

Box 1

Overview of the algorithm used by Norm for transforming a Word document to JATS Note that the Word Document Object Model (DOM) tree is relatively flat compared to the JATS DOM tree, and is much less structured; hence, the configuration plays an essential (more...)

The first step in the algorithm involves transforming the data in the Word document as an ordered list of styled content. Note that Norm represents this content internally as a list of tuples (one tuple per styled element) with the following format:

    [ (JATS element, Content, Word style), … ]

Where the Content is the textual content, along with any inline styles, for the Word element.

Box Icon

Box 2

Representation of sample content The internal representation of a sample content fragment for an article title, containing the JATS element name, the content and the Word style name, respectively. Note that this tuple may be embedded as content within (more...)

Although Prepper calls Norm on behalf of the user, here is a sample command-line usage:

    $ python norm.py -w article.docx -o /location/of/output/

Norm provides various command-line options, which are described in the following table:

Table 1

Short ArgumentLong ArgumentPurpose
‑h‑‑helpShow help message and exit.
‑a‑‑archive_nameSets a custom name for the zip archive created by this script.
‑c‑‑cfgSpecify the location of the config file that maps Word Styles to JATS Elements. A default will be used if none is provided.
‑o‑‑outSet the directory for norm's output, a XML file and a zip file containing the document assets will be created, using the name of the document.
‑v‑‑verboseEnabled Verbose Output for a more debug friendly output.
‑w‑‑wordSpecify the location of the Word Doc for norm to process.
‑V‑‑versionGet the version number of Norm.

As previously discussed, Norm's configuration files drive the transformation logic, making Norm is highly extensible and customizable. Norm's configuration files include:

  • Mappings from Word styles to JATS elements
  • The major section (article-meta, body, or back) in which a particular JATS element appears
  • Which and how many children a particular element can have
  • Attributes permitted by a particular element

The following is an excerpt of lines from a configuration relevant to the surname element, corresponding to the Word style 'AuthorSurname':

    [ FRONT ]
    AuthorSurname = surname

    [ FRONT-PARENTS ]
    surname = name
    name = contrib
    contrib = contrib-group
    contrib-group = article-meta
    article-meta = front

    [ CHILDRENLIMITS ]
    surname = 1
    name = 1
    contrib = Yes

    [ ATTRIBUTES ]
    AuthorSurname = contrib-type,author

The JATS XML tree is recursively defined by the configuration. In the above example, note that each element in the FRONT-PARENTS section has a defined parent. Norm uses this information about the parents to generate the front, resulting in the the following XML hierarchy:

        <front>
          <article-meta>
             <contrib-group>
               <contrib contrib-type="author">
                 <name>
                   <surname></surname>

While XSLT is a natural choice for many XML transformations, we developed Norm using a scripting language (Python) because we needed the ability to extract content embedded in Word documents, such as images and video, which XSLT alone cannot do. Because of this requirement, and the fact that we had in-house developer expertise in Python, the choice was made to keep the entire code base confined to a single language for simplicity's sake.

Future Plans for Norm and other mPach Components

While Norm can already transform Word documents to JATS XML, we have begun developing support for the transformation of the OpenDocument ("ODF") format (used primarily by Apache OpenOffice and LibreOffice). Norm is designed to be extensible enough to allow the transformation of other structured document types, such as LaTeX, to JATS. At this time there are no plans to transform content from PDF, though there is promising work in this area (such as Merops, pdf2xml, pdfx, LA-PDFText, and GROBID) that might provide the foundation for future support.

Norm, as well as Prepper and other mPach components will be released as open-source software in the near future. Tentatively, the source will be available as a GitHub repository with an Apache 2.0 license, though details have not been finalized.

References

1.
Welcome to the Shared Digital Future. HathiTrust Digital Library. http://www.hathitrust.org/about.
2.
See resources aggregated by the Library Publishing Coalition.
3.
Design Principles and Requirements. mPach. http://www.lib.umich.edu/mpach/design-principles-and-requirements.
4.
Office Open XML. Wikipedia. http://en.wikipedia.org/wiki/Office_Open_XML.
5.
NISO Journal Article Tag Set (JATS) Version 1.0: Preview XSLT Stylesheets. https://github.com/NCBITools/JATSPreviewStylesheets.
6.
Recommended Practices for Online Supplemental Journal Article Materials: A Recommended Practice of the National Information Standards Organization and the National Federation of Advanced Information Services. January 2013. http://www.niso.org/publications/rp/rp-15-2013.
7.
Collections. HathiTrust Digital Library. http://babel.hathitrust.org/cgi/mb.
8.
HathiTrust Data API. HathiTrust Digital Library. http://www.hathitrust.org/data_api.
Copyright 2013.

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License

Bookshelf ID: NBK159727

Views

  • PubReader
  • Print View
  • Cite this Page

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...