NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2015.
JATS for Reuse (JATS4R) is a working group that was set up in June 2014 to address the issue of standardizing JATS XML for the ease of reuse. This paper will describe the current state of the group’s recommendations, our workflow, and the tools that we have built so far. The JATS4R home page is at http://jats4r.org.
NISO Z39.96-2012 JATS is a NISO standard that defines a set of XML elements and attributes for tagging journal articles and describes article models suitable for archiving, publishing, and authoring articles in XML. Because it needs to be able to accommodate nearly any published journal content, there is a lot of flexibility in the models. But as JATS does not define “best practices”, it is not used consistently across publishers; this inhibits harvesting and reuse of JATS-tagged materials.
JATS4R has been formed to explore how specific JATS elements are used for which such inconsistencies have been found, and to attempt to harmonize them uses by issuing recommendations on how reuse-relevant elements should be used. Besides establishing communication channels and working out procedures, the group has already tackled some specific tagging issues, most notably the machine readability of license statements and of mathematics in journal articles.
Background
In mid-2012, an automated software tool – the Open Access Media Importer (OAMI)[1] – started to crawl the Open Access Subset of PubMed Central (PMC) in order to find openly licensed articles that contained audio or video materials. Since then, it has uploaded over 19,000 of these media files to Wikimedia Commons to facilitate reuse on Wikipedia and elsewhere. In doing so, it made use of several JATS elements – e.g. those for licensing, keywords and media types – and revealed a number of inconsistencies in the XML available from PMC. The inconsistencies were manifested as differences between the XML at PMC among different publishers, and:
- the XML at the publishers' sites
- the JATS standard
- he JATS documentation (Tag Libraries)
- the PMC Tagging guidelines
- the PMC Style Checker.
Although JATS is an industry standard XML schema built specifically for publishers and the electronic storage of journal content, and PMC performs some standardisation of the XML during ingest, JATS has had to allow for the fine nuances of publishing and the varying requirements of different types of content and different publishers. Also, a growing amount of JATS content – particularly from fields outside biomedicine – does not reach PMC and is thus not subject to their standardisation techniques. As a result, and despite best intentions, publishers use JATS inconsistently, which has led to reusability problems.
Some of these inconsistencies hindered the ability of the OAMI software to reuse the material. Especially problematic was the inconsistent and sometimes ambiguous way that licenses were tagged, which required the implementation of complex text-mining like algorithms in order to accurately determine whether or not the content was compatible with reuse on Wikimedia Commons.
For example, in some instances, the license URI specified a “CC-BY” license, but the human-readable text contradicted it, adding an extra non-commercial clause:
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.5/"> <p>Re-use of this article is permitted in accordance with the Creative Commons Deed, Attribution 2.5, which does not permit commercial exploitation.</p> </license>
Such contradictions in licensing terms have since also been observed by the Wellcome Trust, who commissioned a tool to monitor compliance with their Open Access Policy.
In other examples, articles had license information outside of any <license> element:
<permissions> <copyright-statement> Uosaki et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. </copyright-statement> <copyright-year>2011</copyright-year> </permissions>
Other articles were found to have different, contradictory license URIs within the <permissions> element.
Needless to say, this made the task of the OAMI software much more difficult than it should have been. Also, because of Wikimedia Commons' strict enforcement of proper licensing, it was important that the OAMI module be conservative: false positives being much less acceptable than false negatives. That meant that a lot of eligible content was, unfortunately, discarded.
These problems led to a paper[2] that was presented at JATS-Con 2014[3] and triggered a call to action for the development of best practices for tagging in JATS in a way that improves reusability.[4]
Two months later, a group of open access publishers and other interested parties came together in Cambridge, UK, to discuss how to improved reusability of articles, and to move this effort forward. The meeting resulted in a prioritization list and a proposed basic workflow for elaborating recommendations, along with a website,[5] a mailing list[6] and a name for the group: “JATS for Reuse”, abbreviated “JATS4R”.
The prioritization was informed by the issues that affected reuse through the OAMI,[1] whose functionality had since begun to be expanded from the initial focus on supplementary materials to full text.[7] This revealed that many other JATS tags are used inconsistently, for example those concerned with mathematical formulas, affiliations and contributor roles.
In regular online meetings that have converged on a biweekly pattern, the group has since implemented the proposed workflow, refined it, used it to draft recommendations for a number of elements, and undertaken steps towards developing a validation system.
In January 2015, the group was expanded to include more publishers, representation from online hosts, and also other interested parties such as content processing vendors. The group is open to anyone interested in the creation of content in XML format using the JATS DTD. Current JATS4R endorsing parties are listed on the JATS4R website.
Workflow
At the meeting in Cambridge, UK, in June 2014, a workflow was proposed, which has since largely been implemented (see Fig. 1).
However, in practice, the process has been less linear, and certain aspects have taken more focus than others. What was unanticipated at the start was the power of GitHub issues as a way for all members of the group as well as non-members to add comments and queries asynchronously, as and when they were encountered. So, the use of GitHub issues has become a larger part of the workflow than uploading samples directly to the GitHub repositories (see Fig. 2).
Fig. 2Actual, current workflow
Discussions based on GitHub examples are refined both using GitHub and in a biweekly conference call until consensus is reached. The process of agreeing to best practices has also taken on a more protracted part of the workflow and is the key element.
In less than a year, the group has tackled two major tagging issues, written recommendations for them, and incorporated them into the validation tools.
Once formal recommendations are published, and the validation tools are released and deployed, we expect that we should be able to compile compliance statistics and to make these available.
Prioritization and upcoming
A prioritisation list was produced at the first meeting. The first two recommendations JATS4R has tackled were uppermost on the initial recommendation list. In January 2015, the list was refined by the expanded group, and group members voted on the next priorities. A copy of the list[8] is kept in the JATS4R Google folder.
Besides requirements for the schematron tools, the next issues for consideration are references (citation of software/ data/ versions/ law cases) as well as versioning and corrections. In some instances, we anticipate that JATS4R will become a forum for publishers and other interested parties to discuss their individual issues with JATS and general publishing requirements and standards, and to suggest refinements or recommendations of best practice.
Recommendations
Recommendations are drafted on the GitHub wiki.[9] Once they have been finalized and approved by the members of the group, they will be tagged with a version, and moved to the website.
Permissions
An important piece of metadata associated with journal articles is the license (if any) under which it is distributed. This is especially true when considering machine-readability, because the main use case for machines “reading” articles is so that they might reuse the content for various purposes. Obviously, it would be important for such a machine to be able to determine the license for any article accurately, and to determine reliably whether the intended reuse is permitted.
In the simplest case, where the entirety of an article is made available under one license, and that license is identified by a stable URI (such as the Creative Commons licenses), the current recommendations describe the exact placement of the license URI within the JATS markup, so that it can be read unambiguously. Specifically, it should be given in the @xlink:href attribute on the <license> element. For example:
<permissions> <copyright-statement>© 2014 Surname et al.</copyright-statement> <copyright-year>2014</copyright-year> <copyright-holder>Surname et al.</copyright-holder> <license xlink:href="http://creativecommons.org/licenses/by/4.0/"> <license-p>This is an open access article distributed under the terms of the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License</ext-link>, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.</license-p> </license> </permissions>
Furthermore, we recommend that the contents of the <license-p> element be considered to be for human-readability only and that any license URIs that occur within it can be safely ignored by machines. In this way, the human-readable portion of the license statement can be written to reference other licenses, or contain any discussion whatsoever, without worrying that the contents will confound future attempts at reuse.
Some articles or other published documents that use JATS incorporate third party or other material which is released under different licenses. For example, a CC-BY article may include an image or figure for which the copyright holder maintains “all rights reserved” or has selected an alternative CC license. We have established recommendations for how these should be tagged. In particular, the “topmost” permissions element is taken to apply to the article as a whole, and by default, to any of its parts, unless that part contains its own <permissions> element. If so, then that <permissions> element must be complete, and specify copyright, license, etc., in their entirety.
For example, the recommendations need to allow for elements of an article/chapter that fall under an alternative license to the remainder of the document. We have proposed guidelines that allow for these instances in the main, but do not overly complicate the markup to the extent that machine readability would be hard to achieve. For example, if a CC BY article contains one figure that is all rights reserved, we recommend that the single figure contains a separate permissions element. This maintains a machine’s ability to eliminate it from re-use and prevents the publisher from having to list the whole article as all rights reserved and thus prevent reuse of more material than necessary. If one part of a composite figure is all rights reserved, it is reasonable for that whole figure to be listed as such, unless the figure is reproduced with each composite as it’s own entity. However, the issue of composite figures and deconstructing them to their separate elements is on the JATS4R agenda.
Despite the growing amount of open access literature, to our knowledge, issues of tagging different licenses for third party content have not previously been considered.
We have also defined some recommendations for the specification of an article’s copyright. Whenever an article is under copyright, both <copyright-year> and <copyright-holder> should be present, and <copyright-year> should be a full four-digit year with no whitespace. Note that the <copyright-statement> is intended for human-readability, and therefore, its contents are not addressed by JATS4R.
Math
The JATS4R recommendations for marking up mathematics are intended to ensure that formulas and equations are tagged consistently, such that they could be interpreted by automated processes, or rendering engines, without significant extra work.
In particular, mathematical formulas must be enclosed in <inline-formula> or <disp-formula> elements, to ensure that they are easily identifiable. To avoid ambiguity, there must be one and only one formula within each of those elements.
Furthermore, in order for an article to be JATS4R-compliant, every formula must be supplied as markup---either MathML (in a <mml:math> element) or TeX/LaTeX (in a <tex-math> element). Images may be supplied; but if they are, they must be enclosed in an <alternatives> element and accompanied by the same formula given as markup.
If alternative representations of the same formula are provided (within an <alternatives> element), there is no explicit or implied preferred representation. It is incumbent on the content provider to ensure that each representation is correct and logically equivalent.
How to assess and state compliance
We will version the recommendations linearly. This means a new version number for each update. There will be no difference between the addition of a new 'subject' or new rules for an existing subject. Each version includes all of the preceding subjects. For example, if we released recommendations for "permissions", "authors", and "formula" over a period of time, the versions would work out as in the chart below. Please note that these subjects and version numbers are just given as an example.
Table 1
Date | Subject | Subject Iteration | JATS4R Version | Subjects Included |
---|---|---|---|---|
Jan 1 | Permissions | 1st iteration | 1 | permissions(1) |
Mar 1 | Authors | 1st iteration | 2 | permissions(1), authors(1) |
Jun 1 | Formula Permissions | 1st iteration 2nd iteration | 3 | permissions(2), authors(1), formula(1) |
Compliance with JATS4R can be tested with Schematrons. We will provide a Schematron for each version on the JATS4R website that may be downloaded and built into production publication systems. We will also provide an online tool for single-article testing. A preliminary, demonstration version of this tool is available.[10]
Articles can signal their JATS4R compliance with an <?xml-model?> processing instruction that references the appropriate schematron. At this time, this is an example URI:
<?xml-model href="http://jats4r.org/schema/0.1/jats4r.sch" schematypens="http://purl.oclc.org/dsdl/schematron" title="JATS4R 0.1"?>
Infrastructure/getting involved
The central place for communication for JATS4R is a dedicated GitHub organization JATS4R, which currently contains two repositories that are in the public domain:
- elements contains XML examples collected from different publishers as well as minimal examples for using thematically related groups of JATS elements (e.g. for permissions, maths, or figures) and serves to elaborate best practices around using these elements;
- jats4r.github.io contains the group’s website content, including schema files and the online validator; all of which is hosted on GitHub Pages.
All three repositories have their own issue tracker and the individual issues therein serve as public discussion threads for specific problems or suggestions, which can be grouped by way of labels. On that basis, the group’s recommendations are drafted on the wiki of the elements repository, again grouping thematically related elements together. The infrastructure is completed by a mailing list that serves for general coordination, e.g. of the biweekly meetings, as well as a Google Drive folder that hosts the minutes and other auxiliary materials, e.g. the prioritization list or the draft of this paper.
This open way of working provides the community with multiple ways to get involved with JATS4R activities, e.g. by
- contributing tagging examples,
- posting new GitHub issues in the elements repository,
- contributing to the discussion of existing issues,
- posting pull requests,
- scrutinizing the recommendations while they are still being drafted,
- testing the validation tools,
- helping to improve the website,
- sharing information about JATS4R by linking to the group’s website or individual components of its infrastructure.
Adoption and implementation
Adoption
Although the call to action was not selective, the initial meeting was held by a closed group. However, all documentation and discussion has been on open forums, as outlined in the section above. As its first public event, the group organized a webinar[12] during Open Access Week 2014, which was recorded and posted on YouTube. Similar outreach activities are envisaged for the future.
In December 2014, JATS4R was mentioned by two of the STM eProduction seminar talks in London, UK,[13] and an active recruitment process resulted in a number of further publishers, online hosts, and other members of the publishing community joining the group. We continue to welcome further members. There is a list of thirteen companies and organizations that endorse the JATS4R principles on the JATS4R homepage, and that list is growing.
Implementation
t is not intended that the validation tools will be a method of “outing” non-compliance, but as a means of facilitating publishers and other interested parties to measure and report their own efforts towards accessibility and reuse of their content, especially their open access corpora. We aim to publish compliance rates of endorsing organisations and anticipate that list of organisations to grow. There is no requirement to achieve 100% compliance in order to participate in the group, but a commitment to work towards this goal within organisational frameworks and restrictions is expected. JATS4R is intended to help the community to reach the combined aim of enabling the literature produced to reach the widest possible audience.
Conclusions
JATS was conceived in order to aggregate multiple publishers' DTDs into a single necessarily-permissive DTD, with journal content flowing from publishers to its final resting place in an archive. But now, with the ongoing spread of open licensing, content is becoming more mobile in legal terms, which opens up new ways of using and reusing journal content, including reuse from archives and across publishers. In technical terms, this translates to a need for further standardization, which the JATS4R group was formed to facilitate. The group has initiated a process to review current usage of specific JATS elements, to harmonize it across publishers and to prioritize its activities based on reuse.
The first standards have been published, and tools made available to test against the standards. The group uses an open communication infrastructure and invites the wider JATS community to participate in the process of establishing standards to benefit archiving and interchange, streamline JATS tool development, and reduce production costs.
Not only do we seek to ensure maximum reuse of available content, we also propose to provide publishers and producers of content with concrete recommendations that they can use to produce their JATS XML content, in order to provide maximum portability and reusability. This, in turn, adds value to the content by virtue of the network effect. If publishers and producers of JATS XML have the same expectations of the use of the elements and attributes, and people conform to these, then transfer of content and processes between publishers and vendors will be simpler, and the costs associated with these transactions will be reduced. Ultimately, reduced production costs lower publishing costs, and these benefits can be transferred to the wider community in lower APCs or subscription costs. Standardisation reduces waste, allowing the focus to be placed on real benefits for the communities we serve.
Links
- 1.
- Open Access Media Importer: http://commons
.wikimedia .org/wiki/User:Open _Access_Media_Importer_Bot. - 2.
- Mietchen D, Maloney C, Moskopp ND. Inconsistent XML as a Barrier to Reuse of Open Access Content. JATS-Con 2013/2014. Bethesda, MD. http://www
.ncbi.nlm.nih .gov/books/NBK159964/ - 3.
- Mietchen D. User:Daniel Mietchen/Talks/JATS-Con 2014/Inconsistent XML as a Barrier to Reuse of Open Access Content. https://en
.wikipedia .org/wiki/User:Daniel_Mietchen /Talks/JATS-Con_2014 /Inconsistent _XML_as_a_Barrier _to_Reuse_of_Open_Access_Content. - 4.
- Beck J. Call to Action: http://videocast
.nih .gov/summary.asp?Live =13963&start =11980&bhcp=1. - 5.
- JATS4R website: http://jats4r
.org/ - 6.
- JATS4R mailing list: https://groups
.google .com/forum/#!forum/jats4r. - 7.
- 8.
- 9.
- 10.
- Validator: http://jats4r
.org/validate/ - 11.
- JATS4R Github Organization: https://github
.com/JATS4R. - 12.
- JATS4R OA Week webinar: https://www
.youtube.com /watch?v=S3iwVlkvWdY. - 13.
- STM eProduction seminar: http://www
.stm-assoc .org/events/e-production-seminar-2014/
- Improving the reusability of JATS - Journal Article Tag Suite Conference (JATS-C...Improving the reusability of JATS - Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015
Your browsing activity is empty.
Activity recording is turned off.
See more...