NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2010.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010 [Internet].

Show details

Superset Me—Not: Why the Journal Publishing Tag Set Is Sufficient if You Use Appropriate Layer Validation

.

Author Information

This paper relates the experience of a publisher who chose to create a superset of the NLM Journal Publishing Tag Set in order to enforce business rules, data types, and house style and, having done just that, realized that a subset could have been sufficient to meet the publisher's needs if it were used in conjunction with the appropriate layer validation technology, such as Schematron.

Introduction

An important reason a publisher uses an XML vocabulary to mark up its publications is to obtain a necessary degree of control over the content and the processes involved into making it available. Before the Journal Publishing Tag Set (hereinafter JPTS) of the NLM Journal Archiving and Interchange Tag Suite (hereinafter JATS) emerged as a de facto standard in the STM publishing industry, customizing existing DTDs, such as ISO 12083, or writing your own and then running a validating XML parser was the only way to check business rules, data types, and editorial style.

In addition to the JPTS's availability and its wide adoption, another factor that has changed the electronic publishing paradigm is availability and maturity of tools and technologies that enable the publisher to exert control over its content and workflow. Even though JPTS has been around in some form since 2003, the publisher needed to develop its own quality control tools using such technologies as Java or XSLT in order to implement checking of metadata integrity, business rules, data types, and house style. Such a solution required considerable in-house expertise in the chosen programming languages, and the quality control tools developed for that purpose were usually difficult to adapt to the inevitable frequent DTD changes. The development and adoption of Schematron, specifically designed to be a rule-based validation language for XML, as well as availability of user-friendly environments, like the oXygen editor, have made it possible to use a generic tag set in conjunction with Schematron to achieve a degree of control surpassing that attainable by using a customized DTD.

Taking into account (1) the current wide adoption of JPTS by conversion shops, composition vendors, content hosts, aggregators, and archives, (2) the Schematron technology reaching its mature state, and (3) availability of user-friendly tools, an electronic publisher may want to seriously consider using a generic tag set and shifting the burden of validation from the XML parser to the more appropriate layer, such as the Schematron engine, which will perform the majority of required checks.

Why we built a JPTS superset

Why then didn't we choose that route? There were several reasons. When, as a result of collecting its requirements, AGU decided to move from a proprietary DTD to the tag set based on JPTS, not all of JPTS's elements and attributes were parameterized, nor did JPTS yet contain all the elements that we needed. For example, AGU uses its own indexing language where each subject descriptor consists of a numeric code and a textual description, and there was no easy way to express this structure in JPTS v2.3. In fact, it was AGU's request to the Secretariat, along, I assume, with the similar ones coming from other publishers, that led to the introduction of the Compound Keyword element in the NLM Tag Suite v3.0. Ironically, however, its addition was too late for our benefit because we had already been too invested into developing the AGU Production Tag Set.

Another reason for choosing to build a superset of JPTS was lack of familiarity with and expertise in Schematron technology on our part and the lack of maturity of the tools for running Schematron, e.g., oXygen. Running SVRL (Schematron Validation Report Language) batch files was not a viable option in the production environment. Some of the XML checks we run require validation against a relational metadata database, and at the time the technology did not offer a seamless way to integrate such validation with the rules-based document checking.

Yet another reason for choosing to superset JPTS was that AGU had to use its tag set to mark up not only journal articles but also books and their chapters. As a result, the AGU Production Tag Set has been built as a superset of JPTS. We use Schematron as a quality control tool that performs close to 200 checks on publications of various genres (journal articles, books, book chapters, newspaper articles) and at various stages (in press, initial, and final).

In this paper, however, I will try to demonstrate that a journal (as opposed to a book) publisher with similar requirements might be better off not by creating a JPTS superset, i.e., adding specific structures to the tag set, but rather building its subset, i.e., removing unnecessary elements from JPTS, and then implementing Schematron validation on top of the subset.

Validating with an XML parser versus with a Schematron

In this section I will compare how the same requirements can be expressed using the strict DTD models where validation is done by an XML parser and using the loose JPTS models with validation performed by Schematron.

Attribute values (enumerated list)

Requirement. Article type is required and can be one of three types: a regular article (rga), a correction (cor), or an editorial (edt).

Strict DTD

<!ATTLIST  article  
            article-type  
                        (rga | cor | edt)               #REQUIRED >                
            

JPTS

<!ATTLIST  article 
            article-type   
                        CDATA                            #IMPLIED >        
    

XML instance (contains nonallowed article type)

<article article-type='xxx'/>
    

Schematron

<rule context="article">
    <assert test="@article-type=('rga','cor','edt')">
    
      @article-type '<value-of select='@article-type'/>' 
      not allowed, must be 'rga', 'cor', or 'edt'</assert></rule>
    

Schematron message

@article-type 'xxx' not allowed, must be 'rga', 'cor', or 'edt'

Number of element occurrences

Requirement. Acknowledgments section, if present, must contain exactly one paragraph, except for two journals (journal code ‘ja’ and ‘rg’) where Acknowledgments must contain two paragraphs.

Strict DTD

<!ELEMENT  ack          (p, p?)                                   >

JPTS

<!ELEMENT  ack          (p*)                                      >

XML instance (wrong number of paragraphs)

<article> 
  ...
  <journal-id>jb</journal-id> 
  ... 
  <ack> 
    <p>Blah</p>
    <p>Blah-blah</p> 
  </ack> 
</article>
                

Schematron

<rule context="ack[ancestor::*/journal-id=('ja','rg')]">
  <assert test="count(p) eq 2">

    '<name/>' in '<value-of select="ancestor::*/journal-id"/>'   
     must contain exactly two paragraphs</assert></rule>

<rule context="ack">
  <assert test="count(p) eq 1">

    '<name/>' in '<value-of select="ancestor::*/journal-id"/>' 
     must contain only one paragraph</assert></rule>

Schematron message

'ack' in 'jb' must contain only one paragraph
                

Element position and sequence

Requirement. If a journal has a subject grouping (such as table of contents category or a disciplinary subset) and an article belongs to a special collection (such as a one-time special section or an ongoing theme), then the subject grouping metadata must precede the special collection metadata.

Strict DTD

<!ELEMENT  article-categories
                        (subject-group*,
                         special-collection?)                     >            

JPTS

<!ELEMENT  article-categories
                        (subj-group*)                             >

XML instance (wrong sequence of subject groups)

<article-categories>
    <subj-group subj-group-type="special-section">
        <subject content-type="EARLYWARN1">New Methods and 
		Applications of Earthquake Early Warning</subject>
    </subj-group>
    <subj-group subj-group-type="toc-category">
        <subject content-type="SDE">Solid Earth</subject>
    </subj-group>
</article-categories>            

Schematron

<rule context="article-categories/subj-group[@subj-group-type= 
	('special-section','theme')]"> 
 <assert test="not(following-sibling::subj-group[@subj-group-type= 
	       ('toc-category','subset')])">

   <name/>/@subj-group-type='<value-of select='@subj-group-type'/>' 
   must appear after a ToC Category or a Subset when either is 
   present</assert></rule>            

Schematron message

subj-group/@subj-group-type='special-section' must appear after 
a ToC Category or a Subset when either is present
                

References

Validating bibliographic references presents a particular challenge given, on the one hand, their variety and, on the other, the need to enforce house style. On the one end of the spectrum is a strict approach where the DTD prescribes the fixed order of elements and allows for no mixed content. In this model, the punctuation, spacing, and face markup are generated on output.

Strict DTD

<!ELEMENT  book-standalone-citation
                        ((person-group | string-name),
                         year,                 
                         source,
                         edition?, 
                         (person-group | string-name)?,
                         size?, 
                         elocation-id?, 
                         publisher-name,
                         publisher-loc)                           >
<!ATTLIST  book-standalone-citation
              id                  ID                    #REQUIRED >            

On the other end of the spectrum is JPTS's mixed-citation element, which allows for any number of elements in any order mixed with the character data.

JPTS

<!ELEMENT  mixed-citation     
                        (#PCDATA | person-group | string-name | 
                         year | source | edition | size | 
                         elocation-id | publisher-name | 
                         publisher-loc | ... | ...)*              >
<!ATTLIST  mixed-citation
              id                  ID                    #IMPLIED
              publication-type    CDATA                 #IMPLIED  >

Example:

Mood, A. M., and F. A. Graybill (1963), Introduction to the Theory Statistics, 2nd ed., 295 pp., McGraw-Hill, New York.

XML instance (strict DTD)

<book-standalone-citation id="mood63">
  <person-group person-group-type="author">
    <name><surname>Mood</surname> 
          <given-names>A. M.</given-names></name> 
    <name><surname>Graybill</surname> 
          <given-names>F. A.</given-names></name>
  </person-group>
  <year>1963</year>
  <source>Introduction to the Theory Statistics</source>
  <edition>2nd</edition>
  <size units="page">295 pp<size/>
  <publisher-name>McGraw-Hill</publisher-name>
  <publisher-loc>New York</publisher-loc>
</book-standalone-citation>

XML instance (JPTS)

<mixed-citation publication-type="book-standalone">
    <string-name>
        <surname>Mood</surname>, <given-names>A. M.</given-names>, 
    </string-name>                    
        and 
    <string-name>                    
        <given-names>F. A.</given-names> <surname>Graybill</surname>
    </string-name> 
    (<year>1963</year>), 
    <source><italic>Introduction to the 
            Theory Statistics</italic></source>, 
    <edition>2</edition>nd ed.,         
    <size units="page">295</size> pp.,
    <publisher-name>McGraw-Hill</publisher-name>, 
    <publisher-loc>New York</publisher-loc>.
</mixed-citation>

One could use Schematron to check that the required elements are present

<rule context="mixed-citation[@publication-type='book-standalone']">
    <assert test="(person-group | string-name) and year and source 
                      and publisher-name and publisher-loc">

       required element missing</assert></rule>
and that the elements are in the correct sequence.

XML instance (JPTS) (edition is in the wrong place)

<mixed-citation publication-type="book-standalone">
    <string-name>
        <surname>Mood</surname>, <given-names>A. M.</given-names>, 
    </string-name>                    
        and
    <string-name>                    
        <given-names>F. A.</given-names> <surname>Graybill</surname>
    </string-name> 
    (<year>1963</year>), 
    <edition>2</edition>nd ed.,
    <source><italic>Introduction to the 
            Theory Statistics</italic></source>, 
    <size units="page">295</size> pp.,
    <publisher-name>McGraw-Hill</publisher-name>, 
    <publisher-loc>New York</publisher-loc>.
</mixed-citation>

The following fragment uses positional predicate [1] to check that year is immediately followed by source.

Schematron

<rule context="mixed-citation[@publication-type=
               'book-standalone']/year">
    <assert test="following-sibling::*[1]/self::source">
    
      '<name/>' must be followed by 'source', not by '<value-of 
      select='name(following-sibling::*[1])'/>'</assert></rule>
            

Schematron message

'year' must be immediately followed by 'source', not by 'edition'

But how to check the sequence of required elements when there might be optional elements interspersed between them? The following fragment checks that required publisher-name is preceded by required source, any optional elements that may occur in-between notwithstanding:

Schematron

<rule context="mixed-citation[@publication-type=
               'book-standalone']/publisher-name">
  <assert test="preceding-sibling::source">

    '<name/>' must be preceded by 'source'</assert></rule>
                

There is, however, a more elegant approach suggested by Rick Jelliffe, the inventor of Schematron, which can be used to combine the flexibility of the JPTS citation model with the benefits of the strict element order a structured DTD may offer. In this ingenious method, each element is rewritten as a string of its element names, and the content model is represented as a regular expression. Then a Schematron checks the string of element names against the regular expression.

Thus, one may have an XML file, e.g., citation-models.xml, where all allowed structured citation models are specified:

...
<model publication-type="book-standalone">
  ((string-name | person-group),
   year,                 
   source,
   edition,
   (string-name | person-group)?,
   size?,
   elocation-id?, 
   publisher-name, 
   publisher-loc)
</model> 
...            

The Schematron generates an error or a warning message if the content does not match the model. The method offers a number of advantages:

  • XML is still DTD-valid;
  • mixed content is permitted;
  • type-sensitive handling of references is possible.
The caveat here is, however, that implementing this approach requires a clever use of XSLT 2.0.

Lessons Learned

The main objective of using any schema language and any validation tool is to ensure data quality, integrity of markup, and control over your processes. We have implemented AGU Production Tag Set, which is a superset of JPTS, and an extremely sophisticated Schematron that performs close to 200 checks validating the business rules, data types, and house style. The tag set we have developed is based on JPTS and can easily be mapped to it. While this implementation meets AGU requirements and needs, if we had to develop a tag set for marking up AGU journals now, with our present knowhow and toolkit, we would have chosen not to build a superset of JPTS; rather, we would create its subset by eliminating the structures irrelevant to the AGU publishing model and build a Schematron to compensate for the looseness of JPTS.

Since Schematron can provide the same degree of control over the data and markup that any DTD does and since even the most prescriptive (“Prussian”) DTD will still require an application of a rules-based quality control tool anyway, one may as well settle on using JPTS as the de facto industry standard widely adopted by the industry players, such as publishers, conversion shops, composition vendors, and archives, and check compliance with various rules by implementing an application that uses Schematron or any other comparable technology. By employing a Schematron that works in conjunction with a permissive (“Californian”) DTD, validation is shifted: instead of relying mainly on an XML parser that checks DTD-expressed grammars, which are by design not very expressive, the bulk of business rules, data types, and style-related validations are escalated to the next layer and performed by the Schematron engine.

This paradigm shift, sometimes referred to as “an appropriate layer validation,” is not without cost though: one has to be fully cognizant of the fact that your XML, while perfectly valid to JPTS, may be absolutely semantically incorrect and/or make no sense. As a content producer you become more dependent on running the additional quality control tool than you would have been otherwise. Plus, this imposes an extra constraint on your business partners who help you massage or deliver content: you need to share your Schematron with them, and they must be capable of using that technology. Also, it is important to remember that Schematron does not “fix” data and markup: it only checks the rules and generates errors and warnings. It is people who must then use their intelligence to make corrections and run the Schematron again. As a result, having well-defined procedures becomes of paramount importance but that is nothing new for most publishers.

It is worth noting, however, that while writing simple Schematron is not too difficult, building a complex and efficient one is no easy task: one has to meticulously elicit and carefully document the requirements; make sure that Schematron validation fits well into the existing workflow or modify the workflow, if necessary; modularize Schematron structure; ensure that various rules do not conflict with one another; optimize Schematron performance; and invest a lot of time and effort in testing the application. Not only that, to take full advantage of advanced Schematron features, the designer has to employ fairly sophisticated XSLT 2.0. Since it is inevitable that the requirements will be changing and evolving, a publisher has to develop and cultivate in-house expertise in Schematron and XSLT to maintain and adapt the application.

Finally, the question of tagging and validating publications that are not journal articles, such as books and book chapters, still remains. Since the NLM Archiving and Interchange Tag Suite currently does not contain a generic book model, publishers who need to mark up documents of different genres have no other way but to build a superset of JATS/JPTS. Only if this deficiency is addressed, will the NLM Archiving and Interchange Tag Suite, if it could speak for itself, be able to state with confidence, “Superset Me—Not!”

Acknowledgments

I gratefully acknowledge Wendell Piez, without whose intellectual and technical contributions this work would not have been possible. I would like to thank Debbie Lapeyre who first brought the concept of appropriate layer validation to my attention a few years ago. My thanks go to Tshawna Byerly for her copyediting suggestions. I am indebted to anonymous reviewers for their valuable critique and suggestions.

Copyright 2010 by Alexander B. Schwarzman.

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License

Bookshelf ID: NBK47084

Views

  • PubReader
  • Print View
  • Cite this Page

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...