NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2015.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015 [Internet].

Show details

Superimposing Business Rules on JATS

, , and .

Mulberry Technologies, Inc.

Publishers are stuck between a rock and a hard place. They want to use JATS for interchange, and they have been told that XML can help them maintain content consistency and enforce their business rules, which JATS does not do. XML is not just for archives and interchange, it can save time, money, and effort in day-to-day production. We suggest a Schematron validation layer so that publishers can have it both ways without having to use multiple data models (a notion many publishers find confusing) or needing to transform their content on export (which many content creators find terrifying). Schematron need not be a single shotgun approach to a document, but can be run many times over the lifecycle of a document with differing requirements fitting different lifecycle stages.

When you decide to bring XML into a publishing environment, the first and most fundamental decision you have to make is: ‘whose XML is it?’. By that we don’t mean:

  • What model should you use? (JATS or DITA or TEI or …),
  • At what stage in the life cycle should the XML be created?,
  • When and by whom will Quality Assurance happen for the XML?, or even
  • What tools will be used to create, manage, and display the XML?

The answers to all of those questions depend on this fundamental question, but none of them are what we mean by ‘whose XML is it?’. The question is to what purpose and for whose benefit are you creating XML? When there are tradeoffs to be made among the various participants and stakeholders in your publishing environment, how are decisions to be made?

If your only goal is to have XML because someone in management put XML on your annual goals list, or to make XML to go into a repository that no one uses, you don’t need anything we are going to be talking about here. If the goal of your XML is to produce content that is easy for one system (archive, database, formatting engine, or what ever) to ingest and nothing else matters, you have one situation. If the goal of your XML is to reduce production costs, your decisions will be quite different. If the multiple goals of your XML include producing content that can be sent to several publishing partners and systems, publishing in several formats and media, and creating a searchable archive for making new products, then you have a substantially different situation, and a more complex one. And if your goals include improving your internal processes, providing management information on your documents, and improving the quality of your publications, the story changes yet again.

Why are we starting a paper about Schematron talking about basic principles? Because if, and only if, you want to experience some of the major benefits of XML in your publishing environment as part of your workflow, you may want to use Schematron. If your XML is produced long after your web and print publication, Schematron will not buy you much. But even if all your goals for XML concern the website, your business partners, mobile devices, and standards compliance, in other words, even if your XML is solely for others to do their processing on your data, you can still benefit from the advantages of XML for your work. If you want XML to help you improve the consistency and quality of your publications; if you want XML to help with management reporting and querying your content; if you want the QA only possible with richly tagged data, then you want the XML to be YOUR XML, not a partner or vendor’s XML. Then Schematron may open a world of possibilities.

Make the JATS Model into Your Model

Of course a publisher does not have to use Schematron to achieve these goals. In fact, some of the authors of this paper have given papers at previous JATS-Con events suggesting that that an oft-employed method for publishers to improve article quality and enhance editorial workflow is via a subset of a JATS tag set. Subsetting means removing from ‘your’ version of the models things that are allowed in JATS but that you decide not to use. The JATS model is fairly non-enforcing (by design); as almost no element or attribute is mandatory, there is very little required sequence, and some elements contain many large groups of other elements. However the ‘love all, serve all’ philosophy of JATS does not suit a publisher wanting to automatically process articles for quality or publication readiness. Employing a subset allows a publisher to create an enforceable set of style rules that can be checked with an XML parser and demonstrated to be correct (all figures need titles, no lists are numbered). Elements or attributes that are optional in JATS can be made mandatory, and unused elements can be removed.

Why might you make a subset? So the models you work with are as simple as possible, and so that you won’t have accidental inconsistencies. A conversion vendor will never use the elements you do not want, and a downstream processor will never need to allow for them. And so that your documents will be totally unsurprising to partners who receive the XML. In fact, they won’t know you used a subset; all they will see are valid JATS documents.

And a few of you have taken our advice. A very few.

JATS Straight Out of the Box

Most of you are using one of the JATS DTDs as provided on the NLM JATS website. Although the JATS DTDs are designed to be customized, and the Tag Libraries give instructions on how to customize them, most users are not customizing JATS. Please don’t misunderstand us, there is nothing wrong with that, and we don’t want to imply that there is. But if that is what you are doing, you are not taking full advantage of your investment in XML.

Mulberry has had discussions with many of you who do not want to modify JATS. Some of you think that even a subset is somehow non-compliant. Some of you don’t want to spend money on models when there is a perfectly good one you can simply adopt. Some of you are being dictated to by vendors and/or business partners telling you that you must use one of the JATS models exactly as published. Some of you are intimidated by the complexity of having a model for internal use, a transform to convert documents in that model into a public model for use outside your environment, and the costs and staff requirements to build and maintain such a complex system. None of you want to be in the software business.

We understand. Publishing is a complex business with limited budgets and managers who are not and do not want to be programmers. We suggest to you that you can still use XML, and out-of-the-box JATS, to help you make your publications what you want them to be. You can use XML to enforce your business rules and to direct limited editorial attention to suspicious locations. How? Schematron!

Schematron Adds More QA Options

So a well-chosen subset might be your first line of defense, and you don’t have one. That is not insurmountable. Even with an exquisitely designed subset, not all problems can or should be solved at the modeling level. Some things can’t be stated in grammar-based languages like DTDs and Schemas, and many more things should not be made into rigid rules, but brought to the attention of a person for possible human intervention. (Yes, our Style Guide says we must do it exactly this way, but the Senior Editor with 40 years on this journal wants an exception this time; so we give it to her.)

A more flexible method for improving quality and shortening workflow is a Schematron-based architecture. Schematron is a rules-based Quality Assurance and reporting language designed for XML, written in XML (commands in the language are XML elements), and typically implemented under-the-hood as XSLT. Schematron has simple ways to let a publisher ask questions of a set of XML documents and write reports concerning circumstances of interest. Schematron allows testing for presence or absence or elements, attributes, and element combinations; specific narrative content; complicated cross connections between material; values against authority files, and much more. Since Schematron is just a reporting language, it can be run any time it would be useful; it does not change the XML documents it tests in any way. Schematron programs (rulesets) are written as a series of Rules, each of which states the context in which it will be performed (usually an element) and contains various tests against the XML data in that context.

Because Schematron is a programming language, tests can be written to find violations of very complex business rules. For example, consider this editorial requirement from a journal publisher for constructing an article number:

        "Characters 5-6 of the article number must be equal to the article sequence in 
        its issue"

This rule is obviously specific to this publisher, possibly to one of their journals. A requirement such as this cannot be detected or enforced with an XML parser. In fact, many overworked editorial staff will miss violations of this rule as well. This is a perfect example of a rule that can be easily coded into a Schematron rule; the Schematron will unfailingly find every offending article. Schematron was written as a simple reporting language that can tell you when your business rules have been violated and let you decide if that violation matters at this time.

Business and Style Rules

What do we mean by business and style rules? The laws (both recorded in Style Guides and unwritten in years of common practice) that every publisher lives by to make their quality and their branding unique. Such rules may deal with style, substance, look and feel, or editorial practice:

  • All figures and tables must have captions not exceeding 50 characters.
  • Lab Report articles may have no more than six authors.
  • The number of the first page of an article must be less than the number of the last page. However, Last Word articles begin on the last page of the issue and may be continued on a previous inside page.
  • These 3 unrelated (optional) elements must be present in any Editorial.
  • An abstract must be a single paragraph.
  • Book and journal titles, both in text and in citations, must always be italicized.
  • All citations must be referenced.
  • If the volume number is preceded by the word ‘vol.’ the volume number is not to be italicized, otherwise volume number should appear in italics.
  • Every figure must be referenced at least once and not more than 5 times.
  • The volume must correspond to the year inside the publication date element.
  • The issue number must be an integer between 1 and 88.
  • The IDs (@id) of the tables values must be in sequential order of the tables in the article.
  • No duplicate keywords are allowed.
  • An article of type "introduction" must contain a section entitled ‘Foreword’, ‘Preface’, or ‘Introduction’ (unless there is a contributor whose role is ‘Feature Editor’)
  • There are twelve issues a year — except when there are thirteen, so issue numbers larger than 13 are in error and an issue number of 13 should be closely examined.
  • All publication dates must be valid ISO dates, unless the season element is present.

House style and business rules are typically enforced with a combination of JATS DTD modifications (super- or subsets of the existing tag library), conversion vendor guidelines, single-process automations (run the process that checks all the graphics sizes), and with meticulous copy-editing. Fixing errors becomes progressively more expensive as the articles move toward publication. When the article is on the website is not an ideal time to find out that a style guideline has been violated or that the figure references could not be made into live links.

Schematron: The Joy of Your Own Error Messages

There is also a far from critical, but very real, infelicity of subsets and supersets that Schematron can help with: if you change the JATS model, your errors become validation errors, found by an XML parser. Validation errors have downsides of their own. Many XML processing systems will not deal with invalid documents; the publisher doesn’t have the option of breaking one of their own rules at this time, the software lays down the law. And model validation error messages are generated by the underlying parser, so they are written in XMLese and are often less than helpful, especially to workers unfamiliar with the nuances of parser messages.

The publisher (not the parser developer) writes the text of the messages reported by a Schematron program, so the errors can be reported in the language and even the jargon of the target audience. Because Schematron messages are custom; they can also be as specific and detailed as desired. We like to say that, ‘Schematron messages are the best in the world, because you write them.’

Here is a Schematron error message for the requirement ‘Characters 5-6 of the article number must be equal to the article sequence in its issue’. The error message might look something like this:

        Rule[Artid-5-6] The <article-id> chars 5 and 6 ARE "12" and
        should be "09" based on the article's sequence number assignment.

Notice that the text of the error message is built from data (such as the element name from the source document, the actual fifth and sixth characters in the article identifier, and a portion of the article’s real sequence number). The Schematron ruleset can gather up the information, store it in variables for later use, and serve it forth as needed. The message provides a precise description of the offending element and even maps the violation back to a specific rule (Artid-5-6), which can presumably be found in the Style Guide or requirements document. This provides an exact rationale, making it simple for a member of the production or QA team to clearly understand the problem and fix the source document.

(DTD or XSD or RNG) Plus Schematron

Schematron can be used in addition to validation from the DTD, RNG (RELAX NG Schema), or XSD (W3C XML Schema) a publisher is running, especially in a ‘layered’ architecture in which Schematron is applied in addition to a schema and at various different stages of the workflow. In a paper presented at JATS-CON 2010, Alexander Schwarzman observed that:

an electronic publisher may want to consider using a generic tag set and shifting the burden of validation from the XML parser to the more appropriate layer, such as the Schematron engine, which will perform the majority of required checks. [5]

In other words, if JATS is too loose a model to support your business rules, let Schematron do all the validation. A bit radical? Perhaps, but with useful hints for real world implementation. Use Schematron to check all those rules you would have made if you had made your own subset. Only allow affiliations inside <contrib-group>, not in any of the many other places JATS allows them. Make the copyright holder required. Make sure figure groups and floats groups are never used. Make the JATS yours through additional model checking through Schematron.

Schematron can be used to check many things a DTD or XSD might have checked. You might make the presence test part of a series of tests which include requiring that an element have content (which DTDs cannot do, although other types of schemas can) and what kind of content it is allowed to contain.

In addition to its own special types of testing, Schematron can be used to add some of what XSD and RNG can do to the world of DTDs. For example, Schematron can check the presence of content in addition to the presence of an element. Schematron can also test patterns in that content, governed by regular expressions or the presence of key phrases. A number of publishers and aggregators use Schematron to add datatype checking to an element or attribute (typically but not always W3C XSD datatypes), since DTDs do not support strong data typing. For example, a URI could be checked against the XSD simple datatype anyURI, to see if making a live URI link would be possible.

Case Study: ISO (DTD+ and Validations Beyond XML)

The ISO, the International Organization for Standardization, develops and publishes International Standards on topics from food safety to country codes, with publications covering quality (ISO 9000), energy, risk management, and many more heavy-duty subject areas.

The ISO Standards Tag Set (ISOSTS) is a subset/superset of JATS Publishing and is checked using the ISO Standards Tag Set Published Schematron, which is one very long (nearly 800 lines) Schematron program module that includes especially-written XSLT functions. The Schematron ruleset is freely and publicly available for download at: http://www.iso.org/schema/isosts/resources/schematron/ISOSTS_validation.sch

Within the Schematron ruleset, tests are grouped into Rules and given unique identifiers (META_1, META_2, XREF, FN) so they can be tied back to requirements. The ISOSTS Schematron tests are used during production to ensure that the ISO rules (much tighter than JATS) are followed. Some of these tests could have been accomplished in their subset/superset DTD, but only at the risk of making it complex, rigid, and difficult to maintain. The ISO Schematron rules:

  • Check for the presence of elements (requiring elements that are optional in ISOSTS/JATS);
  • Check for empty elements (DTDs cannot require content, but ISO can. For example, the article metadata element <copyright-holder> is required and must not be empty.);
  • Count elements in context (where only an editor knows if this is an error, for example, does the front matter contain more than one section of type ‘foreword’ or of type ‘intro’);
  • Test elements based on their ancestors (to limit where certain types of sections are allowed, for example, a <body> element (unless inside a subpart) should contain only a single ‘scope’ section);
  • Test that each section has an identifier (attribute @id) except within boxed-text or within sections that are not part of the main Standard structure (such as inside corrigenda or prolog notes);
  • Make sure that references to standards (inside the <std> element) are directly inside the reference list and not inside a mixed citation, and
  • Check many many more business rules.

Some of the ISO rules test production issues outside the realm of XML modeling. For example, ISO has a complicated scheme for determining the identifiers (attribute @id) of sections, tables, figures, etc. Their naming scheme is checked using Regular Expressions through Schematron. Naming schemes are not part of a DTD, or even related to XML, but they are important to document production and management, and they can be checked by Schematron.

Case Study: SAGE Publishing (Errors/Warnings/and Phases)

At this conference in 2012, Julie Blair of Sage Publications presented a paper on the SAGE Publishing Schematron System. [2] SAGE uses Schematron (and supplies a version to their typesetters) to achieve a 29% reduction in errors prior to online publication.

Blair’s case study points to several useful suggestions for implementing Schematron.

First: how did SAGE decided on which Schematron rules to write? They wrote rules based on their typesetting encoding guidelines. Then (in cooperation with Highwire Press which hosts their journals platform) they collected more than eight months of issue error reports and wrote Schematron rules to catch those errors.

Second: SAGE used the role facility of Schematron to mark some messages as errors and others as warnings. Issue files that contain errors will not be ingested in their CMS system. Issue files with warnings will offer the user a choice to add this to the CMS, or not.

Third: Because different rules were needed for different types of content, but SAGE wanted a single Schematron program module for their CMS, they made use of a facility in Schematron called ‘Phases’. A phase is a collection of Schematron rules that can be executed together, based on some condition such as document lifecycle or type of content. Since phases merely point to existing patterns, the same rules can be activated for several phases when appropriate.

A small sampling of SAGE Schematron checks includes the following.

  • Warning: Presence of spaces within a surname (which may indicate that some of the given names have been incorrectly tagged or that there really is a multipart surname. Only an editor knows for sure.)
  • Error: Email address contains whitepsace.
  • Error: Invalid types of author notes (based on the attribute @fn-type).
  • Error: All lists must have a @list-type attribute.
  • Error: A related article must contain either a DOI or a volume/first page combination.
  • Warning: A rule checks whether a section title has been ‘faked’ with a paragraph and bold emphasis. (This is only a warning because there may legitimately be bold at the beginning of a paragraph.)
  • Warning: Article titles display in all caps, but should be mixed case in the XML. (This is a warning not an error because some titles contain nothing but acronyms and other upper case words.)

Schematron at Varying Phases in the Lifecycle

The rules that apply to a manuscript or a preprint are not necessarily the same rules that apply moments before online publication. A publisher might have, for example, an initial Schematron and a final Schematron, or even three or more times in the lifecycle when rules are checked, with different rules for each.

Early in the editorial process, much of an article’s metadata is unknown. Elements that pertain to publication details, such as volume, issue, page numbers, etc. are not applied until quite late in the publishing stream, so there is no point in checking rules about this metadata at the manuscript or initial copy-edit stage. Other metadata might be being checked by the legal team and be present, but not necessarily final early in the production process. However, even at an early stage, an article should conform to many style and business rules. Hence, a first-phase ‘loose’ Schematron could check large-grain style and business rules, but could allow blank or temporary metadata values. As an article wends through the workflow, further editorial changes are made and metadata is added. At this stage, more restrictive Schematron checks can be applied, both rechecking the original rules applied in the first pass (to detect changes possibly introduced in the editorial flow) and making new tests concerning which metadata rules are applied. A final, more restrictive validation can be applied just before publication to enforce publisher-specific or journal-specific metadata rules.

An additional advantage of this approach is that the manuscript becomes progressively ‘more correct’ as it travels through the editorial flow. Many publishers make the mistake of applying Schematron rules only very late in the workflow. The cost of finding and fixing mistakes grows (almost exponentially) with each step in production. What happens if a paper type that allows only five authors is found to have seven authors two days before publication? An expensive fix and unhappy authors and editors.

A typical Schematron architecture may introduce validation at several points in the editorial flow:

  • The first point may be when the documents are first received from the conversion vendor or any point where the article in considered the ‘initial’ XML. At XML origination, many facts about the publication will not be known. So any validation at this point will check only the larger document structure and essential elements. Do all sections have titles? Is the abstract a single paragraph? Do all figures have captions? Etc. This validation serves as the starting point for copyediting and manuscript preparation. An enlightened approach (such as that described by SAGE in the Case Study mentioned above) might be to provide the conversion vendor with the Schematron and require that they not deliver any articles that did not pass these initial validations. This places the burden of detecting conversion errors on the service provider — where the responsibility rightly resides. Finding conversion or initial XML coding errors is not a productive use of a publisher’s time. The publisher’s value-add is in applying business and style rules, not in finding missing section titles.
  • A richer Schematron ruleset could be run on the same documents after copyediting. At this point all of the initial rules would be rechecked and additional rules concerning use of lists, presence of DOIs in citations, and journal-specific rules can be checked.
  • After the articles have been integrated into a specific issue of a journal, additional rules for journal titles and abbreviations, volume and issue identifiers, and even page count or starting page number could be checked.
  • Another Schematron ruleset might be run just prior to submission to PubMed Central, to account for their special requirements.
  • There might be different Schematron checks for web publishing versus PDA publishing.
  • There might be accessibility checks before producing an eBook article, not needed for print.

Case Study: Modular Schematron

Requirements

A large publisher of hard science journals (we will call them ‘Publisher X’) has a long litany of editorial, style, and business rules. These rules have grown over decades of publication and are set by the society as well as by editors. These rules can be very complex, even for human agents to check. For example, the business rule requirements include such tests as:

  • A <mixed-citation>’s @citation-type attribute must be one of the types specified in references-authority table.
  • There are character limits for page number, volume number, issue number, given names, and name suffix.
  • The content model for <mixed-citation> depends on the @citation-type attribute value, so the citation must have the content elements required for that named citation type.
  • The <label> inside a <mixed-citation> must not contain a period.
  • The <conf-name> element in a citation must not contain the literal string ‘presented’ (case insensitive).
  • The <year> element inside <mixed-citation> must conform to one of the following patterns:YYYY; YYYY-YYYY; YYYY-YY; YYYY, YYYY; or YYYY/YYYY.
  • Elements appearing as children of <mixed-citation> may not be empty.
  • And (to illustrate the complexity) if the reference list contains more than 3 citations to journal articles that have been published after 1950, issue a warning when not even one of the journal citations contains a DOI.

Modular Schematron Suite Layered Over a DTD

The number of rules, their complexity, and the variation among journals argued against putting the rules into a DTD or Schema. Schematron was the obvious choice to create a separate validation layer.

To enable running Schematron at several places in the lifecycle as well as to enforce different rules on a per journal basis, Publisher X developed a modular architecture with plug-in construction. In their Schematron Suite, each business or style rule lives in its own file. Their modular Schematron system is composed of a set of ‘driver’ Schematron files that use the ‘include’ mechanism to pull in the appropriate tests. With this plug-in architecture, common business and style rules are written once, centrally maintained and easily incorporated into multiple workflows. Not all publications need to use the same rules, as each can include only those tests that are relevant.

Because of the modular construction, each test suite Schematron program is merely a collection of included modules as shown below.

        <!-- names.sch - Validates personal name components and related elements -->
        <include href="lib/names.sch"/>
 
        <!-- dates.sch - Constraints over the use of constituents of dates -->
        <include href="lib/dates.sch"/>
 
        <!-- common-final.sch - Common final phase checking for all articles -->
        <include href="lib/common-final.sch"/>

For example, the check on article types is a single module, as are rules dealing with figures, and rules for tables. The Schematron repository includes one module for testing book references, a different module for testing journal references, and yet another module for other types of references (non-article and non-book). Within the modules, similar rules are collected into families that correspond to modules. For example, the three reference testing modules just described have been included into a ‘reference-list’ checking suite which then becomes part of a larger ‘back matter’ rule checking engine. Finally, the back matter rule checking engine can be run in a either a loose or strict mode. Such a modular approach can be easily adapted to nearly any foreseeable workflow.

Publisher X created separate initial and final Schematron suites of tests. In total, Publisher X has just under ten top-level Schematron schemas, calling nearly 25 modules. The initial journal article validation has 18 modules; the final journal article validation Schematron includes 21 modules, and the Paper-in-Press validation contains 13 modules. Each module in the Suite consists of a number of rules, each of which contains many tests. So each complete test suite is simply a collection of included modules relevant to the desired product or workflow. In the Publisher’s opinion, this architecture provides a library for code reuse, centralized maintenance, and an easily understandable ‘one test, one file’ maintenance regime.

Schematron Rules

The Schematron rules in this Suite are more complex than those in the other case studies, largely through the addition of documentation elements, so different results can be accomplished using them. Publisher X developed a strict design philosophy in order to make the Schematron more easily manageable.

  • Each Schematron check enforces a single rule and can be directly linked to a house style or business practice.
  • Each rule has a unique identifier ("JATM12") that includes a rule family code (JATM) and a unique rule number (12).
  • Each rule includes elements in a different namespace that contain documentation. (The Schematron can be used to document itself!). One element is a statement of the requirement, which provides traceability to a requirement in the house style or business practice, and one element provides a lower-level technical description of the requirement in such a way that can translate directly into a Schematron statement.
  • One Schematron command provides a message that names the rule under consideration so that each rule can be logged.
  • The main Schematron command (the testing part of the rule) checks the requirement and provides the error message, which names the violated rule and provides a description of the exact problem.

Implementing Schematron

A major hurdle to building a Schematron architecture such as any of the ones described is the familiar mismatch between editorial staff and technology staff. How do you get your Style Guide translated into Schematron rules?

Case Study: A fine solution to this problem is to have the editorial staff write and maintain the error messages used to construct the schema modules, as part of the style guides. A large aircraft and armament manufacturer has several thousand Schematron rules that work across their maintenance documentation. The rule specifications live in large spread sheets, maintained by editorial. Each rule is numbered; described in native language; associated with a part, procedure, or task description; the text of the Schematron message is given; and a link is provided to the remedy that must be performed (both what and by whom) when that message is viewed. The programmer merely copies the message and writes code that will deliver that kind of error report. Both parties like this system very much, although the unit testing is a bit extreme.

Another potential source of conflict is keeping rules up to date. The editorial and production people who write the Style Guide and determine the business rules are not the people writing and maintaining the Schematron tests. This leads to the constant problem of maintaining equivalence between the Style Guide and related documentation and the Schematron rule descriptions. Often, a rule or style will change and changes are not reflected in the Schematron descriptions, or similarly, a Schematron ruleset is rewritten, but the change does not make it back to even the online version of the Style Guide document.

Case Study: One promising implementation to solve this conundrum is described in George Bina’s paper ‘Schematron for Information Architects’ delivered at XML Prague 2015. [1] Because a Style Guide includes all of the information for a Schematron pattern, it is possible to use a machine-readable Style Guide source to create a Schematron ruleset with the pattern instantiations. An Open Source project called Dynamic Information Model (DIM) demonstrates such an implementation (See http://www.github.com/oxygenxml/dim). With this tool, XSLT transformation scripts are used to extract the actual Schematron schema from a DITA source document. In this way, the Style Guide and the Schematron rules live in a single document, so the Style Guide prose and the Schematron rules will not get out of sync. The DIM approach uses DITA as the document model, but in theory any other model could be used — even perhaps JATS itself. So a Style Guide written in JATS (or BITS) could be used to generate Schematron rules to apply to other JATS documents. We look forward to seeing further work and other implementations of the DIM solution.

Conclusion

Of course a publisher who wants good XML QA does not need to use Schematron. There are many other methods. Once you have XML-tagged data, opportunities open up. Searching within the document need not be just a simple search for a word or phrase in text; searching can use the power of XPath. XPath (the XML language for walking the tree-structure of your XML document) can let you search in context. Using XPath, you can find tables that are inside footnotes or find out how many abstracts contain more than one paragraph. XQuery is a SQL-like querying language that provides the means to extract and manipulate data from XML documents. XML databases allow you to use both XPath and more complex XQuery, but even many XML editors let you do a fast XPath or XQuery search. Yes, these techniques require some XML knowledge and may not be suitable for all copyeditors. And even apart from searching, there are many other quick/fast QA techniques, using XPath, XQuery, or XSLT (the XML transformation language). With only a little very simple programming, you can make a formatted list of any tagged element or element with a certain attribute combination (like a table of figures, but a list of any element you wish to see). You can use a stylesheet to create a false-color proof, turning the object of your interest bright pink or dark blue in a formatted display created to enhance proofing. These techniques are good; please use them.

We ask that you also consider Schematron.

For some of you, XML or JATS-flavored XML belongs entirely to a publishing vendor or partner. You get your manuscripts in Microsoft Word, you review, select, and edit them in Word. You send the Word files to a vendor who does … something … you don’t know or care what … and who then sends PDF to your printer, HTML to your database spinner, and JATS to your archive. How do you QC this material? The print looks good or you yell at them. The HTML is OK unless the article’s author calls you and fusses about it. The XML is OK if the archive doesn’t bounce it back — and if they do they can talk directly to your vendor. That XML ‘belongs’ to the archive, not to you the publisher. It costs you money, it meets your goal of getting your content into the archive, but it isn’t helping you publish the content you want to publish.

On the other hand, you could take the Word files you receive, convert them to XML, review drafts produced from that XML, edit the XML, and produce PDF for print, HTML for web display, and XML for your publishing partners from that XML. And, at each stage in the lifecycle, you can have Schematron alert you to weirdnesses in the documents and violations of your business rules. This is a far more disruptive scenario than simply having people you don’t know at some remote vendor site make a mysterious file format, but it can benefit you.

Even if you cannot go to an XML-first publishing style, at some point you will receive XML. Try to make sure that point is before you make print, web, PDA, access-enabled eBook, and archival forms of your article. This will give you the opportunity to identify and fix problems in your documents before final publishing, and you have some powerful technology helping you do what you do. As Julie Blair of Sage expressed it: ‘Using Schematron can transform an XML workflow that is just getting by, to one that thrives and works for you.’ [2]

XML that is working for you ‘belongs’ to you. Take ownership.

References

1.
Bina G. ‘Schematron for Information Architects.’ At: XML Prague 2015 [Internet]; 2015 Feb 13-15; Prague, Czech Republic. Available from: http://www​.xmlprague​.cz/sessions2015/#sch.
2.
Blair J. Developing a Schematron–Owning Your Content Markup: A Case Study. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2012 [Internet]. Bethesda (MD): National Center for Biotechnology Information; (US); 2012. Available from: http://www​.ncbi.nlm.nih​.gov/books/NBK100373/
3.
International standard ISO/IEC 19757-3:2006: Information technology — Document Schema Definition Languages (DSDL) — Part 3: Rule-based validation — Schematron. First edition. 2006. –June–01. Available from: http://www​.iso.org/PubliclyAvailableStandards .
4.
Lapeyre DA. Why Create a Subset of a Public Tag Set. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010 [Internet]. Bethesda (MD): National Center for Biotechnology Information; (US); 2010. Available from: http://www​.ncbi.nlm.nih​.gov/books/NBK47099/
5.
Schwarzman AB. Superset Me—Not: Why the Journal Publishing Tag Set Is Sufficient if You Use Appropriate Layer Validation. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010 [Internet]. Bethesda (MD): National Center for Biotechnology Information; (US); 2010. Available from: http://www​.ncbi.nlm.nih​.gov/books/NBK47084/
6.
Usdin T. When the ‘One Size Fits Most’ tagset doesn’t fit you. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2013/2014 [Internet]. Bethesda (MD): National Center for Biotechnology Information; (US); 2014. Available from: http://www​.ncbi.nlm.nih​.gov/books/NBK195189/

Additional Reading

1.
Klímek J, Benda S, Nečaský M. ‘Translation of Structural Constraints from Conceptual Model for XML to Schematron.’ Journal of Universal Computer Science, vol. 20, no.3 (2014), 277-301. doi:10​.3217/jucs-020-03-0277 (Naming and Design Rules (NDR) as a special case of business rules)
2.
Lubell J. ‘Documenting and Implementing Guidelines with Schematron.’ At: Balisage: The Markup Conference 2009; 2009 Aug 11-14; Montréal, Canada. In: Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3; 2009. doi:10​.4242/BalisageVol3.Lubell01.
©Copyright 2015 Mulberry Technologies, Inc.

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

Bookshelf ID: NBK279902