A Quality Assurance Tool for JATS/BITS with Schematron and HTML reporting

Martin Kraetke; Franziska Bühring

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2016 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2016.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2016

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2016 [Internet].

Show details

Contents

A Quality Assurance Tool for JATS/BITS with Schematron and HTML reporting

Martin Kraetke and Franziska Bühring.

Author Information and Affiliations

De Gruyter adopted the JATS/BITS schema for journal content and established De Gruyter specific XML guidelines for creating XML metadata and full text data. Together with le-tex, De Gruyter developed a submission checker to validate the data quality of book and journal packages delivered from their service vendors. The tool is based on the Open Source software Talend and transpect.

The submission checker verifies the consistency of metadata, validates against the JATS schema and De Gruyter’s business rules, which are specified with Schematron. An HTML report provides a rendering of the source files with the error messages. The messages are displayed at the error location and are grouped by their severity. Content passing the check is forwarded for archiving and publication. It guarantees a technically correct rendering of the content on degruyter.com and facilitates the retrieval and processing for future purposes.

Brief Description

Within this paper we'll describe De Gruyter’s JATS/BITS guidelines and its data checking workflow. Furthermore, we discuss benefits and limitations of an automated data checking utilizing Schematron and HTML reporting.

Introduction

De Gruyter has been an independent academic publisher of scientific content for over 260 years. About 1,200 book titles, more than 750 subscription based or Open Access journals, and a variety of digital products are published every year. The content covers the subject areas humanities, social sciences, STM and law [1]. Besides its own brands and imprints, De Gruyter allocates the e-publications of diverse Publisher Partners, like Harvard and Princeton University Press and integrates their content into some of De Gruyters various product lines and business models.

To accomplish constantly high quality of content, fast publication times and at the same time considering important characteristics of specific publishers or subject areas, De Gruyter furthers standardization in many different areas. Therefore, within production standardizing publication formats, streamlining processes as well as ensuring the compliance of all aspects of it has become a major objective.

Background

Standards

The main focus to increase the efficiency was to have standards in place and raising the number of journals and books using them. This was considered necessary to achieve better conditions for production services and increase process stability and predictability. One step to obtain more standardization was establishing the team eProducts and Standards within the production department in 2013, which complements the books/journal production department and the purchasing team. Since three years, the team is centrally responsible for creating and maintaining new standards and guidelines. Also, the requirements for new tools and software that serves the production process are described and implemented accordingly.

Since then the numbers of standards increased rapidly. Until today around 30 documents, that specify and classify product types, services they use and production processes, have been created.

XML

To achieve a higher degree of automation and a greater flexibility for new content models, XML guidelines became an important part within production. De Gruyter uses JATS and BITS schemas for archiving and publishing their content. Within the De Gruyter XML Guidelines the specification for books and journals are consolidated and unified to a large extent. At the same time international standards such as ISO norms or Cross Ref standards are incorporated whenever feasible. As a result De Gruyter can profit from tools, software and conversion services that already implemented these standards.

Thus XML deliveries, their structure, naming and components need to be defined explicitly. For De Gruyter this includes, apart from the XML itself, if available the corresponding assets, such as figure and media files, a PDF equivalent and Electronic Supplementary Material. All files are compiled within a ZIP package and conform to the De Gruyter predefined naming convention and folder structure. An issue delivery package will be built as follows:

Box 1

Structure and Naming Convention for an Issue Delivery

{journal-code-online}_{issue-id}_{timestamp}.zip
|--{issue-id}/
|  |--{article-id}/
|  |  |--{article-id}.xml	
|  |  |--{article-id}.pdf
|  |  |--graphic/
|  |  |  |--{element-id}.jpg
|  |  |  |-- (…)
|  |  |--media/
|  |  |  |--{element-id}.m4a
|  |  |  |--{element-id}.mp3
|  |  |  |-- (…)
|  |  |--suppl/
|  |  |  |  |--{filename}.{fn-ext}
|  |--{frontmatter-id}/
|  |  |--{frontmatter-id}.xml
|  |  |--{frontmatter-id}.pdf
|  |--issue-files/
|  |  |--{issue-id}.xml

Since strict XML guidelines have not been available all along, backlist content has not been created according to the latest guidelines. If these files need to be corrected and re-uploaded for publication, De Gruyter defined more lax guidelines to avoid high costs for adapting the content to the latest version of the De Gruyter XML standards. They include the minimum requirements, which ensures a successful processing and publication.

XML deliveries from Publisher Partners need to conform to De Gruyter guidelines as well. In some cases it is inevitable to define exceptions for their content as not all specifications fit to the requirements of single Publisher Partners.

Variables and IDs

Alongside XML tagging instructions, De Gruyter also defines standards for the structure of variables and IDs. They are built using Regular Expressions or metadata information from the ERP (Enterprise Resource Planning) system. Conventions for file or folder naming are based on these definitions and part of the XML guidelines. For example, element-ids (used within the attribute @id) are build as follows:

Box 2

Building the Article ID

{article-id} = {doi-code}-{article-system-creation-date-year}-{article-counter-ID}
{doi-code} = [predefined value for each journal]
{article-system-creation-date-year} = ^\d{4}$
{article-counter-ID} = ^\d{4}$

It illustrates the use of Regular Expressions and metadata information to form a new variable, that could either be used by itself or in combination. Those IDs and variables are primarily referenced by the XML guidelines, but are also used for others.

Motivation

The first issue has been the absence of XML guidelines. With establishing the team eProducts and Standards a first version was developed and communicated. Over the years, the XML instructions have been consolidated between books and journals to identify optimization potential by unifying elements among different product types. At the same time typesetting companies professionalized and were able to deliver XML files created from their production process. These files were created according to the guidelines, delivered to the publisher and put online. Errors within these deliveries became visible at different stages:

During the upload process to the media asset management system, when file naming conventions weren't correct or meta data information was wrong.
While transforming the XML to HTML for De Gruyter Online, when XML files were not well-formed, valid or meta data information was incorrect.
Through a visible check by the production editor (PE) at De Gruyter Online, when XML elements have been used wrong or were missing.

If an error occurred, the PE had to get back to the vendor to correct the XML and instruct him to re-upload the content (Fig. 1). In most cases, this was a time consuming, inefficient process. Furthermore, some errors were not spotted as the files were valid and the error didn't become visible with the HTML rendering. Despite having valid and error-free XML files, it was not guaranteed that the XML delivered from various vendors and created using different software was consistent and valid with regard to all De Gruyter requirements.

Fig. 1

Publication Workflow before implementing a quality assurance tool in case of errors

As a manual and visual check of the XML deliveries from the production department was seen as very inefficient and expensive, the obvious solution was to develop and implement an automated checking tool to ensure the quality of the XML delivery.

Aim

The direct and narrow objective for a quality assurance tool was to have an automated check for all XML deliveries regarding De Gruyter requirements that are specified within the guidelines. The feedback in terms of errors should be returned to the vendor directly rather than having production involved to forward or screen the report.

Indirect and broadly defined aims that can be seen as a result of the successful implementation are:

Ensuring a technically correct rendering of the content on De Gruyter Online.
Facilitating the retrieval and processing of the files for future purposes.

Requirements

The basic requirements for the checking tool were to examine whether the delivered files are valid and consistent XML files. That implies:

parse XML
validate XML
check consistency of identifiers (IDs such as journal-id, ISSNs, volume number ...)
match the internal list of files to referenced files

Processing Parameters

Another requirement was to have the Checking Tool as configurable as possible. Which means, processing parameters should be integrated in a declarative manner so that De Gruyter is capable of revising them independently from the general infrastructure and programming code. These processing parameters include:

Business Rules
Configuration Files:
- list of allowed values for @article-types
- list of allowed values for {doi-prefix} in combination within the publisher name
- file listing all variables and IDS and their composition
- a list of additional journal metadata that is extracted from an internal journal database
Metadata information from the ERP system

It should be possible to adjust these parameters without affecting the feasibility of the tool.

Business Rules

Implementing the XML guidelines and specifications using Business Rules has been the main requirement for the tool. To maintain the Rules internally as well, it was explicitly defined to implement them using Schematron.

Cascading Rules

Book and journal publications are built upon the same rules for specific content (e.g. contributors, abstract ...), but at the same time have a set of unique rules (e.g. different mandatory elements and attributes). To avoid as much redundancies as possible a concept for cascading business rules was seen as essential. In other words, starting from a general set of rules for all product types, a more specific rule set for book or journal publications only is defined to overwrite or complement the global rules. Strategically De Gruyter intends to add more cascading levels. Specific rules for different Publisher Partners, imprints, business areas or even single book series or journals should be defined as part of the cascades (Fig. 2).

Fig. 2

Existing (solid lines) and planned (dashed lines) cascading rules for De Gruyter publications

It is expected that occasionally unforeseen content constellations appear within deliveries that may fail the quality assurance tool. To have this content passing regardless, De Gruyter wants to define specific journals as non-standard. These journals are supposed to be checked against a lax set of business rules. This provides time to correct or implement additional rules or cascadings without having to process the files manually.

Additionally it should be possible to distinguish between different types of messages. There should be errors, whenever an element has not been created according to the guidelines and the content cannot be declared as publishable. Warning messages are supposed to be used as an information to the vendor that this will become an error message soon, but for now can still be published.

Metadata Compliance

An essential component for the Quality Assurance Framework is the metadata verification. Information such as ISBN, ISSN, Book- or Journal title should be tested against the metadata within the ERP system. Therefore a list containing the concordance between ERP information and the respective JATS or BITS element is provided.

Report

As a result of the checking process a file listing all errors and warnings should be created and provided automatically to the vendor. Besides mentioning the error, the report is expected to point to the tag, where the error or warning occurs. Depending on the result of the report, the content is forwarded for publication or send back to the vendor.

Implementation

After gathering and documenting all requirements, requests for proposals have been sent to various vendors. Ultimately le-tex has been selected to implement the XML quality assurance tool using their Open Source framework transpect (Transpect). The following sections describe the implementation as well as all important and relevant components. De Gruyter named the tool Submission Checker (SC) and it is referenced like that subsequently within this paper.

Publication Workflow

Within the De Gruyter publication workflow the Submission Checker receives the publication files from the vendor. After the checking process has been finished and no errors have been found, the files are transferred and imported into the De Gruyter Media Asset Management (MAM) system. Therefrom the content is forwarded to aggregators (like Amazon, iBooks...) and published at De Gruyter Online. Additionally journal articles are send to Abstracting and Indexing Services such as Scopus, Web of Science and PubMed. In case the content delivery is incorrect, an error report is returned to the vendor to correct and re-upload the files (Fig. 3).

Fig. 3

Publication Workflow after implementing the Submission Checker in case of errors

FTP folder

To access the Submission Checker each vendor received an FTP login, where (s)he could access the following folder structure:

Box 3

FTP folder structure

in/
error/
warning/
ok/

To trigger the Submission Checker, the vendor copies the content file (a ZIP package) to the folder in. This folder is monitored and the file will be forward for checking. Depending on the result of it, (s)he either receives an HTML report within the folders error and warning or a success message within the folder ok.

Talend

Talend is a data integration platform, that is used at De Gruyter to process and transform files and information from various sources. Within the Submission Checker Talend is part of several processes (Fig. 4):

Monitoring the "in" folders of each vendor on the FTP.
When a file is moved to the folder, Talend forwards it to transpect.
Talend interprets the product information extracted from transpect and delivers the respective metadata information from the ERP system.
Talend analyzes the Status XML, that has been created during the checking process, and processes the file according their status:
- status=ok: the content will be send to the MAM system and a text file is created within the respective ok folder containing the "ok" message.
- status=warning: the content will be send to the MAM system and together with the report html file it is also moved to the respective warning folder.
- status=error: the content will not be send to the MAM system, instead only the report html file is moved to the respective error folder.

Fig. 4

Talend processing, interpreting files within the Submission Checker framework

Transpect

Book and journal packages are validated with a customer-specific configuration of transpect. Transpect is an Open Source framework for data conversion and checking and is developed and maintained by le-tex, a Leipzig-based publishing service provider. The framework not only involves various modules with specific features but also a methodology to combine them into workflows.

While Talend manages the processing of files and metadata, transpect validates the content files and performs important tasks within the Submission Checker framework:

Extract data and metadata from the zip package
Validate and check the XML documents.
Analyze the file structure and file naming conventions.
Check PDFs in the package and store their metadata.
Provide the final checking results to Talend.

The next section provides a brief overview of the technologies used by transpect.

Technologies

Transpect uses core XML technologies and standards. Here is a summary of the technologies used for the Submission Checker and their purpose.

XProc is a W3C standard to specify XML workflows. Transpect is based on various XProc modules which provide specific features, such as unzip an archive, generate HTML from JATS or validate XML with Schematron. A comprehensive list of all transpect modules is provided by the transpect reference [3]. The entire process logic of the Submission Checker is laid out in XProc.
XSLT 2.0 is a programming language to transform XML-based data and a W3C standard. Many Transpect modules rely on XSLT transformations. In the context of the Submission Checker, XSLT is used for various conversion tasks such as processing XML metadata, assembling Schematron files or transforming JATS to HTML.
Schematron is a rule-based validation language to declare assertions for individual XML contexts. De Gruyter’s business rules such as XML fulltext and metadata guidelines are implemented with Schematron.
RelaxNG is an XML schema language to specify patterns for the structure of an XML document. All metadata and content files are specified in a modular RelaxNG schema.

The next sections describe the key concepts of transpect in the context of De Gruyter’s Submission Checker starting with the XML representation.

XML representation

It has often proved to be useful, to include all vital information in one document. For example, a Schematron rule needs to compare if values in the XML match metadata values or parts of the filename. Besides the XML-based content files, PDF properties, metadata, file listings and parameters are stored as XML, too. Finally, the components are wrapped together in one XML set. The code sample below shows the basic XML structure.

Box 4

General structure of the XML set

<sc:set xmlns:sc="http://degruyter.com/xmlns/submissionchecker">
  <sc:content>
    <-- JATS journal article files or BITS book -->
  </sc:content>
  <sc:pdfinfo>
    <-- PDF check results -->
  </sc:pdfinfo>
  <sc:meta>
    <-- metadata from ERP system -->
  </sc:meta>
  <sc:files>
    <-- Zip file listing -->
  </sc:files>
  <sc:params>
    <-- Global Parameters -->
  </sc:params>
</sc:set>

The XML set schema consists of various components, each is described by an individual RelaxNG schema. Therefore, the existing RelaxNG schemas for JATS and BITS where used and custom schemas were specified for the other components.

Fig. 5

The content model of the XML set

Hence, the Submission Checker does not validate individual book or article files but the entire XML representation of the package. In principle, this method is also useful to bring other problems to light. For example, a failed PDF check or missing metadata raise validation errors, too.

Cascades

The Submission Checker validates both book and journal packages. Although they follow different XML schemas and XML guidelines, they are checked with the same architecture. Because of similarities between JATS and BITS, some checks can be shared bewteen books and journals.

Aside from general product differences, sometimes it seems to be necessary to exclude or soften some checks for individual products. For example, De Gruyter produce XML packages in behalf of other publishers in the context of De Gruyer’s Publisher Partner program. For some reasons, it is neither convenient nor necessary for a publishing partner to comply with specific guidelines.

Therefore, transpect not only distinguishes between book and journal packages but also allows to override specific rules based on a certain publisher, book, journal or even an article. This is made possible by transpect’s configuration cascade: At defined locations, configuration files in a hierarchic directory structure may reside. Transpect analyzes the directory tree and the accompanied parameter files to get the configuration cascade. After the configuration cascade is known, transpect derives the matching cascades which applies to the current input file. Certain transpect modules support the configuration cascade and are able to dynamically load their configuration files, e.g. XSLT stylesheets, Schematron files, CSS files etc.

Box 5

Cascade directory structure

common/
|--params.xml
|--books/
|  |--params.xml
|--journals/
|  |--params.xml

In the context of the Submission Checker, transpect provides a general configuration, which includes XProc pipelines, XSLT stylesheets and Schematron rules that are shared among book and journal workflows. More specific Schematron rules and different RelaxNG schemas are stored separately for books and journals. As shown in Fig. 2, it is planned to develop more detailed configuration levels up to book series and journals.

Phases

Besides the configuration cascade, transpect also supports phases. The noun is adopted from Schematron and describes custom checking profiles which correspond to a certain XML schema and a Schematron phase. In addition to cascade levels, phases provide an additonal layer of configuration. A phase can be statically passed as parameter or dynamically determined by analyzing the input file.

Currently, the Submission Checker includes two phases to provide different levels of checking. A strict phase is intended for the daily production and a lax phase for archive packages. In contrast to the strict phase, the lax phase includes a smaller set of Schematron patterns and allows more flavors of the JATS/BITS schema family.

The phase is evaluated by analyzing the input file against a set of product-related rules. The rules are implemented in XSLT and stored separately for books and journals.

Box 6

Configuration cascade for phases

common/
|--books/
|  |--select-phase/
|  |  |--select-phase.xsl
|--journals/
|  |--select-phase/
|  |  |--select-phase.xsl

For books and journal packages, the lax phase is selected, if the publication date of the book or article is older than the release date of the Submission Checker. For journals only, the lax phase is indicated, if a publisher or journal is tagged as non-standard in De Gruyter’s metadata. If the phase evaluation fails, the strict phase is selected as fallback.

Global Parameter Sets

De Gruyter’s XML guidelines include various naming conventions, e.g. for file names, identifiers and variables such as journal codes, chapter suffixes or counters. These definitions frequently address other variables, as mentioned in Variables and IDs.

For this purpose, transpect uses parameter sets as part of the configuration cascade. A parameter set is stored as XML in a specific cascade directory. The XML document consists of parameter elements with key-value pairs. To indicate a reference to another parameter, its name is wrapped in curly braces and tagged with a leading dollar sign.

Box 7

Snippet of a parameter set

<?xml version="1.0" encoding="UTF-8"?>
  <c:param-set xmlns:c="http://www.w3.org/ns/xproc-step">
  <c:param name="article-id" value="{$doi-code}-{$article-system-creation-date-year}-{$article-counter-id}"/>
  <c:param name="doi-code" value="[a-z]+"/>
  <c:param name="date-year" value="\d{4}"/>
  <c:param name="article-system-creation-date-year" value="{$date-year}"/>
  <c:param name="article-delivery-type" value="(ja|aop)"/>
  <c:param name="article-counter-id" value="\d{4}"/>
</c:param-set>

The parameters are not only applied in Schematron checks but used throughout the entire process chain. For example, when an issue file must be loaded, the global issue-id parameter is used to find the correct file in the package. Generally, the parameter sets provide a convenient way to set important values which affect not only checks, but the entire process.

Schematron Implementation

Assembling Multiple Schematron Documents

According to the configuration cascade, different Schematron files are stored for books, journals and both product types. Later, an XProc step [5] collects the Schematron files that apply to the current package and compiles a global Schematron file. If a Schematron pattern exists twice in multiple Schematron documents, the Schematron pattern from a more specific level overrides more general Schematron patterns.

This method allows to define commonly used Schematron rules and overwrite them if necessary on more specific cascade levels. In principle, it’s possible to override all rules for a specific journal, although this wouldn't be very cost-efficient.

Use XSLT Functions in Schematron

The standard implementation of Schematron is based on XSLT: An XSLT generates another stylesheet from the Schematron file which is applied to the XML file to be checked. The output of the generated stylesheet is an SVRL report.

In contrast, transpect uses the oXygen’s Skeleton-based Schematron [4] implementation that allows foreign elements. This method can be used to pass over XSLT code from Schematron to the generated XSLT. This allows the definition of custom XSLT functions wich can be called from XPath expressions. For example, the following Schematron snippet shows how a XSLT function is used to check whether a proper ISO language code was used.

Box 8

Reference to an XSLT function in a Schematron assertion

<rule context="trans-abstract">
  <assert test="if (@xml:lang) 
                then tr:is-valid-iso-lang-code(@xml:lang) 
                else false()" id="trans-abstract_attr">
    The element &lt;trans-abstract> must always contain the attibute @xml:lang with a language code according to ISO 639-1.
  </assert>
</rule>
              
<xsl:function name="tr:is-valid-iso-lang-code" as="xs:boolean">
  <xsl:param name="context" as="xs:string"/>
  <xsl:value-of select="some $i in document('http://this.transpect.io/xslt-util/iso-lang/iso-lang.xsl')//tr:langs/tr:lang 
                        satisfies $i/@code eq $context"/>
</xsl:function>

Additional Markup

The author may use additional markup to add more semantic information to Schematron messages. Equal to other approaches discussed at previous JATS-Cons [6][7], the role-attribute is used to indicate the severity of a Schematron message.

Box 9

The role attribute indicates the severity of a Schematron rule.

<assert test="$subtitle-meta eq string-join(.//text(), '')" id="book-subtitle-match-metadata" role="error">
  The value of book-title don't match the value found in our ERP system: '<value-of select="$subtitle-meta"/>'
</assert>

Additionally, HTML code can be used for authoring Schematron messages. This provides more semantic markup which can later be rendered in the browser. Another use case is to link the Schematron message to the corresponding rule in De Gruyter’s guidelines, as shown in the example below.

Box 10

Schematron rule with HTML markup

<assert test="matches(., concat('^', $doi-prefixes, '/.+$'))" id="doi-prefix">
  The DOI prefix '<value-of select="$doi-prefixes"/>' is expected for 
  publisher-name '<value-of select="$publisher-name"/>'. Please note that 
  this error can also be caused by an unknown publisher name or missing book-id.
  <span xmlns="http://www.w3.org/1999/xhtml">  
    (See XML-Guidelines, section 
    '<a href="http://degruyter.com/foo/bar/guidelines.xhtml#publisher">Publisher</a>')
  </span>
</assert>

HTML report

For users, the primary source of validation messages is the HTML report. The HTML report shows the validation errors at the location where they occur. The three column layout provides a navigation, the content with the error messages and a summary which are grouped and can be filtered by their type:

rule family: a group of related messages (i.e. Schematron file name, RelaxNG schema name)
rule name: ID of the Schematron assert/report
severity: role of the Schematron assert/report (e.g. warning, error, fatal-error)

Fig. 6

The HTML report shows the RelaxNG schema and Schematron validation errors where they occur

The HTML report is generated in several steps. First, an intermediate HTML document is generated from the XML representation of the package. The intermediate HTML document is injected into the report template which provides the basic layout and includes JavaScript and CSS resources. Finally, the RelaxNG and Schematron validation messages are patched into the HTML report.

Fig. 7

MathML equations are rendered with MathJax

Fig. 8

File listing of the zip archive

Fig. 9

The report shows a mismatch of page counts between PDF and XML

Workflow

Talend runs transpect two times. The first time, transpect is used to extract identifiers such as book ISBN or article ID from the archive. Then Talend takes the parameters and performs a request for the corresponding journal or book metadata from De Gruyter’s ERP system. Afterwards Talend invokes transpect for a second time with the metadata snippet for the current package. Finally transpect starts the actual validation of the package.

The workflow comprises the following stages.

Unzip the archive and provide an XML listing of the files.
Check all PDFs and provide an XML representation of the checks.
Load metadata files from De Gruyter’s ERP system.
Evaluate the configuration cascade and the checking profile for the current package.
Select the RelaxNG schema and assemble the Schematron rules which apply to the configuration cascade and checking profile.
perform RelaxNG and Schematron validation
Convert JATS and BITS to HTML.
Inject the HTML in the report template and patch the RelaxNG and Schematron validation messages in the context where the error occur.
Provide Talend with a summary of the checks as shown in the example below (Status XML).

Box 11

The status XML provides a summary of the checks.

<c:param-set xmlns:c="http://www.w3.org/ns/xproc-step"
basename="aep_aep.2014.40.issue-3_2014-12-11--23-55-31" product="journal"
pub-type="issue" upload-date="2016-03-08+01:00" status="failed"
phase="strict" schema-valid="valid" type="report"
schematron-errors="33" schematron-warnings="0">

Benefits

In 2014 De Gruyter started using the Submission Checker for journal publications and extended it in 2015 to process book files as well. Since then over 40.000 packages have been processed. During this time of live implementation, various experiences have been made. Some rules have been to strict to reflect the actual content, some had to be become stricter to guarantee a successful rendering at De Gruyter Online or meet the requirements of related processes.

The effort within the production departments to manage error reports from different sources has been minimized. There are remarkably less complaints about poor rendering of specific content at degruyter.com.

Support regarding bugs or error messages that could not be interpreted by vendors are managed by the eProducts and Standards team. That, on the one hand, allows the team to constantly improve the Submission Checker itself, on the other hand returns valuable feedback regarding the unambiguity of specifications and instructions within the XML guidelines.

Despite having higher percentage of the delivered XML to conform to the De Gruyter guidelines, the amount of error messages per month decreased from around 30-40% to about 10-20% (Fig. 10). This implies an improved XML process with the vendors.

Fig. 10

Development of ok/warning and error statuses since implementing the Submission Checker in 2014

The graphic also shows large deviations (e.g. August 2014, December 2015). They result from major Submission Checker releases and the insufficient adaptions within the deliveries. That again proves the necessity of automated quality assurance and the benefit for De Gruyter. It would be out of De Gruyter’s scope to assure required adjustments within the content otherwise.

A large benefit that evolves from an XML quality assurance tool lies more or less within the future. De Gruyter is supposed to have advantages, when content or parts of it need to be re-used for different or new types of publication, like a databases. In that case, conversion efforts are expected to be relatively low since unpredictabilities have been minimized within the checking process.

Outlook

As of now, the initial requirements have been implemented and additional features are planned to realize within the next months. It is planned to realize the following components:

IDPF epubcheck

For book publications, an EPUB file is required for each delivery. Therefore, a feature is planned to check the EPUB file against the IDPF and the De Gruyter specific standards. Fortunately, transpect already provides an module which implements the standard IDPF epubcheck.

PDF check

Currently only the PDF metadata and the page count are evaluated. Additionally. it is planned to check color and transparency settings, embedded fonts, crop marks and bookmarks.

External Checking Framework for vendors

The experiences that have been made show, that having a checking process in place shortly before publication is often inconvenient for vendors. Some errors only become visible at this stage, which is too late since getting back and forth between correcting the file and re-uploading it to the Submission Checker is very ineffective. At the one hand, De Gruyter expects the vendor to have its own quality assurance in place that tests the content continuously against the De Gruyter standards to prevent errors at the stage of the SC. On the other hand, within the Submission Checker schematron rules are used to verify the content and it is anticipated, that extracting these rules and providing them to the vendor is associated with relatively low effort. Besides it is expected, that error rates decrease even more, which could result in lower support requests internally.

At the same time it needs to be considered that some checks rely on external files (see Processing Parameters) which in some cases cannot be delivered with the schematron rules (like metadata information from the ERP system). This makes it necessary to omit these checks within this framework. Also testing the file structure and naming conventions of the zip package should be excluded as it is can be assumed that single XML files are tested during the production process rather than finalized ZIP files.

Conclusion

De Gruyter decided to archive their content in XML to leave divers options for re-use and to optimize pre-publication processes. At the same time the XML files are prepared from various vendors using different software and workflows. To make sure these files meet all requirements having the Submission Checker in place to ensure their quality automatically is inevitable.

Furthermore, De Gruyter is getting more familiar with the various structures within their content. That on the one hand creates the opportunity to improve guidelines by unifying elements or, if necessary, define specific rules and use configuration cascades. On the other hand De Gruyter gains a better understanding of options and limits along with the JATS schema family.

Finally, to opt for le-tex and its transpect framework was very beneficial as it complies with De Gruyter's strategy to use industry standards whenever feasible. It was possible to build on diverse modules and interfaces that are already implemented within Transpect. This made it an efficient and advantageous project for both parties.

References

1.: De Gruyter Website: The Publishing House. [Accessed: 2016-03-03] http://www.degruyter.com/dg/page/15/the-publisher.
2.: Talend Website [Accessed: 2016-03-08] https://www.talend.com/
3.: Transpect Reference [Accessed: 2016-03-11] https://www.transpect.io/
4.: XProc step that use oXygen’s Skeleton-based Schematron implementation [Accessed: 2016-03-11] https://github.com/transpect/schematron.
5.: XProc step to assemble Schematron documents. [Accessed: 2016-03-08] http://transpect.github.io/modules-htmlreports.html#tr-assemble-schematron.
6.: Blair J. Developing a Schematron–Owning Your Content Markup: A Case Study. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2012 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2012. [Accessed: 2016-03-11] http://www.ncbi.nlm.nih.gov/books/NBK100373/
7.: Usdin T, Lapeyre DA, Glass CM. Superimposing Business Rules on JATS. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2015. [Accessed: 2016-03-11] http://www.ncbi.nlm.nih.gov/books/NBK279902/
8.: GitHub page of epubcheck-transpect [Accessed: 2016-03-11] https://github.com/transpect/epubcheck-transpect.

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Bookshelf ID: NBK350149

Contents

PubReader
Print View
Cite this Page
Kraetke M, Bühring F. A Quality Assurance Tool for JATS/BITS with Schematron and HTML reporting. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2016 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2016.

Other titles in this collection

Journal Article Tag Suite Conference (JATS-Con) Proceedings

Conference Links

Recent Activity

Clear Turn Off Turn On

A Quality Assurance Tool for JATS/BITS with Schematron and HTML reporting - Jour...
A Quality Assurance Tool for JATS/BITS with Schematron and HTML reporting - Journal Article Tag Suite Conference (JATS-Con) Proceedings 2016
Normal bacterial flora on hands - WHO Guidelines on Hand Hygiene in Health Care
Normal bacterial flora on hands - WHO Guidelines on Hand Hygiene in Health Care
MnSOD, partial [Hevea brasiliensis]
MnSOD, partial [Hevea brasiliensis]
gi|5777414|emb|CAB53458.1|
Protein
USP3 ubiquitin specific peptidase 3 [Homo sapiens]
USP3 ubiquitin specific peptidase 3 [Homo sapiens]
Gene ID:9960
Gene
9960[uid] AND (alive[prop]) (1)
Gene

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Bookshelf

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2016 [Internet].

A Quality Assurance Tool for JATS/BITS with Schematron and HTML reporting

Authors

Affiliations

Brief Description

Introduction

Background

Standards

XML

Box 1

Variables and IDs

Box 2

Motivation

Fig. 1

Aim

Requirements

Processing Parameters

Business Rules

Cascading Rules

Fig. 2

Metadata Compliance

Report

Implementation

Publication Workflow

Fig. 3

FTP folder

Box 3

Talend

Fig. 4

Transpect

Technologies

XML representation

Box 4

Fig. 5

Cascades

Box 5

Phases

Box 6

Global Parameter Sets

Box 7

Schematron Implementation

Assembling Multiple Schematron Documents

Use XSLT Functions in Schematron

Box 8

Additional Markup

Box 9

Box 10

HTML report

Fig. 6

Fig. 7

Fig. 8

Fig. 9

Workflow

Box 11

Benefits

Fig. 10

Outlook

IDPF epubcheck

PDF check

External Checking Framework for vendors

Conclusion

References

Views

In this Page

Other titles in this collection

Conference Links

Recent Activity