NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2015.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015 [Internet].

Show details

All Aboard! Round-tripping JATS in an HTML-based online CMS and editing platform

.

In a project just starting up, we are converting JATS (or actually near-NLM-Book) data into HTML for an HTML-based CMS, where the documents (structured reference material in short articles) will be edited (in the usual sort of web-based HTML editor) – and then must be siphoned back up into the JATS-like textbase, for further processing using JATS-based tools.

Going both ways presents several interesting theoretical and technical challenges. A simple set of design principles governing the mappings both eases implementation, and implies a generalized architecture or methodology for doing this with any HTML application and JATS data set.

In December (2014) I was asked by a customer (a division of De Gruyter) to help with an interesting transformation problem. The challenge was to round-trip a data conversion between data conformant to JATS,* and HTML, so that edits to the JATS data could be performed in a web-based (and HTML-based) CMS, and then ported back into the data store. While the project participants agreed that this was hardly a viable proposal in the general case (or even a definable one), we also agreed that we did not have to solve it in the general case, but only with respect to the requirements of our particular pipeline working with our actual data – requirements that, we judged, might be addressable if only we had scope to define, not only a mapping from JATS to HTML, but also constraints over and assumptions regarding the HTML we would support coming back into the JATS.

That is, we were free to define a subset of HTML that would be capable of converting back into JATS again to fit with the tool we would built to support this conversion; and we could drive the development by scoping it to a particular data set, not a theoretical one.

This approach – not trying to do everything, but narrowing the problem in order to do what can be done, while taking note of where issues are intractable – allows us to learn by experiment, illuminating the stresses and challenges of working with data that must perform in both JATS-based, and HTML-based environments. But that wasn't the only reason I was interested in helping with the project. I also saw an interesting opportunity to “open the pipe”, as it were, and test how much or how little it would actually be necessary to constrain the data, and where, in order to achieve our goals. We like XML, we keep saying, because we can validate it. This is an opportunity to see whether we might be setting the bar for ourselves too high.

We know that to provide for editing XML in an XML-first workflow is actually much more challenging than it seems it should be – due largely to the complexity of three tasks (and the complex interrelations between them): defining and enforcing good tagging (well fitted to local needs); developing tools; and training, supporting and empowering users. The claim is that moving to the web should relieve at least the latter two of these issues, while not significantly worsening the first. While I have my doubts as to this claim,** I am certainly willing to be wrong about it – and the web architecture is admittedly ubiquitous. This effort, in other words, offers great rewards, even if ultimately it succeeds better at exposing the issues than it does at solving them.

Theory aside, the fact that we were starting with a limited data set, and could build from there, gave us a place to start. First, we implement a simple mapping into HTML (but a comprehensive one for our data set), from which we can produce actual HTML. Then we demonstrate transformation back again. In doing so, we expose and codify (to the extent practical) both mappings and constraint sets on both sides. But we would also make both transformations intentionally "noisy", producing exceptions (and sometimes validation errors) when they could not be assured to produce correct results. (I.e., no graceful fallbacks or coping strategies.) This would mitigate the chances of information loss or degradation happening unnoticed, in either direction.

Rule-based conversion: JATS to HTML

We wanted our our mapping from JATS to HTML, in version 1, to be as simple and systematic as possible. This was because we knew that we could return later if any refinements were called for, while having something simple would be advantageous in early stages as we built and tested the rest of the pipeline. We also had the advantage of a data set that we agreed could be considered representative, for the purposes of v1.0. Analyzing this data set could provide guidance as to priorities for mapping. (I.e., valid JATS constructs were not high priorities for consideration in the mapping unless they actually occurred in the data.) We do expect our reference data set to grow.

So we started with what is almost the simplest possible mapping: everything becomes either a div or a span. In true HTML fashion, we would rely on @class attribute assignments to carry semantic labels (generic identifiers) from the JATS into the HTML.*** Essentially, we would represent JATS as an HTML microformat (using that term here to refer to a technique, not a movement). Add a few exceptions to this (for obvious things such as anchors, paragraphs and lists, common elements for which more exact equivalents were readily available) and we would have something basic but comprehensive enough to try. Our data set gave us precise boundaries defining the scope of this mapping.

The next problem was determining how to determine how to distinguish between elements expressed as blocks (div) and inlines (span). It becomes a div if it never contains anything but elements (it is a wrapper); it becomes a span if it never appears anywhere except in line (next to text). Any elements of which both were true would become div (such as JATS fn, which can occur in paragraph content, but as a wrapper for a structure). Elements of which neither of these were true (such as p or, in some data sets, label) could in principle become p ... but for the generic mapping we again chose div for these. This means everything is a div, except a very few things are spans due to their both containing text, and appearing next to it.

There was a refinement to this, however; we want the div/span distinction to be made by element type, not for each element. That is, if our data presents

<bold><italic>Bold and italic!</italic></bold>

we wish to see two span elements, not a span inside a div – the fact that bold has only element content here should not make it a div – or the other way around (because the italic does not appear in mixed content). So the criterion has to be based not on the contents of this element, but the contents everywhere in the data set of elements of the type. That is, because both bold and italic are mapped to span elsewhere (as both appear next to text), they become span here as well.

There are two ways to determine which element types should be spans and which should not, according to these principles. The more robust and correct way would be to do so by reference to the appropriate schema, where analysis can determine which elements are permitted to appear alongside text. However, the design of JATS (which makes a principled distinction between elements appearing only in element content, or in mixed content) makes it possible to get the same results in reference to a data set, even when the particular schema is unknown or undefined, and that is what we opted to do in our case, for the sake of expedience. This being the case, we were able to generate a “starter” XSLT from the data set itself (using a heuristic process implemented in XSLT to do so), which declared mappings for elements on this basis. We then proceeded to refine it by hand.

Further adjustments fell into three categories:

  • There were cases where we allowed that elements mapped to div could be better represented as p, not because they are “paragraphs” (something which is underspecified in HTML in any case) but because they have text content, while any contained elements were classifiable as inline (as described above).
    This set of elements includes (in our data set) p, title (except inside metadata or as a child of sec), mixed-citation, label, and caption.
  • We decided that for robust display of the HTML, and to follow the “principle of least surprise”, we would map sec/title elements in the JATS to different levels of HTML headers (h2-h6), depending on their depth. (The mapping is still simple as in all cases they would match a CSS .sec > .title selector.)
  • Some elements had reasonable mappings in HTML itself. Specifically, we found we could represent the two kinds of lists we had (JATS list without @list-item-type and with @list-item-type='simple') as ul with li or dl with dd, respectively. We also mapped JATS graphic to HTML img; similarly, HTML has elements corresponding to JATS italic, bold, sup, and sub.****
  • Finally, we had an ad-hoc mapping for JATS ext-link (with a specific @ext-link-type), specified by the HTML/CMS team, who were implementing hypertext features (internal and external linking) in the target system.

Our provisional mapping looks like so:

Table 1

ElementsRuleRationale
book, body, book-meta, book-title-group, contrib-group, contrib, name, publisher, pub-date, book-part-meta, title-group, alternate-form, sec-meta, book-part, back, ref-list, ref, sec, disp-quote, figBecome divContain only element content (i.e. a “wrapper” element)
book-id, book-title, subtitle, volume, surname, given-names, publisher-name, publisher-loc, isbn, year, day, month, elocation-id, book-meta//title, book-part-meta//title, sec-meta//title, book-meta//ext-linkBecome divContain text content, but appear only inside metadata
p, title, mixed-citation, label, captionBecome pContain text, but never appear next to text (i.e. never appears inline, only in wrappers)
sec/titleBecome h2-h5 (depending on sec element nesting level)To make life easier in an HTML environment and provide a backstop to CSS
list, list-itemBecomes dl/dd or ul/li depending on @list-item-type ('simple' or anything else)Same
graphic w/ @xlink:hrefimg w/ @srcThe HTML equivalent
italic, bold, sup, subBecomes i, b, sup or sub based on element typeSame
ext-linkProviding an ad hoc mapping, this becomes HTML a with @class assignment and a mapping for JATS @xlink:href).Following a specification of how to tag external references in the target environment

Elements not listed here (significantly, tables and math) do not appear in our data (yet), and are not mapped. The rule for any element not mapped explicitly is to copy it into the result (where it will be detectable as invalid), while emitting a warning message at run time.

This appears to be a complete and comprehensive to serve as a rough cut, enough to get us started. It should be stressed, however, that the element-to-element mapping is not the most critical, inasmuch as the JATS element type itself is captured in the HTML @class value – and that is what we will look for in the data coming back.*****

An abbreviated example illustrates this principle:

Box Icon

Box 1

The simple mapping. The source data: Results of transformation with the simple mapping:

As a consequence, we get HTML that appears reasonably well in display; that presents @class value hooks for CSS; and that exposes the JATS element types for use both in the HTML application and for the return journey back to JATS.

Representing JATS attributes and attribute semantics

Unless we also account for attribute assignments in the JATS, however, an element-to-element mapping will be insufficient, to say the least. Much of the most important information that needs to be represented in HTML takes the form of attributes in the JATS data, to which (for the most part) HTML has nothing corresponding.

Our first solution to this was creative, but perhaps no better than that: it was to overload the semantics of HTML @class. For example, JATS (or rather, here, BITS):

<book-part id="book-part01" book-part-number="1" book-part-type="chapter"> ...
would become
<div id="book-part01" class="book-part book-part-number..1 book-part-type..chapter"> ...

We were even able to demonstrate round-trip processing of attributes stored this way, with an XSLT that split apart the @class values. However, we eventually decided on a different option. Exploiting a feature in the HTML5 specification,******we determined we could break out the JATS vocabulary into attributes specifically provided (again by means of a rule-based mapping) for that purpose:

<div id="book-part01" class="book-part" data-book-part-number="1" data-book-part-type="chapter"> ...

While this may make it more difficult to produce a schema describing the HTML subset we are prepared to handle, we judged that the tradeoff in both transparency, and robustness, was worth it. The developers on the CMS side concurred with this design.

"JATS-sniffing" the HTML

The essence of this approach is in the theory that if we keep the mapping rules simple enough, implementing a comprehensive transformation back again from our HTML “projection” will be feasible, at least for the actual documents we actually need to work with. We do not need to implement a transformation back into JATS from arbitrary HTML; instead we are free to stipulate exactly what variety (or usage profile or subset) of HTML we are bound to accept.

Yet at the same time, our mapping rules aimto be simple and predictable enough that little or no extension would be necessary to support element structures and combinations we had not seen.

The first rule of our transformation back to JATS, therefore, is that if a clean mapping is not available (following either the rules for generic mappings, or explicit exceptions to those rules), an element in the HTML will be copied, not converted. That is, the transformation is designed to produce invalid results for source data that goes outside the boundaries of a rule set that is only implicit as far as they apply to HTML. (These rules govern both how JATS semantics are represented, and whether as represented, they will be valid in JATS.) Because we guard the boundaries by validating the results against JATS, we do not need to guard them inside the XSLT.

This is a different tactic and presents a different set of functional requirements than transformations as typically designed and used: the ordinary assumption is that, as long as a transformation does not actually drop contents, it must do everything it can to render valid outputs (perhaps with warnings), even if there is some loss of information encoded in markup along the way. Things are expected to degrade gracefully rather than fail. If instead, we can expect our results to be either correct (according to simple and traceable rules) or invalid, then we don't have to worry about a middle ground, namely coding for cases where things are valid, but correctness is still in doubt, because the information we need to bring our data back into JATS is not perfectly explicit and clear in the HTML.

This being the case, we can straightforwardly implement a transformation in reverse that works as follows:

  • If any value given on @class (but not more than one) corresponds to the name of a known JATS element, we will create an element by that name.
    So div with @class='p' becomes p, as does p with @class='p' or @class='p font.italic'. (No element in our JATS subset is named font.italic.)
    A div with @class='p italic', however, remains a div (which will be invalid in the results), in this case because two JATS elements are named, not just one.*******
  • Similarly, any attributes named data-* are converted into similarly-named JATS attributes (without the “data-” prefix) with the same value.
  • Where a pattern of elements matching an ad hoc mapping occurs (such as we have for external links), reverse it.
  • Everything else is copied through.
  • When results are valid, we know the data is clean. When they are not, we can either push the data back (the editing environment should be able to prevent it from occurring); or extend our mapping rules and XSLT to accommodate the usage; or edit the data to make it conformant.

Depending on how we approach the last problem, the burden can be placed on the HTML editor to maintain and manage a JATS-compatible HTML profile in the editing environment.

Formalizing a JATS-compatible HTML profile?

We have demonstrated successful round-trip conversion of data in our sample set using transformations written to these specifications. Development and testing of the editing environment is proceeding. To me, the biggest question will be whether that environment can be customized and constrained enough so that the HTML it produces follows our rules. (What proportion of the data, when edited and rendered back, will come as valid and correct JATS markup?) Presumably, if we hit a wall there, we will have to come back to our mapping rules (from HTML back to JATS) to make them more accommodating of what the editor actually produces. However, by no means has it been shown that such an editing environment is impossible or impracital. (If the irony is that, in the end, it would have to be a JATS editor in an HTML costume ... so be it: a true JATS editor might be next.)

If it turns out that this approach works well, however, this would raise another question. Shouldn't it be possible to expose and codify the HTML profile corresponding to a particular schema or subset of JATS, according to the rules stated earlier? If it is, a formal specification and perhaps even tools (DTD, RNG, Schematron) might be possible, for working with such an HTML flavor or dialect. These capabilities could be tremendously useful in actually managing a workflow, for the usual reason: that it exposes requirements (as a function spec) and provides a mechanims for trapping issues and problems early, even before going to JATS.

Such a subset of HTML would have to have these characteristics:

  • Be valid to a known species of HTML, such as XHTML 1.1 or HTML 5 (in an XML serialization). This may not be entirely straightforward, inasmuch as there are a few places (for example, list elements inside paragraphs) in JATS is more permissive than HTML, which requires finessing (for example, by allowing div element wrappers for lists in the HTML subset, which would also have the virtue of supporting JATS list/title). However, I believe the stress points (such as this one) can be isolated and dealt with.
  • Capture JATS names completely and unambiguously – all elements in scope (e.g., all elements in body) would have to capture single (unambiguous) JATS values as @class assignments. (We leave mapping and managing metadata aside as a separate set of problems.)
  • Observe co-occurrence constraints reflective of implicit JATS structures

Of course, it is the last of these that is the real challenge. Such constraints might include specifications such as: “An element with @class assignment sec may have attributes @data-sec-type or @data-specific-use, but no other data-* attributes”. (Constraints on attribute values might be similarly expressed: “@data-sec-type must be ‘chapter’, ‘section’ or ‘subsection’”, etc.) It should in principle, however, be possible to derive these rules straight from a schema for the JATS version in use. An implementation could take the form of a combination of RNG (or even DTD) and Schematron.

We have not done this to date, and do not plan to until we have established the viability of our basic strategy. We may learn we don’t actually need such a schema, or that it wouldn’t help the problems we discover we have. Alternatively we might learn such a schema would be a great help or even essential. Presumably, such a specification would provide additional transparency to the CMS developers, along with theoretically useful gateway functionality. (That is, we would have a schema against which to validate our HTML before attempting to convert it.) The disadvantage would be that, like any new layer (or formal, external specification of a layer), it would be one more point of stress between the pragmatists and the purists, and perhaps not rewarding of the effort – at least if our HTML were coming through well enough without it.

Summary and speculation

We believe we have established that two-way transformation is possible in principle if we are willing and able to constrain the tag set on both sides (that is, both JATS and HTML usage profiles); it can even be readily achieved, without surprises, at least under lab conditions.

However, we have done this by designing our tools to convert downhill (from JATS to HTML) very regularly and transparently, expressing the JATS semantics in HTML in a way that makes the same element types (and attribute values) easily recognizable coming the other way – meaning the uphill conversion can be equally simple. And rather than tolerate lossiness in either direction, we prefer to generate invalid results (especially coming back into JATS), exposing the points of failure.

In other words, we will never have a pair of XSLT stylesheets that will convert from any JATS into HTML (of whatever form), or from any HTML into (whatever form of) JATS. But we will have stylesheets that can handle anything more or less like what we have already seen in our environments and are prepared to support there – and that could be a great deal. And we will be able to see whenever they fail: we won't have silent errors or lapses.

Additionally, we begin to see the outlines (requirements) for a set of tools and a methodology by which one could readily develop and deploy transformations (even auto-generated transformations), that would convert JATS of a particular usage profile into a corresponding “reflection” in HTML – and would be able to accept data for conversion the other way, providing the HTML on the return side conforms to expectations. I call this approach “ascetic HTML”. Imagine if we could not only run these transformations, but also generate schemas capable of ensuring their success.

Such a system would probably require hand fitting around the edges to work well. (Even in the examples we have done, we performed adjustments for representing links in the HTML.) Additionally, in order to function properly it depends on good tooling on the outside (supporting operations over the HTML) – since garbage-in-garbage-out is the only way the return transformation can work.

Yet even if this doesn’t work out – the very fact that we face a requirement to get JATS-encoded data into and out of an HTML-based system may indicate progress. If we step back to view the bigger picture, it appears that editing structured text, whether it be represented as HTML, XML or something else, has always been challenging primarily because UI conventions for doing so – the widgetry of “outlining”, representing and encoding structured data (as opposed to simply “painting the screen”) – have never been clear, established and transparent to a critical mass of users, whether in the browser or on the desktop. For every user that has understood the need and usefulness for explicit structure, and sought a way in the editor to represent and exploit the structures of the document, there has been another who has found such exposure of the organizational “bones” of the document to be an unnecessary abstraction (as it may well be, for them to do their job as they see it). To the extent this is the case, it is not a question of an HTML or an XML editor, but of how and where structure is imposed on the document and represented in its interfaces. With respect to these questions, there are also huge variations also in the type and form of texts, publishing and processing systems: one size does not fit all. However, I also think real progress is happening in this area, and users are becoming more sophisticated and knowledgeable about both markup and structured text – good news for both HTML and XML.

If this is so, we are likely to see better interfaces for authoring and editing in both HTML-based and XML-based environments (to say nothing of those that present a different kind of editor altogether, such as a markdown or simple markup convention). One way they will be better – one way they will have to be better – is that they will support better and stronger validation, and be more capable of managing arbitrary sets of constraints and rule sets, more amenable to the expression and enforcement of local business rules.

And this will open the way for more general and more adaptable solutions for integrating even loosely coupled systems built for formats as different and apparently incompatible as JATS and HTML.

Footnotes

*

Actually the format was a derivative of NLM 3.0 Book aka "Purple"; but for the purposes of this project it may as well have been JATS (or any other documentary DTD of similar complexity), so I say JATS throughout.

**

One reviewer of this paper asked why weren't we simply editing XML, and indeed I did discuss with the customer whether an XML-based editor, even on the web, wouldn't be easier and better, to be told that regrettably it wasn't an available option.

***

In this paper, I borrow an XPath convention (prepending the name with @) to distinguish attributes as such from elements.

Long-time markup practitioners may recognize this technique of aligning vocabularies as being codified as early as SGML/Hytime Architectural Forms.

****

Incidentally, these also accounted for all the elements mapped to span in our data set. So in the end we had none.

*****

In HTML, @class can be provided with several (space-delimited) values; the return XSLT will ignore any that don't correspond to its set of JATS element types.

******

See The HTML5 draft Recommendation on so-called data-* attributes in HTML 5.

*******

If the HTML/CMS team finds this insufficiently rigorous, we are prepared to change our rule, so (for example) only things with @class='jats-p' become p.

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.
Bookshelf ID: NBK279900

Views

  • PubReader
  • Print View
  • Cite this Page

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...