NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2013 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2013.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2013

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2013 [Internet].

Show details

A Publisher’s InDesign to BITS and EPUB Infrastructure: Conventions, Configuration, Conversion, Checks

.

Author Information

Deploying advanced XML technologies such as XProc, XSLT 2.0, and Schematron, an “ex-post” conversion of InDesign files may be a viable alternative to XML-first publishing production workflows.

Introduction

The Hogrefe group of publishers, a leading European publisher of books, journals, and tests in the field of psychology and related disciplines, was looking for a solution for creating electronic products and archival XML content of their books. More than 150 new releases per year should be processed in a centralized XML/EPUB workflow. The major requirements were:

  • EPUB output;
  • XML output in an APA-compatible vocabulary;
  • electronic publication simultaneous with or earlier as print publication;
  • apart from specified exceptions, content identity assurance for print and electronic versions;
  • high-quality typesetting; strong preference for InDesign;
  • keep current typesetters if possible;
  • keep current authoring, typesetting and proofing workflows (Word MS, InDesign typesetting, PDF proofs);
  • but be ready for alternative workflows, particularly those that start with content that is already tagged;
  • be able to adapt to multiple layouts and content types (specialist books, tests, encyclopedias, …).

The requirements and the overall sentiment among production editors suggested that InDesign would keep its position as the primary typesetting system.

Several workflow alternatives were considered, among them manual conversion of the printed books (which was rejected for time-to-market and quality reasons) and carrying along the tagging in InDesign. The latter option was considered too difficult and costly, given InDesign’s well-known shortcomings in supporting XML-first workflows, in particular when the XML dialect is not “plain” enough and when there are possibly many author correction passes.[1].

Solution

They finally settled for a workflow where conventional InDesign typesetting is combined with additional conventions and a self-service checking tool for the typesetters. The checking tool, which is available as a command-line tool and as a Web service, runs an IDML→BITS→EPUB conversion pipeline based on XProc and XSLT 2.0. Besides the conversion artifacts, it produces an HTML report where error messages and warnings are tied to the error location in the document (that is, in its HTML rendering). The messages are created by Schematron checks that may be inserted at multiple stages of the conversion pipeline – for detecting unknown style names, unanchored text frames, incorrectly split table cells, uncited bibliography entries, etc. In addition to Schematron messages, schema validation messages may also be inserted at the validation error locations.

The components of this solution will be described in the following sections.

Schema: Adding Style to BITS

BITS beta 0.2 came out when the project started, and the requirement of APA DTD vocabulary made BITS an obvious choice.

Tagged content should be at the heart of the workflow, so there should be obviously a path from XML to EPUB. But it was an open question whether the InDesign→EPUB path should be allowed to bypass XML. Because an EPUB should look the same whether it was produced from InDesign or from XML, it was decided that the path from InDesign to EPUB lead via XML all the time. Another requirement was that certain ad-hoc styling, such as text alignment, table cell background colors or borders, should be passed through from InDesign to EPUB, without requiring that somebody maps a document’s stylistic properties and local style overrides to named content types in XML and from there to CSS stylistic properties and local overrides. le-tex already had an XML vocabulary for expressing global and local layout properties, CSSa[1], and we had the converters from .docx and IDML’s proprietary layout vocabularies to this vocabulary in place. So expressing the IDML files’ layout properties as CSSa was a natural choice, and in order to be able to pass it to EPUB as CSS, it had to be in the intermediate XML, too. So CSSa was added to the BITS schema. The resulting schema is called Hogrefe Book Tag Set (HoBoTS).[1]

Fig. 1

Fig. 1

InDesign’s normal paragraph style and two variations thereof (without indent, with space above), expressed as CSSa rules in a HoBoTS document

It is important to note that not each and every styling property will be forwarded to the EPUB. There are rules as to which css:attributes should be discarded (e.g., font-family and font-size), which should be converted (e.g., device-cmyk to RGB), and which should be converted to markup (e.g., subscript/superscript, bold, italic). It is also important to note that no matter where the bold or italic property came from, from named character styles or from manual overrides, they will all be converted to BITS bold and italic elements. With the notable exception of title elements, where bold is considered as the default.

Fig. 2

Fig. 2

When the CSS is serialized for the EPUB, some of the original properties such as font-size have been discarded

An example for local overrides can be found in Fig. 3.

Fig. 3

Fig. 3

Table background color, text justification, vertical alignment, and cell padding as local overrides

CSSa is a cost saver in that it does not require everything to be marked up perfectly using semantically named styles. Of course the publisher is free to demand perfect markup for certain works or book series. This can be achieved by Schematron checks that warn if certain verbose layout properties, be it in style definitions or as local overrides, are found in the HoBoTS XML. These checks may be selectively defined per imprint, per series, or per work (see Configuration Cascade). But time to market is also important; and given EPUB’s status as currently the only commercially relevant artifact of this workflow, it is probably good enough to postpone advanced semantic tagging to when it’s commercially or strategically relevant.

Another use of CSSa is indeed quality control. As we will see in Split Table Cells, there are conventions for naming the styles of split table cells in InDesign, so that they may be joined for the XML/EPUB versions. In most cases, a pre-split cell lacks its bottom border while a post-split cell lacks its top border. So a Schematron warning may be raised if a cell has all but the bottom borders but its style name does not contain the keyword 'SPLIT'.

These checks will be applied to an intermediate XML format rather than on the HoBoTS XML, so this is in itself not an argument in favor of adding layout information to BITS. However, after the initial conversion to EPUB is finished, the HoBoTS XML will become the master file. As the checking rules repertoire is growing over time and more heuristics might be introduced to infer semantics from layout, it seems sensible to carry along the layout information on top of BITS.

CSSa blends with BITS (and other structure-oriented schemas) as an orthogonal layer that may easily be stripped away for interchange. Another orthogonal layer that we anticipated is RDFa. RDFa will be used for marking up the correct answers in tests, for example. Hogrefe will use or create a controlled semantic vocabulary for marking up tests. It seems natural to use a technology that was designed for adding meaning to content in a controlled manner. Therefore we prefered enhancing BITS with RDFa instead of using content-type attributes in HoBoTS or class attributes in HTML/EPUB.

After converting the BITS DTD files to Relax NG using trang, [1], including BITS, RDFA, and CSSa was straightforward. The only thing that was difficult: the DTD did not contain any hooks for extending the allowed attributes of most elements. So we wrote an XSLT that created a RNG Schema patch like this: everywhere where the attributes @xml:lang, @content-type, @display-as, or @abbr are allowed, all CSSa and RDFa attributes are also allowed. So we did not have to alter BITS itself. We just import an unaltered BITS 0.2 (or 1.0, eventually), CSSa, RDFa, and our automatically created patch.

There are a some other changes to BITS that are also contained in the HoBoTS RNG Schema.[1] For example, Hogrefe wanted to have a mandatory (and possibly more than one) block level item (such as paragraph) within table cells instead of the inline content, as it is commonplace to have multi-paragraph table cells in their books.

Typesetting Conventions

Although the aim is to only minimally change the conventional typesetting workflow, there must be some conventions in place for an automatic converter to function correctly.

Style Names

It starts with style names. As we were configuring the converter for the first couple of book series, we discovered that we introduced too many individual per-series configurations because they called a chapter heading 'h_level1' here, 'h_level2' there and 'chapter_heading' over there. We tamed the wilderness by agreeing on a list of basic style names. Everything that is semantically the same must have the same base name. Note: Neither does that mean that it does have to look the same way in the print layout, nor is it an exhaustive list of all style names that there may be. The typesetters have the liberty to add arbitrary suffixes to the base names after a tilde character (for CSS-compliant naming, the tilde will be converted to '_-_' in CSSa). These suffixes may designate a paragraph style to have no indent, an extra vertical space before, or both. In most cases, the converter simply ignores the suffixes.

A Schematron check will compare the document style names (or rather, its CSSa rule names) against the saved CSSa rules of a template file and report any deviations in the HTML report.

Anchoring

Marginal notes, floating images and the like must be anchored lest they appear at the end of the document. In cases where this anchoring may not be achieved with InDesign’s native anchoring (a two-page spread filled entirely with figures, a multi-page boxed text, …), an alternative mechanism may be used. It is an ID/RID-mechanism based on conditional text: insert the ID text into a story that should be anchored (e.g., an image caption that is grouped with an image rectangle), assign the condition 'StoryID'. Insert the same text at the position where it should be anchored, and assign the condition 'StoryRef'. Make sure to hide StoryID and StoryRef conditional text for printing. A similar mechanism is available for figures without captions. These alternative anchorings will be processed by our idml2xml converter[1] in the same way as normal InDesign anchorings.

Footnotes in Tables

InDesign does not support footnotes in tables. This has not improved in recent years despite the publishers’ need for this feature. We are not speaking of footnotes that should be placed at the bottom of the table. We are speaking of footnotes in the regular text flow whose reference marker’s fate is to be located in a table.

As a workaround, typesetters have been using footnotes that are placed before or after the table, but with an invisible marker. The reference marker has been inserted manually as text. We are supporting this workaround in that we have agreed upon naming conventions for the character styles around invisible footnotes and for pseudo reference markers. The conversion process will move the invisible footnotes to the pseudo markers’ locations within the table. In addition, there are Schematron checks that complain about invisible footnotes with incorrect style names and about a mismatch between pseudo reference marker and invisible footnote counts.

For tables that cover a whole page or more, typesetters use another workaround: they create a manual footnote in a distinct table row, and a manual reference marker in the text. These footnotes will also be converted to footnotes proper and placed at the pseudo reference marker location, with checks in place.

Split Table Cells

Sometimes it is necessary to split cells of a table when there should be a page break in the middle of the cell. The pre- and post-split cells should be joined in the electronic formats. This is achieved by adding the keywords 'SPLIT' and 'REST' (in the sense of “remainder”) in the pre- and post-split cell style names, respectively (after the tilde, cf. Style Names). If these keywords are present in a cell, a dedicated XSLT pass will merge all cells of the pre-split row with the corresponding cells of the post-split row. This only happens when the cell counts of the two rows are identical. Failure of this merger may be detected with Schematron later, based on the continued presence of 'REST' cell styles in the converted document.

An additional complication arises from the fact that cells may be split at paragraph boundaries or in the middle of a paragraph. If pre- and post-split paragraphs should be merged after the merger of their parent cells, the same 'SPLIT' and 'REST' keywords may be added to the paragraph style names. If these keywords are present, the paragraphs will be merged, with the style name of the pre-split paragraph as the resulting common style.

As discussed in Schema: Adding Style to BITS, missing bottom or top table borders indicate a cell split and may be used to heuristically check for missing 'SPLIT'/'REST' keywords. For split paragraphs, presence of @css:text-align-last="justify" in the last paragraph of a cell may be used as an indication that this is a pre-split paragraph that should be joined. Schematron may complain about missing 'SPLIT'/'REST' keywords then.

This is an example for an important concept:

Box Icon

Box 1

Don’t base the conversion on heuristics. Use strict rules, fixed keywords, etc. The checks, however, may indeed be based on heuristics.

Conversion Pipeline; Configuration Cascade

The conversion is based on open standards such as XProc[1] and XSLT 2.0[1]. A conversion pipeline consists of several macroscopic steps that can itself comprise up to 30 individual XSLT passes. The macroscopic steps of Hogrefe’s book conversion are depicted in Fig. 4.

Fig. 4

Fig. 4

Conversion pipeline © 2013 Maren Pufe

The first step converts IDML to a flat Hub XML[1] document. Hub is a DocBook-based XML format with CSSa and CALS tables. It deviates from DocBook 5.1 in that the document structure may be flat, i.e., consist of only paragraphs and tables below the top-level element. It also may contain tab and line break elements.*

The IDML → flat Hub converter runs without any configuration. The next step in the pipeline is what we call hierarchizing the flat Hub: creating a section hierarchy, associating tables and figures with their captions, nesting lists, etc. This is the step that accepts by far the most configuration options: Regexes for the style names of the chapter/section headings, the figure titles, etc. We call this step “hierarchization” because of its primary function of creating a section hierarchy. There is a basic hierarchization configuration for Hogrefe’s default InDesign template. But in order to be able to treat different layouts or content types differently, this configuration may be overridden. The standard XSLT import mechanism comes in handy here. Not only static configuration variables (said regexes), but also XSLT templates may be overridden, which is a powerful tool to handle special transformation requirements efficiently. The more specific XSLT (e.g., for a series) typically imports the next less-specific stylesheet (for an imprint, which in turn imports the common XSLT, which in turn imports the default XSLT library’s stylesheet).

Just like the macroscopic steps are orchestrated to the overall conversion pipeline using XProc, the individual XSLT passes of a macroscopic conversion step are orchestrated by means of an XProc pipeline. Books of a series that need some special preprocessing may be processed by a different hierarchization XProc pipeline that invokes additional XSLT passes. An example for that is: If chapter numbers and chapter titles are typeset in separate paragraphs, they need to be contracted (merged) prior to hierarchization.

After hierarchization to (almost) standard DocBook 5.1, the document is converted to HoBoTS which is straightforward for most parts of the structure. The reason why we don’t up-convert directly into the target vocabulary is that some of the upconversion templates (not only hierarchizing, but also table nesting recognition etc.) are quite complex and we don’t want to maintain a separate set of upconversion templates for each target vocabulary.

Configuration Cascade

The other steps, conversion to HTML and EPUB generation, are also quite straightforward, but with a sophisticated twist regarding CSS forwarding, as mentioned in Schema: Adding Style to BITS. This is a good example for the configuration cascade: first a common CSS file for the Hogrefe Group’s EPUBs is referenced in the generated HTML. Then an CSS (if available) for the imprint (e.g. for the English-language division), then (if available) for the series and ultimately for the individual work. All these files, if present, reside at defined locations from which they are referenced. Then the filtered and forwarded CSS rules from the document are included in a style element. All this central styling will be parsed according to the CSS precedence rules, and a combined stylesheet will be generated. Only this is being referenced from the HTML, which avoids including verbose style elements or referencing non-existing CSS files in potentially many individual HTML files that will emerge from the EPUB builder. In addition, the surviving (i.e., excluding font-family, font-size, …) CSS properties for local layout overrides will be serialized as @style attributes.

The same specificity cascade as depicted in Fig. 5 is applied to all other kinds of configuration data: the splitting point configuration for the EPUB builder, the up-conversion XSLT stylesheets and XProc pipelines, the HoBoTS to HTML conversion templates, etc. For example, the HTML conversion stylesheet may generate numbered links from an bibliography entry to the citation xrefs in the text. A similar backlinking may be switched on between index entries / index terms. If this is considered to create too much visual clutter, these options may be switched off for a whole imprint, for a book series, or for an individual work.

Fig. 5

Fig. 5

Configuration cascade © 2013 Maren Pufe

The benefit of a cascaded configuration cannot be overestimated, because:

Box Icon

Box 2

The lack of defined customization hooks in other conversion pipelines has yielded one or multiple of three unfortunate outcomes: bloated converters that were tuned to handle the most absurd fringe cases in the input; content that was shoehorned into unfit (more...)

Schematron Checks

The different sizes of the check marks in Fig. 4 symbolize the amount of Schematron checks that is performed after each macroscopic step. As written above, the Schematron errors and warnings range from unanchored frames and unknown styles to all-boldface pseudo headings and uncited references.

The resulting messages (and also the HoBoTS Relax NG validation messages) will be inserted in an HTML rendering of the flat Hub document, as seen in Fig. 6.

Fig. 6

Fig. 6

HTML report with messages linked from the error locations

Schematron rules are selected according to the configuration cascade. The difference to the XSLT, XProc CSS and other configuration selection rules is that for a given conversion stage (anchoring, styles, flat Hub, hierarchized Hub, HoBoTS, …), the rules are accumulated. So not only the most specific, but all rules of the cascade, i.e., common, per-imprint, per-series and per-work rules will be applied to the intermediate document at the given process stage.

It is amazing how fast Schematron rules can be added if an issue is discovered. It is often a matter of a mere 20 minutes to identify the need, write the rule, commit and deploy it.

User interface

The conversion may be run from the command line, either by invoking Java/Calabash or with a Makefile frontend.

An alternative approach is a simple WebDAV or Web browser interface where people can upload the files. There are file name conventions in place so that the system can infer from the filename which imprint, series and work ID a given IDML file belongs to.

Fig. 7

Fig. 7

Progress messages in the Web interface

In addition, the typesetter may upload images as PDF, PNG, or JPEG, also obeying certain naming conventions. These will be converted to PNGs and JPEGs with appropriate resolutions and color spaces, using ImageMagick on the server side.

This image conversion is available as a post-upload action. Other post-upload-actions are:

  • adding an input file or a conversion artifact to revision control (content repository);
  • invoking a CrossRef resolution of the references contained in the resulting HoBoTS XML;

The CrossRef resolution post-conversion option will, also orchestrated by XProc[1], post a request to CrossRef, poll a mail account for an answer, generate an InDesign script that will add DOIs and links to dx.doi.org to the reference paragraphs in the InDesign file. The CrossRef query batch result and the InDesign script are stored in the content repository where the typesetter can check them out.

Fig. 8 contains a screenshot of the upload interface with the post-conversion actions available.

Fig. 8

Fig. 8

Actions available after the conversion pipeline has run on the IDML file: “add to revision control” and “CrossRef resolution”

Hogrefe is using Subversion as a revision control system for the content. Once a HoBoTS XML file has taken over the role of master file from IDML, a flag (a Subversion property) may be set on the HoBoTS file that will save the versioned file from being overwritten with new automatic conversion results.

Getting the Software

Some adaptions have been made specifically for Hogrefe. But most modules of the software have been released under a permissive open source license (2-clause BSD).

We are currently assembling a demo project[1] that may be checked out with an Subversion client**.

Conclusion

It has been demonstrated that a high-quality ex-post conversion of InDesign books to BITS and EPUB is feasible. Key success factors are Schematron checks and sophisticated XSLT 2.0 templates. Key factors for making it manageable for the variations encountered in book publishing are cascaded configuration and encapsulation of the individual steps as XProc pipelines.

Although this demonstrated conversion methodology is absolutely feasible, the author believes that converting from paginated media does not constitute the future of publishing workflows. But given the widespread use of InDesign these days, the demonstrated approach with its only minor deviations from established typesetting practice seems to be a sensible tradeoff between the typesetting system’s XML capabilities, process robustness, time to market, and cost-efficiency.

Footnotes

*

We could have used BITS as a base format, or HTMLBook[1], if they had been available at the time when we specified Hub XML. We were looking for a flat format that may be up-converted to a book’s hierarchized XML representation within the same vocabulary, and NLM/JATS was lacking the book parts then.

**

Browsing the cited URL does not offer much insights because Subversion will fetch other modules as externals. So please use an svn client.

Copyright 2013 by le-tex publishing services GmbH.

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Bookshelf ID: NBK159733

Views

  • PubReader
  • Print View
  • Cite this Page

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...