NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2010.

Bookshelf ID: NBK47314

JATS to EPUB: Unraveling the Mystery

Laura Kelly.

National Center for Biotechnology Information, National Library of Medicine

It's no great mystery that the increasing support of the EPUB format is making it more and more attractive to publishers. But what does seem to be a mystery is how best to use existing JATS data to generate EPUB content. This paper looks to unravel that mystery by explaining the basics of EPUB and examining how it relates to the Tag Suite. The paper also discusses ways that existing Tag Suite tools can be utilized to make the process of generating EPUB data easier.

What is EPUB?

EPUB is an open standard from the International Digital Publishing Forum (IDPF) that describes electronic publications. It is comprised of three separate standards: the Open Publication Structure (OPS), the Open Packaging Format (OPF), and the OEBPS* Container Format (OCF). These standards are used together to capture a publication's contents and structure, and to provide a machine-readable packaging system that allows eReaders to properly identify and display the content. A complete publication is made of up different files—each complying with one of the three standards—zipped into one file whose extension is "epub".

Because EPUB files are zipped content, they may be opened using a file compression program like WinZip or Archive Manager. While this method of viewing EPUB files is not the intended display method for eReading devices, it is helpful in understanding how the EPUB files are structured and created.

Content Container: OCF

The internal directory structure of all EPUB files is described by the OCF specification. At the root level of the EPUB file there is one mimetype file, one folder named META-INF, and one or more content directories. The specification does not dictate the name of the content folders, but since content may be stored at this level in multiple formats or "renditions," the specification does recommend using one folder per rendition2. In Figure 1, the OPS and PDF directories are the content directories. The OPS folder contains the content compliant with the OPS specification for EPUB and the PDF folder contains a PDF file of the document.

JATStoEPUB.epub directory structure showing META-INF and OPS folders and file named mimetype

Fig. 1

Unzipped EPUB File.

The mimetype file identifies the EPUB file as such and is required for eReaders. Its only content is the string "application/epub+zip".

Within META-INF are files containing information about the EPUB document, including the container-level metadata, content, and encryption. At minimum, META-INF must contain one file called container.xml that points to the root file of the OPS rendition of the publication. If more than one rendition is included, the container.xml file should point to the root file of each rendition. In JATStoEPUB.epub, META-INF only contains container.xml, which points to the root file for the OPS rendition (OPS/epb.opf) and the root file for both the OPF and PDF renditions (for renditions that consist of only one file, that file is the root). Figure 2 shows the contents of container.xml.

Fig. 2. META-INF/container.xml.

Fig. 2

META-INF/container.xml.

Packaging Standards: OPF

The metadata requirements for EPUB publications is defined by the OPF specification. Specifically, the OPF:

  • Describes and references all components of the electronic publication (e.g. markup files, images, navigation structures).

  • Provides publication-level metadata.

  • Specifies the linear reading-order of the publication.

  • Provides fallback information for when extensions to OPS are employed.

  • Provides a mechanism to specify a declarative global navigation structure (the NCX).

(Open Packaging Format 2.0.1 v1.0.1; see Ref. 4)

These requirements are all met with two files, the OPF and the NCX. In JATStoEPUB.epub, the files are called epb.opf and epb.ncx, respectively. As illustrated in Figure 3, these two files are stored in the OPS content directory. NCX (Navigation Control file for XML applications or Navigation Center eXtended) is part of the DAISY Consortium's Digital Talking Book (DTBook) specification and is used to define the publication's navigational structure. The OPF file includes all other information required by the OPF specification.

Fig. 3. OPS content directory.

Fig. 3

OPS content directory.

The OPF uses the Dublin Core Metadata Initiative® (DCMI) element set as the vocabulary for the publication metadata. These elements are contained in a metadata element. The OPF specification adds attributes to these elements that are specific to the OPF namespace, but the element set is otherwise unchanged. The full content of epub.opf is shown in Figure 4.

Fig. 4. epb.opf file.

Fig. 4

epb.opf file.

The other components of the epb.opf are the manifest element, which lists all files that make up the publication (OCF container files are excluded from this list), and the spine element, which defines the reading order for the publication. As shown in Figure 4, there is only one item referenced in the spine element, and that is the XHTML of the article. Not present in this example is an optional guide element which provides links to parts of the document that the document author wishes the reader to have quick access to (list of figures, glossary).

The NCX file defines the navigational structure for the document. This structure is accessed by the end users as a "clickable" table of contents to navigate to various parts of the document. The NCX differs from the spine element in that it is used by the end user whereas the spine is read by the eReading devide to load the content in the proper order.

Content Markup Standard: OPS

The OPS defines the preferred vocabularies and supported media types for EPUB documents. The specified content markup includes the DTBook and XHTML 1.1 vocabularies. CSS2 is also identified in the OPS as the method by which the EPUB documents are styled. The included vocabularies of both XHTML and CSS are subsets of the full specifications from the W3C, however. The IDPF imposed limits on what from those vocabuaries is included to reduce the burden on eReading systems—both hardware and software.

An eReading system may support more of the vocabularies than are included in the OPS and the specification states that in reference to styles, compliant eReaders are required to "gracefully degrade" unsupported properties6 and must not suffer catastrophic failure because of their presence in the data. Despite this safety measure, following the vocabularies set forth in the OPS will ensure that data is supported across the widest range of eReaders.

DTBook, XHTML, and CSS are all included in the OPS-specified list of core media types—those MIME types that must be supported by compliant eReading systems. The full list of supported types is GIF, PNG, JPEG, SVG, XML, and CSS. Files that are not of a core media type may be included in the EPUB document, but for each non-supported object, a fallback item of a supported MIME type must be included. EReaders may support more file types than those specified as core types, so including non-core media files is allowed. One object may have several formats, including more than one non-core media types, but there must be at least one object in the fallback chain that is of a core media type. For images, the required "alt" attribute that provides a short textual description of the image is defined as a sufficient fallback for that object. Figure 5 shows an example of a simple fallback chain.

Fig. 5. Fallback chain.

Fig. 5

Fallback chain.

The JATS to EPUB Conversion Process

When developing a conversion to create EPUB data from the Tag Suite, the first step was analyzing the requirements of the target output. Those requirements are outlined above. Next in the process was to evaluate the Tag Suite data to decide whether or not the required information exists. The requirements for the content and metadata are easily met but the packaging and container information is not explicitly included in JATS data. Most of the required information is available, however, and can be extracted.

To create the necessary EPUB components, the single XML input document must be processed multiple times to create several files in separate directories. This kind of complicated processing can be handled using the XML pipelining language, XProc. XProc describes sequences of operations to be performed on XML documents. These operations include running XSL transforms, which allows existing tools to be repurposed.

This project to convert Tag Suite data into EPUB documents began with multiple XSL transforms rather than beginning with XProc directly. As such, the pipeline for this project relies heavily on that XProc feature to produce the necessary EPUB components of the container, packaging, and content markup. Figure 6 illustrates the pipeline process described below.

Fig. 6. JATS to EPUB pipeline.

Fig. 6

JATS to EPUB pipeline. Graphical representation of the pipeline converting JATS data to EPUB. (A) indicates where the Universally Unique Identifer (UUID) is created. The output from this step becomes the input for (B), (C), and (D) so the XSL transforms (more...)

The Container

The EPUB container requires three things:

  1. the prescribed directory structure,

  2. a container.xml file pointing to the publication's OPF file, and

  3. a MIME file to identify the EPUB document.

To meet these requirements, the XProc pipeline will creates the primary output directory with the META-INF and OPS subdirectories. It then calls an XSL transform to create the container.xml file, which references the OPF file containing the document's metadata, manifest, and reading order. Because there is only ever one OPF file per publication, this conversion always writes the file as epb.opf to simplify the container file generation.

Packaging

After creating the directory structure, the pipeline invokes a pair of XSL transforms to generate the packaging components from the JATS source data. The XSL to create the epb.opf file converts JATS metadata elements to the DCMI elements. As previously mentioned, the metadata is tagged with DCMI elements, some with OPF-specific attributes added. These attributes allow preservation of some of the granularity present in JATS metadata which is absent from the standard DCMI element set. Table 1 shows several Tag Suite metadata elements and their corresponding DCMI elements, some with OPF attributes.

In the OPF file, the minimum metadata required is one title, one identifier, and one language element. The language attribute is always present in JATS data, but neither the title nor the identifier are required JATS elements. To meet the requirement for a unique identifier, this conversion invokes the XProc function to generate a Universally Unique Identifier (UUID) for the publication. The UUID is used in various places throughout the EPUB document, including both the OPF (see Figure 4) and NCX files.

The manifest element lists all files that make up the publication content. Any included objects that are not of a core media type must also have a fallback object identified. Generating the manifest requires several actions being taken on all of the referenced object files in the source XML. This includes JATS tags like graphic, media, inline-supplementary-material, and supplementary-material. The conversion must check whether or not the MIME type of each referenced object is core media type. For any object that is not, the conversion checks for any specified alternatives for the object. If none of the specified alternatives are of a core media type or there are no specified alternatives, the conversion must generate a warning message that alerts the user to the problem. All of these checks must be run on the value provided in the xlink:href attribute of the objects, which can also prove to be a challenge. Object references vary greatly across systems, and in some cases, the object reference includes neither the file extension nor the mimetype attribute, making it impossible to determine the object's MIME type.

This conversion will write out the file names of the referenced objects as they appear in the source data, however, regardless of whether or not the object's MIME information is available. If the MIME information either cannot be determined or is not among the core media types, the conversion will write a warning message, not an error. This allows the conversion process to complete but alerts users to the problem. Figure 7 shows the warning messages generated by the conversion in both the OPF file and on the command line during the conversion process.

Fig. 7. Non-supported MIME type warnings.

Fig. 7

Non-supported MIME type warnings.

Following the OPF file generation, the XProc pipeline runs an XSL transform to generate the NCX file for the document, which provides all navigation information for the document. In the case of most articles, the primary navigation point is just the article itself. Figure 8 shows the NCX file as generated by this conversion.

Fig. 8. NCX file.

Fig. 8

NCX file.

Content Markup

With the container and packaging complete, the only remaining piece is the content markup. Because XHTML is a preferred vocabulary for EPUB, this conversion process began by using the article preview XSL transform available on the Tag Suite website.11 The XHTML output generated by that transform follows the transitional XHTML DTD rather than the strict subset required by the OPS. After making some modifications to a customized version of that stylesheet to produce the required XHTML, it became clear that the table structure employed in that page layout would not be appropriate for eReader display (see Figure 9).

Fig. 9. Screenshot of 2-column EPUB layout.

Fig. 9

Screenshot of 2-column EPUB layout.

Despite the layout concerns, however, parts of the preview stylesheet can be repurposed for the conversion to EPUB. This project uses the NLM-style citation processor, which saves a great deal of time and an effort. The pipeline can be customized to choose preview stylesheet's APA citation processor, or any other citation conversion process to which the user has access.

As demonstrated in Figure 9, certain limitations of the target format must be kept in mind. The use of tables in formatting pages is one example. EReaders do not handle tables well, because they often prevent text reflowing easily. Reflowable text is a primary concern for EPUB data preparation because it ensures that no matter the eReader screen or text size, the content will be displayed correctly.

The last component for the EPUB package is the CSS. Unlike other implementations where the style sheet is referenced by a URI, here the style sheet must be delivered as a file within the package. At the present time, the XProc pipeline calls an XSL transform whose only purpose is to generate a CSS file. There is no actualy transformation in the XSL file, so it is, admittedly, an abuse of XSL. By housing the file's generation in a separate XSL, however, that XSL can easily be substituted in the pipeline for one of the user's choosing, allowing easy replacement of the included CSS.

Final Steps

Once the container, package, and content are all complete, there is still some manual work required in order to complete the EPUB document. The referenced media files must be copied into the output directory structure before creading the *.epub file. While this step could be included as part of the XProc pipeline, currently it must be done manually. This forced intervention gives the user a chance to review the included objects for possible unsupported file formats with no fallback items and to review the objects for their appropriateness for eReading systems.

The final step is to compress the contents, including the subdirectory structure, and save it with the epub extension. The JATS data is now an EPUB document, ready for delivery to eReading systems.

Challenges, Limitations, and Known Issues

As with any conversion project, there are some issues when converting JATS data to EPUB. Some of these issues arise because of the limited nature of the target output and others arise because of the source data. The IDPF's choice to include existing specifications and vocabularies has reduced the number of these issues significantly, but they have not been completely eliminated.

The limitation on supported media types presents an issue when converting data. While the limitations are necessary to ease the burden on eReading systems, no such limitation exists in the Tag Suite and converting files to core media types requires manual intervention.

Table structures are also potential problems. Because of the need for reflowable text, large or complicated tables are not handled well in eReaders. Additionally, some eReaders do not handle nested tables at all. Since nested tables are allowed in Tag Suite and flattening a table structure cannot be automated with much reliability, manual review is recommended before including these structures.

This conversion started with using only XSL. As the complexities of the project grew, the XProc pipeline was implemented to better handle the complicated processing. The XSL to perform certain functions was already written, so those pieces were implemented. Generating the container.xml and jatsepub.css files, for example, can be handled directly with XProc functions. Future development on this project will include shifting more functions to the XProc pipeline in order to streamline the process.

XHTML, with Metadata, in a ZIP file

In creating the EPUB specifications, the IDPF repurposed existing vocabularies and standards. This has made EPUB not only easy to understand, but it has also made it quickly attainable. With little new vocabulary to learn and many existing tools to generate required components, EPUB isn't much of a mystery at all.

Tools Used

1.
XProc engine: Calabash by Norman Waslh. Available online at http://xmlcalabash​.com. The author of this paper used version 0.9.23 when developing this project.
2.
XSLT processor: Saxon by Michael Kay. The author used the open source version 9.2 for this project. Project information is available online at http://saxonica​.com. The product is available for dowload from http://saxon​.sourceforge.net/.
3.
eBook management: Calibre by Kovid Goyal. Available from http:​//calibre-ebook.com/ was used for previewing EPUB files and capturing the screenshot in Figure 9.

References

1.
Open Container Format (OCF) 2.0.1 v1.0.1 [Internet]. International Digital Publishing Forum™; c2010 [cited 2010 Sep 22]. Available from http://www​.idpf.org/doc_library​/epub/OCF_2.0.1_draft.doc.
2.
Open Container Format (OCF) 2.0.1 v1.0.1 [Internet]. International Digital Publishing Forum™; c2010. Section 2.3.2, Single-publication containers, but with alternate renditions; [cited 2010 Sep 22]. Available from http://www​.idpf.org/doc_library​/epub/OCF_2.0.1_draft.doc.
3.
Open Packaging Format (OPF) 2.0.1 v1.0.1 [Internet].International Digital Publishing Forum™; c2010 [cited 2010 Sep 22]. Available from http://www​.idpf.org/doc_library​/epub/OPF_2​.0.1_draft.htm#Section1.1.
4.
Open Packaging Format (OPF) 2.0.1 v1.0.1 [Internet].International Digital Publishing Forum™; c2010. Section 1.1, Purpose and Scope; [cited 2010 Sep 22]. Available from http://www​.idpf.org/doc_library​/epub/OPF_2​.0.1_draft.htm#Section1.1.
5.
Open Publication Structure (OPS) 2.0.1 v1.0.1 [Internet]. International Digital Publishing Forum™; c2010 [cited 2010 Sep 22]. Available from http://www​.idpf.org/doc_ibrary​/epub/OPS_2.0.1_draft.htm.
6.
Open Publication Structure (OPS) 2.0.1 v1.0.1 [Internet]. International Digital Publishing Forum™; c2010. Section 1.3.5, Relationship to CSS; [cited 2010 Sep 22]. Available from http://www​.idpf.org/doc_ibrary​/epub/OPS_2​.0.1_draft.htm#Section1.3.5.
7.
ANSI/NISO Z39.86 - Specification for the Digital Talking Book [PDF]. National Information Standards Organization; c2010. Section 8, Navigation Control File (NCX). ISBN 978-1-880124-63-5. Available from http://www​.niso.org. pp 43-56.
8.
XProc: An XML Pipeline Language [Internet]. W3C; c2010 [cited 20 Sep 2010]. Available from http://www​.w3.org/TR/xproc/
9.
Hazelhurst, Colin. c2009-2010. Inside epub. Blog entries 31 Dec 2009–6 Jan 2010. http://netkingcol​.blogspot​.com/2009/12/introduction-to-epub​.html. Accessed 24 Aug 2010.
10.
Walsh, Norman. XML Pipelines: A Guide to XProc [draft publication]. Richard Hamilton, editor. c2010 Norman Walsh. Chapters 1, 2, 3, 7. Draft status. Accessed 23 Sep 2010. Available from http://xprocbook​.com/book/book-1.html.
11.
NLM Journal Archiving and Interchange Tag Suite: Tools for Version 3.0. Preview XSLT. Available from http://dtd​.nlm.nih.gov/tools/tools.html.
12.
Kelly, Laura (National Center for Biotechnology Information, Bethesda, MD). Delivering PMC Content to Mobile Web and eBook Devices: A Feasibility Study [Internal report]. Located: NCBI SharePoint server IEB/ELS/beck/Shared Documents.

Open eBook Publication Structure (OEBPS or OEB) is a previous version of the specifications currently represented as the OPS and OPF specifications. The IDPF separated the two specifications to enable modular adoption. The combination of OPS and OPF supercedes the OEBPS format but the abbreviation is still used in many parts of the EPUB specifications, including the MIME identifier for OPS files.

Footnotes

*

Open eBook Publication Structure (OEBPS or OEB) is a previous version of the specifications currently represented as the OPS and OPF specifications. The IDPF separated the two specifications to enable modular adoption. The combination of OPS and OPF supercedes the OEBPS format but the abbreviation is still used in many parts of the EPUB specifications, including the MIME identifier for OPS files.

This work is in the public domain and may be freely distributed and copied. However, it is requested that in any subsequent use of this work, the author be given appropriate acknowledgment.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010
Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010 [Internet].

Recent activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...