NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2010.

Bookshelf ID: NBK47083

Journals and Magazines and Books, Oh My! A Look at ACS' Use of NLM Tagsets

Dan O'Brien and Jeff Fisher.

Author Information

Dan O'Brien and Jeff Fisher.

American Chemical Society, Publications Production Systems

Over the past several years, the ACS Publishing Division has implemented XML-based publishing processes that make use of the NLM Tagsets. We describe how our chemistry-related Journals, Books, and Magazine publications have all implemented various flavors of the NLM Tagsets in both the production and electronic delivery of our content. We discuss some of the customizations that we made to the tagsets, why we made them, how the tagsets are used within our environment, and some of the successes and mistakes that we have experienced along the way.

Introduction

This paper presents a case study of how the Publications Division of the American Chemical Society (“ACS Pubs”) has utilized and applied the XML Tagsets from the National Library of Medicine (“NLM”) (1) in the production and delivery of ACS Pubs’ content products.

First we will quickly introduce ACS as an organization, and then we will take a high-level look at ACS Pubs’ production process for the products that apply to this paper: our journals, books, and our magazine. Next, the bulk of the paper discusses how each ACS Pubs product has made use of NLM tagsets, what types of changes we made, and some of the decisions that were made during the course of implementing our XML program. Finally, we present some lessons learned, some specific successes, and a view of where we are heading next.

What is ACS and ACS Publications

The American Chemical Society (“ACS”) is a professional membership organization, chartered by the U.S. Congress in 1876, and representing over 160,000 professionals at all degree levels and in all fields of chemistry and sciences that involve chemistry. Primary ACS divisions include Membership and Programs; Chemical Abstracts (“CAS”), a secondary publisher of chemical related data, information abstracts, and databases; and the Publications Division (“ACS Pubs”).

ACS Pubs Product Types

ACS Pubs publishes three types of content products:

  1. Journals: 40 peer-reviewed journals cover all types of chemistry and cross-disciplinary fields. Annually they represent 300,000 published print pages. A number of journals, representing about half of ACS' annual volume, are published on a weekly basis, while others are published on monthly or bi-weekly schedules.

  2. Books: an average of 30 peer-reviewed books are edited, compiled, and published annually through the Symposium Series. Each book averages around 25 chapters, and while each chapter is independently authored and submitted, the book itself is the primary focus and is published as a cohesive whole. (This contrasts with journals in which the focus is on article-level publication, with issues serving more as a binder of published articles.)

  3. Magazine: ACS' flagship magazine, Chemical & Engineering News (“C&EN”), is published weekly and it is likened to a "Business Week" for those in chemistry-related disciplines. In addition to specific issues published in print and online, C&EN also publishes daily news articles online.

Each of these three product types has incorporated XML into its content production processes to some degree, allowing each to distribute its content in multiple forms: print, online, mobile, etc. Each product type has unique characteristics, however, that influence the design of the XML that is produced. Before we start discussing the differences in how these product types use XML, we will first take a quick look at the basic, overall processes used by ACS Pubs to produce these product types.

How ACS Pubs Content is Produced and Published

Journals and Books

While ACS' Journals and Books differ in some key product characteristics (a couple mentioned above, with more mentioned later), they also share some similarities. It is because of these similarities that they are able share many of the same high-level production and publication processes: submission and peer review, XML production & editing, page composition, and delivery. We will step through each of these processes below.

Submission and Peer Review

ACS Pubs follows the classic scholarly publishing model for accepting manuscripts: an author writes a manuscript and submits it to a publisher as a candidate for publication; the publisher arranges for a "blind" review of the paper by peers; and finally the journal's editorial office makes a decision whether to accept or reject the paper for publication.

In addition to Journal articles, book chapters are also handled this way by ACS Pubs, although the book chapter authors are invited to submit their content based on the book's specific topic.

ACS makes use of the ManuscriptCentral system, from ScholarOne (2), to handle submission, peer review, and Editorial Office operations for both Journals and Symposium Series Books. Article and chapter content is uploaded by the author to the ManuscriptCentral system in various formats, predominately using MS-Word, although a percentage reflects other formats such as LaTeX. A PDF of the submission is generated by ManuscriptCentral to facilitate peer review and production.

After acceptance for publication by the Editorial Office, ManuscriptCentral automatically transfers the manuscript's content and metadata to the ACS Pubs production workflow system, usually within minutes of acceptance (3).

XML Production

The production process for ACS' Journals and Books uses an "XML First" approach in which the content is converted to XML as early as possible in the process. This allows the XML file to be the version of record for all subsequent editing, page composition, and online delivery stages.

XML Conversion. Once in the ACS production workflow system, the first step in the XML production phase is conversion of the author's submitted manuscript content to XML. ACS Pubs utilizes external vendors to convert the author's content into XML. The process of requesting and receiving converted files is also highly automated, with the manuscript content automatically packaged and delivered to the conversion vendor within minutes of entering the XML production workflow. In parallel, while the manuscript text is converted to XML, any associated artwork or media from the author is also extracted and prepared for publication.

After the conversion vendor has completed its XML conversion work, the ACS Pubs production workflow system receives the converted XML content and automatically routes the manuscript to a queue where it awaits editing by a Technical Editor.

Technical editing. The journal article or book chapter then enters the Technical Editing phase, consisting of three distinct phases:

  1. Automated Pre-Editing

  2. Technical Editing

  3. Automated Validations and Post-Editing

The middle step, Technical Editing, is performed by an editorial staff having chemistry or related science degrees. The focus of the technical editors is to apply ACS editorial style to the content, clarify scientific intent, and enrich the content through additional XML presentation and semantic markup. For example, an author's reference to a particular protein may be tagged with a semantic link to its entry within an external online protein database.

Because the Technical Editing staff is highly skilled and focused primarily on providing scientific and semantic enrichment, ACS Pubs has automated the more routine XML updates in the automated Pre-Editing stage, before the article or chapter is presented to the Technical Editor. Thousands of automated edits made at this stage include actions such as normalizing British American spelling to American English and adding metadata tagging. Likewise, after the Technical Editor completes his or her work on the paper, automated post-edits and validations take over to further condition the XML content for composition and eventual delivery.

At this point in the process, we have a complete "XML package" for the given article or chapter: a completed, edited XML file with accompanying graphics, media, and supplemental material.

Page Composition and Review. After the "XML package" is readied, an initial "proof PDF" is composed from the XML in preparation for proofing. The composed PDF is produced by a composition vendor where the requests to, and responses from, the vendor occur automatically at the direction of the ACS production workflow system.

For journal articles, a "proof package" is automatically packaged and delivered to the "proof review author" (who was identified during the manuscript submission process) via ACS' proof delivery site. Book chapters receive an ACS internal review only. Any required manuscript changes are made to the "XML package" as specified by the reviewer.

After the proof is approved, a publication-ready PDF is composed in anticipation of online distribution.

See the following figure for a diagram of the overall process flow.

Fig. 1. Process Flow for ACS Journals.

Fig. 1

Process Flow for ACS Journals.

Distribution and Delivery

The journal and book processes diverge somewhat in the distribution and delivery phase.

Journals

At this point, a journal article is ready to be published online, ahead of its print issue. Different publishers have different names for this initial online-only version of the article: epub, advanced articles, in press, ahead-of-print, etc. At ACS, they are called Articles ASAP (As Soon As Publishable). Articles posted ASAP represent the first instance of publication for that article; the effective article "publication date" that is later added to the article XML reflects this ASAP publication date (4). Most articles are not assigned to an issue until after they are posted online as an ASAP article.

After the ASAP article is posted to the ACS Pubs online delivery website, it remains there until it is eventually replaced with a version from the print issue. The only effective difference between the ASAP and Issue versions of the article are the publication date (reflecting the original ASAP date) and article citation information (containing the final issue, volume, and page numbers of the issue to which it was assigned) added to the Issue version.

Books

In contrast to journal articles, which are individually published online as soon as they are ready, book chapters are not published independently. Instead, each chapter awaits the finalization of its parent book XML. The book XML production process includes an extra final step of linking the individual chapter XML files into the parent book XML using XInclude.

A draft of the entire book is then composed and used for several parallel "book finalization" activities, such as:

  • Proofing and corrections;

  • Submission for generation of the subject index;

  • Submission to the U.S. Library of Congress for receipt of CIP data;

  • Generation of MARC records needed by libraries and subscribing institutions.

After these book finalization steps are completed, the book and chapters are published online.

Technologies at a Glance

Table 1

XML EditingArbortext Editor from PTC
MathFlow Editor from Design Science
XML AutoRedact from Inera
ACS custom built tools
CompositionArbortext APP (formerly 3B2) from PTC
Web DeliveryLiteratum from Atypon Systems
Workflow & Content Management SystemDocumentum from EMC

C&EN Magazine

The life of a magazine article follows a roughly parallel path to journal and book manuscripts but covers different terrain along the way. Several features of the magazine content and its production process distinguish it from journals and books:

  • Internal vs. external authoring: Magazine content is largely written by staff or contract writers, as opposed to external journal and book authors that petition for acceptance of their papers.

  • Highly designed pages: Magazine content tends to place a higher value on page design and styling as compared to journals or books. There is less of a separation of design from content with many magazine articles, where the design is often adjusted to fit the needs of a given article; for example, text may be colored to reflect the theme of the article or issue.

  • News vs. issue articles: Magazine staff produces online news articles, only some of which may be eventually bundled or further edited for inclusion within an issue. Other content is explicitly written from the start as feature articles for particular issues.

  • Edit to fit: The length of a magazine issue is usually pre-planned, so content is sometimes edited and/or padded with ads or graphics to fit within the issue's planned page range – after the initial draft pages have been laid out.

  • Looser concept of "article": Articles often contain sub-articles or sidebars that constitute a separate but related narrative. A single "master article" may actually be a composite of several smaller articles, such as news bits, event listings, letters to the editor, etc. While these composite articles may be represented as a cohesive whole within the print issue, each smaller chunk may be delivered as an independent article in their electronic form online.

These magazine-specific features not only influence the type of XML markup required but the production process as well. While an "XML-First" production flow was evaluated for C&EN magazine, the production attributes of highly designed pages, "edit to fit," and looser article definition all suggest that XML is better extracted towards the end of the production process, where it is needed to facilitate online delivery and syndication.

However, to allow meaningful XML to be produced towards the end of the content lifecycle, the content needs to be kept in a well-structured state all throughout the production process. Using Adobe InCopy, a number of structured content templates that have some amount of flexibility allow content organization and structure to be maintained during the rapid editing cycles needed for page design and "editing to fit."

This results in an overall process for the C&EN magazine as following:

  1. Story/Article Authoring: Staff and contract writers provide content, either as breaking news or to fit specific topics for an upcoming issue theme. Stories are written in MS Word using a basic template, and then sent to the Production Editors.

  2. Editorial Production and Page Composition: Production Editors approve stories and perform any needed copyediting. Page Designers adapt or create a page design for the article and flow the article text into it. Production Editors review draft pages and determine what further "edit to fit" is required.

  3. XML Extraction: In preparation for online delivery and syndication to other delivery channels, the structured content is extracted into XML form.

  4. Distribution and Delivery: Using the extracted XML, content is fed through web templates to produce the online article pages. The same process occurs for both the weekly issues as well as the individual daily news stories.

Definitions and Terminology

Before a discussion of ACS' use of NLM tagsets, we want to lay out some terminology that we will use as a framework within that discussion.

Tagset vs. Schema

For the purposes of this paper, we differentiate between the terms "tag," "tag definition," "tagset," "module," and "schema" as defined below.

  • Tag: an instance of user-defined XML markup: an element, attribute, named entity, etc.

    For example, <citation> is a tag, and the "citation-type" attribute within <citation citation-type="journal"> is also a tag.

  • Tag Definition: the specific coding within a DTD or XSD that declares the name of the tag, what type of content it may contain, etc.

    For example, the following simplified DTD tag definition declares a "citation" tag, and defines that it allows two or more child tags:

         <!ELEMENT citation (author+ , source)  >

  • Module: a way of logically organizing tag definitions, primarily to allow reuse of these tag definitions for multiple schemas. Related tag definitions are often grouped together and stored within larger chunks that we will call "modules" (5). The use of modules stands in contrast to listing all tag definitions within a single DTD or XSD file. These modules may also be called "entity files" or "include files." The process of determining how tag definitions should be grouped into modules is beyond the focus of this paper.

    For example, if both a Book and a Journal schema wanted to share the same tag called "citation," they could both reference the same "citation" module that contains all of the necessary tag definitions to completely represent a citation.

    Within file citation.ent, we might find all tags related to our simple citation:

         <!ELEMENT citation (author+ , source) >
         <!ELEMENT author (#PCDATA) >
         <!ELEMENT source (#PCDATA) >

  • Tagset: a collection of related tag definitions, usually stored within a set of interrelated modules, that provides a complete XML vocabulary for a given application. A tagset includes a comprehensive set of tags, but stops short of modeling a specific type of XML document.

    For example, a "Book" tagset might comprise several modules that, taken as a collection, define a complete set XML tags for marking up a book or things related to a book. E.g., a book tagset might be comprised of these modules:

    • Special module for defining book metadata tags,

    • General module for defining tags related to text formatting,

    • General module for defining tags related to references,

    • Public model for defining tags related to tables,

    • etc.

  • Schema: a specific use of a tagset to form a specific content model. This model would be coded as a DTD, an XML Schema (XSD file), etc. that in turn would reference specific modules from the tagset.

    For example, three different schemas could all be built from the single "Book" tagset:

    • A "book" schema that models an entire book from cover to cover;

    • A "chapter" schema that models just a book chapter;

    • A "book info" schema that models the information needed to create metadata feeds about a book.

Tagset/Schema "Customization Levels"

One could consider that the possible approaches to customizing a public tagset or schema could occur along a spectrum between two ends:

  • On one end, no customizations are applied and the public tagset/schema is used precisely as-is.

  • On the other end, only the general principles behind the public tagset are used to "inform" the design of one's own schema; the actual public tagset isn't used at all in the development of the target tagset or schema.

In reality, ACS' application of the NLM tagsets occur somewhere between these extremes. From our own anecdotal experience and discussions with other organizations, we suspect that many other applications of the NLM tagset also fall somewhere between these extremes.

We will give names to a few points along this spectrum; we will call these "customization levels."

List of "Customization Levels"
  • As-is: The public version is used without changes or modification, with the sole exception that the schema filename or public identifier may be different.

  • Extended: A superset of public tagset/schema is used, with additional tags defined beyond the public version. XML that is valid for the public schema would also be valid to the extended version of the schema, although the reverse may not be true.

    For example, the definition of tag <xyz> in the public schema may allow tags <a> and <b>, while the customized version defines tag <xyz> allowing tags <a>, <b>, and new tag <c>.

  • Reduced: A subset of the public schema is used, with some tags removed from the schema. XML that is valid for the public schema may not be valid to the reduced version, although the reverse would be true.

    For example, the definition of tag <xyz> in the public schema may allow tags <a>, <b>, and new tag <c>, while the customized version defines tag <xyz> as only allowing tags <a> and <b>.

  • Customized: Modifications are made to the public tag definitions that are a combination of both extensions and reductions. A “customized” tagset still uses the same tag names and similar tag hierarchies as in the public schema, but has changes that are a result of a combination of extension and reduction. XML that is valid for the public schema is likely not valid to the customized schema, and likewise XML that is valid to the customized schema is likely not valid to the public schema.

    For example, the definition of tag <xyz> in the public schema may only allow tags <a> and <b>, while the customized version defines tag <xyz> as allowing only tags <b> and <c>.

  • Built from: Modifications made to the public tagset are more substantial and include renamed tags, non-trivial changes to the tag hierarchy, etc. Many of the modules from the public schema are used as the starting point for a tagset in this category.

    For example, the definition of tag <xyz> in the public schema may allow child tags <a> and <b>, while the customized version renames the tag as <abc> and only allows tags <a> and <b> to exist when grouped within tag <c>: <abc><c><a/><b/></c></abc>

  • Informed by: The public tagset may be referenced during the design of the custom tagset, but is not directly used in its construction. Tags could occasionally share similar names or models, but they are not defined using the same modules.

Table 2Comparisons and Examples of Customization Levels

Customization LevelPublic Tagset/SchemaCustomized Tagset/Schema
XML valid to Customized schema?ExampleExampleXML valid to Public schema?
As-isYes
<xyz>
   <a/> <b/> <c/>
</xyz>
<xyz>
   <a/> <b/> <c/>
</xyz>
Yes
ExtendedYesa
<xyz>
   <a/> <b/> <c/>
</xyz>
<xyz>
   <a/> <b/> <c/> <d/>
</xyz>
Likely not
ReducedLikely not
<xyz>
   <a/> <b/> <c/>
</xyz>
<xyz>
   <a/> <b/> <c/>
</xyz>
Yesb
CustomizedLikely not
<xyz>
   <a/> <b/> <c/>
</xyz>
<xyz>
   <a/> <b/> <c/> <d/>
</xyz>
Likely not
Built fromNo
<xyz>
   <a/> <b/> <c/>
</xyz>
<abc>
   <c>
      <a/> <b/>
   </c>
</abc>
No
Informed byNo
<xyz>
   <a/> <b/> <c/>
</xyz>

Defined in module xyz.ent
<abc>
   <aa/> <bb/> <cc/>
</abc>

Defined in module custom-abc.ent
No
*

a Assuming that the new tags are desginated as “optional,” and not “required,” within the cusomized schema.

**

b Assuming that the removed tags were desginated as “optional,” and not “required,” within the public schema.

Tagset/Schema "Customization Implementation Methods"

In addition to defining various levels of customization, we further recognize two basic approaches to implementing these customizations (with the exception of the "as-is" and "informed by" levels):

  • Modifications. A rather simple and straightforward approach is to merely edit the tag definitions within the actual modules provided with the public tagset. An advantage of this approach is that it can be done rapidly with little understanding of the underlying design of the public tagset.

    However, one drawback to this approach is that it may not be obvious to other future schema developers if a given module is the original version or a customized version. Another drawback is the need to carefully manage versions of a shared module. Otherwise, an application that has awareness of several different schemas, where those schemas "share" common modules (for example, a book and journal schema share a "paragraph" module), may pick up the wrong version of a module when attempting to load a given schema. This could lead to runtime errors or unintended consequences for applications like composition systems and XML editors that are often configured to handle different types of content.

  • Overrides. With this approach, the public tagset files are not directly modified but are left intact. Any customized tag definitions are located within separate, schema-specific modules that are positioned within the schema to "override" the public versions of the same tag definition.

    An advantage to this approach is a shaper divide between customizations and original tagset definitions. The original module from the public tagset is not modified in this approach, so any other schemas that use the public module will continue to pick up the correct tag definitions.

    However, this approach does require that the original tagset was specifically developed to allow "override" capabilities. The NLM tagsets are fortunately and thoughtfully designed with many override capabilities.

Example. Consider that an organization wishes to leverage a public tagset to build a schema for their publication application. Within that public tagset, module tagsoup.ent contains tag definitions for two tags: <xyz> and <abc>. However, while the organization wants to use the existing public definition for tag <xyz>, they decide that they need a different, customized definition for tag <abc>, perhaps one that allows some additional child tags.

Assuming that the public tagset was designed to allow overrides, they have two choices on how to implement their customization for tag <abc>:

  1. They edit module tagsoup.ent to change the original definition of tag <abc> to be their own definition.

  2. They create a second, enhanced definition for <abc> within a new, private module called custom-tagsoup.ent. Both modules tagsoup.ent and custom-tagsoup.ent are used within their schema, but because custom-tagsoup.ent is referenced first within the schema, its custom definition for tag <abc> takes precedence.

These two implementation methods are not mutually exclusive, and a given tagset or schema could be customized using some combination of both Modifications and Overrides.

Tagset/Schema “Customization Profile”

A “customization profile” for a tagset/schema can be defined using the combination of the "customization levels" and “implementation methods” defined above. A two-dimensional chart that contains the possible combinations can be formed using the customization levels on the X-axis and customization implementation methods on the Y-axis. Within this chart, the Y-axis implementation levels would not apply to the extremes of the X-axis customization spectrum: neither the "as-is" nor the "informed by" customization levels would actually utilize the customization implementation methods on this scale. (6)

Table 3Tagset/Schema Customization Profile

Customization LevelsCustomization Implementation Methods
OverridesMixedModifications
As-is
Extended
Reduced
Customized
Built from
Informed by

ACS Pubs' Use of NLM Tagsets

Overview

When we set out to implement XML-based production for each of our primary product types (Journals, Books, Magazine), we were faced with answering questions such as:

  • Whether to leverage a public schema or develop one from scratch?

  • If utilizing a public schema, whether customization was needed? (I.e., where do we land on the "Customization Levels” spectrum?)

  • If customization was needed,

    • How much customization was needed?

    • What customizations are needed?

    • How to implement the customizations? (I.e., where do we land on the "Implementation Methods" spectrum?)

For each product type, we weighed four primary factors when attempting to answer the questions above.

  1. ACS-specific product requirements: what unique characteristics does ACS Pubs have for journal article, book chapter, or magazine story print/page layout? For online HTML delivery? For metadata feeds, e.g. bibliographic feeds to institutions and CrossRef? For other eventual external content consumers, such as secondary publishers or online repositories like Portico?

  2. ACS-specific process requirements: what unique process characteristics does ACS Pubs have for production and delivery of journal article, book chapter, or magazine content? Process requirements can be rooted in either technical constraints and/or ACS business practices.

  3. ACS-specific terminology: over the decades, certain vocabularies have become entrenched in ACS' products and the production processes. During the XML production implementation efforts, our philosophy was that the XML vocabulary should fit ACS' existing terminology – when appropriate – instead of trying to adapt ACS products or processes to a new vocabulary articulated within a public schema. An XML vocabulary with a radically new terminology would have, at best, dramatically increased staff training requirements, and at worst, inhibited the success and acceptance of the overall project.

  4. Availability of applicable public schemas: considering ACS' product, process, and terminology, which of the publically available schemas come the closest to meeting ACS' needs for the given product type (Journals, books, magazine)?

Following are discussions of how we answered these questions for Journals, Books, and C&EN Magazine.

Journals: ACS Pubs' Use of NLM Journals Tagset

What we use

The ACS Pubs' Journals program began the project to implement an XML-based production process in 2005. ACS Pubs initiated this effort by engaging the services of a consulting firm that specialized in markup technologies and applications with a special emphasis on STM publishing. Through their leadership, a panel of ACS participants was assembled and led through facilitated discussions to determine current and future product, process, and terminology requirements. The ACS panel represented all areas of the journals program, including production, electronic delivery, sales and marketing, editorial office operations and peer review, IT, and product development.

At the time of that effort, the NLM journal tagsets were just starting to emerge as a de facto standard in modeling journal articles within the STM publishing community. This status, combined with a relatively close match in modeling the journal article constructs already used by ACS journals, made the NLM tagsets a natural top candidate. However, as the NLM Tagsets were primarily intended to be a common interchange format between publishers and archives, and not specifically intended to support the production of any journal or publisher (1, 7), direct "as-is" use of the NLM journal tagset or one of its schemas was unlikely without at least some level of customization.

The recommendation by this group was to develop an ACS-specific schema, built on an ACS tagset that was loosely based on version 2.3 of the NLM Journal Archiving and Interchange tagset. This recommendation reflected findings that ACS had non-trivial differences in both product-specific requirements and terminology (some of which is highlighted in later sections) when compared to the public NLM tagset. ACS accepted and proceeded to implement this recommendation, resulting in the ACS Journal tagset and schema (implemented as a modular DTD). While the NLM tagset served as the starting point for development of the ACS Journal tagset, the original tagset modules themselves were modified to implement the ACS-specific requirements.

The ACS Journal tagset and DTD are represented in the Customization Profile with a Customization Level of "Built From" and Implementation Method using "Modifications."

Table 4Tagset/Schema Customization Profile for ACS Journals Production Tagset

Customization LevelsCustomization Implementation Methods
OverridesMixedModifications
As-is
Extended
Reduced
Customized
Built fromACS Journal Production
Informed by

Current State & Maturity Level

The ACS Journal Tagset v1.0 was finalized in late 2006, with ACS Journal DTD v1.02 going into production in early 2007 reflecting minor adjustments discovered during early testing. This version has remained in production use through late 2010 with some incremental patch releases along the way (indicated as v1.02a – v1.02e). With few minor exceptions, each version and patch release has been backwards compatible with the prior version, meaning that previous XML content would could still be validated successfully with the newest DTD version.

The v1.02 ACS Journal DTD was a monolithic schema, intended to be used for all internal ACS journal article XML operations from conversion and editing to composition and delivery. It was also the same version distributed to all XML consumers external to ACS Pubs, such as our Chemical Abstracts Service and NIH's PubMed Central. As our internal applications evolved and external consumer use of our XML increased, making updates to ACS Journal DTD required ever-increasing overhead in dealing with testing, communication, and logistics of handling joint deployments across all XML applications and consumers.

With our latest 1.03 version of our ACS Journal Tagset (rolling into production in late 2010), we sought to alleviate this logistical bottleneck by packaging several different schemas that all leverage a common ACS Journal v1.03 tagset:

  • v1.03 External/Interchange DTD: is functionally equivalent to the prior version, v1.02e. With the rollout of v1.03 tagset, we opted to only introduce the changes internally to ACS Pubs journal production; all external XML content feeds would continue to receive XML that is valid to the schema that those consumers already had.

  • v1.03 Production DTD: contains the updated functionality to serve internal journal production process requirements. This DTD is coded as a superset of v1.03 External/Interchange DTD, meaning that all new features are developed as extensions to v1.03 External, and any content valid to v1.03 is automatically valid to v1.03 Production. XML content intended for delivery to external consumers is transformed to be compatible with v1.03 External DTD using XSLT scripts.

  • v1.03 Layout DTD: is a specialized superset of v1.03 Production DTD which expresses an additional set of vocabulary that is focused on page layout & composition processes. For example, additional tags are defined to indicate whether a given table should be allowed to spread across the width of an entire page or constrained to fit within a single text column – information that has no relevance outside of the context of page composition processing. XSLT scripts are used to automatically inject, and later remove, the specialized Layout tags when content is moving to and from the basic Production schema.

The Production and Layout schemas are implemented as customizations to the base v1.03 tagset, defined at the "Extended" customization level and implemented using the "Override" method. The Production schema extends the base tagset, and the Layout schema further extends the Production schema. Because all customizations are implemented using the "Override" method, all three DTDs live happily within the same directory and share the same core modules. See the following figure for an architectural diagram of the current ACS Journal tagset and DTDs.

Fig. 2. Architecture of current ACS Journal DTDs sharing a common tagset..

Fig. 2

Architecture of current ACS Journal DTDs sharing a common tagset..

It is our belief that this finer-grained approach of using distinct schema versions from one common tagset will allow ACS Pubs more flexibility to evolve its products and processes with a much lower cost related to development and deployment logistics.

Where it is Used

Looking at our newer three-tiered schema paradigm (External, Production, Page Layout), we categorize the use of our XML content as follows.

v1.03 Layout: page composition processes, by both internal processes and external vendors

v1.03 Production: internal journal production, including XML conversion vendors and Technical Editing.

v1.03 External: all consumers external to journal production, including

  • ACS Pubs web delivery system

  • ACS Mobile product

  • CAS (secondary publisher subsidiary of ACS)

  • NIH's PubMed Central

  • Portico

Highlights of Differences from Public NLM Tagset

In this section, we glance at some of the notable differences between the current v1.03 ACS Tagset in use within journal production and the public v3.0 NLM Tagset. Along the way, we identify some of the drivers behind ACS Pubs' decision to employ customizations at the "built from" level. This information is intended to be illustrative only, and is certainly not an exhaustive list of all changes.

Altered Basic High-Level Structure

Instead of using <article> as the root element, ACS opted for the more generic term of <document>.

The concept of "metadata" was already established internally, so this name took the place of NLM's <front> tag.

In addition to journal and document metadata sections, we had significant amounts of metadata related to the production process that we wished to keep within the document. A new <processing-meta> tag serves as a container for this type of information.

Table 5

NLMACSDriver(s)
<article> 
  <front> 
     <journal-meta> 
    <article-meta> 
  <body> 
  <back>
<document> 
  <metadata> 
     <journal-meta> 
     <document-meta> 
     <processing-meta> 
  <body> 
  <back>
ACS terminology, ACS process

More Semantics Using ACS Terminology

ACS Pubs wanted to retain its existing production and product related terminology of "figure," "chart," and "scheme," so instead of using NLM's approach of one <fig> tag that can be further distinguished by using a @fig-type attribute, separate <fig>, <scheme>, and <chart> tags were defined. The differences go beyond mere naming however. For example, per ACS journal style, a <fig> is displayed with its label and caption underneath the image, while a <chart> is displayed with its label and title above the image.

Whereas the NLM tagset requires an <article-title> to be present if <title-group> is used, some types of ACS articles only have a need for a title for use on the web so a <web-title> tag was added. While the original ACS tagset lacked a "subtitle," later versions copied NLM's <subtitle> tag but renamed it as <document-subtitle> for consistency within ACS' tagset.

ACS also has long-standing production and product terms that differ from the NLM tagset: "SI" (supporting information) to refer to supplementary material outside of the article, and "WEO" (web enhanced object) to refer to online-only media components of an article, such as an animation.

More examples of ACS-specific terminology captured as additional tags are listed below.

Table 6

NLMACSDriver(s)
<fig><fig>, <chart>, <scheme>ACS terminology, ACS product
<title-group>
   <article-title>
   <subtitle>
<document-title> | <web-title>
<document-subtitle>
ACS terminology, ACS product
<abstract> with @abstract-type<abstract>
<synopsis>
<dek>
ACS terminology, ACS product
<graphic> with @content-type<abstract-graphic>
<toc-graphic>
<title-page-graphic>
<bio-pic>
ACS terminology, ACS product
<supplementary-material> <si> (“supporting information”)ACS terminology, ACS product
<media> <weo>, <toc-weo>ACS terminology, ACS product

ACS Citation Tags

ACS reference styles also differed from the out-of-box NLM citation models. ACS retained NLM's general-purpose <citation> tag (which was renamed to <mixed-citation> in NLM's v3.0 tagset) for use with non-journal citations, such as book and patent citations, that tended to have variability in the types of bibliographic information supplied and styling intended by the author. This model allows (and expects) that any required punctuation and spacing is included between the semantic child tags. The XML consumer, such as a page composition engine or an XML-to-HTML translator, has little to do when rendering these citations beyond perhaps applying some styling on some of the semantic tags (such as styling the contents of a <year> as bold).

For more highly-structured journal citations, NLM's v2.3 tagset defined a structured citation model, the <nlm-citation> tag, which reflected NLM's own citation style at the time. This tag allowed only certain child tags (like <source>, <year>, etc.) within a pre-defined order. (Since then, NLM's v3.0 tagset has deprecated the <nlm-citation> in favor of a somewhat different tag, the <element-citation> which allowed a larger set of child tags but no longer enforced a particular order.) For both <nlm-citation> and <element-citation> tags, any loose, untagged text was disallowed, with the expectation that punctuation and spacing between elements were to be generated by the XML consumer.

ACS has three primary journal reference styles that we felt would benefit from a highly-structured tagging design similar to NLM's <nlm-citation> and <element-citation> tags. Since ACS was not seeking to adopt NLM's actual citation style that was inherit in the <nlm-citation> tag at the time, we decided to define structured ciation models for each of our three ACS primary journal citation styles:

  1. acs-titles: our standard citation style that included article titles,

  2. acs-no-titles: an abbreviated form of our standard citation style that omitted titles,

  3. acs-biochem: a separate citation style for our biochemistry titles.

Some ACS Journals required the use of a particular citation style, such as "acs-no-titles", to save real estate on the page. Having dedicated citation tags for each of these citation styles assisted production staff in ensuring that the right citation style is used for each ACS journal.

An example of a tagged structured ACS citation follows below:

<acs-titles>
     <acs-cite-author>
          <surname>Bader</surname>
          <initials>M. W.</initials>
     </acs-cite-author>
     <acs-cite-author>
          <surname>Bardwell</surname>
          <initials>J. C. A.</initials>
     </acs-cite-author>
     <article-title>Catalysis of disulfide bond formation and isomerization in 
	     <genus-species>Escherichia coli</genus-species></article-title>
     <source>Adv. Prot. Chem.</source>
     <year>2002</year>
     <volume>59</volume>
     <fpage>283</fpage>
     <lpage>301</lpage>
 </acs-titles>

As suggested by this sample, the ACS citation models are similar to the <nlm-citation> and <element-citation> models: loose text is disallowed, and all semantically identifiable components of the citation are individually tagged within their respective tag. Any spacing and punctuation required between elements for display is generated by the XML consumer/renderer.

We saw several benefits to applying this structured, "data-oriented" approach to tagging journal references:

  • Lower production costs because editorial staff did not have to spend time getting each punctuation and space edited just right (as they do for non-journal citations using the <citation model>); they just focus on the who's, what's, where's, and when's of a citation.

  • More capabilities for future re-use; the citation information could be harvested and presented in different forms for future products.

  • Higher accuracy of reference linking to CAS ChemPort, CrossRef, PubMed, etc.

Table 7

NLMACSDriver(s)
<citation> <citation> ACS product
<nlm-citation> (v2.3)
<element-citation> (v3.0)
<acs-titles>
<acs-no-titles>
<acs-biochem>
ACS terminology, ACS product, ACS process

ACS Product and Domain Specific Features

A few features that are specific to ACS Journals and their chemistry sciences content requires special tags in the ACS Journal Tagset. A special chemical notation occasionally used in some chemical expressions within text and titles, "tie-bars" resemble horizontal square brackets that hang above or below the line of text and "tie" two specific characters together. Since this notation may or may not occur within the XML document's natural hierarchal structure, it is implemented as milestone tags, with one tag indicating where the tie-bar should start and another tag indicating where it should end.

Due to ACS Pub's focus on material from chemistry-related disciplines, additional tags to identify chemical names and processes were defined in our tagset. Because some potentially hazardous steps within experimental procedures may need to be specially highlighted or extracted, a <caution> tag is also available.

Finally, one type of ACS product is termed as a "living review." In this product, a previously-published review article is selected to be republished (as a new article) with new or updated information added. Unique to this product is that the new information is visually distinguished from the original text, in both the paginated and HTML products, by using special styling such as the color red. In addition, a need to summarize changes between versions and to provide a special mechanism to identify the original article resulted in a few new tags added to our tagset.

Table 8

NLMACSDriver(s)
n/a<tie-bar-start/>, <tie-bar-end/> ACS product
n/a<chemical-name>, <chemical-process>, <caution>ACS product
n/a<live-change> and related tagsACS product

ACS Table and Math Extensions

We defined a few extensions for our use of the public MathML2 and CALS Table modules. For MathML, we have a need to indicate specific alignment points. These are needed both within single equations that break across lines, and for aligning multiple labeled equations in relation to one another. The MathML2 specification had no direct support for this, so we extended our instance of the MathML2 sub-schema to allow an <ACS:marker> tag within the equation markup, allowing alignment point information to be supplied to the page composition process. Additionally, we defined tags to enable tie bars in math, similar to those added for text content.

One of the first modifications that we made to our original ACS Journal Doctype was extensions to the CALS table module within our tagset. Prior to XML, the Journal production team had a long-standing approach to handling tables by assigning different row types, which caused different behaviors within page composition. Another early extension was support for handling indentation behavior for column and individual cells.

Table 9

NLMACSDriver(s)
n/aMathML 2 extensions:
   <ACS:marker>
   <object-group>
ACS product
n/aCALS Table extensions
   @row-type={list of types to receive special handling}
   @indent-left=amount + unit
   @indent-left-style={full, first-line, hanging}
   @spacing-before, @spacing-after
ACS process, ACS product, ACS Terminology

Other non-compatible Changes

Some tags were seemingly carried over from the NLM Tagset to the to the ACS Tagset, but upon closer inspection, they behave very differently.

Table 10

NLMACSDriver(s)
<name>: strict model with specified order <name>: loose model with no order, untagged PCDATA allowedACS product
@content-type attributes almost everywhere, and allow any value @content-type attributes only defined a couple specific places, and only allow pre-defined values for semantic markupACS process, ACS product, ACS Terminology

NLM Tagset Features Not Implemented in ACS Tagset

The ACS Journal Tagset did not implement many tags within the pubic NLM tagset because there was no product or process need to "clutter" our implementation with them. A few examples of these are listed below.

Table 11

NLMACS
<sub-article>, <response>n/a
<supplement>, <volume-series>, <issue-part>, etc.n/a
<counts>n/a

ACS' Content Interface with NLM-Based Web Delivery System

In 2008, ACS Pubs deployed a new web delivery platform which is based on Atypon's Literatum product. At its core, our web delivery platform utilizes a customized version of the NLM Archiving and Interchange Tagset, containing several types of extensions documented by Atypon. It is this "Delivery NLM" XML that is used as the source for both generation of the HTML article content pages and the metadata that drives other features of the delivery site. We thought that a look the content interface between ACS Journal production and our delivery system – both relying on different customizations of the NLM journal tagset – could be informative to others.

NLM Journal Tagset Extensions within the ACS Pubs' Delivery System

The NLM Tagset in use by ACS Pubs' delivery system originated by starting with the standard NLM tagset and making customizations to provide additional functionality and capabilities as required by the core delivery system. Because these customizations were implemented without breaking compatibility with XML content that is compliant with public "native" NLM tagset, and was implemented by a combination of both overridding and directly modifying the tag definitions within the original public modules, we characterize these extensions within the Customization Profile at the "Extended" level and using a "Mixed" implementation method.

Table 12Tagset/Schema Customization Profile for the Journal Tagset Used by the ACS Journal Delivery System

Customization LevelsCustomization Implementation Methods
OverridesMixedModifications
As-is
ExtendedACS Journal Delivery System
Reduced
Customized
Built fromACS Journal Production
Informed by

Due to the fact that the ACS Pubs' production journal schema and delivery platform schema are both based on the same root NLM tagset, one might reach a conclusion that journal production XML content could be readily absorbed by the delivery platform. In reality, this is only partially true. As pointed out in the prior section, the ACS Journal Tagset and Schema have non-trivial differences from the public NLM tagset, differences that are even more apparent when comparing to the customized version of the NLM tagset used by our delivery platform. ACS production XML is not functionality equal to the NLM XML required by our delivery system, so a conversion step was needed.

ACS Production to Literatum Journal Content Interface

We will briefly cover the overall process by which ACS journal XML content is converted and delivered to ACS Pubs' delivery platform. Not only is a translation required to mechanically convert ACS markup to equivalent NLM markup, the converted XML must additionally conform to the tagging conventions specifically required by ACS Pubs' delivery platform. Merely ensuring that the resulting XML document is "valid" against the delivery system's NLM schema is not sufficient to ensure that all system-required tagging constraints are met.

Considering that this translation needs to occur as content is passed from the production system to the delivery system, an obvious question may be "in which system should this translation take place?" Should it be an export function from the journal production process? Or, should it be an import function within the delivery system? Our answer is "both."

Because Atypon has intimate knowledge of ACS Pubs' Literatum-based delivery system, and thus knows precisely what is required of the resultant NLM XML, a part of Atypon's original Literatum implementation engagement with ACS Pubs included writing an "ACS2NLM" lexer to convert ACS production journal XML into a form of NLM XML that is compliant with ACS Pubs delivery system. An advantage of having this "import" lexer within the delivery system is that it isolates ACS Pubs production staff from needing to have specialized knowledge or training of the precise tagging needed to drive the inner workings of the Literatum system. This approach does have its costs as well: when ACS Pubs plans product or process changes that would manifest as changes in the ACS XML, we must factor in additional overhead for specifying, developing, deploying, and carefully regression testing any required changes of the "ACS2NLM" lexer.

On the side of the journal production workflow system, we developed a "content delivery normalization" export function that is applied to an instance of the ACS XML as it is being packaged for online delivery. Because ACS journal XML often contains non-trivial tagging variances between journals (to meet journal-specific needs of the paginated print/PDF products within production), these “content delivery normalization” edits within the production workflow system reduce the tagging permutations that would otherwise need to be handled by the delivery system's "ACS2NLM" lexer, thus preventing needless complexity within that translation process. This content normalization is applied only to a copy of the XML that is being packaged for delivery, and the original ACS XML containing the full set of product-specific tagging remains unaltered and safely stored away within the production CMS.

Books: ACS Pubs' Use of Book NLM Tagset

What we use

When ACS Pubs set out to implement an XML-first workflow for ACS Symposium Series Books, a few factors played into our analysis and selection:

  • Delivery System: We knew at the beginning of the project that we would be delivering HTML editions that leveraged ACS Pubs' investment in the Literatum-based delivery platform.

  • Composition: A project goal was identified to implement a highly-automated book page composition that used XML as its source.

  • Like Journals: We wanted to leverage as much of our journal production processes as applicable, saving development time and staff training.

  • Unlike Journals: At the same time, we knew that books had unique product characteristics of their own; it was highly unlikely that we would be able to simply "shoehorn" our book production into the very same journal XML production processes and tool.

  • Book vs. Chapter: We knew that we needed to perform some production editing and draft pagination at the chapter level, compose and paginate at the book level, and provide a combination of both book and chapter XML deliverables to our online delivery system.

  • Learned from Experience: We wanted to improve upon our experience of implementing XML-based journal production where applicable.

Because we knew from the start that our Literatum-based online delivery system supported the NLM Book Tagset, and our journals experience suggested that we should seek ways to minimize the amount of XML translation needed, the NLM Book Tagset was an obvious first candidate. Our Literatum delivery system supported an "extended" version of the NLM Book v2.3 Tagset, with customizations defined and documented by Atypon to fit functionality gaps required by the delivery system. A gap analysis between ACS Pubs' book requirements and our delivery system's extended NLM Book tagset revealed that this tagset was indeed a very close match, with many of the extensions meeting our production needs as well. While DocBook was also briefly considered, it would have required significant staff training as well as non-trivial development of transformations to convert from production DocBook XML to the delivery system's extended NLM XML.

Our delivery system's extended NLM Book tagset (and the specific tagging conventions required to drive the delivery system) still had two primary limitations when considering the tagset for our production use. The first is how books are linked to chapters, and the other is related to having the XHTML table model as the default table model. As a result, we made further customizations to an instance of Atypon's customized NLM Book tagset. Highlights of the changes that we made are described further below.

When comparing ACS Pubs' production book tagset to NLM's public book tagset, we characterize the full set of these extensions (whether implemented directly by ACS Pubs or within the tagset instance from our delivery system) using a Customization Profile with a "Customized" level and using a "Mixed" implementation method. (While almost all of the customizations could be characterized as "extensions," one notable change -- removing the XHTML table model in lieu of the OASIS table model -- excludes the tagset from the "Extended" level.)

Table 13Tagset/Schema Customization Profile for the ACS Books Production Schema

Customization LevelsCustomization Implementation Methods
OverridesMixedModifications
As-is
ExtendedACS Journal& Book Delivery System
Reduced
CustomizedACS Book Production
Built fromACS Journal Production
Informed by

Where it is Used

The ACS Book DTD, built on our delivery system's customized version of the NLM Book Tagset, is used throughout production, starting with the initial XML conversion (occurring shortly after chapter authors submitted their work) and continuing through all aspects of book and chapter production, including copy editing and page composition. Once a book is ready for online delivery, we apply a rather simple XSLT to create the version of book and chapter XML that Literatum expects to see.

Current State & Maturity Level

Since the original production deployment of the ACS Book DTD in early 2009, no changes have been required.

Highlights of Differences from Public NLM Tagset

We will share several notable differences between the ACS Book DTD and the public NLM version. Two of them are extensions made by ACS Pubs, while the other changes were defined within Atypon's instance of the NLM Book DTD from which the ACS Book DTD was defined.

Addition of XInclude

In Atypon's model, a book XML contains book-level information (title, editors, ISBN, publication and copyright information, etc.) as well as a list of chapters listed by DOI number. The chapter XML files share the same schema as the book XML, and also use <book> as their root element.

In contrast, our use of the automated page composition solution (built on PTC's Arbortext Advanced Publishing Engine) required a more explicit method for linking chapter XML into a book XML when composing a full book. For this purpose, we implemented support for the XInclude (8) within the ACS Book tagset. The Xinclude linking mechanism allows production staff to work with the entire book – including all content from individual chapter XML files – as a single composite document.

We have found the use of XInclude to be a very natural, successful, and beneficial extension to the Book tagset, and would urge the NLM Tagset owners to consider inclusion of XInclude in a future version of the public NLM Book tagset.

Use of OASIS Table Model

The public NLM Book tagset uses the XHTML Table model, although modules are included to enable the OASIS table model. Within Atypon's customized version of the Book tagset, both XHTML and OASIS table models are enabled, although the OASIS table tag names all contain an "oasis:" pseudo-namespace prefix to prevent tag clashes with XHTML tag counterparts. In the ACS version of the Book tagset, we dropped the XHTML table model, and dropped the "oasis:" pseudo-namespace prefixes from the OASIS table tag names. We made this change for three reasons:

  1. Our XML editing and page composition platforms worked with OASIS tables much better without the "oasis:" pseudo-namespace prefix.

  2. The OASIS model allowed greater control and flexibility in meeting ACS book product requirements as compared to the XHTML model.

  3. Production staff already had experience and training in using the OASIS table model from our journal tagset.

Addition of DocBook <index> Model (Atypon)

One notable gap from the public NLM Book tagset is the lack of dedicated tags to facilitate the creation of an index. ACS Symposium Series Books feature a subject index in the back of each book containing index entries that list the page numbers for pertinent terms, figures, and tables. The Atypon version of the NLM Book tagset contains integrated support for a simplified version of DocBook's tag model, and we found this tagging met our needs for generating the subject index section for each book.

Addition of <book-series-meta> (Atypon)

Another gap from the public NLM Book tagset was support for identifying a series to which a book belongs. For ACS Pubs, the Symposium Series to which the books belong is a distinct publishing identity, much like a journal. Similar to a journal (and the <journal-meta> information that accompanies most journal articles), a book series has a set of metadata attached to it, such as an ISSN. We found that Atypon's implementation of the <book-series-meta> fit ACS Pubs' needs perfectly.

Magazine: ACS Pubs' Use of NLM Tagsets For C&EN

What we use

In 2010, the production team for our C&EN magazine began the process of implementing a specially tailored schema to their production and electronic delivery processes. This schema is based on the ACS Journal tagset with extensive customizations made to further meet magazine publishing requirements.

Unlike the Journals and Books, the driving goal was not to implement an "XML first" process in which the XML served as the common content format within C&EN production editing and page design activities. Indeed, we determined that introducing XML during these workflow stages would have forced a disruptive change in the production tools and processes while offering little tangible production benefit in return. Instead, the primary goal for the use of XML with C&EN was two-fold:

  1. Ability to store a "content of record" version of article content that is independent of any particular production application format or technology, thus allowing for future reuse of this content.

  2. To serve as technology-neutral "content interchange format" to facilitate automated content delivery, such as to a web delivery platform or external syndication.

The choice to use a customized version of the ACS Journal tagset to implement a schema occurred only after a careful evaluation of other public schemas. An emphasis was placed on the ability of the tagset to retain C&EN-specific semantics from a product perspective. The ability to use tag names consistent with C&EN-specific terminology was a plus but not a driving requirement. With this in mind, some of the schemas that we considered are listed below.

  • DITA For Publishers, a publishing-oriented application of the DITA framework

  • EPUB, a set of electronic content interchange and distribution standards that are increasingly used for eBook applications and devices.

  • PRISM, PAM (Publishing Requirements for Industry Standard Metadata, PRISM Aggregator Message) from IDEAlliance

  • NewsML 1, NewsML-G2, and NITF from IPTC (International Press Telecommunications Council)

  • DocBook from OASIS

  • A customized instance of ACS Journal tagset (built from NLM tagset)

Of these schemas, the "DITA For Publishers" and a customized ACS Journal tagset were deemed to be the top contenders. While the remaining choices had wide adoptance within the news and magazine communities, they were primarily intended for interchange of metadata and formatted content, with little support for capturing content semantics (without further customization) needed for reuse and archiving. While we felt that we could meet our objectives using either "DITA For Publishers" or a customized ACS Journal tagset, the latter was selected because it offered a few advantages:

  • it already offered many existing tag names that referenced terminology already familiar to C&EN staff,

  • it already had support for many C&EN product-specific content features – features that were previously defined in the ACS Journal tagset to handle "magazine-like" front-matter content published in some ACS journals;

  • it was already familiar to the team who was responsible for supporting ACS Pubs' various schemas and XML implementations, resulting in a lower learning curve when implementing the needed customizations.

When comparing the C&EN magazine schema to the NLM journal tagset on which the source ACS Journal tagset was based, we characterize the set of these extensions with a Customization Profile at a "Built From" level and using a "Mixed" implementation method.

Table 14Tagset/Schema Customizations Profile for the C&EN Magazine Schema

Customization LevelsCustomization Implementation Methods
OverridesMixedModifications
As-is
ExtendedACS Journal& Book Delivery System
Reduced
CustomizedACS Book Production
Built fromACS C&EN MagazineACS Journal Production
Informed by

Where it is Used

As previously indicated, the schema is not targeted to be used directly within the production editorial and page design processes. Instead, finalized content will be "exported" from the production system as XML using this schema where it will be leveraged as an intermediary format for further electronic distribution. It is also at this point that the XML content, and related content components like images, media files, etc. are stored as a "content of record" package within the CMS.

Current State & Maturity Level

The C&EN magazine schema and tagset has been developed and is planned for implementation in 2011. We expect that there will be fine-tuning of the schema needed as integration and deployment testing proceed.

Highlights of Magazine-Specific Features Added to the Tagset

An entire paper could be written that explores the full set of tagset features required for capturing magazine content semantics at a sufficient detail for reuse. For the purposes of this paper, we will describe a few of the notable magazine-specific features that we added to the tagset. As before for Journal and Book tagset definitions, these customizations where driven by product, process, and ACS C&EN terminology requirements, although product capabilities received the primary focus.

Digital Assets

Within C&EN magazine production, there is less of a need to semantically identify the type of non-narrative content such as tables, images, media, etc. as compared to journals and books. Some non-narrative content can fall into more than one of these content type classifications. As a result, C&EN collectively refers to these non-narrative content types as "digital assets."

Within C&EN's tagset, we removed the existing tags for tables, figures, graphics, equations, etc. In their place, we defined a single "digital asset" tag that allows multiple types of content. Some digital assets can have more than one content type, such as a table that provides table markup to drive HTML, an image for the printed page, and a spreadsheet file for further use by the online reader.

Content modularity

Because magazine content often has a looser concept of article boundaries and a more flexible content structure (see our earlier introduction to the C&EN Magazine product), we introduced XInclude to allow more modularity of how individual components are produced and assembled for publication. While the Books production schema introduced XInclude at one level (i.e., from the book to its constituent chapter and index content sections), our magazine tagset implements xinclude to allow the following modularity:

  • Articles can contain other articles. This is implemented as a recursive "<article> allows <article>" definition, allowing for any type of sub-article arrangement required by magazine production.

  • Any type of digital asset (table, image, media file, etc.) and its related metadata can be linked to an article or sub-article. This is different from merely using an @href attribute to reference a given content file: a separate XML file for the digital asset containing the object's title, caption, credits, etc. encapsulates all information about the digital asset. This digital asset XML file also includes the digital asset's specific content tagging (in the case of tables) and/or reference(s) to the external constituent media file(s).

Pagination

The nature of C&EN's print issues, like many magazines, do not always have a strict sequential pagination of their articles, requiring a flexible approach to capturing page numbers. The page number tagging model was expanded to allow for ganged (one article starts on the same page that another article ends), continued, and sub-article issue layouts.

Ads

The financial model of most magazines usually includes at least some level of advertising content. Tagset features were added to facilitate the capture and management of ads. The model for an ad is based in part on a simplified form of a sub-article model with some additional ad-specific metadata tags defined.

Magazine specific features

Some other specific content features are regularly used by C&EN magazine content, requiring specialized tags to capture them. A few of these are listed by way of illustration:

  1. dek: similar to a subtitle or synopsis, a dek is a short caption of the article displayed above or below the title of an article or digital asset with the intent to draw the reader's interest. See the figure of a magazine page below for examples.

  2. eyebrow: C&EN uses the term "eyebrow" to capture a word or short phrase to give a snapshot context of the article to the reader. See the figure of a magazine page below for an example.

  3. pull quotes: used widely by all types of news and magazine publications, a pull-quote is a short passage extracted and highlighted from the article's narrative with the intent to capture the reader's interest. The passage may be extracted verbatim or it may be edited or paraphrased. See the figure illustrating a pull quote below.

Fig. 3. C&EN Magazine Page Example.

Fig. 3

C&EN Magazine Page Example. A page from C&EN showing various content features.

Fig. 4. Pull Quote Example.

Fig. 4

Pull Quote Example. A pull-quote can occur anywhere in text.

Flexible Content Categorization Model

One requirement is to allow multiple types and multiple layers of categorization to be applied to any given article. The print issue's table of contents represents one type of categorization, e.g., grouping "business" related content under one section while "technology" related articles are listed within another section. RSS and syndication feeds are often organized by topic, such as "green chemistry" or "CO2." Some news stories may be related to a recently published ACS Journal article, leading to a desire to associate the magazine story with a particular ACS journal. The approach that we ended up with was a categorization structure using a recursive model, allowing any number of category types and values to be assigned. The tag model was defined to allow many types of metadata regarding a given category: its type, its "display names" as intended for presentation, an internal or "code" value, its source system or taxonomy, etc.

Summary: The ACS Tagset Inheritance and Interchange Map

Fig. 5. This map illustrates which tagsets/schemas are derived from which tagsets, and how XML content is interchanged between them at ACS Pubs.

Fig. 5

This map illustrates which tagsets/schemas are derived from which tagsets, and how XML content is interchanged between them at ACS Pubs.

Successes & Lessons Learned

In addition to sharing how ACS Pubs has customized and applied NLM-based tagsets, we want to share some lessons that we have taken away from experiences of implementing these XML-based processes.

DTD-Related Experiences

Busting the NLM DTD "compatibility" myth

We found a common misconception, both within our organization and within the larger STM publishing community: that XML is either compatible with the NLM DTD or it is not. Instead, as we have tried to illustrate here, reality is more flexible: there are multiple levels of customization and compatibility that are possible. Because the NLM DTD cannot be all things to all organizations and their respective products and processes, customization should be expected.

Additionally, there is no "standard" way to tag an "NLM XML" content file. Due to the inherit model flexibility within the NLM tagsets, we find that organizations or systems that leverage the public NLM tagsets will often, out of necessity, enforce additional specific product, process, or system tagging requirements that go beyond the rules encoded within NLM's schema. The "NLM XML" produced by one organization or system may fail to meet the tagging conventions required by another organization or system – without some type of translation process between them.

The view that the Journal Archiving and Interchange tagset provides the basics for encoding STM content seems most appropriate to us. It promotes effective exchange of information between organizations while still being designed to allow many different styles of tagging.

Moving from monolithic one-size-fits-all schemas vs. specialized schemas

As we described in the section titled "Journals: ACS Pubs' Use of NLM Journals Tagset," we originally had a single DTD to handle production, print composition, web delivery, content interchange with external parties, and content archival. Not only did we struggle with implementing support for increasingly widely varied and sometimes conflicting requirements, but the logistics of testing, coordinating, and distributing updates to all parties and systems also grew increasingly challenging.

From this experience, we offer this advice: watch for warning signs that you are over-extending your tagset/schema. One shared tagset can be designed with shared modules to support multiple application- or system-specific schemas that want to share a common vocabulary. This allows specific schemas to evolve without needlessly impacting other processes. As an example, our implementation of three "flavors" of our most recent v1.03 Journal schema ("External," "Production," and "Layout") follows this approach. It should be noted that development of translation steps (e.g., XSLT scripts) is needed with this approach to transform XML content from one schema to another.

Use of XInclude for Books & Magazine

The addition of the XInclude to our Book and Magazine tagsets has proved to be a very positive and useful enhancement for ACS Pubs. Allowing the same content to be processed as either smaller stand-alone chunks or within context of larger composite documents, without the need for intervening translations, has provided tangible benefits and flexibility to both our staff and our XML applications.

Beyond the DTD

Three-part packaging to defining XML tagging requirements

We have found that specifying tagging requirements by using only a DTD is insufficient. Instead, using a package consisting of three interdependent deliverables provides the highest level of success:

  1. The schema (a.k.a. DTD) itself.

  2. Documented tagging conventions, with validation tools & services as needed to enforce the conventions.

  3. Complete XML samples that are valid to both the schema and the documented tagging conventions.

Providing just the schema or DTD itself to another group or organization, with no additional guidelines, provides an incomplete picture regarding how to create or use the required tagging. Documented tagging conventions will articulate how to supply or interpret tagging that will meet product or process related requirements. Also, because written documentation can be incomplete or misinterpreted despite the best efforts of the author(s), sample XML files can help fill in any gaps in comprehending the requirements.

In addition to more clearly specifying our tagging to external consumers, we also found benefits internally during the development process. The mere process of creating all three deliverables actually reinforced a comprehensive specification of our tagging. The process of creating the "convention documentation" often reveals gaps within the schema that need to be further refined, while the process of creating XML samples often revealed gaps in the tagging conventions.

When we are asked for "a copy of our DTD," we do not supply just the DTD itself because this only tells part of the story; we supply the complete package of DTD, conventions, and samples. As a rule, we consider that development of an XML schema is not complete until the conventions and samples are also finalized.

"XML as a product" mentality

One significant value of an XML-based content production process is the ability to reuse content for future products or applications. To fully enable this, however, requires a mindset that considers the XML content itself as an internal product in its own right. Without reinforcing this philosophy an XML application can devolve into mere inter-process content glue, meeting only the needs of the immediate systems and processes with little regard to how the XML content could be constructed for higher degrees of reuse.

Within a busy production environment, where the primary staff focus can be to "just get it published," we found that staff occasionally took shortcuts by setting up informal tagging conventions to accomodate new product features. These informal tagging conventions sometimes caused problems further downstream within our web delivery platform or prevented us from cleanly re-using the content for other purposes. We addressed this within the Journal and Book production groups by forming a "Tag Team" that is charged with both establishing XML tagging conventions and participating in the new product development process to ensure that any new tagging requirements are carefully evaluated with all current and future XML uses in mind.

ACS Pubs' "Validations service" for Journals

In addition to the ACS Journal production DTD, some of the additional tagging conventions that we defined are critical for successful operation of our production systems. For these critical conventions, we developed a "validations service" that is used throughout the production workflow. Some of these business rules, such as ensuring that the tagged volume and issue numbers are valid for the tagged journal ID, are requirements that a DTD could not enforce. In addition, some validations enforce more basic tagging conventions, such as ensuring the number of table cells in a row match the number of table columns; again, this is not something that the DTD alone could enforce.

We had evaluated several public validation standards and frameworks, most notably Schematron, but none offered one thing that we needed: validation of the XML using external data sources, such as production schedule information defined within our workflow system. As a result, we built our own validation framework, using Groovy (9). This framework allows full programmatic expression of individual validations and use of existing Java application classes within our production workflow system. The validation system is hosted as a web service, allowing it to be called from many different XML processing clients: ACS staff working within Arbortext Editor, our XML conversion and composition vendors (before they return an XML file back to us), and the production workflow system itself at various checkpoints. It has been in production use by ACS Pubs with minor updates since 2007.

Future direction: More Semantics Require More Tagsets or Extensions

After describing what we have done, we will take a moment to describe where we think we are headed. While the concept of a Semantic Web or Web 2.0 may no longer be top news, we believe that semantic content enrichment still represents a new frontier for STM publishers, and that momentum for exploration is rapidly growing. We expect that a renewed focus on content interchange, with a special emphasis on capturing and exchanging higher degrees of semantic markup, will emerge both within and between organizations, and that this focus will spur additional tagsets or tagset extensions. Any "content enrichment" services and systems that emerge will likely be active participants in – and driving forces behind – defining these additional semantic tagsets or extensions.

While the NLM Tagsets already provide a very high capability for capturing semantics, much of this is generic machinery (for example, using a general purpose <named-content> element or a @content-type attribute), and the actual expression of the semantics is left up to the XML application. This does not facilitate interchange between applications and services unless additional semantic conventions are also developed and shared outside of the tagsets. We expect that a need for greater exchange of common sets of semantic information will drive further extensions to the NLM and other publishing-related tagsets.

Acknowledgments

We especially wish to thank the following folks for their assistance with this paper: Anne O'Melia, Jim King, Dave Levy, Carolyn O'Brien, and Madi Nassiri. We are also deeply indebted to the peer-reviewers of the paper's outline who provided fantastic input and feedback which helped to shape the final paper.

References

1.
See http://dtd​.nlm.nih.gov/
2.
ScholarOne is a subsidiary of Thomson Reuters, http://scholarone​.com/about/
3.
As part of an ACS-funded enhancement to the ManuscriptCentral system, ScholarOne built an automatic content and metadata delivery system built on FTP and Web Services technologies.
4.
Some ACS journals allow author manuscripts to be posted online immediately after acceptance; these "JAM" (Just Accepted Manuscripts) would represent the first instance of publication for these articles. A fully edited and composed "ASAP" version of the paper will eventually replace the JAM version.
5.
For specifics of how the NLM Tagset authors describe their approach to tagset modularity, see http://dtd​.nlm.nih.gov/#id49746 and http://dtd​.nlm.nih.gov/#custom.
6.
For specifics of how the NLM Tagset authors describe customization and extensibility options to its tagset, see http://dtd​.nlm.nih.gov/#custom.
7.
Kasdorf Bill. "The Benefits to be gained from the New DTD standard", presented at the Association for Learned and Professional Society Publishers; ’ (ALPSP) Technical Update: “A Standard XML Document Format: The case for the adoption of NLM DTD?”, December 3, 2007, London. http://www​.alpsp.org/ForceDownload​.asp?id=606.
8.
The W3C Xinclude specification allows multiple separate XML files to linked and then processed as a single composite XML document. For more information, see http://www​.w3.org/TR/xinclude/.
9.
For more information on Groovy, see http://groovy​.codehaus.org/

Customization Level Examples

In this section, we provide samples of the various Customization Levels (except the "Informed By" using extracts of a schema expressed in DTD syntax.

Fig. 6. As-is.

Fig. 6As-is

A tagset extract with no changes, showing a level of "As-is."

Fig. 7. Extended.

Fig. 7Extended

A tagset extract showing customizations a level of "Extended."

Fig. 8. Reduced.

Fig. 8Reduced

A tagset extract showing customizations a level of "Reduced."

Fig. 9. Customized.

Fig. 9Customized

A tagset extract showing customizations a level of "Customized."

Fig. 10. Built From.

Fig. 10Built From

A tagset extract showing customizations a level of "Built From."

Terms & Definitions

ACS

American Chemical Society

ACS Pubs

The Publications Division of the American Chemical Society.

ASAPs

ASAPs "As Soon As Publishable," ACS Pubs' term for the initial online publication of a fully-edited article before it appears within an issue.

C&EN

"Chemical and Engineering News," ACS' weekly magazine

CIP data

"Cataloging In Publication"; information regarding the U.S. Library of Congress' registration of a book, usually displayed on a book's "copyright page" after its title page.

CMS

"Content Managmenet System."

DTD

"Document Type Definition," a syntax for expressing an XML schema.

JAMs

"Just Accepted Manuscripts"; ACS Pubs' term for an initial publication of the author's originally submitted manuscript, before any editing or other production steps.

ManuscriptCentral

A service application from ScholarOne that facilitates the submission and peer-review processes of journal articles.

MARC records

"MAchine Readable Cataloging," a format developed by the US Library of Congress for exchanging bibliographic information, primarily for library cataloging systems.

Model

The part of a tag definition that governs what a tag may contain. E.g., some tags may only contain other specific tags, while some tags allow text.

Module

A container holding one or more tag definitions. Often related tag definitions are grouped into one module, then modules are linked together to form tagsets and schemas.

NIH

United States "National Institutes of Health."

NLM

"National Library of Medicine," part the NIH.

Schema

A set of tag definitions that are linked together to form a set of tagging rules for a specific type of XML document.

ScholarOne

A Thomson Reuters business that produces systems and solutions for scholarly publishing, including ManuscriptCentral.

STM

"Scientific, Technical, and Medical" fields.

Tag

A bit of markup to identify something about a document or a piece of a document. For the purposes of this paper, we refer to XML elements, attributes, etc. as "tags."

Tagset

A related set of tags that, taken together, form a useful vocabulary for identifying comments of a certain class of documents.

Technical Editing

Somewhat similar to copyediting, technical editing involves staff editors with scientific degrees applicable to the content on which they work, allowing editing within a context of retaining or clarifying the scientific intent of the paper.

XML

Extensible Markup Language, a format syntax from the W3C. http://www.w3.org/XML/

XML-First Production

An approach to content production in which content is tagged in XML as soon as possible in the workflow, usually before the copy editing step.

XSD

"XML Schema Definition," a W3C standard syntax for expressing an XML schema.

a Assuming that the new tags are desginated as “optional,” and not “required,” within the cusomized schema.

b Assuming that the removed tags were desginated as “optional,” and not “required,” within the public schema.

© 2010 American Chemical Society.

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010
Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010 [Internet].

Recent activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...