NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2011 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2011.

Bookshelf ID: NBK62090

PMC Tagging Guidelines: A case study in normalization

Abigail Elbow, Breena Krick, and Laura Kelly.

NCBI, NLM

In 2003, when the NLM Tag Suite was being introduced, PMC was developing a new production system that would use XML data tagged in the NLM DTD. Since the DTD was new, there were no modeled examples and we were constantly asking ourselves how things should be tagged. We realized we needed more consistent tagging than the DTD required—we needed our own subset. The decisions we made while developing the current PMC software became the foundation of what is now the PMC Tagging Guidelines—a set of rules that describes the normalized XML used in PMC.

This paper looks at how and why PMC developed the Guidelines, how they changed over the years, and reviews the current implementation of the Guidelines and the associated tools.

PMC Overview

PubMed Central (PMC) is a free archive of life science journal literature at the U.S. National Library of Medicine (NLM). Content in the archive is supplied by journal publishers in either XML or SGML. Currently PMC accepts data submission in approximately 35 different schemas. Once received, the data is then run through PMC's processing which includes a transform into XML valid against the JATS Journal Archiving and Interchange (Green) DTD. This resulting XML is referred to as NLM XML, or NXML.

The NXML is archived in the PMC database and delivered to end users primarily through web browsers. Because the conversion from NXML to HTML is done on-the-fly and involves no manual work by PMC staff, the quality of the output is very heavily dependent on the quality of the archived NXML. The best way to guarantee quality output delivered to the users, then, is to ensure that quality NXML is archived.

The PMC Tagging Guidelines describe PMC's preferred style for XML documents, and the PMC Style Checker reviews XML files for compliance with those style rules. These two tools are fundamental to how PMC enforces the quality of the archived data.

History of the Style Rules

Development of the PMC Tagging Guidelines began in 2003 when the group started to archive data in the JATS format. Since the JATS (then the NLM DTD) had just been released earlier that year, virtually none of the data submitted to PMC was tagged in the JATS, so PMC developed XSL transforms to convert publisher-supplied data into the JATS.

Like many others who have made the switch to the JATS from another data format, when writing the XSL transforms we were faced with the reality that there are sometimes varying ways to tag the same or very similar structures. Because our output for end users needs to be produced in a "lights-out" process, we made the decision to normalize some of the structures in our NXML.

At first these normalization rules were very limited. We focused on a few pieces of article metadata, figures, and tables. These structures were the ones we'd seen with the least consistent tagging—both across data suppliers and sometimes within a single supplier's submissions. Our decisions on how to normalize data often came from situations we were troubleshooting during testing of the new production system. This meant that our first style rules were documented as notes on scraps of paper or whiteboards—whatever was handy when the situation arose.

Once we collected the style rules, we needed a way to enforce them. We chose to write the tests as a RelaxNG schema that also included the JATS validation rules. For the first implementation with the straightforward style rules, the RNG was successful. As the style rules increased in both number and complexity, however, the RNG was no longer able to accommodate the rules.

The biggest limitation of the RNG implementation was that non-compliance with the rules was always an error. As we learned early on, our style rules must be more flexible than that. We have some rules that are absolute (at least one <pub-date> element must be present) because our production system relies on them (without a <pub-date>, we have no citation or release date). Other rules, however, are really more strong preferences or recommendations (don't emphasize the entire contents of <title>). Not following these style rules won't break our processing or display, but it isn't what we consider best practice.

To gain the needed flexibility, we replaced the RNG with a set of XSL stylesheets which could easily handle the more complex tests and allow us to report both errors and warnings. Figure 1 shows an overview of the PMC data conversion process. The publisher-supplied content is first run through our XSL converters to generate NXML. The NXML, if valid according to the DTD, moves on to style checking. If the NXML is not style-compliant, the output is written as out-badstyle.XML. This file cannot be loaded into the PMC database. If the NXML file is style-compliant, the output is written as out.nxml which can be loaded into the database. This is also the case if the NXML generates style warnings. The conversion process log files document the style warnings, but the output is still out.nxml and can be loaded into the database.

Fig. 1. PMC Conversion Overview.

Fig. 1

PMC Conversion Overview.

The Guidelines

At the same time, our internal style rules were becoming more comprehensive, and increasing numbers of publishers were adopting the JATS. As publishers adopted the Tag Suite—whether in existing systems or new systems for publishers using XML for the first time—they looked to PMC for guidance. They were facing the same challenges we had faced and wanted help getting their data into the JATS, sometimes with the primary goal being PMC participation.

This increasing demand for guidance led us to document our own style rules—previously only written in RNG or XSLT—in prose. It wasn't until 2005 that the PMC Tagging Guidelines were released as HTML pages—two years after we had implemented those rules in our data processing.

When first released in 2005, the PMC Tagging Guidelines were a single HTML page that described the tagging style for PMC. As the project has grown and changed, the Guidelines have also matured. The Guidelines are now delivered as multiple HTML pages that include comprehensive examples, fully-tagged articles, and the citation tagging style described in NLM's Citing Medicine1.

Structure of the content

The Guidelines include five sections: Introduction, General Tagging Practice, Document Objects, Elements, and Update History. The Introduction section is intended to acquaint new users with the layout and structure of the Guidelines. General Tagging Practice includes tagging suggestions that apply to XML documents as a whole. This includes information like which DTD to use and our preferred practice on object linking. This section does not include specific rules for any elements.

Two other sections, Document Objects and Elements, define rules that are narrower in scope. Document Objects describe rules for parts of an XML document that define a single concept and may refer to several elements. For example, the Licensing Information section in Document Objects provides general instructions on how to identify and tag licensing information. Those instructions include references to specific elements (<permissions>, <open-access>), but do not detail their usage. That detail is presented in the Elements section which includes style rules for the elements and their attributes. There is comprehensive linking between the two sections to help users navigate.

The last section, Update History, details the full history of changes to the Guidelines. This section can be particularly helpful for those who need to set up automated conversion processes on their own systems.

Structure of the XML document

The HTML pages that display the PMC Tagging Guidelines are generated from an XML document that includes style rules not just for PMC, but also for the NIH Manuscript Submission (NIHMS) system and NCBI Bookshelf. The systems that run these three projects overlap heavily and since many of the style rules are based on system requirements, there is also a lot of overlap in style rules for the projects.

To maintain style documentation for the projects, we created one XML file that uses attributes to indicate project applicability. This allows us to keep all of the projects rules in sync and makes updates much simpler.

Figure 2 shows the XML describing style rules for <abstract>. The paragraphs on lines 14, 17, and 20 all have the scope attribute which specifies the project to which that paragraph applies. The value pmcpublic on line 17 indicates a rule that is only applicable to the public PMC rules. There is also a pmcdev value which is used in cases where we've needed to create workarounds to solve very specific problems. The pmcdev rules are strictly limited to PMC staff and are not made public. The values of scope can also specify which type of publication they should be used for: article or book. The article value indicates the PMC and NIHMS projects. This value is shown on lines 31 and 32 of Figure 2.

Fig. 2. Sample of Tagging Guidelines XML.

Fig. 2

Sample of Tagging Guidelines XML.

An exclude-scope attribute is also used in the Guidelines XML. As the name suggests, it allows us to specify one project to which the rule does not apply. This is useful for rules that apply to both articles and books, but may not apply to the NIHMS project. We use the attribute exclude-scope="nihms" and the rule is included for the PMC and Bookshelf Guidelines, but not NIHMS. This attribute is included on the first line of XML in Figure 4.

Fig. 4. Guidelines XML Showing version Attribute.

Fig. 4

Guidelines XML Showing version Attribute.

The Guidelines XML is converted into the HTML Guidelines with XSLT. Each project's guidelines are generated separately by running the XSLT with different run-time variables. Figure 3 shows the web display for the XML in Figure 2 for each of the three projects. The run-time variable also indicates which CSS to call in the output HTML, allowing each project to maintain its own branding.

Fig. 3. Rendering of Tagging Guidelines.

Fig. 3

Rendering of Tagging Guidelines.

In addition to documenting tagging style for multiple projects, the Guidelines XML documents style rules for both versions 2.3 and 3.0 of the JATS. Because v3.0 was a non-backwards compatible release and migrating existing systems to the newer version can be a difficult and sometimes expensive undertaking, PMC made the decision to maintain style rules for both versions.

The XML in Figure 4 shows the tagging for the <license> element. Because the model is different in versions 2.3 and 3.0, the style rules are different and must be distinctly identified. The guidelines for <license> include tagging examples which also carry the version attribute (lines 812 and 823).

Unlike the scope attribute, the version attribute is used to apply a specific style to the HTML output. Figure 5 shows the browser display for the XML in Figure 4 after conversion to HTML. All of the XML content is included in the output, but the content for versions 2.3 and 3.0 is distinguished through different styles.

Fig. 5. Guidelines HTML With Different Version Instructions.

Fig. 5

Guidelines HTML With Different Version Instructions.

This distinction is meant to help users quickly and easily identify tagging rules for the different versions. The HTML Guidelines also include a function that lets users turn the display of a specific version's rules on or off. The function is available at both the main page level and the individual element level.

Using the Style Tools

The XSL stylesheets that comprise the PMC Style Checker are available for download. Additionally, PMC has a web version of the Style Checker that allows people to check their XML without having to set up the XSL transforms. Both versions of the Style Checker are designed to be used with the Tagging Guidelines. When a style error is reported by either tool, the error includes a brief explanation and the URI for the corresponding style rule in the Tagging Guidelines. In the following example, the reported <related-article> error displays the URI http://www.pubmedcentral.nih.gov/pmcdoc/tagging-guidelines/article/tags.html#el-relart.

The user can then copy and paste the URI into a browser which will show them the specific style rule that is not being followed. See Figure 7 showing the <related-article> element and its descriptive paragraphs and table.

Fig. 7. PMC Tagging Guidelines Entry for <related-article>.

Fig. 7

PMC Tagging Guidelines Entry for <related-article>.

Each paragraph and table describes the style and functionality of the <related-article> tag. The links within the paragraphs direct the user to samples on how to tag a <related-article> element. Since the error reported in Figure 6, indicates a problem with the element's attributes, the user would review the attribute information in the Guidelines and then view the linked samples to review examples of <related-article> tagging.

Fig. 6. Sample PMC Style Checker Error.

Fig. 6

Sample PMC Style Checker Error.

The XML tagging in Figure 8 is the "Sample 5" article from the fully-tagged examples and shows the tagging needed to resolve the style error reported in Figure 7.

Fig. 8. <related-article> Tagging Example in "Sample 5".

Fig. 8

<related-article> Tagging Example in "Sample 5".

The user can re-tag the source XML file according to the sample in Figure 8 and then resubmit the updated XML file to the online PMC Style Checker tool. Having the file tagged correctly will produce a successfully validated and style-checked output XML file.

The Guidelines in PMC Production

A publisher's exposure to the Guidelines begins when they first apply for participation in PMC. The first step of the application process is a scientific review of the journal by the NLM Selections and Acquisitions Group to determine its eligibility for the NLM Collection. When the journal is selected for the collection, we notify the journal and include a list of next steps, one of which is a technical review of data by the PMC staff. To help prepare for the technical review, we ask the publisher to review the PMC Tagging Guidelines and to run their prepared XML files through the online Style Checker. Following this advice saves us a lot of time in terms of answering questions and troubleshooting XML files that are problematic. And it can help identify sticking points in the publisher’s coding, and save them from being rejected from PMC for failure to meet the technical requirements.

The technical evaluation includes reviewing roughly 50 sample article files to make sure the data have the quality and consistency to be in the PMC archive. We normally expect some small errors, and therefore allow 3 attempts at submitting fully-correct files. However, we also maintain a set of “Minimum Requirements”2, to demonstrate basic XML competence; failure to meet these requirements can result in immediate rejection. These require valid XML to the declared DTD, but do not require that the XML conform to PMC style. However there is a prominent note on the minimum requirements page “strongly suggest[ing]” PMC style.

During the technical evaluation, the sample files are processed as if they were production content (with a few minor exceptions). This means both programmatic testing—for example valid XML, PMC style, properly called-out images and other files—as well as a manual review of the data and its HTML representation. Errors can appear anywhere along the way and PMC style errors are fairly common since new publishers are often not cognizant of the difference between “valid according to the DTD” and “conforming to PMC style.” Therefore the evaluation reports that we send back to publishers frequently include reference to the PMC Tagging Guidelines.

Sometimes the errors result in *-badstyle.XML files which prevents loading of the article to the database. At this stage, however, there are often style warnings where the provided tagging does not follow PMC's preferred style. During the technical evaluation especially, we encourage the publisher to follow all of the Guidelines in order to have fully correct data.

Current Participants

Publishers already participating in PMC use the PMC Style Checker and Tagging Guidelines as well. Some use the Online Style Checker to check individual files before submission. This is a reasonable step for publishers that have smaller quantities of data for PMC submission, or publishers who do not have the resources to implement the XSL Style Checker into their system. This model tends to make a big difference in the quality of submission, especially since these publishers tend to code their XML at a far more manual level, and therefore are more likely to have mistakes and inconsistencies in their data than publishers who have more automated processes.

In contrast, many larger publishers or third-party data providers have downloaded the Style Checker XSLT files for local use as part of their production workflow. Again, this makes both their and our lives easier. Errors can be caught ahead of time, and if there is a question about how to tag something properly, it’s easy for them to check with us before-hand, instead of sending files that might fail processing. It also helps catch problems that might generate only a warning in PMC and thus not be flagged for review during normal QA procedures.

Tagging Guidelines

There are a number of particularly useful sections of the PMC Tagging Guidelines that we repeatedly refer publishers to in order to advise them on their tagging. It helps to have quick and consistent instructions to send out for some of the common errors that we see, or just simple advice and new practice suggestions for the publishers.

A good example is release-delay tagging: a processing instruction (PI) that PMC and publishers use to set the embargo on any particular article, especially if it differs from the default embargo for the journal as a whole (for example, for special "immediate release" articles). Processing instructions in general, and release-delay PIs in particular, are new to many publishers and data providers. The HTML Tagging Guidelines include clear instructions and examples of how to use them properly. The PMC Journal Managers no longer need to type out instructions each time; we can simply point the publisher to the exact instructions6 in the Tagging Guidelines, saving everyone time.

Another scenario that is also made much easier with the PMC Tagging Guidelines is table coloring and shading. This is a fairly rare occurrence in standard journal tables. However, when it comes up, many publishers assume that the only way to deliver such a table is to supply an image of the complete table. This reduces the value of the article XML because it no longer includes complete searchable content. With the PMC Tagging Guidelines, we can now instruct publishers on the simple way to add shading or coloring to table cells, so that the entire table can be fully tagged in XML.

Two additional aspects of tagging make the Tagging Guidelines particularly useful for ongoing publishers: article-type attribute and <related-article> tagging. These are invariably sources of confusion, even for seasoned PMC staff, but the Tagging Guidelines provide easy reference for appropriate tagging. Related-article tagging is also a good example of why PMC style is an essential layer on top of the DTD. related-article-type is a required attribute on <related-article>, however its value is not prescribed in the Journal Publishing Tag Set. This makes sense for the Tag Set: a given publisher using the model may want to use any number of house-specific values to describe the relationship between two articles. However for the purposes of PMC, we need a fixed set of possible values, because both the database and renderer have to know what to do with a given group of related articles. Therefore the Style Checker is employed to enforce the values used in PMC. Where necessary, our conversion processes can always take a publisher-specific value (used in their own system) and convert it to the appropriate PMC value, as long as the tagging is consistent and we know what values to expect from the publisher.

Style Checker in Production

As mentioned above, all submissions to PMC are converted to NXML and run through the Style Checker to catch any non-compliant tagging. There are a few items that this is particularly useful for catching.

The first of these is errors in MathML tagging. Authors and publishers are understandably concerned primarily with how a given equation renders: i.e. can an article user read and understand the math? At PMC, however, we want to ensure the accuracy of not only math rendering, but also the semantics of its tagging. For us, the MathML needs to reflect the meaning of the equation, not just generate a formula that looks correct. There are some common tagging practices that exemplify this tension and the Style Checker is a great way for us to find those. The two most common involve <mml:mn>, and sub- or super-scripts to a whole expression.

Many times, coders will use successive <mml:mn> elements (number tokens that represent a real number) to code the individual units of a single number, for example:

<mml:mn>7</mml:mn><mml:mn>5</mml:mn>

to code the number “75”. This would render a “75” successfully, but clearly the semantic meaning is lost, as seventy-five is not the same as a seven next to a five. In a similar vein, consider an expression like (a + b)2. Mathematically the entire expression is squared, but the equation would appear correct if (as is commonly done) the “2” were tagged as a superscript to only the closing “)”. See Figure 9 for both versions of the tagging and their rendering, along with the Style Checker test for this error.

Fig. 9. Variations of MathML Tagging.

Fig. 9

Variations of MathML Tagging. MathML tagging for an equation showing the same rendering for both incorrect and correct tagging. A) Incorrect tagging, and the resulting image. B) Correct tagging and the resulting image. C) The XSL test from the Style Checker. (more...)

Both of these practices are monitored by the Style Checker, which will generate an error if they are present, thus alerting us to request a data correction.

A second common finding of the Style Checker is DOI tagging errors. Since these identifiers follow a prescribed format, we can check every identified occurrence in an article (for example in an article citation, or in the reference list) to make sure that the DOIs are properly formatted. Although a properly-formatted DOI can still be wrong, having this check in place helps flag the majority of bad DOIs (for which we then request corrections), and thus improves the quality of the archive. This is especially useful given the value that DOIs add to users for finding original articles online.

Thirdly, we use the Style Checker to check <related-article> tagging. This element is used in cases such as errata, commentaries, and letters to refer to an original or otherwise related article, but it is easy to miss some part of the tagging. In particular, the related-article-type attribute can cause confusion. Using the Style Checker, we check for numerous potential problems, including related-article-type values that don’t correspond with the article-type attribute on <article>; “correction” articles that are missing a <related-article>; etc. Again, this can catch many more problems than a manual quality assurance process would, and helps maintain the integrity of the PMC database.

A fourth useful test is for empty elements. These can be the result of taggers using a template for each article when a particular article doesn’t have a given item, or it can indicate something missing from the article itself, but it rarely causes invalid XML. And although in the former case, the empty elements could simply be discarded, there’s no systemic way to determine which type of problem has occurred in a given article. So having a style check for these cases flags the problem for the Journal Manager to investigate, thus preventing missing or mis-tagged content in an article.

Finally, the Style Checker is useful in monitoring <xref> tagging, by checking the ref-type attribute against the referenced element. It’s an easy mistake to tag the value bibr in ref-type to point to a footnote, or a table to point to a figure. Neither of these is invalid according to the Tag Suite, but they represent clear errors in the tagging. With the Style Checker, we can catch these problems programmatically.

Continuing Development

The PMC Tagging Guidelines have proven to be an invaluable resource for PMC staff and data suppliers. Beyond just the Guidelines, the implementation of a specific style has helped maintain the quality of the PMC archive. One of the most important lessons about developing a project style has been that there is no finish line. As long as the project continues to evolve, the style and tools must also continue to evolve.

For PMC this evolution has involved—and will continue to involve—the needs of our data providers. As more groups participate in PMC, we constantly receive requests for features, tagging enhancements, and even simple guidance. All of these requests, as well as much of the input we receive from data providers, continues to shape PMC's style rules and PMC as a whole.

References

1.
Citing Medicine [Internet]. Bethesda (MD): National Library of Medicine (US); 2007. Available from: http://www​.ncbi.nlm.nih​.gov/books/NBK7256/.
2.
Add a Journal to PMC [Internet]. Bethesda (MD): National Library of Medicine (US); 2011. [updated 2011 Jul 7; cited 2011 Aug 8]. Minimum Data Requirements. Available from http://www​.ncbi.nlm.nih​.gov/pmc/pub/pubinfo/#min-data-req.
3.
PMC Style Checker [Internet]. Bethesda (MD): National Library of Medicine (US); 2011. [cited 2011 Aug 8]. Available from http://www​.pubmedcentral​.nih.gov/utils/style_checker​/stylechecker.cgi.
4.
PMC XML Tagging Guidelines [Internet]. Bethesda (MD): National Library of Medicine (US); 2011. [cited 2011 Aug 10]. Fully-tagged sample article 2. Available from http://www​.ncbi.nlm.nih​.gov/pmc/pmcdoc/tagging-guidelines​/article​/JournalPub-sample2.txt.
5.
Mathematical Markup Language (MML™) 1.01 Specification [Internet]. W3C; c1999. [updated 1999 Jul 7; cited 2011 Aug 10]. Available from: http://www​.w3.org/TR/REC-MathML/chap3_2​.html#sec3.2.3.
6.
PMC XML Tagging Guidelines [Internet]. Bethesda (MD): National Library of Medicine (US); 2011. [cited 2011 Aug 22]. Document Objects: Processing Instructions. Available from: http://www​.ncbi.nlm.nih​.gov/pmc/pmcdoc/tagging-guidelines​/article/dobs​.html#dob-procinst.

This work is in the public domain and may be freely distributed and copied. However, it is requested that in any subsequent use of this work, the author be given appropriate acknowledgment.

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2011
Journal Article Tag Suite Conference (JATS-Con) Proceedings 2011 [Internet].
Bethesda (MD): National Center for Biotechnology Information (US); 2011.

Recent activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...