NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2011 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2011.

Bookshelf ID: NBK61837

Reality Check: What to expect from automated conversion to NLM XML

Devorah Bloom, Beth Friedman, and Gitty Kupferstein.

Author Information

Devorah Bloom, Beth Friedman, and Gitty Kupferstein.

Data Conversion Laboratory, Inc.

When looking to convert legacy content to the NLM Journal Publishing DTD, and in converting from SGML and other flavors of XML, there is always concern of how well automated tools can give quality results with minimal post-conversion clean-up. This paper will look at what an automated approach can (and can't) do, issues that are best dealt with pre-conversion, issues that are best dealt with post-conversion, and an examination of the kinds of tools that can help to ensure consistency and accuracy in delivered documents. This presentation is based on lessons learned in converting over 2 million pages to the NLM Journal Publishing DTD.

Introduction

With technology constantly changing and organizations always looking to keep up with the latest and greatest, revising data formats and making content work with the latest formats is often on people's minds. While there are a variety of automated and semi-automated conversion methods available, no process is foolproof and making use of the right tools will make the transition less painful. This paper will look at the reasons that getting to a fully lights-out automated conversion is difficult, and show the use of specialized pre- and post-conversion tools to improve the process.

To make sure we are using consistent terminology, it is useful to look at how we describe various stages in the life cycle of a document as it gets converted from paper or PDF to NLM XML:

  • Pre-Conversion

    1. Analysis of document set

    2. For paper - scanning/zoning/ocr; for PDF - zoning/text extraction

    3. Image cropping/conversion

    4. Proofreading/clean-up of text

    5. Styling/pre-edit tagging to drive conversion software

  • Conversion (this is the lights-out part of the process)

  • Post-Conversion

    1. Parsing

    2. Viewing

    3. Quality Control

Within both the pre- and post-conversion steps, there are steps that can be taken to maximize the effectiveness of the automated conversion. The remainder of this paper will provide the details of these tools within the production life cycle framework.

Pre-Conversion

Pre-analyzing content

One of the first decisions to make is which DTD best suits the dataset. Our experience has shown that because the NLM Journal DTD is clearly laid out, well documented, amply robust, and public domain, it meets most requirements and is very often the DTD of choice. One downside is that the DTD could be too "loose" i.e. data can be captured in multiple ways. In one of last year's JATS presentations, Debbie Lapayre discussed that there is an abundance of tags available, and there are therefore benefits to creating a subset DTD in order to simplify the workflow. However, what if you don't have the expertise to edit a DTD and don't want to have the headache of maintaining the DTD? Analyzing a representative sample of your content prior to conversion will lead to a better understanding of the data structure, and thereby create a more specific conversion, limiting the tag usage for each tagging scenario. Legacy data tends to run the gamut - different style formats, different content, different source - and all of these need to be taken into account.

When converting from paper or any other page oriented format (ex. PDF), there are a number of items that need to be kept in mind and decided prior to commencing the project. For example should a table of contents or index be tagged or will it be auto-generated? How true does the XML need to remain to the original source? Does the look of the original document need to be maintained, mandating capture of connective text and punctuation? Does the user need to keep elements, for example equations/tables/citations, in multiple formats so as to keep "the look" and yet have the searching/update capability? Should leading/trailing punctuation be included in elements such as titles or labels? Should specific data, such as abstract or acknowledgment, capture a title in the XML or will that be generated in the renderer?

Another common issue is the appearance of objects (figures and tables) mid paragraph. When a figure or table is technically in the middle of a paragraph, it doesn't disturb the reading on a print page, but in an electronic publication it would. So - do you want the figure/table to be tagged directly after its callout within the body, or in a separate section in the XML file?

Identifying source elements is sometimes content-based and needs to include all flavors of text (e.g., Figure(s), fig., Figs., Illustration(s), Chart(s), etc.) and/or appearance-based (placement, font, alignment, point size). On the target side, usage and placement of elements depending on the target application need to be defined. For example, where to place tables/figures/footnotes/sidebars, should affiliation/bio be inside/outside a contrib tag, should the word "Table"/"Figure" be included in the xref, should citations include connective punctuation, etc.

The handling of equations/math is another decision to be made upfront. Should inline-math/display formulae be captured as images, as MathML, or both? Equations as images have the advantage of most accurately remaining true to the look of the object, but cannot be revised nor searched, and become pixilated upon enlargement.

The above-mentioned criteria and many more make up the conversion specification document. A specification such as the one below should become the go-to document that contains all conversion rules. A hand-tagged sample is helpful to test the results in the final user application.

Fig. 1. Sample conversion specification.

Fig. 1Sample conversion specification

When the source data is electronic - either tagged to a different DTD or an earlier version of the same DTD, a conversion specification should be created as well. Analysis is done to determine how the source tags/data would map to the tags in the target DTD and the results become a conversion specification.

OCR from Paper hardcopy

Conversion from paper is accomplished via the use of scanning and OCR software. Due to variations in page layouts, zoning the text using ABBYY guides the OCR software as to the data flow of the page.

Text Extraction for conversion from PDF

There are a variety of ways to extract text from a PDF. The two methods we use are:

  1. Adobe Acrobat with BCL Jade/Gemini - With multiple column pdf pages, zoning software is useful to map the text flow prior to extracting the text so as to ensure correct ordering of the data in the output. This is particularly important in the case of multi-column pages, as without zoning, text may be extracted across columns, merging sentences from different paragraphs, rendering the data almost useless. Telling the text extraction software what is found on the page and in what order to extract it in will ensure that the text flows correctly; zoning also offers the ability to identify objects, i.e., specify what is a table/figure (producing better output), and to omit unnecessary text (such as page headers/footers).

    Fig. 2. Zoning of page in Jade to identify objects and direct text flow.

    Fig. 2Zoning of page in Jade to identify objects and direct text flow

  2. Comparison method - There are a variety of ways to extract text from PDF; no one is perfect and each comes with its own strengths and weaknesses. In certain cases, based on how the PDF was constructed, the text extraction tools often struggle to accurately extract the text. In those cases a different approach to the text extraction may be necessary. Similar to OCR technology where more than one algorithm is used and the common result is deemed to be correct, we have developed a tool to compare the text results of PDF extraction versus OCR.

    We OCR the PDF using ABBYY, extract the text using Acrobat's own comparison, compare the two outcomes and examine the differences (using software to remove false positives). The output goes out to a Word file that looks similar to a Word document with Track Changes turned on. The tool highlights the potential differences, highlighting the characters that need further checking. This drastically reduces the time it would take to do a full proofreading since we are identifying what actually needs to be looked at rather than looking at everything.

Proofreading/Text Clean-up

To obtain optimal accuracy levels in OCR output requires extensive proofreading. Various tools are available on the market to improve the proofreading process. Along with the "off-the-shelf" products, we have found that specialized software is sometimes needed to make the process more efficient and reliable. For example, in order to help a proofreader distinguish between characters that are commonly confused, we created modified versions of the fonts, using a font tool called TypeTool from FontLab, designed to help distinguish between similar looking characters - "O" vs "0", "Z" vs "2", "1" vs "l".

Fig. 3. Proofreading with special fonts: (top) extracted text with applied font; (bottom) source PDF text.

Fig. 3Proofreading with special fonts: (top) extracted text with applied font; (bottom) source PDF text

Text extracted from PDF Normal files will still need some level of proofreading. The usual pitfalls in text extraction are hard/soft hyphens, special characters, and emphasis, which all need to be checked.

We make use of a hyphenation spellchecker to catch extraneous soft-hyphens or missing hard-hyphens. For words that can be spelled with or without a hyphen (for example, well-known), the software will check within the rest of the file to see if that word is found anywhere else as a guide.

Fig. 4. Results of hyphen-checking tool.

Fig. 4Results of hyphen-checking tool

Styling/Pre-editing of the Word Content

Visually, there may be content that can be tagged in multiple ways - is it a bolded paragraph or a section title? Is a paragraph with a number/letter preceding it a list item with a label or a regular paragraph?

Styling of the content using Word paragraph styles to guide the conversion is key. For example, once the style is in place to indicate a paragraph is an author line, software can be developed to tag each of these authors. A Word template is created to include all styles necessary to drive the conversion software.

Fig. 5. Sample Word text styling.

Fig. 5Sample Word text styling

Pre-editing of the Word content is another way to help guide the conversion software. For example, pre-edit tags such as [[TABLE_START]] and [[TABLE_END]] can indicate to the sw that all content in between is part of the same table, helpful for continuation tables or tables that include images or Math equations that can confuse software. Other uses include indicating placement of untitled graphics, page breaks, data to be moved or deleted, nested tables, content tagging, start/end of multiple citations in a reference. In many cases pre-tagging is often far easier than fixing the XML after the fact.

Another helpful tool involves inserting hidden text into the data to aid the conversion software. For example, citations in legacy data tend to be inconsistent in their format. Older content may not have followed any citation guidelines and subsequent content may cycle through a variety of citation formats before the publication settled on a normalized format. Tagging citations manually can be a real nightmare. We developed a macro tool that prompts an editor with specific questions about the type of citations (Harvard vs. numeric, method of linking to citations if numeric, order of author name (last/first, first/last) etc.) and implants key information into the data file to drive the handling of citations.

Conversion

Using the programming language of choice (DCL programmers prefer Perl and JavaScript), the programmers make use of the rules in the conversion specification, the hand-tagged sample, and the pre-edit tags/Word styles to engineer the conversion software. A representative sampling of data is used as a production sample to test the software. Once the results of the conversion sample meet with the user's approval, the production floodgates can be opened and full conversion begins.

Post-Conversion

Viewing of Parsed XML

Identifying errors in XML is complicated. Making use of viewing tools provides visual indicators for the editor to confirm the correct styling and to identify errors. There are several public-domain viewers available online for viewing NLM XML. XSLT stylesheets are very helpful to create a visual look of the XML to ensure the XML is not just valid but correctly represents the data. The purpose of this viewer is to provide a visual of the tagging, not for cosmetic/publishing purposes. The XSLT stylsheet makes use of a variety of colors, fonts, and point sizes to easily identify elements during editorial review.

Fig. 6. Viewer making use of different fonts and colors to provide a visual for tagging.

Fig. 6Viewer making use of different fonts and colors to provide a visual for tagging

The stylesheet enables the editor to view tables so as to check alignment, borders, and spanning. Images can be viewed within the html result to verify the correct graphics are referenced. XSLT can be used to display MathML equations (along with math applications such as MathPlayer), to highlight all special characters, to display tag attributes or any other information that needs to be checked.

The level of visual qa varies based on the complexity of the project. Initial viewing should be at a 100% level to ensure conversion quality. After a comfort level is achieved with this first review, because the conversion is automated and most data is converted correctly, a smaller sampling of pages can be viewed. There are certain traditional "pitfall" elements that will always require visual checking, such as tables, figures, and math.

Reporting stylesheets

XSLT stylesheets can be used to create reports on chunks of data that need to be broken apart and have their components reviewed. The report displays each tag in a separate column, which is a clear visual aid for editorial review. This makes it visually easy to spot possible problems. This same approach can be used for any/all elements where a break down of the text is necessary and needs to be reviewed.

Fig. 7. Sample reports created to more easily find oddities in the tagging content.
Fig. 7. Sample reports created to more easily find oddities in the tagging content.

Fig. 7Sample reports created to more easily find oddities in the tagging content

QC software

While a file may parse error-free and look fine in viewing, there may still be tagging errors. Post-conversion software-based QC checks may help to identify discrepancies between the XML files and the tagging specifications. It is helpful to make use of existing QC tools (such as the NIH Stylechecker), however custom QC software should be developed to ensure that all rules in the conversion specification are being followed. Generic examples for qa sw include checking that <lpage> is greater than <fpage>, that the position attribute is correct, and the content of an element such as year or prefix matches certain criteria. Customized QC software would check the items specific to the user's needs and the data set being converted.

Fig. 8. Sample error log produced by QC software.

Fig. 8Sample error log produced by QC software

Conclusion

"Give me six hours to chop down a tree and I will spend the first four sharpening the axe."

- Abraham Lincoln

Pre- and post-conversion effort pays off in the end. No conversion to XML can be a totally hands-off/lights-out silver bullet. The analysis performed up-front will lead to a better result in the end. The final XML will still inevitably need some adjustments but the effort put in before and after conversion will minimize the amount of tweaking necessary.

Copyright © 2011, Data Conversion Laboratory, Inc.

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2011
Journal Article Tag Suite Conference (JATS-Con) Proceedings 2011 [Internet].
Bethesda (MD): National Center for Biotechnology Information (US); 2011.

Recent activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...