NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2010.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010 [Internet].

Show details

Portico: A Case Study in the Use of the Journal Archiving and Interchange Tag Set for the Long Term Preservation of Scholarly Journals

, , , , , , and .

Author Information

, , , , , , and .

ITHAKA
100 Campus Drive, Suite 100, Princeton NJ 08540, USA

This paper explores the experience of Portico (www.portico.org), a not-for-profit digital preservation service providing a permanent archive of electronic journals, books, and other scholarly content, as both an implementer and a partner in the on-going development of the National Library of Medicine’s Journal Archiving and Interchange Tag Set. It reviews the shared origin of both entities in the 2001-2002 Mellon Foundation funded study of e-journal archiving projects, and the process by which Portico has attempted to distill and share with the larger JATS community the experience which we have gained over the course of converting approximately 70 e-journal formats (adumbrating approximately 170 distinct variants such as versions and provider-usage-profiles) into the Journal Archiving and Interchange Tag Set. It briefly details the challenges encountered by the authors of this paper (who comprise the technical team who create the transformations of publisher-supplied content to NLM) in normalizing these many formats to a single one. It describes the profile of the NLM Journal Archiving and Interchange Tag Set employed by Portico (which is fully convertible to that tag set), the rationale behind the usage choices it enforces, and its use in normalization of input, some of which itself is NLM (albeit reflecting usage profiles, whether implicitly or explicitly documented and declared, of different publishers). Finally, it discusses possible additional refinements that might be desired, particularly in relation to the <named-content> and <custom-meta-group> elements, to ensure consistent and lossless transformation from various publisher formats to the Journal Archiving and Interchange Tag Set.

What is Portico?

Portico is a digital preservation service for electronic journals, books, and other content. Portico is a service of ITHAKA, a not-for-profit organization dedicated to helping the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. Portico understands digital preservation as the series of management policies and activities necessary to ensure the enduring usability, authenticity, discoverability, and accessibility of content over the very long-term.

Portico serves as a permanent archive for the content of over 117 publishers (on behalf of over 2000 learned societies and associations), with, as of this writing, 11,954 committed electronic journal titles, 65,986 committed e-book titles, and 39 digitized historical collections. The archive contains nearly 15 million archival units (journal articles, e-books, etc.), comprising approximately 175 million preserved files.

The technological design goals of the Portico archive were, to the extent possible, to preserve content in an application-neutral manner, and to produce a "bootstrapable archive" of XML metadata plus the digital objects themselves. "Bootstrapable" in this context means that each archived object can be packaged in a ZIP file, with all original publisher-provided digital artifacts, along with any Portico-created digital artifacts and XML metadata associated with the object, and the entire archive can be reconstituted as a file system object, using non-platform-specific readers, completely independent of the Portico archive system. The archive is designed to be OAIS-compliant, and is subject to a process of continual review to ensure that it continues to conform to commonly accepted standards and best practices as they are understood in the digital preservation community. In 2010, Portico became the first digital preservation service to be independently audited by the Center for Research Libraries (CRL) and subsequently certified as a trusted, reliable digital preservation solution that serves the needs of the library community.

For each journal article in the archive, Portico preserves all original publisher-provided digital artifacts, including PDF page images, along with any Portico-created digital artifacts associated with the item. These latter include structural, technical, descriptive, and events metadata, and a normalization of the publisher-provided SGML or XML journal article files to Portico’s profile of the National Library of Medicine’s Journal Archiving and Interchange Tag Set.

Portico functions as a “dark archive”. While a limited number of credentialed users at both depositing publishers’ and subscribing libraries’ sites can access the archive content via an audit interface (hosted by Portico’s sister service, JSTOR), subscribers to content in the archive generally continue to access that content at the publishers’ host sites. Participating libraries, including their students, faculty, and staff, gain direct access to archived content when specific conditions or "trigger events" occur which cause titles no longer to be available from the publisher or any other source. Trigger events include cessation of a publisher’s operations, discontinuation of a title by a publisher, back issues no longer offered by a publisher, or catastrophic and sustained failure of a publisher’s delivery platform. Portico currently provides post-trigger event access to four journals. The majority of the titles archived by Portico are also available for post-cancellation access (PCA) if needed. Upon receipt of a claim from a participating institution and confirmation of the past subscription status by the publisher, campus-wide access is provided to the requesting participating library. Portico currently handles PCA claims from 136 institutions for 551 e-book and e-journal titles.

How did we get here?

The development of the Portico archiving service has been closely intertwined with the development of the NLM Journal Archiving Tag Suite (see Figure 1). Indeed, both can trace back to a shared ancestor in a research project funded in December 2000 by the Andrew W. Mellon Foundation to investigate the feasibility of creating a permanent archiving of scholarly e-journal literature. Linda Cantera summarizes the key events in that timeline:

In October 1999, the Council on Library and Information Resources (CLIR), the Digital Library Federation (DLF), and the Coalition for Networked Information (CNI) convened a group of publishers and librarians to discuss responsibility for archiving the content of electronic journals. A series of meetings led to the publication in May 2000 of the document, "Minimum Criteria for an Archival Repository of Digital Scholarly Journals". Soon after, the Andrew W. Mellon Foundation solicited proposals for one-year e-journal archiving planning projects which would incorporate the minimum criteria outlined in this document. Seven institutions were awarded grants for projects carried out from January 2001 through early 2002: the libraries of Cornell University, Harvard University, Massachusetts Institute of Technology (MIT), Stanford University, the University of Pennsylvania, and Yale University, and the New York Public Library (NYPL). Cornell and the NYPL took a subject-based approach, with Cornell addressing issues related to agricultural journals and the NYPL addressing those related to electronic resources in the performing arts. Harvard, Pennsylvania, and Yale took a publisher-based approach: Harvard worked with Blackwell Publishing, the University of Chicago Press, and John Wiley & Sons; Pennsylvania worked with Oxford and Cambridge; and Yale worked with Elsevier Science. MIT investigated the issues presented by "dynamic" e-journals, that is, those in which the content changes frequently, while Stanford focused on the development of tools to facilitate local caching of e-journal content.

One outcome of the Mellon-funded study was a decision by the Mellon Foundation in 2002 to fund the JSTOR Electronic Archiving Initiative, intended to develop a permanent archive of scholarly electronic journal articles via capture of publisher source files (as distinguished from publisher presentation files obtained via a web crawler). In 2005, the JSTOR E-Archive was constituted as the free-standing archiving service, Portico.

Concurrently with these developments, PubMed Central (PMC) was created in 2000 by the National Center for Biotechnology Information (NCBI) to enable free access to full-text articles in life-science journals. In late 2000, PubMed Central modified its original workflow from one which stored publisher content in the publisher’s native DTD, serving up HTML on demand, to one that normalized publisher content to a common, access-optimized (PMC DTD 1.0) format before ingest to storage, and served up HTML from that single, common format. In 2001, PubMed Central declared a shift in its primary focus from journal access to journal preservation, and concomitantly was considering modifications to its access-oriented DTD, to alleviate some of the challenges of consistent normalization from an increasing number and variety of publisher DTDs (see Beck 2005, Beck 2010).

The Harvard University Library (HUL) Mellon Project in particular proved a key common factor in the development of both PMC and Portico. The University of Chicago Press (UCP) was one of the original publishing partners in this project. UCP’s Evan Owens, who consulted on the project, subsequently became Chief Technology Officer of the JSTOR E-Archive initiative at its inception. And, crucially, when HUL commissioned Inera to study the feasibility of creating a common archival DTD that would enable the archive to tag material received from any publishing partners in the same format, both the UCP and PMC DTDs were part of that study.

The possible benefits of a common archival DTD were as clear to Portico as they were to PMC – not least for the workflow simplification that such a DTD would enable. As the architecture for the Portico archive was being formulated, it was clear that the creation of a normalized form of publisher content provided added possibilities for consistently curated bibliographic metadata, and thus for overall archive management. Additionally, the very process of analyzing publisher content with a view toward normalization, and the quality assurance checks performed on that content both before and after normalization (both practices, incidentally, also recommendations of the Feasibility Study), would ensure that Portico understood the distinctive publisher vocabularies and their varying usages while those knowledgeable in their use were still available as resources, and while the possibility still existed to correct errors and inconsistencies in the application of those vocabularies

Fig. 1. A Shared History.

Fig. 1A Shared History

The development and publication of the first version of the NLM Journal Archiving DTD coincided with the early formulations of the Portico archive architecture. Portico was very happy to respond to the proactive outreach of NCBI to include the community of interest in the ongoing refinement of NLM JATS. Portico joined the NLM Working Group from its inception, with CTO Evan Owens and Director of Data Technology John Meyer very actively sharing the experience of the Portico data team in using the NLM DTD as a normalization target as the number of new publisher content streams and DTDs handled by its archive, and for which normalization XSL transformations were created, steadily increased.

DTD Evolution: Inera’s Crystal Ball

The analysis and recommendations in Inera’s “E-Journal Archive DTD Feasibility Study” not only informed Portico’s articulation of normalization policy and practice, they also proved predictive of the cruxes of the DTD’s gradual refinement and evolution.

Creating a common archival DTD that could adumbrate the content expressed in SGML and XML vocabularies designed for multiple scholarly disciplines, meeting the unique needs of varying custom production processes, entailed, as the study declared, two key challenges:

First, can a common structure (DTD or Schema) be designed and developed into which publishers' proprietary SGML files can be transformed to meet the requirements of an archiving institution?

Second, if such a structure can be developed, what are the issues that will be encountered when transforming publishers' SGML files into the archive structure for deposit into the archive?

From Portico’s experience, the NLM JATS provides a clear positive answer to the first question. The ongoing challenge, and the impetus for ongoing refinements to the DTD, is continually to narrow the impedance mismatch between individual publisher DTDs and the NLM DTD, without needlessly elaborating the NLM DTD itself, while simultaneously ensuring that there is no significant, irrecoverable semantic loss in transforming to the archival format.

The study lists the following as key challenges in the use of an archival DTD:

  • Use of generated and boilerplate text, especially in
    • Label text for figure captions
    • Citation text
    • Author name and affiliation
    • Dates
  • Expression of links between author and affiliation
  • Reference elements
  • Expression of non-article and other content
  • Abbreviations and definitions
  • Keywords
  • Sections, including handling of sections without headers
  • Placement of floating objects, such as figures, tables, graphs
  • Tables, including cell formatting issues (cells with figures, content alignment, etc.)
  • Math
  • Intra-, inter- and extra-article linking
  • Publisher-specific elements

It is instructive to look at the minutes of the Working Group with this list in hand, to see quite how predictive it is of the issues that engaged the Working Group’s attention as the NLM DTD went through its various refinements. Instructive also was its predictive power of the likely difficulties an archive might face in normalizing content from multiple suppliers, with varying levels of quality control in their production. Portico has seen content from publishers whose DTDs do not validate; whose content does not validate against its own DTD; content with incorrect encoding declarations; content whose DOCTYPE statement declares it conforms to one version of a publisher DTD when it in fact uses elements from a later version; documents that incorrectly declare that they are “standalone”, documents that are split over many files, but which contain no external entity declarations to indicate document components; documents with such external components that are in fact HTML or other fragments; documents with missing internal linking cues (IDS and IDREFS) for references and other objects. With respect to links with external files, we have found that the packaging of article components (images, tables, figures, and other supplemental files) follows no consistent naming conventions; nor is it feasible to attempt to enforce such naming conventions as the feasibility study recommends. We certainly do follow its recommendation of performing quality assurance both on the files we receive (for example, validating them), and on the transformed files we produce (both validating them against the DTD and with Schematron, and by performing visual inspection of the transformation output)(see Morrissey etal.).

DTD Evolution: The Portico NLM Profile

Fig. 2. DTD Evolution.

Fig. 2DTD Evolution

Portico’s first profile of the NLM DTD was developed when that DTD was at version 2.0 (see Figure 2). In accord with recommendations of the Feasibility Study, Portico documented its policies with respect to the key issues highlighted in the report. There is, first of all, an articulation of the relation between the Portico profile and the NLM base. By policy, that profile was designed to be compatible with, and fully convertible to, NLM:

  1. The Portico Article DTD is derived from the NLM Archiving and Interchange DTD and NLM Article Tag Set. It is our goal that the derivation be done according to the recommended best practices for customization of the NLM DTDs according to the NLM documentation. That cannot be a requirement, because we don't know how they may constrain customization in the future, but it is a high priority goal.
  2. Our modifications should be as few as possible. Any change that we make should also be considered for submission to the NLM as a modification request for future inclusion in the official version. If time permits, we should try to get a change in the official DTD instead of making it ourselves as a local customization.
  3. Our changes should be designed so that we retain the ability to convert to 100% NLM compatible markup if needed, at the cost of some loss of information. In general, our changes are designed to preserve additional information that NLM does not currently support, hence the cost of conversion to 100% NLM-compatible is likely to involve some information loss.
  4. From time to time we will make changes of necessity before NLM incorporates that change into their DTD. In some cases, NLM will not implement the change exactly the same way that we did, creating an incompatibility. This will be treated as a future migration issue and dealt with as part of preservation planning. If the granularity is the same but the markup is different, an incompatibility can easily be rectified at the point of migration for other reasons.
  5. We expect that we will version our DTD over time and that content corresponding to different versions will exist in the archive at the same time. Whether we fix this through migration is a preservation planning issue for the Portico Article DTD format. As long as the DTD versions are backwards compatible, there is little pressing reason for migration, since the net change to most article instances would be zero. One alternative would be simply to revalidate against the latest DTD and upgrade the format identification in METS without changing the content. Again, that is a preservation planning decision and would be best made when other METS updates are also planned, such as during periodic audit.

Portico made use of the modularization capabilities of the NLM DTD, creating a PorticoCustom.ent file to add overrides to create the Portico profile. The motivations for each of the changes are documented in a modular addition to the NLM XML documentation, both in a top-level “Portico Global Remarks” section included in the (unaltered) NLM “General Introduction”, and, at each element/attribute level, in a “Portico Remarks” section, detailing any constraints or variation on the usage of the element or attribute detailed below in the (unaltered) NLM documentation

For example, in the “Portico Global Remarks”, there is a clearly articulated policy about generated text and its use in the title, label, author, citation, date and other fields which the Feasibility Study accurately predicted would be potential migration trouble spots:

Generated text is defined as text and/or punctuation and/or white space that appears in the publisher's printed edition of a document that is implicit in the mark-up. It does not appear explicitly, but the context implies its presence. The presence of generated text can only be determined by manually inspecting the print (or online) version and comparing it to the publisher's data. The publisher's rules for each context can then be extrapolated.

It is Portico's position to treat all generated text as "implied" data. That is, we consider the text as supplied by the publisher to be present and real, but invisible in the data.

The purpose of our generated text policy is that no renderer will need publisher-specific information about the data in order to make rendition decisions involving generated text.

As a result of this position we shall make our best effort to include all such text explicitly in the archival version of documents. Knowing this, a renderer of the documents need not worry about identifying and generating any such text. If there are exceptions to this rule for any document they will be described in the meta-data.

All generated text will be identified by being contained in the X - Generated Text and Punctuation element. Text generated by Portico will have the value of the x-type attribute of the X - Generated Text and Punctuation element set to "archive". Text in the publisher's data already contained in the x element or its equivalent will have that attribute set to "publisher". The attribute value will also be "publisher" if we use the X - Generated Text and Punctuation element as a wrapper around extant text, punctuation or white space.

If the only content of an element is the X - Generated Text and Punctuation element then that element is also assumed to be generated by Portico (e.g. <title><x xtype="archive">Generated title</x></title>).

There is a possible undesirable side-effect to this policy. Generated text may be duplicated in the #PCDATA. For instance, if a particular editor chooses to insert something like "<ital>Bibliography</ital>" into the #PCDATA of the first item of a bibliography and our analysis has concluded that we should generate a title for the bibliography element, the word "Bibliography" will appear twice. This possible duplication is a result of a combination of our generated text policy stated here and our policy of not parsing #PCDATA for semantic meaning. We therefore make no claim that the archival data will not contain duplicated text.

The remarks expand on the contexts in which generated text may appear, and how Portico transforms are to treat it.

There are four major classifications of generated text.

  1. Titles and title-like text
  2. Boilerplate text
  3. Labels
  4. Punctuation and white space used as separators and/or connectors

There are myriad ways that publishers use generated text. The X - Generated Text and Punctuation element will indicate all generated text - whatever the context.

  • Titles and title-like text
    If the printed version of a document contains titles that are not in the data, we will supply the title. In the rare case that the DTD does not allow us to insert a title element in that place, we will prepend the title to the #PCDATA or use some other appropriate element.
  • Boilerplate text
    If there is implied boilerplate text (e.g. copyright notices, disclaimers, "Notice" or "Warning") we will generate the text inside the appropriate element and the entire text will be contained within the X - Generated Text and Punctuation element. We will take all necessary steps to ensure the legality of such notices and/or compliance with the publisher's policy. Boilerplate text can be as simple as an implied superscripted asterisk for the first footnote.
  • Labels
    Labels have parts. The main part is the identifier. Identifiers look like "1", "A", "I", "a", "i", etc. The identifier is what uniquely defines the object being labeled. The identifier can be preceded and/or followed by text or punctuation that is common to all such objects. Such prefix/suffix material could be "Figure ", "Fig. ", ".", ": ", or "Table ", for example. Thus, for the label "Graph 2a." we call "Graph " the label prefix, "2a" the label identifier, and "." the label suffix.
    If a label looks like "(iv)" we consider the "(" to be the prefix and the ")" to be the suffix.
    An identifier may not be used at all, making the terms "prefix" and "suffix" superfluous. In such a case the text itself ("Graph", "Figure", "Illustration", etc.) is the label. Thus, when speaking of a prefix and suffix the presence of an identifier is implied.
    It is possible that either the label prefix or suffix or both may be generated.
    It is quite common for the "Figure " of figures or "Table " of tables (for instance) to be generated text.
    To avoid confusion, we speak of the prefix and suffix material separately from the label identifier.
    If a particular publisher has a convention of generating either prefix or suffix material, or both, we will also generate it in the archive if and only if the label identifier is available in the markup (as opposed to the #PCDATA or not at all). For example, if the publisher's data is "<label>6</label>" and the observed printed convention for the particular labeled object is "6.) " we will archive "<label>6<x x-type="archive">.) </x></label>".
    If a label is contained in the attribute of an element we will migrate it into a Label (of a Figure, Reference, Etc.) element for that element. If the publisher's data for that element already contains their equivalent of a label element, and the element's attribute gives the prefix, we will prepend the attribute value to the content of the extant label's #PCDATA.
    If, however, a label identifier (as defined above) does not appear either in an attribute of an element, or in that element's "label" element, we will not generate a label even if the printed version of the data shows a label. Another way of saying this is that we will not do auto-numbering.
    The only exception to that rule is if the entire document is auto-numbered for one or more elements. In that case we will auto-number those elements and make no claim that they are in the correct order.
    In summary, for most cases we will only generate any appropriate implied text and/or punctuation if it is fixed (e.g. always "Fig.", always "Chart"). All such generated text will be contained within the x element. In the case where auto-numbering is necessary throughout the entire document (the numbers are in the printed version but not in the data) we will generate the numbers (identifiers) and any appropriate prefix and/or suffix, put the entire thing in an x element, and make no claim that the elements occur in the data in the order they are to be numbered. That is, the numbering may not agree with the print version.
  • Punctuation and white space used as separators and/or connectors
    This case applies most often to names, addresses, and formatted strings (such as telephone numbers or social security numbers). It is often the case that such things are supplied in a database-like manner rather than the manner in which they are printed.
    Two issues arise here: the punctuation and white space between the data items, and the order the data items appear in the data.
    We take responsibility for the punctuation and white space. If the order in which the data items occur is implicit or explicit in the data, we will archive it in that order. If the order is not somehow specified, we will archive it in the order it appears in the publisher's data.
    Thus, given as input: "<name><fname>Frank</fname><lname>Oz</lname></name>", If analysis shows that the printed name looks like "Oz, Frank" in the general case, then we would archive it thus: "<string-name><lname>Oz</lname><x x-type="archive" xml:space="preserve">, </x><lname>Frank</lname></string-name>". Another common practice of publishers is to use an implied "and" before the last contributor of an article. This would appear in the mixed content of the element thus: "<x x-type="archive" xml:space="preserve"> and </x>".

As the Feasibility Study predicted, determination by an archive of the correct approach to -- or perhaps more precisely, the institutional policy regarding -- the consistent handling of generated text was a key issue, rearing its head at any arbitrary level of granularity in the proposed target DTD. This policy of making all implicit publisher data explicit during normalization – principally by use of the <x> element -- was the source of nearly half of the customizations Portico made to the NLM DTD in defining its own profile, as an inspection of the PorticoCustom.ent file and its documenting comments reveal. In accordance with Portico’s above-documented intent that “[a]ny change that we make should also be considered for submission to the NLM as a modification request for future inclusion in the official version,” these customizations, their motivation and rationale were communicated by Portico to the NLM Working Group. That the need for such a fine-grained facility was a generally felt one is reflected in the fact that virtually all the proposed new contexts for the <x> element included in the Portico customizations have made their way into the 3.0 version of the NLM Journal Archiving DTD.

A key recommendation of the Feasibility Study was that the common archival DTD not attempt to be a superset of all journal publishing DTDs, but rather “should fall between the intersection and union of structural elements common to most publishers.” The Portico profile was hewed to that directive, and to its own stated policy of both minimizing modifications, and communicating those modifications and their rationales to the community of use, with the intent of seeing their inclusion in the shared public DTD. As with the extension of the use of the <x> element, inclusion of the Portico profile changes in the public DTD (3.0) is in fact what occurred to virtually every modification comprising that profile. Those modifications include:

  • Change id attribute in def-list, list, list-item, and tex-math elements from type CDATA to type ID
  • Add chem-struct-wrapper (now chem-struct-group) to element citation
  • Create floats (now floats-group) element
  • Add xml:space to attribute list of the x element
  • Add copyright-holder element
  • Add style attributes to td and th elements
  • Add break element to emphasized-text model
  • Add seq attribute to volume and issue elements
  • Create new journal-subtitle element
  • Create new trans-journal-title element
  • Create facility for handling keyword strings

As with the addition of the <x> element to more element contexts, the changes to the NLM DTD should not be construed as a convergence of all users of that DTD to the Portico profile, but rather as indication of both the expressiveness of the over-arching DTD architecture, and its ability flexibly to extend that expressiveness while heeding Occam’s caution that entities (or in this case, elements and attributes) should not be multiplied beyond necessity.

Neither Lossiness nor Tag Abuse

Simplicity and expressiveness come at a cost – not least because those characteristics can be in opposition to each other. To make a simple tag set richly expressive, its constraints often have to be relaxed (IMPLIED rather than REQUIRED attributes, optional children or child order, CDATA with, perhaps, externally specified controlled lists of values rather than DTD-constrained values, some perhaps surprising valid contexts for elements).

In discussing the virtues of mark-up languages such as SGML and XML, we are accustomed to speak of certain aspects of text content and markup in pairs of what we consider to be disjoint, contrasting terms: presentational vs. structural markup; descriptive vs. procedural markup. The literature, including the Feasibility Study, cautions us however against cleaving too closely to these distinctions, lest they cleave apart in our hands (see Piez 2005, Piez 2001, Usdin). This was a thread of vexed discussion throughout the Working Group discussions and internally at Portico as well as we struggled to determine the minimum profiling to the NLM DTD that would still make it possible for us correctly to map from publisher content to NLM, without distorting or losing provided semantic content on the one hand, or without abusing the semantics of the target DTD itself.

As both the minutes of the working group and Portico’s own experience transforming source documents that come to us in some variant of NLM indicated, this balancing of simplicity and expressiveness, and the discrimination between “display” and semantics, opens the possibility for, at the very least, divergence in implementation, doubt or confusion about “best practice,” and, at worse, tag abuse. Internally, Portico’s response to this “design feature” of the NLM DTD is to document Portico policies in the DTD documentation down to the tag and attribute level, and to enforce these policies with a Schematron validation of the output of every transform. NLM, in its version 3 documentation, has added a comprehensive section on “Common Tagging Practice”.

While the focus of NCBI is biological information, the NLM tag set is intended to be applicable to, and suitable for, electronic journal content in any scholarly domain. From the outset, it eschewed domain-specific vocabulary elements, in accord with the recommendations of the Feasibility Study. The study felt that omitting such subject-specific elements made an acceptable trade-off between simplicity and lossiness, particularly if an archive followed the recommendation to archive the source document along with the transformed one (which is Portico’s practice). However, again as envisioned by the study, it rapidly proved important to include elements for expressing mathematics (the tex-math was added in version 1.1; MathML in version 2.2).

There are, in addition, two other elements for capturing domain-specific semantics: named-content and custom-meta. In instances based on earlier versions of the DTD, usage suggests that these elements sometimes are used for their intended purpose, but that they were sometimes used as “escape hatches” when the content model for a specific element did not yet include a needed child (what do we do, for example, if we have the equivalent of a “price” element in a citation?).

The named-content element has been available since version 1.0. It is documented as a “word or phrase whose content/subject matter has special semantics or content-related significance.” It has been elaborated, in version 3.0, with the new styled-content element, to allow for finer-grained distinction for occasions where we do, or do not, have semantic as well as stylistic tagging in the source document to indicate the reason for special display features for the content of the source element. The documentation for the element, particularly in earlier versions of the NLM DTD, suggests that its content is merely “a word or phrase”, applicable to “a drug name, company name, or product name”, or to define “systematics terms” or “biological components”. However, its content model has always been quite rich. It is illustrative of the need for such a semantic marker to note that the number of contexts in which a named-content element can appear has increased from 35 to 112.

The custom-meta element, with its varyingly-named wrapper elements, custom-meta-wrap and custom-meta-group, did not appear until version 2.0. By definition,

Some DTDs and schemas allow for metadata above and beyond that which can be specified by this DTD. This element is used to capture a metadata element that has not been defined explicitly in the models for this DTD, so that the intellectual content will not be lost.

So the element seems to have been intended as a sort of semantic “miscellaneous” element, the instrument for capturing subject-specific semantic mark-up, without multiplying DTD elements. However, the “Remarks” section seems to constrain the element’s applicability, focusing on “business” data:

This element will probably be used for special cases, product-specific material, or other unusual metadata, for example, the journal-history information preserved in at least one publisher’s DTD.

The element is further constrained by the fact that it (or more correctly, its wrapper parent) was only allowed in the content models of two elements (article-meta, journal-meta) in versions between 2.0, 2.1, and 2.3, and only added to a third (front-stub) in version 3.0.

These constraints have proved something of a challenge to Portico, where we have seen many such domain-specific semantic elements in the various content streams we receive. These include, for example, from biological journals, elements for synonymies (one of the publisher-specific elements cited in the Feasibility Study; used to specify multiple taxonomic names for the same plant or animal) and for taxonomic keys (a sort of decision table for determining to which species a specimen plant belongs). Because of the location of these elements in publisher content, we have used named-content to capture this information – although doing so meant extending the content model of that element in our profile (changes later propagated through to the NLM DTD).

We continue to wonder if custom-meta is not a more appropriate target element for such semantic capture, and whether we should be considering either expanding the number of elements that are allowed to contain custom-meta, or whether, if usage has constrained this element’s understood semantics to relate only to business data, a new “miscellaneous” or “domain-specific-content” element is required. It is perhaps worth considering, alternatively, whether, for such domain-specific semantic markup, it makes sense to follow the precedent set in mathematical markup, and allow the use of standard namespaced elements from relevant domains. This might be worth considering for archives with a collection spanning multiple disciplines, and for domains where, as for mathematics, there is a clearly defined, broadly applicable, and widely used XML vocabulary.

Looking Ahead

It is Portico’s experience that the Journal Archiving and Interchange Tag Suite, informed by the analysis and recommendations of the Inera Feasibility Study, disciplined by the consistency of approach and architecture emanating from that paper, from the HUL study of which it was a component, and the from the deep experience of NCBI and other participants in that study, has met and continues to meet the challenge of developing “a common structure … into which publishers' proprietary SGML files can be transformed to meet the requirements of an archiving institution.” Certainly the convergence of many scholarly publishers on one or another of the tag sets in the suites is an indication of its applicability to many domains and production workflows. Nearly a third of the different publisher vocabularies currently processed by Portico are some version or variant of the NLM journal archiving or journal publishing DTDs. While normalizing this content to the Portico NLM profile involves more than an identity transform, and while some care must be taken to make explicit the implicit semantics sometimes buried in the publisher's implementation of the document type definition, these transforms are considerably simpler than those from document types outside of the NLM family.

This interchange from a publisher to an archive is a strong test of the viability of publisher content outside of the publisher’s own production system environment. While the use of implicit (generated) text, and, absent publisher-specific documentation, the reverse-engineering of controlled lists for attribute values in parallel with the interchange of documents still require analysis, it is altogether much more straightforward to ensure lossless transformation to a normalized archival format that we believe will be viable for the very long term.

References

    All links valid as of 2010.10.06

  1. . "Report from the Field: PubMed Central, an XML-based Archive of Life Sciences Journal Articles".Presented at International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML, Montréal, Canada, August 2, 2010. In Proceedings of the International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML. Balisage Series on Markup Technologies, vol. 6 (2010). ). doi:10.4242/BalisageVol6.Beck01 .
  2. . "PubMed Central XML-based archive of life sciences literature at the NLM". XML 2005 Conference Proceedings. Available at http://docs​.google.com​/viewer?a=v&q​=cache:puQyWarEUGgJ:citeseerx​.ist.psu​.edu/viewdoc/download%3Fdoi%3D10​.1.1.83.1650​%26rep%3Drep1%26type​%3Dpdf+%22PubMed+Central+XML-based+archive+of+life+sciences+literature+at+the+NLM​%22&hl=en&gl​=us&pid​=bl&srcid​=ADGEESjyNpNzFRfKs8JZwatjFRf4b9JfeoLNRVHzhcWvewme9omTrnX0PEDuDAKToJWFOif0vq8v2G-exl1qdp1B7hsy61oHieF4CYq6x-yOdkuOhu7bTelcF-6CaLFbVtEsTISF0eKp&sig​=AHIEtbS4cP-vYQVDkcgymjX3lPDjAj461w.
  3. , Ed. "Archiving Electronic Journals Research Funded by the Andrew W. Mellon Foundation Edited, with an Introduction, by Linda Cantara, Indiana University". The Digital Library Federation Council on Library and Information Resources Washington, DC. 2003. Available at http://www​.diglib.org/preserve/ejp.htm.
  4. and , co-chairs. "Preserving Digital Information: Report of the Task Force on Archiving of Digital Information". The Commission on Preservation and Access and The Research Libraries Group. May1996. Available at http://www​.clir.org/pubs​/reports/pub63watersgarrett.pdf.
  5. and . "Minimum Criteria for an Archival Repository of Digital Scholarly Journals Version 1.2". Digital Library Federation. Washington, DC. May 15 2000). Available at http://www​.diglib.org/preserve/criteria​.htm.
  6. . "Report on the Planning Year Grant for the Design of an E-journal Archive Presented by: Harvard University Library Mellon Project Steering Committee Harvard University Library Mellon Project Technical Team To: The Andrew W. Mellon Foundation". April 1 2002. Available at http://www​.diglib.org​/preserve/harvardfinal.pdf.
  7. "E-Journal Archive DTD Feasibility Study. Prepared for the Harvard University Library, Office of Information Systems, E-Journal Archiving Project". 2001. Available at http://www​.diglib.org/preserve/hadtdfs​.pdf.
  8. , , , , , and . "Portico: A Case Study in the Use of XML for the Long-Term Preservation of Digital Artifacts". Presented at International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML, Montréal, Canada, August 2, 2010. In Proceedings of the International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML. Balisage Series on Markup Technologies, vol. 6 (2010). doi:10.4242/BalisageVol6.Morrissey01.
  9. . "Format and Content: Should they be separated? Can they be?: With a counter-example". In Proceedings of Extreme Markup Languages (Montréal, Québec), 2005. Available at http://conferences​.idealliance​.org/extreme​/html/2005/Piez01/EML2005Piez01.html.
  10. . "Beyond the 'descriptive vs. procedural' distinction" In Proceedings of Extreme Markup Languages (Montréal, Québec), 2001. Available at http://conferences​.idealliance​.org/extreme​/html/2001/Piez01/EML2001Piez01.html.
  11. . "When 'It Doesn’t Matter' means 'It Matters'”. In Proceedings of Extreme Markup Languages (Montréal, Québec), 2002. Available at http://conferences​.idealliance​.org/extreme​/html/2002/Usdin01/EML2002Usdin01.html.
  12. . "Good Archives Make Good Scholars: Reflections on Recent Steps Toward the Archiving of Digital Information". 2002. Available at http://www​.clir.org/pubs​/reports/pub107/waters.html.
  13. NLM-NCBI, Archiving and Interchange Tagset Working Group Meeting Minutes. Available at http://dtd​.nlm.nih.gov/working-group​.html.
  14. NLM-NCBI. Archiving and Interchange Tag Set. Available at http://dtd​.nlm.nih.gov​/archiving/#id47369.
  15. Portico. Portico Journal Archiving DTD (internal documentation)
Copyright 2010 ITHAKA.

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

Bookshelf ID: NBK47087
PubReader format: click here to try

Views

  • PubReader
  • Print View
  • Cite this Page

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...