NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2011 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2011.

Bookshelf ID: NBK62174

eDeposit for eSerials: Current Work and Plans at the Library of Congress

Erik Delfino and Jane Mandelbaum.

Author Information

Erik Delfino and Jane Mandelbaum.

Library of Congress

This presentation will describe the current state of the Library of Congress effort to collect digital materials for our national collection through Copyright mandatory deposit. The Library is actively working with publishers to develop standards and workflows for the deposit of e-serials content in the NLM DTD and other formats.

Overview

Since the transfer of the copyright function to the Library of Congress (LC) in 1870, access to the works deposited via the Copyright Office has helped the Library to build the largest and finest research collection in the world (see 17 USC 407). This unique partnership helps preserve the vast array of American creativity while minimizing acquisitions costs for the Library. In coming years, copyright deposit will become even more vital to the Library, as more and more works are published in a growing variety of formats, both digital and analog. The Library is working toward expanding processes that accommodate both physical and digital items, particularly for items received through copyright registration and mandatory deposit.

In the fall of 2010, the Library implemented the first stage of its “eDeposit for eSerials” initiative, with the expectation of continuing and broadening the use of electronic processes to acquire materials for the national collection. Currently, eDeposit for eSerials is a series of manual and automated workflows for requesting, receiving and tracking electronic-only serials for Library collections, in accordance with requirements for copyright mandatory deposit. “Mandatory deposit” rules allow the Library to request via the Copyright Office copies of serials published/made available in the United States. LC expects to work with publishers and other interested parties to take advantage of standards such as JATS to ensure the program is scalable and sustainable.

Background

In summer of 2009, the Copyright Office issued a “Notice of Proposed Rulemaking” stating that the Library would begin acquiring electronic serials published only online via copyright mandatory deposit. After a comment period that autumn, the new regulation went into effect on February 24, 2010.

The Library established a “proof of concept” approach for the first phase of the “eDeposit for eSerials” initiative. For this phase, the Library selected 100 electronic-only serial titles from 41 publishers to be requested for the Library’s collections through mandatory deposit. The selected materials represented a wide variety of publishers and titles.

The Library started requesting titles in September 2010 for the first phase and received the first electronic deposits in October. The Library expects to migrate to the second phase of the initiative in the fall of 2011.

The goals of the “proof of concept” phase are to develop workflows to support (1) identification of eSerials needed for the Library’s collections; (2) issuing requests to publishers for the titles; (3) getting the digital content of the eSerials transferred to the Library; and (4) ensuring the received eSerials are under inventory control.

The next phases will focus on managing the growing collection of born-digital materials, making them available to researchers according to the Library’s access policies, and ensuring that the materials are preservable. As we move forward, we expect that we will be increasingly interested in leveraging JATS and other standards and tools that are being developed and promulgated through NLM and the publishers and other organizations working with NLM. To the extent feasible, we would like to identify areas of mutual interest to reduce burden on publishers and on the Library.

Current Phase: Overview

A preliminary workflow is in place for the proof of concept phase, and the Library is currently receiving and processing eSerials on an ongoing basis. Once an eSerial is requested and an initial deposit (generally one issue) is received and approved for the collections, the Library expects to continue receiving subsequent issues indefinitely, as with any other serial. (Publisher may also choose, if they wish, to submit backfiles for any titles the Library has requested.)

As of August 15, 2011, the Library has received and processed 124 deliveries of eSerial content representing 57 titles. The Library performs anti-virus checking and basic metadata extraction (where possible) on files received, but has not yet performed any normalization or validation.

The workflow for eSerials/eDeposit is currently as follows:

  1. Custodial units of the Library identify titles to be requested via mandatory deposit

  2. Copyright Acquisitions Division (CAD) issues notices to publishers requesting needed titles.

  3. Publishers arrange for transfer of files to the Library of Congress.

  4. CAD verifies that the initial delivery is in compliance with the request.

  5. Library Services reviews the initial delivery and subsequent deliveries to confirm content, and to check for common serials changes (title/ISSN changes, merges, splits, etc.).

  6. Material is then transferred to custody of requesting service unit (Library Services or Law Library).

To support this workflow, the Library has developed a delivery tracking system and a lightweight Web-based browsing application for staff to browse and view rendered content and metadata.

In this phase, we have requested that each publisher submit:

  • at least one “issue” for initial deposit of a specified title.

  • issue-level metadata.

  • full-text of articles in XML, PDF or both (preferable).

  • a DTD or schema for submitted XML files.

In this phase, we have encouraged:

  • that each publisher deliver content in a defined “package” of content and metadata files, with a package “manifest” (to confirm completeness of deliveries), file and package fixity information (e.g., checksums).

  • that each publisher deliver files with accurate and predictable file naming conventions and suffixes.

In this phase, we have accepted:

  • All publisher-submitted metadata and content file types.

Current Phase: Findings

The Library’s mandatory deposit process is guided by a “best edition” statement that specifies the Library’s preferred formats for deposited electronic serials (as well as other materials, both analog and digital). The formats are listed in order of preference, with JATS and other standard XML formats at the top, followed by PDF, HTML, and all other formats; there are also descriptions of needed metadata to accompany deposits. The preferences reflect the Library’s current assessment of the formats that will best meet its dual goals of usability to support access for our current researchers, and preservability to ensure access for researchers in the future.

The formats of the content files delivered to date have generally been within the best edition guidelines. However, there has been – as we expected – considerable variability in the naming and packaging schemes publishers have used to deliver content.

To date, we’ve seen:

  • Significant variations in file and directory naming

  • Variations in part identification (volume, issue, enumeration, chronology)

  • Variations in file format combinations (PDF content only, PDF content + plain text metadata, PDF content + XML metadata, XML content and metadata, varying s tables of contents, content without tables of contents)

  • Many deliveries without a manifest

We are considering this first phase a learning period for all the participants. The Library has taken a conservative approach with the deposits, accepting content as delivered from the publishers, and for now we have decided not to normalize the content data.

Given the wide variations we are seeing, even in this relatively small sample, and the Library’s limited system development resources, we will be seeking ways to standardize both the packaging and the delivered content. For the latter, we hope to minimize duplicating the normalization and processing that has already been done to eSerials content by other organizations.

The Library’s general policy toward eSerials – at least for now – is to prioritize the preservation of the content rather than the look-and-feel of the original. To maximize our limited resources we will be focusing in coming months on acquiring eSerial content in a limited number of normalized forms. Increasing standardized formats and packaging will:

  • facilitate scalability to the thousands of eSerials titles we eventually expect to acquire,

  • help simplify collection management for our custodial units, and

  • enable us to develop more robust and efficient tools to manage access to the content.

The project team has met several times with publishers participating in the first phase of this effort. Based on their input in these conversations, we believe that the publishing community is moving to support JATS across the many subject areas in which the Library collects materials. We also believe the article-focused nature of JATS will become even more useful as more publishers move to article-based publishing.

Part of our strategy will be to “piggyback” wherever possible on the production and distribution practices that publishers already have in place with their other publishing partners, especially where those processes involve eSerial content in JATS format.

Goals for Processing of Content and Metadata in Next Phase:

Our goals for the next phase are to make incremental improvements in standardizing incoming eSerial content and metadata processing. We hope more publishers will begin depositing requested titles in JATS wherever possible (to date, only one publisher has done so). Our goal is to acquire 80% of e-deposited content in JATS or other standard formats, and to normalize the other 20% of content that can’t be acquired that way internally.

Our emerging strategy will focus on acquiring sSerial content via the following methods in descending order of preference:

  1. Acquire normalized (JATS wherever possible) data directly from publishers

  2. Acquire data in a limited number of standard formats from third party sources, to minimize the number and variability of incoming streams

  3. Acquire existing normalization routines from other organizations, and normalize content in-house

  4. Develop normalization routines in-house, and normalize data in-house

In support of this strategy, we would like to make improvements in the following areas, working with publishers and third party organizations as appropriate:

  • Delivery of eSerials to LC in JATS for metadata, container/package and article fulltext (with the addition of a PDF for access copy)

  • Minimization of burden on publisher and LC by taking advantage of existing publisher distribution processes and schedules.

  • Development of an XML “wrapper” based on JATS and/or a METS serials profile to structure journal articles delivered in PDF or other non-XML format

  • Minimization of burden on publisher and LC by leveraging common publishing platforms (e.g., entities that host serials from multiple content providers).

  • Minimization of the number of different types of XML used and submitted by publishers.

  • Standardization on article-level metadata and identification.

  • Standardization on issue-level metadata.

  • Minimization of burden on publisher and receiving organization when serials move between publishers.

  • Standardization on metadata for components such as images, graphs, audio, and video incorporated into an article.

  • Standardization on metadata to put together an article from component files if delivered separately (e.g., an article may be composed of a text file and multiple image files).

  • Standardization on file and/or delivery fixity practices and/or authentication tools.

Challenges and Opportunities for Practical Experience and Future Progress

The Library’s strategic goal is to acquire, preserve, and provide access to a universal collection of knowledge and the record of America’s creativity.

At no other time has the emergence of technology so directly affected how the Library performs these functions. The rapid evolution of digital technologies and the Internet continues to revolutionize how information and data are created, gathered, stored, distributed, preserved, and protected. Shifting media formats, increased production of and access to digital works, new uses of metadata and increasingly complex data rights issues have created new challenges and opportunities.. We are at the beginning of a period in which we expect to be modeling, testing, implementing and iteratively improving new processes for acquiring, managing and delivering digital content. We know these processes will require practical engagement and new types of cooperation with those who create and provide that content, and we look forward to being able to make progress in both the short-term and long-term future.

Copyright © 2011, Library of Congress.

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2011
Journal Article Tag Suite Conference (JATS-Con) Proceedings 2011 [Internet].
Bethesda (MD): National Center for Biotechnology Information (US); 2011.

Recent activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...