![]() | ![]() |
Formats:
|
||||||
Copyright © 2007, Authors. Case study: lessons learned through digitizing the National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research Collection*† 1Email: ajhatfie/at/iupui.edu, Digital Initiatives Librarian 2Email: sdkelley/at/iupui.edu, Digitizing Assistant; Ruth Lilly Medical Library, Indiana University School of Medicine, 975 West Walnut Street, IB-100 Room 208, Indianapolis, IN 46202-5121 Received September 2006; Accepted January 2007. Readers may use articles without permission of copyright owners, as long as the author and MLA are acknowledged and the use is educational and not for profit. This article has been cited by other articles in PMC.INTRODUCTION The Indiana University Center for Bioethics (IUCB) and the Ruth Lilly Medical Library (RLML), Indiana University School of Medicine, joined forces in 2005 to augment online access to bioethics-related materials by developing the Bioethics Digital Library (BEDL) [1]. BEDL's goal is to acquire or borrow unique bioethics-related materials and special collections for digitization, to preserve the digitized materials, and to provide open access to these materials through a full-text indexed, Web-integrated database. To enhance discoverability of BEDL materials, content will be linked to citation records in the Kennedy Institute of Ethics National Reference Center for Bioethics Literature ETHX on the Web database [2] as well as other appropriate digital repositories, creating a network of bioethics resources with multiple access points. Interest in providing open access to digital scholarship is increasing as evidenced by the National Institutes of Health's Public Access Policy [3] and the introduction of the Federal Research Public Access Act of 2006 (FRPAA, S.2695) [4]. One way to contribute to the open access initiative is to convert historical materials found in disparate locations to digital formats that are discoverable and freely accessible on the Internet. This paper presents a digitization case study that illustrates the challenges of transforming a historical collection to a digital collection while attempting to retain the look and feel of the original historical materials. COLLECTION SCOPE The first complete collection digitized for BEDL belonged to a member of the National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research (the National Commission), convened in 1974. The collection—which contains committee reports, memos, documented deliberations, letters from human research subjects, and much more (see sample documents in the online figures)—was donated to the Truman G. Blocker History of Medicine Collections, Moody Medical Library, University of Texas Medical Branch. Based on the accepted digitizing plan, the Blocker History of Medicine Collections allowed RLML to borrow the materials and digitize the entire collection in return for archival-quality tagged image file format (TIFF) images produced during the scanning phase. The digital collection now exists in duplicate at separate geographic locations—a best practice for preservation in the digital age. The collection provides a glimpse into the work of the commission as well as the many issues relating to the use of human subjects in research. Graduate, professional, doctoral, and post-doctoral students—as well as scholars involved in research policy, law, and bioethics—will likely be most interested in the contents of the collection. However, because the materials comprise a historical collection that informs current policy decisions and the conduct of ethical research involving human subjects, materials have general appeal as well. BEDL's Digitizing Team, including three library science graduate students and the digital initiatives librarian, identified four broad categories of material types during the collection preparation phase. Table 1 describes the categories. TECHNICAL PLATFORM BEDL is a valuable, stand-alone resource available through the Indiana University Purdue University–Indianapolis (IUPUI) campus's digital repository, the IUPUI Digital Archive (IDeA) [5]. The IUPUI University Library has provided technology resources for IDeA, including hardware, storage, backup, and system and network administration. The IDeA team is committed to developing scholarly communities and collections in the virtual environment and to ensuring standard migration of digital content through technological developments, thus making the platform for BEDL both stable and scalable as the collections and content grow. IDeA utilizes DSpace, the open source digital repository software developed by the Massachusetts Institute of Technology and Hewlett Packard [6]. DSpace platforms establish systems for academic-oriented digital repository and preservation that comply with the Open Archives Initiative Protocol for Metadata Harvesting, which leads to a plethora of networking and value chain opportunities. A value chain strategy includes acquiring raw materials (e.g., hard or soft copy content), adding value to the raw materials (e.g., metadata, full-text indexing, online accessibility), and returning the value-added end product to consumer audiences. Once started, the value chain can continue to grow, making the raw materials increasingly valuable to broader or niche audiences, depending on a collection's goals and the extent of the value added. COLLECTION PROCESSING Two implicit phases of a digitizing project encompass many steps and methods: (1) physical materials preparation and scanning and (2) scanned-image processing. Figure 1
Prior to the National Commission project, the team experimented with digitizing a test collection: six volumes of the National Bioethics Advisory Committee (NBAC) Reports and Recommendations [7]. Based on the lessons learned during experimentation, best methods and workflow were applied to the National Commission digitization project. For example, the NBAC reports were not processed with optical character recognition (OCR) during post-scan processing. It was evident that the OCR step should be included to add value to the materials. By including the OCR step, full-text indexing and internal document search capability can be applied, making it easier for consumers to discover and use the materials. The National Commission collection workflow includes an OCR procedure, thus adding to the value chain of the content. Phase I of the National Commission digitization project took approximately 180 scanning hours to produce 41,456 archival quality TIFF images. The TIFF images were saved to Dual Layer DVD storage media for backup during post-scan processing. The storage media will suffice for short-term backup until the archival TIFF images are uploaded to BEDL for long-term preservation and migration. DIGITAL FORMATS The National Commission master archival images were scanned at 600 dots-per-inch resolution and saved as uncompressed TIFF files, the most widely adopted format for storing preservation-quality digital masters [8]. The masters must be high quality so that they can be used to create access derivatives or be migrated to new types of digital formats that may become available. The final derivative format for the National Commission materials is Adobe portable document format (PDF), type PDF-A/RGB. This format is promoted as the most desirable for Internet access [9]. PDF is also recommended by the National Initiative for a Networked Cultural Heritage in their Guide to Good Practice [10]. METADATA Beyond adding to the content value chain with OCR and full-text indexing, authoritative metadata is also applied to the record of each collection item. A qualified Dublin Core metadata set, based on the Dublin Core Libraries Working Group Application Profile (LAP), is the default DSpace metadata record structure. Because the team uploads items when processing is complete, the DSpace online submission process is used, item-by-item. During the submission process, appropriate metadata is assigned to the record, as are Library of Congress Subject Headings, Medical Subject Headings, and National Reference Center for Bioethics Literature Classification Scheme and Subject Terms. The field of bioethics is multidisciplinary, therefore all three authorities are used. LESSONS LEARNED As BEDL's Digitizing Team developed and moved through the workflow processes, inevitably challenges were encountered. One of the first was determining appropriate settings, standards, and efficient workflow steps. Once these were determined, documenting and communicating reproducible procedures became an additional challenge. The solution was to create a continually updated Digitizing Guide for use by the Digitizing Team. Learning how to optimize conversion software such as Adobe Photoshop, ABBY FineReader, and Adobe Professional was another challenge. Employing team members who have the necessary skills or are not afraid to experiment with the software is advantageous. Sending team members to workshops and mini-training sessions is also helpful. File formats and conversion methods quickly became a concern. In general, the project's workflow includes three major processes, each of which includes many individual steps: (1) image cleanup in Photoshop, (2) OCR in FineReader, and (3) PDF conversion and compression. The workflow required processed TIFF files from Photoshop to be converted into compressed PDF files before exporting to FineReader. Because the team is committed to accurate and thorough OCR that maintains the original look of the materials, the OCR step is the most laborious and time consuming. FineReader is used to spell-check and identify characters not recognized by the software. The team then painstakingly corrects the inaccuracies. The OCR process essentially embeds a full-text file in the digital file, which increases the final file size substantially. Determining a final derivative format for efficient Web-downloading may seem straightforward as PDF is documented as being the best Web-access format [11]. However, once OCR was applied in FineReader utilizing BEDL's preferred method of saving text under the image, the resulting PDF file sizes became too large for efficient Web downloading. The team has experimented with several OCR and compression techniques to find a combination that would meet its high-quality standards but has not yet identified a suitable solution. While DSpace is functionally a good platform for BEDL, working within the organizational constraints of the DSpace system was restrictive. The DSpace concepts of community, sub-community, and collection were not always the best way to organize the digital materials in BEDL. In some cases, DSpace sub-communities were used extensively to provide the necessary organizational granularity. The display of community, sub-community, and collection hierarchy on the BEDL Web page can be difficult to understand and can potentially frustrate end users browsing the BEDL collections. Finally, working with materials clearly under copyright as well as materials with unknown ownership required making decisions about how to best represent those materials as part of the whole collection while conforming to copyright law and orphan item guidelines. For example, many of the background materials are newspaper articles and peer-reviewed publications. One way to work around rights issues is to compile and post a bibliography of the published materials in the National Commission collection while the team pursues permissions to post the full-text items. CONCLUSION The National Commission digitization project provided a “proof of concept” opportunity in several areas: (1) the team confirmed the DSpace platform provides the necessary functionality for a digital library; (2) the team was able to successfully borrow and digitize a complete special collection; and (3) the team was able to develop process and workflow methodologies that met high-quality standards while maintaining productivity expectations. Several issues in workflow and methods were identified through developing the National Commission digital collection. In most cases, the team worked through unexpected issues and documented best practices. However, two challenges have emerged that require further research: (1) OCR-PDF file size and (2) copyright and ownership with regard to a historical collection that includes publications. Solutions to these issues would benefit the many projects committed to producing high-quality, usable digital collections. Figure 1
Click here to view.(75K, pdf) Figure 2
Click here to view.(42K, pdf) Figure 3
Click here to view.(52K, pdf) Footnotes * Based on: “The Bioethics Digital Library: Best Practices Evolving from Ground Zero,” poster presented at: MLA '06, the 106th Annual Meeting of the Medical Library Association; Phoenix, Arizona; May 23, 2006.† The Bioethics Digital Library digitization equipment and software was supported by National Institutes of Health grant no. NO1-LM-1-3513 from the National Library of Medicine.Figures 1, 2, and 3 are available with the online version of this journal. REFERENCES
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||