pmc logo image
Logo of jmlaJournal information.SubscribeSubmissions on the Publisher web site.Current issue of JMLA in PMC.Also see BMLA journal in PMC.

Formats:

J Med Libr Assoc. 2008 July; 96(3): 223–229.
doi: 10.3163/1536-5050.96.3.008.
PMCID: PMC2479051
Digitizing Dissertations for an Institutional Repository: A Process and Cost Analysis*
Mary Piorun, MSLS, MBA, AHIP, Associate Director and Lisa A. Palmer, MSLS, Catalog Librarian
Mary Piorun, Associate Director for Technology Initiatives and Resource Management Email: mary.piorun/at/umassmed.edu;
Received October 2007; Accepted December 2007.
Objective:
This paper describes the Lamar Soutter Library's process and costs associated with digitizing 300 doctoral dissertations for a newly implemented institutional repository at the University of Massachusetts Medical School.
Methodology:
Project tasks included identifying metadata elements, obtaining and tracking permissions, converting the dissertations to an electronic format, and coordinating workflow between library departments. Each dissertation was scanned, reviewed for quality control, enhanced with a table of contents, processed through an optical character recognition function, and added to the institutional repository.
Results:
Three hundred and twenty dissertations were digitized and added to the repository for a cost of $23,562, or $0.28 per page. Seventy-four percent of the authors who were contacted (n = 282) granted permission to digitize their dissertations. Processing time per title was 170 minutes, for a total processing time of 906 hours. In the first 17 months, full-text dissertations in the collection were downloaded 17,555 times.
Conclusion:
Locally digitizing dissertations or other scholarly works for inclusion in institutional repositories can be cost effective, especially if small, defined projects are chosen. A successful project serves as an excellent recruitment strategy for the institutional repository and helps libraries build new relationships. Challenges include workflow, cost, policy development, and copyright permissions.
Highlights
  • The Lamar Soutter Library partnered with the University of Massachusetts Medical School Graduate School of Biomedical Sciences to digitize doctoral dissertations for inclusion in a newly created institutional repository.
  • Seventy-four percent of dissertation authors (209/282) gave permission for the digitization. The cost to process the entire dissertation collection in-house was $23,562, only $1,062 more than the estimate to outsource.
  • Digitizing the dissertation collection increased access: the print collection was used 723 times in the past 5 years, while the electronic collection was used 17,555 times in 17 months.
Implications
  • Digitizing student works is an effective way to begin populating an institutional repository.
  • In-house digitization projects can be cost-competitive with outsourced alternatives.
  • A repository can be a catalyst for developing relationships in the institution by providing the library with a new avenue for outreach.
  • Skills and experience gained from a small project can be applied to larger-scale projects.
Digitization projects in libraries seem ubiquitous as libraries become increasingly involved in the acquisition, development, and management of digital information [1]. Libraries typically target archival and special collections materials such as historical documents and photographs [2]. Projects to digitize vast collections of books began as early as 1971 with Project Gutenberg and are now getting widespread media attention with the launch of Google Book Search, the Internet Archive, and others [3]. In an April 2007 list of ten assumptions about the future that would significantly impact academic libraries and librarians, the Association of College & Research Libraries Research Committee placed digitization at the top of the list, stating, “There will be an increased emphasis on digitizing collections, preserving digital archives, and improving methods of data storage and retrieval” [4].
A related emergent trend in academic libraries is the implementation of institutional repositories (IRs), digital collections that capture and preserve the intellectual output of university communities [5]. A search of OpenDOAR, the Directory of Open Access Repositories, lists 298 academic repositories in North America [6]. Health sciences libraries are among those contributing to this trend; of 125 libraries that responded to a 2006 supplementary survey for the Annual Statistics of Medical School Libraries in the United States and Canada, 28 have established IRs and 70 are planning to add or are considering offering a repository [7]. According to Foster and Gibbons, libraries build IRs because they “provide an institution with a mechanism to showcase its scholarly output, centralize and introduce efficiencies to the stewardship of digital documents of value, and respond proactively to the escalating crisis in scholarly communication” [8].
Medical librarians are just beginning to report their experiences with institutional repositories in the professional literature [913]. In one case study, Krevit and Crays [13] describe challenges that the Texas Medical Center experienced in piloting a multi-institutional repository, including copyright concerns and lack of faculty participation. An analysis by Singarella and Schoening [14] of the surveys conducted between 2005 and 2007 by the Association of Academic Health Sciences Libraries and a survey conducted in 2006 by the Association of Research Libraries [15] confirmed that the challenges experienced at the Texas Medical Center were not unique. Libraries are the drivers of IRs at their institutions, as few faculty members identify and self-archive their own materials. Libraries struggle to recruit content and employ a variety of strategies to enlist submissions [1619]. Content may vary, but a recent study by McDowell reports that student works account for the largest percentage of documents in institutional repositories, approximately 41.5% [19].
The following case study describes a nexus of these two trends: digitization of student scholarly works and institutional repositories. The first digitization project for the Lamar Soutter Library at the University of Massachusetts Medical School (UMMS) was to digitize 300 doctoral dissertations and add the full text to the school's new IR.
Founded in 1970, UMMS encompasses the graduate schools of medicine, nursing, and biomedical sciences. The Lamar Soutter Library holds 175,000 print volumes and provides access to 316 databases, 4,650 electronic journals, and 359 electronic books. The IR is the library's first comprehensive digital initiative.
In early 2006, the library purchased a license for ProQuest Digital Commons,] a hosted institutional repository system, and named the repository “eScholarship@UMMS” <http://escholarship.umassmed.edu>. The team implementing the repository, a previously reported process [12], consisted of representatives from the library's systems (project management and technical support), cataloging (metadata support), and reference (outreach) departments. In March 2006, the dean of the UMMS Graduate School of Biomedical Sciences (GSBS) expressed interest in digitizing the school's dissertations. The GSBS had produced 300 dissertations, most of which were available only in print format. The team thought this would be an excellent demonstration project: it was supported by the dean, it was a manageable size, metadata could be reutilized from the library's online public access catalog (OPAC), and the dissertation authors held the copyright. In May 2006, the library and GSBS partnered to make the dissertations fully searchable on the web.
Outsourcing Versus Insourcing
The team investigated 2 options for digitizing the dissertations: outsourcing to UMI or performing the work in-house. UMI estimated the cost to be $75 per title ($22,500 total) and 8–12 weeks processing time. The basis for the library estimate was created by library staff scanning and locally preparing 3 sample dissertations. Table 1 shows the library's cost estimate of $27,750—for staffing, project management, equipment, and software—and 725 hours of processing time (or 18 weeks when represented as a 40-hour work week). In all instances, except for project management, the team assumed the work would be performed by temporary help. The team had 2 issues of concern: at the time, electronic files created by UMI were not full-text searchable, and the graduate school would need to commit to sending all future dissertations to UMI to keep the database current.
Table 1
Table 1
Estimate versus actual costs and processing times
The project team recommended that the library process the dissertations in-house, despite longer time to process and higher cost, in order to gain experience, retain access to materials throughout the project, and have tighter control over scanning quality. Library administration accepted the recommendation to do the digitization locally, citing “gaining experience” as the major benefit; however, $27,750 was not available to fund the project. Ten thousand dollars was allotted to hire temporary staff, with the understanding that circulation staff and interlibrary loan equipment would be utilized for scanning and team catalogers would add dissertations to the repository. It was also recognized that the project could not be completed in 18 weeks as staff assigned to the project would need to incorporate the dissertation tasks into their daily workload.
Metadata
To fully utilize metadata from the library's integrated library system, team catalogers customized default templates in the Digital Commons software designed to control the indexing and display of a collection of records. Customizations were necessary to fully describe the dissertations and incorporated features such as the activation of live link functionality in fields where uniform resource locators (URLs) might be included, the addition of a field to record authors' UMMS departmental affiliations, and the accommodation of Medical Subject Headings by changing the field delimiter from a comma to a semicolon. For instance, “Libraries, Medical” and “Library Technical Services” previously displayed as “Libraries” and “Medical; Library Technical Services.” Catalogers copied and pasted title and subject data from the OPAC into the repository manually, using macros when possible. Though the Digital Commons software contained a batch loader functionality, it was not used in the submission process due to the batch loader having a separate extensible markup language (XML) schema that at the time could not be programmed to match the customized dissertation templates.
Digitization and Submission Process
Using alumni contact data provided by the graduate school, library staff wrote to the dissertation authors to request copyright and digitization permissions. Alumni were asked to grant permission immediately, while current graduates were given the option to add only an abstract and delay adding the full-text for one year to allow for publishing opportunities. Initially, only dissertations for which the library secured permissions were scanned and processed. Once those were completed, a decision was made by the project team to scan the remaining dissertations, add records with the abstracts to the repository, and store the full-text files until permission was obtained. Dissertations averaged 250 pages in length and were single-sided, with a mix of text, tables, graphs, and images. They were scanned using a Canon Image Runner 3,300 with eCopy version 3.1, a software program used for scanning, optical character recognition (OCR), and portable document format (PDF) creation. Figures 1Figure 1 and and22Figure 2 illustrate the digitization process. Figure 3Figure 3 shows a typical dissertation record in eScholarship@UMMS.
Figure 1
Figure 1
Figure 1
Digitization process
Figure 2
Figure 2
Figure 2
Process to convert and add dissertations to repository
Figure 3
Figure 3
Figure 3
Typical dissertation record
An Unexpected Step to Alleviate Privacy Concerns
As the project neared completion, the dean of the graduate school expressed concern about the signature pages of the dissertations being made public. The team asked ProQuest's UMI Dissertation Publishing its policy on this issue and learned UMI stopped scanning signature pages in 2005. The team concluded it was worth the additional time and cost to re-create a “blank” signature page for each dissertation, which would retain the names of the advisor and review committee without their signatures (this information was not stored elsewhere). The new signature pages were created and reinserted into the PDF files. Cataloging staff then substituted the revised PDFs in eScholarship@UMMS.
The total number of documents processed was 320, 300 from previous graduates and 20 dissertations submitted by an additional 20 students over the course of the project. The project team was able to successfully contact 282 of the 320 authors, and 209 (74%, 209/282) granted permission to digitize their dissertations. The dissertations (or records providing abstracts only in cases where permission was not granted) were all available online by March 2007.
Processing Time
Actual processing times are summarized in Table 1. The total hours to process the materials were 906 hours, exceeding the original estimate of 725 hours by 181 hours. One-hundred and fifty-nine hours of this difference can be attributed to the unexpected need to replace the signature pages in each dissertation. The total duration of the project was 12 months, as the circulation staff members who scanned the dissertations were not assigned to the project full time. They scanned on average 2 dissertations per night and 5 on weekends. Spreading the work over the course of 1 year allowed for multiple attempts to contact alumni for permission.
Closer analysis of the estimated and actual time needed per dissertation shows 2 important factors. First, the initial time estimate to process a dissertation was low (145 minutes vs. 170 minutes); however, if the additional step of replacing the signature pages was not required, the original estimate would have been accurate. Second, regardless of the difference in the total time needed per dissertation, some important areas were underestimated, such as the time to OCR the abstract and overall project management. Issues that contributed to this miscalculation include the extra time to correct the scientific notation in the OCR process and the total project management time required to obtain permissions from authors to digitize their work.
Equipment and Software
The work was accomplished using existing library scanning equipment. The library already owned copies of the software used throughout the process: Microsoft Access, eCopy, Adobe Acrobat, and Adobe Illustrator. Because eCopy came with a scaled down version of the Readiris OCR software, the library purchased 3 copies of the full Readiris program for a total of $990; however, these were not used in the project because the 2 versions conflicted. Thus, the original estimate of $10,000 for equipment and software was too high.
Labor
Actual labor costs, as shown in Table 1, were $22,572 versus the estimated costs of $17,750. In the initial estimate, a temporary worker was assigned the task of adding the dissertation to the repository; however, 2 staff catalogers performed this work at a resulting higher rate. The $10,000 allotted for a temporary worker paid for quality control, OCR work, and editing of the signature pages for a total cost of $9,372. This labor cost would have been $6,446 if the extra step of editing the signature pages had not been necessary.
Budget
Total project costs were $23,562 ($990 software, $22,572 labor) or $0.28 per page (Table 1). This is $4,188 less than the original estimate of $27,750 to process the dissertations in-house and $1,062 more than the estimated cost to outsource the dissertations to UMI.
Usage
Historical circulation data from May 1999 through November 2007 show the library's print dissertation collection was used 723 times. This is in stark contrast to the first 17 months the electronic collection was available (June 2006 through November 2007). Downloads of full-text PDF dissertations from eScholarship@UMMS totaled 17,555, with 10,497 originating from Google searches.
Staff Development
Team members became more familiar with the repository software, metadata standards, scanning, and OCR technologies and developed closer working relationships on the team and between departments. The team developed a greater awareness of the importance of copyright compliance.
Many libraries have viewed digitizing collections as too expensive an undertaking in this time of tight budgets [20]. Chapmen of Harvard University states the costs for scanning, OCR, and quality control work can be as much as 48% of a project's total costs [21]. Equivalent costs throughout the Lamar Soutter Library's dissertation project match this estimate (47.79%). Using Chapmen's group of activities—scanning, OCR, and quality control—the per-page cost to process black-and-white text in a bound volume can range from $0.10 to $1.40 [22, 23]. Both these figures are based on outsourcing the work. The Lamar Soutter Library's internal costs were competitive with these estimates, at $0.28 per page. This suggests the cost to digitize may be within the reach of many medical libraries and a viable option to populate institutional repositories.
The usage statistics for this collection indicate that by disseminating the dissertations through eScholarship@UMMS, which is indexed by Google, access and use increased substantially. Studies indicate that individuals who publish their research online in addition to publishing in traditional scholarly venues are cited more often than those who rely solely on paper publications [2427]. In digitizing the GSBS dissertations, the library has assisted in making the school's research more widely available.
The team faced challenges such as workflow, cost concerns, policy development, and permissions. Communication and coordination between internal and external departments was vital and minimized errors. As the team learned, regardless of the amount of planning and thought that goes into a project, there is always the possibility that each record or file will need to be reworked. Decisions made in processing the dissertations set a precedent for future collections, such as adding documents without the full text if permission has not been obtained. The team acknowledges this could result in user frustration because they cannot get access to the full text. The team has worked hard to contact as many dissertation authors as possible to keep incomplete records to a minimum.
Nolen and Costanza described their experience in populating the repository at Trinity University, which also focused on student works, by noting, “it's important to start small, choosing projects that have usefulness to our constituents” [28]. The Lamar Soutter Library also found having a small, defined project had many benefits. It allowed the team to experience an early success and manage staff and resources by gradually incorporating the work. The team also gained experience with Digital Commons, metadata standards, and copyright. Additionally, this project served as a recruitment strategy to other campus departments through coordinated promotion by GSBS and the library for further population of the institutional repository. New materials recruited include student works, nursing dissertations, and faculty publications, a small portion of which required digitization.
For UMMS, digitizing dissertations proved to be a successful and cost-effective recruitment strategy and helped the library build stronger relationships at the medical school to secure future content. The team's quick response to the dean's privacy concerns built a foundation of trust for future work. Currently, all dissertations are submitted to the library in both print and electronic format along with a signed permission form to digitize the work. The library anticipates that building this relationship with students will make it easier to recruit future scholarly works over the life of a researcher's career at the medical school.
Footnotes
*Based on a poster at MLA '07, the 107th Annual Meeting of the Medical Library Association; Philadelphia, PA; May 20, 2007; and a presentation at the Scanning Forum 2006; Charlottesville, VA; November 6, 2006.
In July 2007, Berkeley Electronic Press (bepress), the original developers of the software, resumed full support of the Digital Commons product. It is now called bepress Institutional Repository. For more information, see the product description available at <http://www.bepress.com/ir/>.
Contributor Information
Mary Piorun, Associate Director for Technology Initiatives and Resource Management Email: mary.piorun/at/umassmed.edu;
Lisa A. Palmer, Lamar Soutter Library, University of Massachusetts Medical School, 55 Lake Avenue North, Worcester, MA 01655 Email: lisa.palmer/at/umassmed.edu;
1. Muir A. Preservation, access and intellectual property rights challenges for libraries in the digital environment [Internet] London, UK: Institute for Public Policy Research; 2006 [rev. 5 Jun 2006; cited 21 Nov 2007]. < http://www.ippr.org/publicationsandreports/publication.aspid464>.
2. Institute of Museum and Library Services. Status of technology and digitization in the nation's museums and libraries [Internet] Washington, DC: The Institute; 2002 [rev. 23 May 2002; cited 21 Nov 2007]. < http://www.imls.gov/resources/TechDig02/2002Report.pdf>.
3. Coyle K. Mass digitization of books. J Acad Librariansh. 2006;32(6):641–5.
4. Mullins J.L., Allen F.R., Hufford J.R. Top ten assumptions for the future of academic libraries and librarians: a report from the ACRL Research Committee. C&RL News. 2007. Apr, (Available from: < http://www.lita.org/ala/acrl/acrlpubs/crlnews/backissues2007/april07/tenassumptions.cfm>. [cited 21 Nov 2007].).
5. Crow R. The case for institutional repositories: a SPARC position paper. release 1.0 [Internet] Washington, DC: Scholarly Publishing and Academic Resources Coalition; 2002 [rev. Aug 2002; cited 1 Aug 2007]. < http://www.arl.org/sparc/bmdoc/ir_final_release_102.pdf>.
6. University of Nottingham. OpenDOAR: the directory of open access repositories [Internet] Nottingham, UK: The University; 2007 [rev. 26 Nov 2007; cited 26 Nov 2007]. < http://www.opendoar.org>.
7. Association of Academic Health Sciences Libraries. Annual statistics of medical school libraries in the United States and Canada. [Internet]. The Association [password protected; cited 3 Aug 2007]. < http://aahsl.ccr.buffalo.edu>.
8. Foster N.F., Gibbon S. Understanding faculty to improve content recruitment for institutional repositories. D-Lib Mag [Internet] 2005. Jan, [cited 5 Oct 2007].< http://www.dlib.org/dlib/january05/foster/01foster.html>.
9. Phillips H., Carr R., Teal J. Leading roles for reference librarians in institutional repositories: one library's experience. Ref Serv Rev. 2005;33(3):301–11.
10. Hatfield A.J., Kelley S.D. Case study: lessons learned through digitizing the National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research Collection. J Med Libr Assoc. 2007 Jul;95(3):267–70. [PubMed]
11. Mower A. Developing an institutional repository: an insider's look at the University of Utah IR. Libr Student J [Internet] 2006. Sep, < http://informatics.buffalo.edu/org/lsj/>.
12. Piorun M.E., Palmer L.A., Comes J.F. Challenges and lessons learned: moving from image database to institutional repository. OCLC Syst Serv. 2007;23(2):148–57. doi: 10.1108/10650750710748450.
13. Krevit L., Crays L. Herding cats: designing DigitalCommons @ The Texas Medical Center, a multi-institutional repository. OCLC Syst Serv. 2007;23(2):116–24. doi: 10.1108/1065075071074844.
14. Association of Academic Health Sciences Libraries. Annual statistics of medical school libraries in the United States and Canada. [Internet]. The Association. Singarella T, Schoening P. AAHSL institutional repositories (IR) survey summary and analysis: 2005–2007 comparison. [password protected; cited 18 Sep 2007]. < http://aahsl.ccr.buffalo.edu>.
15. Bailey C. Institutional repositories. Washington, DC: Association of Research Libraries; 2006.
16. Lynch C.A., Lippincott J.K. Institutional deployment in the United States as of early 2005 [Internet]. D-Lib Mag [Internet] 2005. Sep, [cited 1 Aug 2007]. < http://www.dlib.org/dlib/september05/lynch/09lynch.html>.
17. Markey K., Rieh S.Y., St. Jean B., Kim J., Yakel E. Census of institutional repositories in the United States: MIRACLE project research findings [Internet] Washington, DC: Council on Library and Information Resources; 2007. [rev. Feb 2007; cited 19 Oct 2007]. < http://www.clir.org/pubs/reports/pub140/pub140.pdf>.
18. Davis P.M., Connolly M. Evaluating the reasons for non-use of Cornell University's installation of DSpace. D-Lib Mag [Internet] 2007. Mar, [cited 5 Sep 2007]. < http://www.dlib.org/dlib/march07/davis/03davis.html>.
19. McDowell C.S. Evaluating institutional repository development in American academe since early 2005: repositories by the numbers, part 2. D-Lib Mag [Intenet] 2007. Sep, [cited 10 Oct 2007]. < http://www.dlib.org/dlib/september07/mcdowell/09mcdowell.html>.
20. Institute of Museum and Library Services. Status of technology and digitization in the nation's museums and libraries [Internet] Washington, DC: The Institute; 2006 [rev. Jan 2006; cited 21 Nov 2007]. < http://www.imls.gov/resources/TechDig05/Technology2BDigitization.pdf>.
21. Chapman S. Managing text digitization: making good digital text objects. Presented at: School for Scanning: Building Good Digital Collections; Jun 2, 2005; Boston, MA.
22. Boston Library Consortium. News release: Boston Library Consortium partners with Open Content Alliance to provide public access to digitized books [Internet]. The Consortium. [rev. 25 Sep 2007; cited 1 Oct 2007]. < http://www.blc.org/news/blc_oca_release.html>.
23. Kenny A.R., Rieger O.Y. Moving theory into practice: digital imaging for libraries and archives. Mountain View, CA: Research Libraries Group; 2000.
24. Lawrence S. Free online availability substantially increases a paper's impact. Nature. 2001 May 31;411(6837):521. [PubMed]
25. Antelman K. Do open access articles have a greater research impact. Coll Res Libr. 2004;65(1):372–82. (Available from: < http://www.ala.org/ala/acrl/acrlpubs/crljournal/crl2004/crlseptember/antelman.pdf>. [cited 12 Oct 2007].).
26. Eysenbach G. Citation advantage of open access articles. PLOS Biol. 2006. May, [cited 26 Nov 2007]. DOI: 10.1371/journal.pbio.0040157.
27. Piwowar H.A., Day R.S., Fridsma D.B. Sharing detailed research data is associated with increased citation rate. PLOS One. 2007. Mar, [cited 26 Nov 2007]. DOI: 10.1371/journal.pone.0000308.
28. Nolan C., Costanza J. Promoting and archiving student work through an institutional repository: Trinity University, LASR, and Digital Commons [Internet] San Antonio, TX: Trinity University; 2006 [rev. Jun 2006; cited 18 Sep 2007]. < http://hdl.handle.net/10090/502>.

See more articles cited in this paragraph