PMC Back Issue Digitization
Biomedical Journal Digitization (2014- )
In 2014, representatives of the U.S. National Library of Medicine (NLM), a component of the National Institutes of Health, and the Wellcome Trust signed a Memorandum of Understanding (MOU) to work together to make thousands of complete back issues of historically-significant biomedical journals freely available online. The project is scanning original materials from NLM's collection at the article level.
The historical material in this project falls into one of three categories for clearances and permissions:
- Material currently under copyright for which the publisher has granted NLM permission to digitize and include in PMC. This material is made available with a Creative Commons license chosen by the publisher.
Material that is in the public domain:
- Titles published in their entirety in the United States prior to Jan 1, 1923;
- Titles published in their entirety outside of the United States prior to Jan 1, 1877;
This material falls under the Creative Commons Public Domain Mark and is free of known copyright restrictions.
Material identified by the Wellcome Trust as an Orphan Work following a diligent search to ascertain the rights holder. This material is made available with a Creative Commons Attribution-NonCommercial 4.0 International License per the MOU.
For each article, available issue cover, available table of contents, and the administrative material of an issue, the following output is produced:
- JATS v1.1 XML metadata record
- 400-dpi 24-bit color LZW TIFF images of all pages
- PDF/A-2b of the article
- OCR text (unedited) of the full article. OCR text for some of these journals is available for bulk download via the Historical OCR Collection described on the PMC Article Datasets page.
Advertising in issues is also being captured with the above-listed output except the OCR text.
Defining and Identifying Orphan Works
The Wellcome Trust has chosen to include in this project certain journals that it has determined to be Orphan Works, as described below. Per the terms of the memorandum of understanding, articles from the orphan works will appear in PMC under the Creative Commons Attribution-NonCommercial 4.0 International License.
Using the definition created by the European Commission, a work shall be considered an orphan work if:
- none of the rights holders in that work is identified
- even if one or more of them is identified, none is located despite a diligent search for the rights holders having been carried out and recorded.
For this project, the Wellcome Library has made the assumption that the publisher owns the rights to all the content they have published. No attempt has been made to trace individual authors (of articles) or any third parties, which may have content embedded within the journal.
Process of Diligent Search
- Using the NLM Locator Plus service, staff at the Wellcome Library identified the full name of the journal and all of its variants, and recorded these details along with the name(s) of the publisher.
- Wellcome Library staff then searched Ulrich"s Periodicals Directory to try to identify the contact details of the publisher. Where contact details were found, these were recorded. When using Ulrich's, searching began using the journal name that was last used when it was being published. Only when this failed, were "previously known" journal titles used.
- If the journal was not listed in Ulrich's, the search was expanded to a more general Internet search using Google. Where contact details were found, these were recorded.
- If a publisher address could not be found in Ulrich's or the Internet, the Wellcome Library contacted the Publishing Licensing Society (PLS) in the UK to see if they have any contact details for the publisher. Where contact details were found, these were recorded.
- In cases where a search of Ulrich's, the Internet, and the PLS failed to identify an address for the publisher, these works were considered orphans.
- In all cases where the addresses of the rights holder was found, the Wellcome Library contacted them to see permission to digitize. In cases where a publisher did not then give permission, these works were then deemed out of scope.
Journal Backfiles Digitization Project (2004-2010)
A number of journals that joined PMC prior to 2008 have benefited from NLM's back issue digitization project, offered to publishers whose archival content was not yet available in electronic form. By scanning back issues that were available only in print, NLM has helped create a complete digital archive of these journals in PMC.
- The full cost of scanning the back issues and creating the related OCR and XML files was covered by NLM and, in some cases, the Wellcome Trust and the U.K. Joint Information Systems Committee (JISC). See the announcement of the collaboration between NLM, the Wellcome Trust and JISC. The Wellcome Trust site has information about the journals sponsored by Wellcome and JISC.
- Participating journals have given NLM permanent rights to archive the scanned material and make it freely available to the public through PMC, subject to normal ‘fair use’ provisions of copyright law. In return, NLM offers to provide the publisher with a complete electronic copy of its material, at no cost. As with existing content in PMC, copyright for the scanned material remains with the publisher or with individual authors, as applicable.
- NLM scanned back to the first issue of each journal. Each issue was scanned cover to cover, with pages scanned at resolutions ranging from 300 dpi to 600 dpi, depending on the nature of the source material. A PDF file was created for every article or other discrete item in an issue. Grayscale and color graphics in an article were reproduced in the PDF file as true representations of the original pages.
- OCR text, of sufficient quality to build indexes for full text searching and to use for other background processing, was generated automatically from the scanned images. There was no manual correction of the OCR text to improve its accuracy, and PMC users do not have direct access to the OCR text.
- An XML record was created for the citation and abstract of any scanned article that is not already listed in NLM's PubMed abstracts database, and these abstracts were added to PubMed.
- For complete technical details, see the NLM Image Specifications and Functional Requirements for Citation Capture [PDF—750K].