FTP Service

Base URL: ftp://ftp.ncbi.nlm.nih.gov/pub/pmc

The PMC FTP Service provides access to:

The FTP service also allows users to cross-reference PMC articles with identifiers such as PubMed IDs, DOIs, and Manuscript IDs.

Open Access Subset FTP Clean Up

On March 18, 2019, PMC will no longer provide bulk packages of Open Access (OA) Subset text and XML at the top level directory of the FTP Service. These files were superseded by the Commercial Use and Non-Commercial Use bulk packages located in the oa_bulk subdirectory. Read the complete announcement.

Please note the following:

  • After a series of experiments using FTP clients with NCBI's FTP server, we've found that the configuration of FTP clients can seriously affect performance. NCBI recommends setting the TCP buffer size to 32Mb. For more information, please see ftp://ftp.ncbi.nlm.nih.gov/README.ftp.
  • To access the complete OA Subset you will need to use the Commercial Use and Non-Commercial Use Collections. These collections complement each other, rather than duplicating files.
  • In order to prevent any one FTP folder from having thousands of files, the .tgz and .pdf files in the oa_package and pa_pdf directories are distributed randomly in a two-level-deep structure. There are two ways to locate a specific article on the FTP site:
    • Use one of the file index lists described below.
    • Use the OA web service, which provides an API to locate articles on the FTP site by PMCID, or by an update date range.

If you have questions or comments about the FTP service, please write to the PMC help desk. Further information on retrieving full text and other common developer queries can be found on Developer Resources page.

Index Files for the PMC Open Access Subset

The FTP site includes six index files to assist with locating an open access article on the FTP site. Search these index files for either a PMC accession number (PMCID) or a PubMed ID (PMID). The matching entry will point you to the specific FTP directory and file name for the article.

.txt Index Files

Filename Location Content of Index File
oa_file_list.txt ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.txt Complete Open Access Subset
oa_comm_use_file_list.txt ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_comm_use_file_list.txt Commercial Use Collection (i.e., Open Access Subset articles with a machine-readable CC BY or CC0 license)
oa_non_comm_use_pdf.txt ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_non_comm_use_pdf.txt Non-Commercial Use PDF Collection (i.e., Open Access Subset articles with machine-readable non-commercial use licenses that have PDFs)

The first line of each .txt index file gives the date and time at which it was last generated. Every subsequent line contains information about one article in PMC.

This line is divided into 5 fields, delimited by tab characters, For example:

oa_package/66/8b/PMC555938.tar.gz	BMC Bioinformatics. 2005 Mar 7; 6:44	PMC555938	PMID:15748298	CC BY

The 5 fields are:

  • The fully qualified name of the .tar.gz file for an article
  • The article citation, comprising the journal title abbreviation, publication date, volume, issue, and the page range or elocation ID
  • PMC accession number (PMCID)
  • PubMed ID (PMID)
  • License type*

* The field value for “license type” can be any of the standard Creative Commons license variants (e.g., CC BY; CC BY-NC; CC BY-NC-ND) or “NO-CC CODE”. “NO-CC CODE” appears when the license is missing, has custom terms (i.e., not a Creative Commons license), or is not machine decodable.

.csv Index Files

Filename Location Content of Index File
oa_file_list.csv ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.csv Complete Open Access Subset
oa_comm_use_file_list.csv ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_comm_use_file_list.csv Commercial Use Collection (i.e., articles with a machine-readable CC BY or CC0 license)
oa_non_comm_use_pdf.csv ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_non_comm_use_pdf.csv Non-Commercial Use PDF Collection (i.e., articles with machine-readable non-commercial use licenses that have PDFs)

Metadata fields are the same as above for the .txt index files, except separated by commas, and with the addition of a timestamp indicating the last update to the article in PMC. The timestamp appears before the PMID. For example:

oa_package/d2/6d/PMC2137107.tar.gz,Environ Health Perspect. 2007 Dec; 115(12):A580a,PMC2137107,2014-05-16 12:59:15,18087575,CC0

Directories and File Formats

The directories available via the FTP service include:

Directory Contents Format
oa_package Open access individual articles packages .tar.gz including
  • A .nxml file, which is XML for the full text of the article, encoded in the NLM/JATS DTD
  • Image files from the article, and graphics for display versions of mathematical equations or chemical schemes
  • Supplementary data, such as background research data or videos
  • PDF, if available
  • Converted video files, in a number of formats, suitable for streaming on the web. These files have the suffix, "-pmcvs_normal" to distinguish them from original, publisher-supplied files
oa_pdf Open access individual article PDFs available for non-commercial use* .pdf – same PDF as found in the oa_package tar.gz file
oa_bulk Open access bulk articles packages for either XML or extracted text. Divided into two collections: .tar.gz
  • Index files in .txt and .csv format. Index files include:
    • File name
    • PMCID
    • PMID
    • Manuscript ID (MID)
  • XML or extracted text of author manuscripts collected under a funding agency’s public access policy
.tar.gz including:
  • Files ending in .xml.tar.gz include nxml files, which are the XML for the full text of each article, encoded in the NLM/JATS DTD
  • Files ending in .txt.tar.gz include text files, which contain the extracted text of each article
historical_ocr Extracted text of
  • select OCR texts from Journals Backfiles Digitization project (2004-2010) and
  • all OCR texts from Biomedical Journal Digitization project (2014-present)

* To access PDFs that allow commercial use, use the oa_package directory to download articles you confirm are part of the Commercial Use Collection.

Last updated: Fri, 4 Jan 2019