FTP Service

The PMC File Transfer Protocol (FTP) Service supports usage of the PMC Article Datasets with the following services:

Bulk download

Individual article download

  • Available for: PMC Open Access Subset only
  • Packages include: XML, plain text, PDF, media files and supplementary materials for a single article

PDF download

  • Available for: PMC Open Access Subset only
  • Individual PDFs of articles: only available for non-commercial use licensed articles

PMC ID Cross-referencing

  • Cross reference any PMC article ID with identifiers such as PubMed IDs, DOIs, and Author Manuscript IDs
  • File: PMC-ids.csv.gz, a file in the top-level FTP directory

Base FTP URL: https://ftp.ncbi.nlm.nih.gov/pub/pmc

Bulk Download Updates (September 2021)

In September 2021 PMC released new bulk download directory structures and packages to our FTP service for two datasets: the PMC Open Access (OA) Subset and the Author Manuscript Dataset. The previous bulk download structure will remain in place during a transition period; it will be moved in November 2021 and deleted in March 2022.

Learn more about the updated FTP structures

*Tip* NCBI recommends setting the TCP buffer size to 32Mb for best performance. NCBI supports secure FTP via SFTP. For more information, please see https://ftp.ncbi.nlm.nih.gov/README.ftp.

If you have questions or comments about the PMC FTP Service, please write to the PMC help desk. Further information on retrieving full text and other common developer queries can be found on Developer Resources page.

Bulk Download

If you only are interested in the metadata and text of an article or author manuscript, then bulk download may be what you want to use. Bulk packages group together hundreds of thousands of articles in XML or plain text formats in compressed packages (Note: The Historical OCR Dataset is only available in plain text format). If you are also interested in media files, supplementary materials, or PDFs, please see the sections on Individual Article Download and PDF Download.

Details on the Update to Bulk Downloads (September 2021)

In September 2021 PMC released the following new bulk download directory structures and packages to our FTP service for two datasets: the PMC Open Access (OA) Subset and the Author Manuscript Dataset:

  • baseline packages that contain all articles available in PMC as of the baseline date for each respective dataset or grouping; and
  • daily incremental packages that only contain articles that are new to the dataset or that have been updated since the baseline or previous incremental package was created.

In November 2021 the previous packages will be moved to a new temporary location so that users who have automated workflows will be alerted to the coming change and can patch their workflows while they transition to using the new bulk download packages or move to the new cloud-based access to these datasets. The previous bulk download packages and their directories will be deleted in March 2022.

Baseline Packages Update Schedule

New baseline packages will be created at least two times per year. Previous baseline and incremental packages and the accompanying file lists will be deleted whenever a new baseline is created.

New baselines will be created:

  • mid-June
  • mid-December
  • as needed*

*PMC is sometimes required to suppress an article from public view for legal reasons if the case involves a legal injunction or a breach of patient privacy. In such cases, a new set of baseline packages will be created for the impacted dataset. This is not a frequent occurence.

Directories Organized by Dataset, License Terms, and File Content Type

Bulk downloads are available on the FTP Service by dataset:

PMC Open Access Subset - Bulk
Author Manuscript Dataset - Bulk
Historical OCR Dataset - Bulk

We have further divided the PMC Open Access Subset bulk packages into three groups based on available license terms:

To access the complete PMC OA Subset you will need to retrieve ALL of the OA Subset packages. These groups are complementary rather than duplicative.

Each of these datasets or groupings is divided into separate directories by file content type: XML (\xml) and plain text (\txt). The baseline packages for each of these OA Subset groups and for the Author Manuscript Dataset are divided by PMCID range (e.g., PMC004XXXXXX) in order to keep package sizes reasonable.

The result is the following directory structure:

|_ historical_ocr/
|_ manuscript/
|___ txt/
|___ xml/
|_ oa_bulk/
|___ oa_comm/
|_____ txt/
|_____ xml/
|___ oa_noncomm/
|_____ txt/
|_____ xml/
|___ oa_other/
|_____ txt/
|_____ xml/

File Lists

There are csv and txt formatted file lists available for each package. The file lists have been updated to:

  • include a flag indicating if an article has been retracted (yes/no, where yes = retracted and no = not retracted).
  • bring the csv and txt file lists into sync (we found that we had updated the csv file with extra fields, but not the txt files of the current production file lists.)

Note: Author manuscripts have different metadata information available than PMC OA Subset articles, so do not assume the same structure for the file lists for these two different datasets.

Sample Bulk File Names

  • Baselist file list: oa_comm_xml.PMC003XXXXXX.baseline.2021-09-16.filelist.csv
  • Baseline: oa_comm_xml.PMC003XXXXXX.baseline.2021-09-16.tar.gz
  • Incremental file list: oa_comm_xml.incr.2021-09-17.filelist.csv
  • Incremental update: oa_comm_xml.incr.2021-09-17.tar.gz

In each of the sample file names above you can substitute various parts to get to the files you want, e.g.

  • Replace oa_comm with oa_noncomm to get PMC OA Subset non-commerical use articles or replace with oa_other to get PMC OA Subset articles without explicity tagged Creative Commons licenses. Replace it with author_manuscript to get author manuscripts.
  • Replace _xml with _txt to get plain text files vs. XML files
  • Replace baseline with incr to switch from a baseline file to one of the daily incremental files, be sure to update the date and remove the PMC00#XXXXXX from the file name
  • Replace PMC003XXXXXX with PMC008XXXXXX in baseline file names to get the articles in the specified grouping with PMCIDs in the range from PMC8000000 to PMC8999999; to get all articles you must retrieve all the PMCID ranges
  • Replace the date (e.g. 2021-09-16) with the new baseline date if the baseline has been updated since this documentation was written; replace the date for incremental files with the date you want to retrieve
  • Replace .csv with .txt as the file extension for the file list to get a tab separated plain text version of the file list

Individual Article Download (PMC Open Access Subset Only)

PMC Open Access Subset Individual Article Packages

If you only want to download some of the PMC OA Subset based on search criteria or if you want to download complete packages for articles that include XML, PDF, figures and supplementary materials, you will need to use the individual article download packages. To keep directories from getting too large, the packages have been randomly distributed into a two-level-deep directory structure. You can use the file lists in CSV or txt format to search for the location of specific files or you can use the OA Web Service API. The file lists and OA Web Service API also provide basic article metadata.

  • Filenames: PMCXXXXXXX.tar.gz where the X's represent a specific PMCID
  • File lists: oa_file_list.csv or oa_file_list.txt (Located up one level in the top level PMC FTP directory)

The first line of each file list is the timestamp the file was written. Subsequent rows contain metadata for each article.

Each row is divided into 6 metadata fields for CSV (5 for TXT), delimited by comma (tab) characters, For example:

oa_package/66/8b/PMC555938.tar.gz BMC Bioinformatics. 2005 Mar 7; 6:44 PMC555938 PMID15748298 CC BY

The fields in the files are:

  • The fully qualified name of the .tar.gz file for an article
  • The article citation, comprising the journal title abbreviation, publication date, volume, issue, and the page range or elocation ID
  • PMC accession number (PMCID)
  • Last updated timestamp (YYYY-MM-DD HH:MM:SS) (NOT INCLUDED in TXT files)
  • PubMed ID (PMID)
  • License type*

*The field value for “license type” can be any of the standard Creative Commons license variants (e.g., CC BY; CC BY-NC; CC BY-NC-ND) or “NO-CC CODE”. “NO-CC CODE” appears when the license is missing, has custom terms (i.e., not a Creative Commons license), or is not machine decodable.

PDF Download (PMC Open Access Subset Only)

PMC Open Access Subset PDF Files

Individual article PDF downloads are only available for non-commercial use licensed articles. To keep directories from getting too large, the article PDFs have been randomly distributed into a two-level-deep directory structure. You can use the oa_non_comm_use_pdf file lists in CSV or txt format to search for the location of specific files, or you can use the OA Web Service API. The file lists and OA Web Service API also provide basic article citation and license information, as well as the date the article was last updated in PMC.

  • Filenames: filename.PMCXXXXXXX.pdf where filename is the original name of the source file and the X's represent a specific PMCID
  • File lists: oa_non_comm_use_pdf.csv or oa_non_comm_use_pdf.txt (Located in the top level PMC FTP directory)
Support Center

Last updated: Wed, 29 Jun 2021