FTP Service

The PMC FTP Service provides

  1. files, file lists, and bulk packages for the articles in the PMC Open Access Subset;
  2. resources to enable cross-referencing PMC articles with identifiers such as PubMed IDs, DOIs, Manuscript IDs, ISSN, etc.

The FTP service also provides access to the Author Manuscript Collection.

The base URL of the FTP site is ftp://ftp.ncbi.nlm.nih.gov/pub/pmc.

Suggested FTP client configuration

After a series of experiments using ftp clients with NCBI's ftp server, we've found that the configuration of ftp clients can seriously affect performance. NCBI recommends setting the TCP buffer size to 32Mb. For more information, please see ftp://ftp.ncbi.nlm.nih.gov/README.ftp.

If you have questions or comments about this service, please write to the PMC help desk. Further information on retrieving full text and other common developer queries can be found on the Developer Resources page.

Files from the PMC Open Access Subset

This FTP service may be used to download the source files for any article in the PMC Open Access Subset. There are two file formats provided:

.tar.gz - these are archive files that include all of the source material for the article, including:

  • A .nxml file, which is XML for the full text of the article, encoded in the NLM/JATS DTD.

  • Image files from the article, and graphics for display versions of mathematical equations or chemical schemes.

  • Supplementary data, such as background research data or videos.

  • PDF, if available

  • Converted video files, in a number of formats, suitable for streaming on the web. These files have the suffix, "-pmcvs_normal" to distinguish them from original, publisher-supplied files.

.pdf - the PDF associated with the article (same as that in the .tar.gz file). Note that not every article has a PDF.

Because there are so many articles, in order to prevent any one FTP folder from having thousands of files, these .tgz and .pdf files are distributed randomly in a two-level-deep structure. For example, the files for the article PMC13901 are (randomly) assigned into the folder b0/ac/. For any particular article, there are two ways to discover the location of its files: using one of the file lists (described next) or using the OA web service.

There are also a few Entrez search filters that might be useful for finding OA articles that have files of a particular type:

These can be combined. So, for example, to find all OA subset articles that have PDF but do not have XML, you can search for "open access"[filter] AND "has pdf"[filter] NOT "oa full text xml"[filter].

Using the file lists

There are several files on the FTP site that provide lists of the available open access (OA) articles from PMC: file_list.txt, file_list.csv, file_list.pdf.txt, and file_list.pdf.csv. These lists provide indices in two different formats (.txt or .csv) of all available OA articles or only those OA articles that have PDFs:


This file can be retrieved from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.txt. The first line of the file gives the date and time at which it was last generated. Every subsequent line contains information about one article in PMC. For example:

b7/8b/Open_Ophthalmol_J_2009_Jun_11_3_26-28.tar.gz  Open Ophthalmol J. 2009 Jun 11; 3:26-28  PMC2701320  PMID:19554218

This line is divided into four main fields, delimited by tab characters. Those fields are:

  1. The fully qualified name of the .tar.gz file for an article.

  2. The article citation, comprising:

    • journal title abbreviation
    • publication date
    • volume
    • issue
  3. The PMC accession number

  4. The PubMed ID (PMID)


This file can be retrieved from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv.

Contents: Fields are the same as above, except separated by commas, and with the addition of a timestamp indicating the last update to the article in PMC. The timestamp appears before the PMID. For example:

8d/2f/Int_J_Health_Geogr_2003_Sep_25_2_7.tar.gz,Int J Health Geogr. 2003 Sep 25; 2:7,PMC222916,2013-03-20 10:04:22,14561226


This file can be retrieved from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.pdf.txt.

This is the same as the file_list.txt file, but it only lists those articles that have PDFs, and, of course, it gives the location of the PDF rather than the location of the .tgz file.


This file can be retrieved from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.pdf.csv.

This is the same as file_list.pdf.txt, except uses commas as delimiters, and adds the timestamp, indicating the last update to the article in PMC.

To find an article from PMC on the FTP site

Copy the PMC accession number from the PubMed Central URL (for example, in the URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13901/, the accession number is PMC13901), and then search for that accession number in the file_list.txt file.

Alternatively, you can use this accession number with the OA web service, by retrieving information from, for example, https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=PMC13901.

Bulk Packages of OA Articles

In addition to the per-article files described above, PMC also makes available gzipped archive files that contain just the XML, and others that contain just the extracted full text, for all of the articles in the PMC open access subset. In the extracted full text (.txt) files, the full text is extracted either from the XML files, or, for those articles that don't have XML, from the PDFs.

Users who do not need PDFs, images, or supplementary data can use these files in data mining and other types of processing. Users are directly and solely responsible for compliance with copyright restrictions and are expected to adhere to the terms and conditions defined by the copyright holder (see the PMC Copyright Notice). Note that these files are quite large (2 to 6 GBs).

The files that contain the XML of all of the articles are:

The archive files containing the extracted full text are:

These files are updated once per week, on Saturday.

Obtaining DOIs and PubMed IDs for articles in PMC

Use PMC-ids.csv.gz to associate PMC articles with a PMC accession number, a PubMed ID, and the corresponding DOI.

PMC-ids.csv.gz is a comma-delimited file with the following fields:

  • Journal Title
  • ISSN
  • Electronic ISSN
  • Publication Year
  • Volume
  • Issue
  • Page
  • DOI (if available)
  • PMC accession number
  • PubMed ID (if available)
  • Manuscript ID (if available)
  • Release Date (Mmm DD YYYY or live)


Journal Title,ISSN,eISSN,Year,Volume,Issue,Page,DOI,PMC accession number,PMID,Manuscript Id,Release Date

Sample entries: Sample entries:
  • Mol Biol Cell,1059-1524,1939-4586,2000,11,6,2019, ,PMC14900,10848626, ,live
  • J Neurosci,0270-6474,1529-2401,2005,25,24,5740,10.1523/JNEUROSCI.0913-05.2005,PMC1201448,15958740,NIHMS3372,live
  • Cancer Res,0008-5472,1538-7445,2007,67,17,8022,10.1158/0008-5472.CAN-06-3749,PMC1986634,17804713,NIHMS25090,Sep 1 2008
  • Proc Natl Acad Sci U S A,0027-8424,1091-6490,2007,104,43,17075,10.1073/pnas.0707060104,PMC2040460,17940018, ,live
  • Cell Host Microbe,1931-3128,1934-6069,2007,2,6,404,10.1016/j.chom.2007.09.014,PMC2184509,18078692,NIHMS36164,live
  • Proc Natl Acad Sci U S A,0027-8424,1091-6490,2008,105,21,7382,10.1073/pnas.0711174105,PMC2396716,18495922, ,Nov 27 2008
  • PLoS Med,1549-1277,1549-1676,2008, ,Immediate Access,e168,10.1371/journal.pmed.0050168,PMC2494565,18684010, ,live


  • If any information is not available, entries will contain an empty space.
  • Articles that show a Release Date are under embargo and not yet available on the PMC public site.
  • When embargoed articles are released to the PMC public site, the Release Date field value changes to "live".

Another way to programmatically determine DOIs and PubMed IDs for articles in PMC is to use the ID converter API.

Last updated: Mon, 23 Mar 2015