In March 2020, in response to a call from the White House Office of Science and Technology Policy, more than 50 publishers agreed to openly share articles related to the COVID-19 pandemic via PMC (and through PMC International, Europe PMC). As a complement to this effort, and in recognition that many researchers were publishing their results rapidly via preprints, a project was launched in July 2020 to collect as many full-text preprints relating to COVID-19 as possible and make them available via Europe PMC alongside peer-reviewed COVID-19 articles. By various means, preprints in PDF format are retrieved from a number of preprint servers (including medRxiv, bioRxiv, arXiv, ChemRxiv, Research Square, and SSRN) and converted to JATS XML. The Europe PMC plus manuscript submission system is used to manage the processing of preprint conversions and communications with authors. To date, around 25,000 COVID-19 preprints have been made available in this way. In this paper we will share what we’ve learned about using JATS XML for preprints and how this aligns or deviates from using JATS XML for peer-reviewed articles.
Collecting COVID-19 preprints
Europe PMC is an open science platform developed by the European Bioinformatics Institute (EMBL-EBI). It is a partner of PubMed Central, and a repository of choice for many international science funders. In 2018, Europe PMC began indexing preprint abstracts and metadata, making them searchable alongside journal-published article abstracts and full text, and including them in workflows such as grant reporting, article citation, and credit and attribution. Content from preprint servers is chosen for indexing based on various criteria, including the server’s screening policy, unrestricted access to the full text, and accessibility of metadata.1 If a preprint has a subsequent journal publication, and that version is also indexed, the preprint and published article are linked on Europe PMC.
In March 2020, more than 50 publishers agreed to make all their COVID-19 full-text content freely available and accessible through PubMed Central and Europe PMC.2 In a complementary effort, Europe PMC launched a project to tag and index the full text of COVID-19 preprints using JATS XML. In addition to creating new processing workflows, this project required a new database and set of internal APIs for the storage and retrieval of preprint full text. After significant development work, we were able to begin processing, indexing, and displaying preprint full text on Europe PMC in July 2020.
Preprints are selected for inclusion in the COVID-19 full text set from a collection of all COVID-19 resources in Europe PMC, created using relevant search terms.3 We plan to create full text JATS XML of all preprints in the COVID-19 collection. The XML will be indexed for search on Europe PMC, and made available for display and text mining provided appropriate licensing or author approval.
Scripts to retrieve manuscript and supplementary files and related metadata have to be created and checked for quality and completeness of data individually for each preprint server. Crossref metadata is used where possible, but not all preprint servers participate in Crossref, or provide all relevant metadata through it. Additional server-specific APIs need to be accessed to collect all relevant metadata, including all author and license information. To ensure the highest data quality, we chose to add servers for full text processing one at a time, beginning with the servers with the highest number of preprints in the COVID-19 collection.
Table 1Preprint servers indexed
As of April 2021, Europe PMC indexes the following preprint servers:
View in own window
| Preprint server | Metadata/abstracts indexed | COVID-19 full text indexed |
|---|
| AAS Open Research | ✓ | |
| AMRC Open Research | ✓ | |
| arXiv | | ✓ |
| Authorea Preprints | ✓ | |
| Beilstein Archives | ✓ | |
| BioHackrXiv | ✓ | |
| bioRxiv | ✓ | ✓ |
| ChemRxiv | ✓ | ✓ |
| Emerald Open Research | ✓ | |
| F1000 Research | ✓ | |
| Gates Open Research | ✓ | |
| HRB Open Research | ✓ | |
| medRxiv | ✓ | ✓ |
| MNI Open Research | ✓ | |
| Open Research Europe | ✓ | |
| PeerJ Preprints | ✓ | |
| Preprints.org | ✓ | |
| SSRN | ✓ | ✓ |
| Research Square | ✓ | ✓ |
| Wellcome Open Research | ✓ | |
Once all relevant metadata and files have been retrieved for a specific version of a preprint, a package is created and picked up for processing through Europe PMC plus, a manuscript submission system primarily used for the processing of author manuscripts.4 A basic, metadata-only XML file, using a subset of JATS XML, is provided as part of the package. Once the package is loaded to Europe PMC plus, full text XML is created and checked by external vendors, sent for indexing, and approved for display through the system.
Using JATS XML for preprint full text
Prior to the start of this project, the Europe PMC plus system was still processing XML in the NLM 3.0 DTD. This project presented an opportunity to complete an upgrade to JATS 1.2, in order to utilize tags introduced in JATS that would most accurately represent preprint information and allow for efficient processing. Our requirements for preprint tagging included the accurate representation of all preprint-specific information and the differentiation of preprints from other types of full text articles.
Preprint-specific tagging
In order to differentiate preprints from journal-published article full text, which can be of various article types,5 preprints tagged for the Europe PMC COVID-19 project use the @article-type value "preprint". This makes them easy to separate from other full text, and to collect together into a set.
We put careful thought into the best way to accurately represent preprint servers in the JATS front matter, and the use of <journal-meta>.6 Though preprints are a version of a publication, often ultimately of a journal-published article, preprint servers are generally not journals. When Europe PMC first began considering creating preprint full text, multiple options of representation were considered. However since the beginning of the COVID-19 preprint full text project in 2020, the majority of preprint servers added to the Europe PMC full text collection appear in the NLM Catalog, and have been assigned a National Library of Medicine journal title abbreviation (NLMTA).7 Therefore, in the preprint XML, preprint servers have been tagged as journals, including those NLMTAs, with the expectation that the other elements of the tagging will sufficiently mark the preprint status of the document.
Each preprint version is tagged with a <pub-date date-type="preprint"> containing the date the version was posted to the preprint server. This is the suggested date-type in JATS for a preprint dissemination date.8
License tagging and related workflows
As part of the switch, system wide, to JATS 1.2, Europe PMC plus began using ALI-namespaced tags for open access licenses, as recommended by JATS For Reuse (JATS4R),9 and subsequently the JATS Tag Library and the PMC Tagging Guidelines.10, 11 The correct tagging and accurate machine-readability of open access licenses became critical in the creation of new processing workflows through Europe PMC plus—workflows created to require minimal intervention or effort on the part of preprint authors.
Once full text XML has been tagged for a preprint version, and the XML is processed through the Europe PMC plus system, any open access license available in the XML is programmatically identified and recorded in the system database. The preprint is then sent through different processing workflows depending on the type of license:
For each preprint version released under any Creative Commons license,
12 the corresponding author is contacted and offered the opportunity to preview the full text 14 days before the full text is automatically released for display and text mining. If the author reviews and approves the full text, it can be released immediately.
If the preprint version has no identifiable open access license, the corresponding author is contacted and asked to approve the full text. In such cases the full text is indexed for search on Europe PMC, but cannot be released for display and text mining without author approval.
With appropriate licensing or author approval, all preprint full text can be displayed and made freely available for text mining. However, the overlay or display of information resulting from text mining is considered a derivative work. The collection and display of text mined terms is a useful feature Europe PMC provides for researchers through its website and APIs.13 Therefore, during the approval process, authors of preprints with no license or a "No Derivatives" Creative Commons license are invited to add a license allowing unrestricted reuse to the version of their preprint full text on Europe PMC.
Box 1Europe PMC License
<license>
<ali:license_ref xmlns:ali="http://www.niso.org/schemas/ali/1.0/">https://europepmc.org/downloads/openaccess</ali:license_ref>
<license-p>This preprint is made available via the <ext-link ext-link-type="uri" xlink:href="https://europepmc.org/downloads/openaccess">Europe PMC open access subset</ext-link>, for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original preprint source.</license-p>
</license>
During the approval process, authors are given the option to pre-approve any future versions of the specific preprint full text under review. Choosing this option waives the following review invitations and any waiting period for release of future versions. The aim of this feature is to further reduce the effort required for preprint authors to release full text preprints in Europe PMC for display and text mining.
Preprint versions, removals, and withdrawals
Preprint servers commonly offer authors the opportunity to make edits to their preprint documents, and release subsequent versions of a preprint as changes are made. Europe PMC is committed to keeping up to date with the versions of each preprint in the COVID-19 full text project. Europe PMC and Europe PMC plus workflows and processes were carefully planned and adjusted to handle multiple versions of a single preprint.
Version tagging and linking
Preprint versions are assigned persistent identifiers in different ways by different preprint servers. The final number of IDs associated with a preprint and version is ultimately based on the number of IDs assigned by the preprint server. Europe PMC has identified two main ways preprint servers assign IDs, or register preprints with DOI registration agencies:
All versions of the preprint are registered under the same DOI/identifier. The DOI or other record is updated whenever there is a new version, or,
Each version of the preprint is assigned its own unique DOI/identifier.
Each preprint or version with a registered identifier is assigned another preprint ID (PPRID) by Europe PMC once indexed. In the course of the COVID-19 full text project, Europe PMC plus creates an additional manuscript ID (EMSID) for the full text. Europe PMC and Europe PMC plus reflect the policy of each preprint server when indexing preprint metadata and when processing and indexing the full text:
Versions that share a single DOI/identifier also share a single PPRID and EMSID, both of which are updated to reflect the latest content.
Versions which each have a unique DOI/identifier each receive their own PPRID and EMSID. Europe PMC links these using an internally developed matching process, and only displays the most recent version in search results. Previous versions are linked to the record for the most recent version.
The DOI, PPRID, and EMSID for a preprint version are all recorded in the JATS XML as different types of <article-id>. Additionally, we use <article-version> to record the version number of the preprint.14 When grouped with other versions of the same preprint, the article version number indicates which version is most up to date and should be displayed.
In Europe PMC plus, preprint version IDs and version numbers are stored in the database as well as in the JATS XML, and are used to ensure the latest version of a preprint is tagged. Versions which fall under the same ID are processed in sequence, while versions each with their own ID can be processed simultaneously. We ensure that versions of a preprint are linked so that an author's most recent selections for version pre-approval and for adding the Europe PMC license are applied to each.
Removals and withdrawals
Preprints are suppressed from preprint servers for a variety of reasons. In the course of the COVID-19 full text project, we have found that in some instances, a preprint withdrawal notification is presented as a new version of the preprint, but previous versions are still available to read under their own IDs. In others, the preprint is removed in its entirety.
EMBL-EBI participated in the creation of ASAPbio recommendations for building trust in preprints.15 The ASAPbio recommendations suggest two categories of preprint removal:
Withdrawal, for situations in which the preprint content is still accessible, but there is a new notification explaining that the preprint has been withdrawn. Equivalent to the retraction of a journal-published article.
Removal, for situations in which some or all of the preprint and its metadata have been removed, and are no longer accessible.
For the purposes of tagging the full text we suggest and have begun using the following JATS article types:
@article-type="preprint-withdrawal", when both the withdrawal notification and previous versions of the preprint are available at the preprint URL.
@article-type="preprint-removal", when only the removal notification remains.
When accessing the preprint URL leads to a 404 page, and not even a notification remains, neither article type is appropriate. In such cases we remove the record from Europe PMC as well.
Currently, no indication of the difference in types, or even the fact that the latest version of a preprint is actually a withdrawal, are machine readable in the Crossref metadata for the preprint servers we index. Tagging notices with the appropriate article type requires parsing the full text content to discover that it contains a removal notice, then visiting the preprint server to discover whether it is a removal or a withdrawal according to these standards. It would be beneficial if preprint servers and Crossref were to adopt similar, machine-readable metadata standards for preprint withdrawals and removals, so such updates to the scientific record could be caught more easily, and automated.
Preprint withdrawal or removal notices usually contain no more than one paragraph of content. The Europe PMC plus system flags versions of a preprint based on length for manual checking and tagging with the appropriate article-type. Once the article-type is correctly tagged, the notice full text is released for display and linking to previous versions on Europe PMC.
Results and future direction
The goal of the project to create a freely available, JATS XML full-text corpus of COVID-19 preprints is to benefit the scientific community. The Europe PMC preprints subset has already been used in articles and analysis including:
Schwab S, Held L.
Science after Covid‐19: Faster, better, stronger?
Significance.
2020.
Aug;
17(4):
8-9
10.1111/1740-9713.01415
.
Ivanova Y, Karapeev G, Butler D, et al.
Fluctuations in SDG relevant research output in response to COVID-19.
Open Science Framework.
2020.
Oct
29.
Available from:
https://osf.io/ea37y/.
Kirkham JJ, Penfold NC, Murphy F, et al.
Systematic examination of preprint platforms for use in the medical and biomedical sciences setting.
BMJ Open.
2020.
Dec.
10(12):
e041849
10.1136/bmjopen-2020-041849
.
We will continue to work closely with the preprints community on standards for preprint metadata and full text content. Standards for notifications for preprint withdrawals and removals, as well as other metadata events, such as the linking of peer reviews or comments to a preprint, are of particular relevance for the enrichment of our preprint data.
Funding
This work was supported by the Wellcome Trust [221558/Z/20/Z]; UK Medical Research Council (MRC) and Swiss National Science Foundation (SNSF).
References
- 1.
- 2.
- 3.
Coronavirus articles and preprints search results.
Europe PMC.
[Accessed 15 April 2021]
Available from:
https://bit.ly/2OSt5B5.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
Beck J, Ferguson CA, Funk K, et al.
Building trust in preprints: recommendations for servers and other stakeholders.
OSF Preprints.
2020.
Jul
21.
10.31219/osf.io/8dn4w
.