U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2020/2021 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2021.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2020/2021

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2020/2021 [Internet].

Show details

A full text collection of COVID-19 preprints in Europe PMC using JATS XML

and .

Author Information and Affiliations

In March 2020, in response to a call from the White House Office of Science and Technology Policy, more than 50 publishers agreed to openly share articles related to the COVID-19 pandemic via PMC (and through PMC International, Europe PMC). As a complement to this effort, and in recognition that many researchers were publishing their results rapidly via preprints, a project was launched in July 2020 to collect as many full-text preprints relating to COVID-19 as possible and make them available via Europe PMC alongside peer-reviewed COVID-19 articles. By various means, preprints in PDF format are retrieved from a number of preprint servers (including medRxiv, bioRxiv, arXiv, ChemRxiv, Research Square, and SSRN) and converted to JATS XML. The Europe PMC plus manuscript submission system is used to manage the processing of preprint conversions and communications with authors. To date, around 25,000 COVID-19 preprints have been made available in this way. In this paper we will share what we’ve learned about using JATS XML for preprints and how this aligns or deviates from using JATS XML for peer-reviewed articles.

Collecting COVID-19 preprints

Europe PMC is an open science platform developed by the European Bioinformatics Institute (EMBL-EBI). It is a partner of PubMed Central, and a repository of choice for many international science funders. In 2018, Europe PMC began indexing preprint abstracts and metadata, making them searchable alongside journal-published article abstracts and full text, and including them in workflows such as grant reporting, article citation, and credit and attribution. Content from preprint servers is chosen for indexing based on various criteria, including the server’s screening policy, unrestricted access to the full text, and accessibility of metadata.1 If a preprint has a subsequent journal publication, and that version is also indexed, the preprint and published article are linked on Europe PMC.

In March 2020, more than 50 publishers agreed to make all their COVID-19 full-text content freely available and accessible through PubMed Central and Europe PMC.2 In a complementary effort, Europe PMC launched a project to tag and index the full text of COVID-19 preprints using JATS XML. In addition to creating new processing workflows, this project required a new database and set of internal APIs for the storage and retrieval of preprint full text. After significant development work, we were able to begin processing, indexing, and displaying preprint full text on Europe PMC in July 2020.

Fig. 1. COVID-19 preprint abstract and full text indexed in Europe PMC.

Fig. 1COVID-19 preprint abstract and full text indexed in Europe PMC

Cumulative by publication date

Preprints are selected for inclusion in the COVID-19 full text set from a collection of all COVID-19 resources in Europe PMC, created using relevant search terms.3 We plan to create full text JATS XML of all preprints in the COVID-19 collection. The XML will be indexed for search on Europe PMC, and made available for display and text mining provided appropriate licensing or author approval.

Scripts to retrieve manuscript and supplementary files and related metadata have to be created and checked for quality and completeness of data individually for each preprint server. Crossref metadata is used where possible, but not all preprint servers participate in Crossref, or provide all relevant metadata through it. Additional server-specific APIs need to be accessed to collect all relevant metadata, including all author and license information. To ensure the highest data quality, we chose to add servers for full text processing one at a time, beginning with the servers with the highest number of preprints in the COVID-19 collection.

Table 1Preprint servers indexed

As of April 2021, Europe PMC indexes the following preprint servers:

Preprint serverMetadata/abstracts indexedCOVID-19 full text indexed
AAS Open Research
AMRC Open Research
arXiv
Authorea Preprints
Beilstein Archives
BioHackrXiv
bioRxiv
ChemRxiv
Emerald Open Research
F1000 Research
Gates Open Research
HRB Open Research
medRxiv
MNI Open Research
Open Research Europe
PeerJ Preprints
Preprints.org
SSRN
Research Square
Wellcome Open Research

Once all relevant metadata and files have been retrieved for a specific version of a preprint, a package is created and picked up for processing through Europe PMC plus, a manuscript submission system primarily used for the processing of author manuscripts.4 A basic, metadata-only XML file, using a subset of JATS XML, is provided as part of the package. Once the package is loaded to Europe PMC plus, full text XML is created and checked by external vendors, sent for indexing, and approved for display through the system.

Using JATS XML for preprint full text

Prior to the start of this project, the Europe PMC plus system was still processing XML in the NLM 3.0 DTD. This project presented an opportunity to complete an upgrade to JATS 1.2, in order to utilize tags introduced in JATS that would most accurately represent preprint information and allow for efficient processing. Our requirements for preprint tagging included the accurate representation of all preprint-specific information and the differentiation of preprints from other types of full text articles.

Preprint-specific tagging

In order to differentiate preprints from journal-published article full text, which can be of various article types,5 preprints tagged for the Europe PMC COVID-19 project use the @article-type value "preprint". This makes them easy to separate from other full text, and to collect together into a set.

We put careful thought into the best way to accurately represent preprint servers in the JATS front matter, and the use of <journal-meta>.6 Though preprints are a version of a publication, often ultimately of a journal-published article, preprint servers are generally not journals. When Europe PMC first began considering creating preprint full text, multiple options of representation were considered. However since the beginning of the COVID-19 preprint full text project in 2020, the majority of preprint servers added to the Europe PMC full text collection appear in the NLM Catalog, and have been assigned a National Library of Medicine journal title abbreviation (NLMTA).7 Therefore, in the preprint XML, preprint servers have been tagged as journals, including those NLMTAs, with the expectation that the other elements of the tagging will sufficiently mark the preprint status of the document.

Each preprint version is tagged with a <pub-date date-type="preprint"> containing the date the version was posted to the preprint server. This is the suggested date-type in JATS for a preprint dissemination date.8

License tagging and related workflows

As part of the switch, system wide, to JATS 1.2, Europe PMC plus began using ALI-namespaced tags for open access licenses, as recommended by JATS For Reuse (JATS4R),9 and subsequently the JATS Tag Library and the PMC Tagging Guidelines.10, 11 The correct tagging and accurate machine-readability of open access licenses became critical in the creation of new processing workflows through Europe PMC plus—workflows created to require minimal intervention or effort on the part of preprint authors.

Once full text XML has been tagged for a preprint version, and the XML is processed through the Europe PMC plus system, any open access license available in the XML is programmatically identified and recorded in the system database. The preprint is then sent through different processing workflows depending on the type of license:

  • For each preprint version released under any Creative Commons license,12 the corresponding author is contacted and offered the opportunity to preview the full text 14 days before the full text is automatically released for display and text mining. If the author reviews and approves the full text, it can be released immediately.
  • If the preprint version has no identifiable open access license, the corresponding author is contacted and asked to approve the full text. In such cases the full text is indexed for search on Europe PMC, but cannot be released for display and text mining without author approval.

With appropriate licensing or author approval, all preprint full text can be displayed and made freely available for text mining. However, the overlay or display of information resulting from text mining is considered a derivative work. The collection and display of text mined terms is a useful feature Europe PMC provides for researchers through its website and APIs.13 Therefore, during the approval process, authors of preprints with no license or a "No Derivatives" Creative Commons license are invited to add a license allowing unrestricted reuse to the version of their preprint full text on Europe PMC.

Box 1Europe PMC License

  <license>
    <ali:license_ref xmlns:ali="http://www.niso.org/schemas/ali/1.0/">https://europepmc.org/downloads/openaccess</ali:license_ref>
    <license-p>This preprint is made available via the <ext-link ext-link-type="uri" xlink:href="https://europepmc.org/downloads/openaccess">Europe PMC open access subset</ext-link>, for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original preprint source.</license-p>
  </license>

During the approval process, authors are given the option to pre-approve any future versions of the specific preprint full text under review. Choosing this option waives the following review invitations and any waiting period for release of future versions. The aim of this feature is to further reduce the effort required for preprint authors to release full text preprints in Europe PMC for display and text mining.

Preprint versions, removals, and withdrawals

Preprint servers commonly offer authors the opportunity to make edits to their preprint documents, and release subsequent versions of a preprint as changes are made. Europe PMC is committed to keeping up to date with the versions of each preprint in the COVID-19 full text project. Europe PMC and Europe PMC plus workflows and processes were carefully planned and adjusted to handle multiple versions of a single preprint.

Version tagging and linking

Preprint versions are assigned persistent identifiers in different ways by different preprint servers. The final number of IDs associated with a preprint and version is ultimately based on the number of IDs assigned by the preprint server. Europe PMC has identified two main ways preprint servers assign IDs, or register preprints with DOI registration agencies:

  • All versions of the preprint are registered under the same DOI/identifier. The DOI or other record is updated whenever there is a new version, or,
  • Each version of the preprint is assigned its own unique DOI/identifier.

Each preprint or version with a registered identifier is assigned another preprint ID (PPRID) by Europe PMC once indexed. In the course of the COVID-19 full text project, Europe PMC plus creates an additional manuscript ID (EMSID) for the full text. Europe PMC and Europe PMC plus reflect the policy of each preprint server when indexing preprint metadata and when processing and indexing the full text:

  • Versions that share a single DOI/identifier also share a single PPRID and EMSID, both of which are updated to reflect the latest content.
  • Versions which each have a unique DOI/identifier each receive their own PPRID and EMSID. Europe PMC links these using an internally developed matching process, and only displays the most recent version in search results. Previous versions are linked to the record for the most recent version.
Fig. 2. Only the latest version of a preprint is shown in Europe PMC search results and, when multiple versions are available, the version number is indicated.

Fig. 2

Only the latest version of a preprint is shown in Europe PMC search results and, when multiple versions are available, the version number is indicated.

Fig. 3. When preprint versions have different IDs, all available versions are linked in the Preprint version history.

Fig. 3

When preprint versions have different IDs, all available versions are linked in the Preprint version history.

The DOI, PPRID, and EMSID for a preprint version are all recorded in the JATS XML as different types of <article-id>. Additionally, we use <article-version> to record the version number of the preprint.14 When grouped with other versions of the same preprint, the article version number indicates which version is most up to date and should be displayed.

In Europe PMC plus, preprint version IDs and version numbers are stored in the database as well as in the JATS XML, and are used to ensure the latest version of a preprint is tagged. Versions which fall under the same ID are processed in sequence, while versions each with their own ID can be processed simultaneously. We ensure that versions of a preprint are linked so that an author's most recent selections for version pre-approval and for adding the Europe PMC license are applied to each.

Removals and withdrawals

Preprints are suppressed from preprint servers for a variety of reasons. In the course of the COVID-19 full text project, we have found that in some instances, a preprint withdrawal notification is presented as a new version of the preprint, but previous versions are still available to read under their own IDs. In others, the preprint is removed in its entirety.

EMBL-EBI participated in the creation of ASAPbio recommendations for building trust in preprints.15 The ASAPbio recommendations suggest two categories of preprint removal:

  • Withdrawal, for situations in which the preprint content is still accessible, but there is a new notification explaining that the preprint has been withdrawn. Equivalent to the retraction of a journal-published article.
  • Removal, for situations in which some or all of the preprint and its metadata have been removed, and are no longer accessible.

For the purposes of tagging the full text we suggest and have begun using the following JATS article types:

  • @article-type="preprint-withdrawal", when both the withdrawal notification and previous versions of the preprint are available at the preprint URL.
  • @article-type="preprint-removal", when only the removal notification remains.

When accessing the preprint URL leads to a 404 page, and not even a notification remains, neither article type is appropriate. In such cases we remove the record from Europe PMC as well.

Currently, no indication of the difference in types, or even the fact that the latest version of a preprint is actually a withdrawal, are machine readable in the Crossref metadata for the preprint servers we index. Tagging notices with the appropriate article type requires parsing the full text content to discover that it contains a removal notice, then visiting the preprint server to discover whether it is a removal or a withdrawal according to these standards. It would be beneficial if preprint servers and Crossref were to adopt similar, machine-readable metadata standards for preprint withdrawals and removals, so such updates to the scientific record could be caught more easily, and automated.

Preprint withdrawal or removal notices usually contain no more than one paragraph of content. The Europe PMC plus system flags versions of a preprint based on length for manual checking and tagging with the appropriate article-type. Once the article-type is correctly tagged, the notice full text is released for display and linking to previous versions on Europe PMC.

Results and future direction

The goal of the project to create a freely available, JATS XML full-text corpus of COVID-19 preprints is to benefit the scientific community. The Europe PMC preprints subset has already been used in articles and analysis including:

We will continue to work closely with the preprints community on standards for preprint metadata and full text content. Standards for notifications for preprint withdrawals and removals, as well as other metadata events, such as the linking of peer reviews or comments to a preprint, are of particular relevance for the enrichment of our preprint data.

Data and code availability

All tagged Europe PMC full text preprints with appropriate licenses are freely available as the Europe PMC preprints subset: https://europepmc.org/downloads/preprints

Europe PMC plus is an open source project with a freely available codebase: https://gitlab.ebi.ac.uk/literature-services/public-projects/xpub-epmc

Supplementary material

Funding

This work was supported by the Wellcome Trust [221558/Z/20/Z]; UK Medical Research Council (MRC) and Swiss National Science Foundation (SNSF).

References

1.
Europe PMC. Criteria for preprint servers. Preprints in Europe PMC. [Accessed 15 April 2021] Available from: https://europepmc​.org​/Preprints#preprint-criteria.
2.
Kiley R. Open access: how COVID-19 will change the way research findings are shared. Wellcome. 2020. May 21. Available from: https://wellcome​.org​/news/open-access-how-covid-19-will-change-way-research-findings-are-shared.
3.
Coronavirus articles and preprints search results. Europe PMC. [Accessed 15 April 2021] Available from: https://bit​.ly/2OSt5B5.
4.
Author Manuscripts in PMC. PubMed Central. [Accessed 15 April 2021] Available from: https://www​.ncbi.nlm​.nih.gov/pmc/about/authorms/.
5.
@article-type: Type of Article. Journal Archiving and Interchange Tag Library NISO JATS Version 1.2. 2019. May. Available from: https://jats​.nlm.nih​.gov/archiving/tag-library/1​.2/attribute/article-type.html.
6.
<journal-meta>: Journal Metadata. Journal Archiving and Interchange Tag Library NISO JATS Version 1.2. 2019. May. Available from: https://jats​.nlm.nih​.gov/archiving/tag-library/1​.2/element/journal-meta.html.
7.
Construction of the National Library of Medicine Title Abbreviations. National Library of Medicine. 2019. Apr 24. Available from: https://www​.nlm.nih.gov​/tsd/cataloging/contructitleabbre​.html.
8.
@date-type: Type of Date. Journal Archiving and Interchange Tag Library NISO JATS Version 1.2. 2019. May. Available from: https://jats​.nlm.nih​.gov/archiving/tag-library/1​.2/attribute/date-type.html.
9.
Permissions. JATS4R Recommendations. 2020. Sep 16. Available from: https://jats4r​.org/permissions.
10.
<permissions>: Permissions. Journal Archiving and Interchange Tag Library NISO JATS Version 1.2. 2019. May. Available from: https://jats​.nlm.nih​.gov/archiving/tag-library/1​.2/element/permissions.html.
11.
Licensing information. PMC Tagging Guidelines. [Accessed 15 April 2021] Available from: https://www​.ncbi.nlm​.nih.gov/pmc/pmcdoc/tagging-guidelines​/article/dobs​.html#dob-license.
12.
About the Licenses. Creative Commons. [Accessed 15 April 2021] Available from: https:​//creativecommons.org/licenses/.
13.
Annotations. Europe PMC. [Accessed 15 April 2021] Available from: https://europepmc​.org/Annotations.
14.
<article-version>: Article Current Version Status or Number. Journal Archiving and Interchange Tag Library NISO JATS Version 1.2. 2019. May. Available from: https://jats​.nlm.nih​.gov/archiving/tag-library/1​.2/element/article-version.html.
15.
Beck J, Ferguson CA, Funk K, et al. Building trust in preprints: recommendations for servers and other stakeholders. OSF Preprints. 2020. Jul 21. 10​.31219/osf.io/8dn4w .
Copyright Notice

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

Bookshelf ID: NBK569517

Views

  • PubReader
  • Print View
  • Cite this Page

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...