NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
Journal Article Tag Suite Conference (JATS-Con) Proceedings 2020/2021 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2021.
When Aries was looking for an XML editing platform to build on for the LiXuid XML workflow, a top priority was change tracking that was robust and usable to satisfy the needs of journal production staff. The solutions available at the time, based on using processing instructions, either did not capture the full range of possible changes or were unreliable in their behavior. Aries also wanted to avoid having to superset JATS to add change tracking elements. Fortunately, Fonto, one of the candidate XML editors, was working on a solution based on DeltaXML called Document History. What sets Document History apart is its ability to display changes made across different revisions of an XML document. This principle is called a changelog and it is different from A/B comparisons typically employed in change tracking solutions. A changelog is a comparison between multiple, subsequent revisions of an XML document merged back into one annotated XML document which is then visualized. This changelog contains the necessary information to attribute changes to specific users and moments in time, something that is typically only offered by active change tracking systems. Another property of this changelog is its ability to show overlapping and conflicting changes. Both textual and XML changes are displayed in Document History as a redlined version of the document. Through the UI, users have the ability to determine what range of revisions they want to look at, and the application provides the ability to navigate quickly to the associated XML editor as well as to mark change as seen. Recently, Fonto developed their own differencing engine, FDiff, which is optimized to work on documents rather than data.
Introduction
In 2016, Aries systems began planning Phase 2 of the LiXuid Manuscript project. LiXuid Manuscript is an initiative to use JATS XML as the main content file in our workflows and processes. Phase 2 focuses on creating an XML-through workflow for (post-acceptance) journal article production. The heart of any such system is the tool that allows you to author and edit the content, the XML editor.
Aries decided not to attempt building an XML editor from the ground up. Instead we had a list of features we considered essential. The editing platform had to resemble a word processor, hiding any hint of the XML under the hood, as most of its users, the authors, would be seeing the editor for the first time when they were tasked with reviewing their proofs. We required the platform to have an extensive API so that we could build specialized tools for adding complex elements like boxed text and definitions lists. We also required the editor to be configurable so that we could offer different tools and behaviors to different roles and tasks.
One feature we could simply not do without was a robust way to track text changes. Especially in scholarly publishing, production editors need to be able to review all changes made by the author to ensure they conform to journal style but also, importantly, to confirm that the author has not expanded claims made in the paper, which would require it go back through peer review. When I worked on such a workflow previously, no XML editor had a change tracking feature that satisfied the requirements of scholarly publishing, so we were forced to build our own (O'Connor et al., 2012). We were hopeful that the state of the art had advanced in the intervening years and that a robust change tracking feature would be available "off the shelf."
Requirements of Change Tracking Features
For a change tracking feature to meet the needs of scholarly journal production, it must meet several requirements. First and foremost, the feature must accurately capture all insertions and deletions of text. It must capture changes to inline formatting as well, as important distinctions may be made using formatting, for example, setting gene names in italics while their protein products are set in roman type.
A change tracking feature must attribute changes to the correct person, so that changes made by an author can be distinguished from those made by an editor. This aspect is especially important when a dispute arises over how an error may have been introduced into an article. Complicating this requirement is the likelihood that some changes will conflict or overlap. For example, a copy editor might enforce a journal policy to use American spellings of terms. At proof stage, the author might revert the terms to British spelling. Seeing the British spellings, the production editor would want to know that the copy editor hadn't missed making the edit in the first place.
Ideally, a change tracking feature will give the production editor a utility for accepting, rejecting, or altering changes made by the author or copy editor. They must be able to step through all the changes made in the document and have a clear indication that they have reviewed all of them. This requirement may also be complicated by changes that conflict or overlap.
Existing Change Tracking Features
Processing instruction-based features
A common method of tracking changes in XML is to capture them inline, in the document itself, using processing instructions. Processing instructions are XML nodes that provide information to the application that is rendering or otherwise handling the XML. In serialized XML, they are represented in the format:
Processing instruction format
<?target content?>
A common use of processing instructions is to designate the XSLT or CSS stylesheet to be used in rendering XML:
Example processing instruction
<?xml-stylesheet type="text/xsl" href="style.xsl"?>
In this example, we see the target is "xml-stylesheet" and the content is given in pseudo-attributes. Pseudo-attributes resemble attributes found within elements, in that they have a name followed by an equals sign and a quoted value, but they are not separate nodes as real attributes are.
An advantage of processing instructions is that they may be placed anywhere in an XML document without causing validation errors. So, the same processing instructions may be used in XML documents valid to any document type definition (DTD) or schema. This is important when designing a system that may be used to edit XML from different DTDs.
Processing instructions are effective at capturing simple insertions. In this example, the target is set as "insertion", and the content is a pseudo-attribute that gives the text that has been inserted:
Simple insertion
<?insertion text="the quick brown fox"?>
The processing instruction for simple insertions may be unambiguously placed where the text in question should appear. Retaining the placement of PIs for deletions may, however, be complicated by the addition of text in the same location as the deleted text that separates the place where the text was deleted from the PI itself.
In either case, processing instructions struggle to capture mixed content, that is, text that has other elements interspersed, such as text formatting elements or cross references. The reason is that the working copy of of the text needs to remain valid, and so, if a change including internal tagging captured in a processing instruction is rejected, there is a risk that the document will no longer be valid. As well, "unwraps" of tagging can be captured in processing instructions, but they and other changes are computationally expensive, as the system must update the PI with each user interaction. This quickly becomes too slow to be useful.
The complexity grows with nested formatting, deletions whose boundaries fall in the middle of formatted text, etc. For these reasons, some systems using PIs to track changes limit themselves to tracking insertions and deletions of text only, and this choice can lead to odd behaviors when coupled with functionality that allows users to accept or reject changes. If formatting is not tracked in a deletion, for example, rejecting that deletion may reinstate the text while losing the formatting.
A potential way to get around the issues of mixed content is to use pairs of processing instructions, associated by an id pseudo-attribute, that span the range of the deletion:
Deletion wrapped in processing instructions
<?deletion-start id="xyz"?>the quick brown fox<?deletion-end id="xyz"?>
However, processing instructions that carry the same "id" in a pseudo-attribute are not really related in a way that is meaningful in the context of XML, as matching ID and IDREF attributes are. So, any system using pairs of processing instructions to wrap changes must add a layer of its own logic to keep track of their positions and relation. This can be tricky, especially when the changes in a document build up, involve structural changes and moving large blocks of text, or include changes by different users that overlap or conflict. When evaluating an XML editor that used such a system, we could find combinations of changes and accept/reject actions that resulted in much of the document being deleted without that deletion being tracked.
We did not move forward with that system.
1:1 XML differencing
The accuracy and stability problems of tracking changes using processing instructions are largely solved by using XML differencing. XML differencing allows the separation of concerns around editing functionality, on one hand, and change tracking on the other. Changes are not tracked in real time while a user is editing and they are not stored in the edited document itself using XML nodes that are not quite up to the task. Instead, two versions of a document are compared, and the differences are presented to the user. So, the editing process does not suffer a performance hit for having to keep track of changes at the same time, and it is guaranteed that changes will not be "lost" due to garbled markup in the document itself.
Rendering changes identified by XML differencing can prove to be a challenge. The differencing engine must resolve the longest common subsequence problem so that the changes reported are meaningful to the editor reviewing them. In the example given in O'Connor et al. (2012), the text "The entrails told the future" is changed to "Then trains stole the furniture". A raw differencing of those strings of text yields "Then
entrail
ns stold
e the furniture" (where underlines are insertions and strikethroughs are deletions), which is less than clear and gives the editor too many changes to review. Fortunately, more sophisticated differencing engines such as DeltaXML do a good job grouping changes into meaningful blocks, yielding "The entrails told
Then trains stole the future
furniture".
Typefi made use of DeltaXML and 1:1 XML differencing when they developed a redlining application for the International Organization for Standardization (ISO; Perera, 2014). ISO was publishing 1700 standards per year at that time, and most of the revisions were done by multiple people in MS Word files. Using Word's Track Changes feature was considered unreliable, as authors could make changes "disappear" by accepting or rejecting them. In addition, overlapping changes, moving blocks of text, and text that was added and then deleted could combine to create an overwhelming and misleading record of the changes made.
The solution was to use eXtyles to export the text of the "before" and "after" versions to XML and then compare them using DeltaXML. The resulting XML diff was then visualized using Typefi. The rendered version showed easily understood green underlining for insertions and red strikethrough for deletions. Because the comparison highlighted only the changes that made it into the final version, users were not distracted by the "noise" of back and forth deliberations that occurred in the Word file.
However, one workflow's noise is another's music (https://youtu.be/BcqC8LSoBws; warning: flashing and strobing lights). In a journal article production workflow, editors will want to see all the changes made at each correction stage and, importantly, they'll want to know who made what changes.
Supersetting with change tracking elements and attributes
To overcome the limitations of 1:1 XML differencing and the inherent instability of using processing instructions to try to capture complex changes in JATS XML, one can superset the DTD by adding specialized change tracking elements. This approach was taken by Dartmouth Journal Services when it developed its ArticleExpress JATS editor (O'Connor et al., 2012).
As in processing instruction-based change tracking systems, information about the alterations is captured inline in the XML itself. However, using elements and attributes to encode this information instead of processing instructions and pseudo-attributes makes such a system more robust, as the underlying XML editor is built to handle the manipulation of these XML nodes. An insertion or deletion element wraps the changed text, and if inserted text was, say, split into separate paragraphs, the native functionality of the XML editor duplicates the change tracking tagging, thus maintaining its integrity.
This system exposes internal (changes within changes) and overlapping changes by using XSLT to denormalize the change tracking elements added by the XML editor.
Nested insertions
Insert 1: <insert ins="1"><p>The rain stays mainly on the plain</p></insert>
Insert 2: <insert ins="1"><p>The rain <insert ins="2">in Spain </insert>stays mainly on the plain</p></insert>
Denormalized: <insert ins="1"><p>The rain </insert><insert ins="1"><insert ins="2">in Spain </insert></insert><insert ins="1">stays mainly on the plain</p></insert>
This structure gives necessary information to the system's accept/reject feature. In that feature, when changes are found within changes, the "outer" change must be acted upon first to maintain the integrity of the XML. The action taken on the outer change determines whether the inner change may be accepted or rejected. For example, the acceptance of an outer insertion allows either acceptance or rejection of an inner insertion. The rejection of an outer insertion, on the other hand, would force the rejection of the inner insertion. This prevents the acceptance of an inner insertion that may only be valid in the context of the outer insertion. A rules engine was built to handle all of the various combinations of insertions and deletions, and it worked recursively to any depth of nesting.
There are drawbacks to such a system. It works fine for text insertions, deletions, and formatting, but did not capture changes to structure or attribute values. Being dependent on the event handlers and listeners of the XML editor added issues of compexity and potential performance problems. Also, XSLTs and the supersetted DTD needed to be maintained through revisions of JATS. Not insignificantly, the change tracking system took quite a bit of time and effort to develop, and it was sensitive to changes in the API of the underlying editor.
Aries Meets Fonto
When Aries was evaluating Fonto to determine whether it could serve as the base XML editor for our LiXuid workflow, Fonto had a processing instruction-based inline change tracking system that did not, for example, track formatting changes. However, they told us that they were working on a new change tracking system, Document History, that was based on XML differencing. Unlike other solutions based on XML differencing, theirs would show all the changes each user made at every step in a workflow.
We were intrigued . . .
Document History
The user interface of DH is designed with non-technical users in mind. It hides the complexity of the underlying XML, while emphasizing the concepts that users are familiar with in text processors like Microsoft Word. For example, textual additions are visualized in green, while deletions are visualized as a red strikethrough.
In addition, DH alleviates some well known change tracking issues:
- People may be working on the same document concurrently, making it hard or even impossible to track changes across their versions of the document;
- Representing structural and semantic changes;
- Representing overlapping and conflicting changes;
- Interoperability between different processors.
DH does not rely on the change information to be stored inside the document. It compares the revisions stored in the CMS, DMS or repository. Effectively, it generates a delta between revision A and B.
This is not unique to DH since there are many A/B comparison tools out there. A typical drawback of such tools is the lack of precision. For example, if you compare revision A with D, how would you know a change was made in B or C?
DH compares all the revisions and combines each ‘delta’ into a so-called continuous changelog, giving it the ability to precisely determine who did what and when a change was made. Even when changes are overlapping or when changes revert other changes which were made previously.
This makes DH a 100% reliable. It gives evidence of each and every change. It’s an intuitive and reliable tool for auditing documents in many domains including life sciences, legislation, standardization, aviation and journals.
Architecture
The DH application is developed as a stand-alone component. It contains only the logic for computing the changelog and the diffs, it does not store the revisions of an article. The revisions of an article are stored in the Editorial Manager database. The DH Architecture diagram depicts the logical overview of the components and their responsibilities.
The Document History backend contains the following main components as depicted in the Main components diagram:
- The API component is a set of REST API services which are invoked by the DH UI.
- The CMS Client component retrieves the JATS XML contents of the revisions of the article which to compare from Editorial Manager.
- The FDiff component is an XML diffing engine which provides an XML-based A/B comparison between to XML files.
- The Changelog component stitches multiple FDiff A/B comparisons together into a so-called changelog. This changelog contains all across all the revisions of an article. The output of the changelog is JATS article XML enriched with change annotations.
XML Encoding of changes
The DH component is schema agnostic, it operates on JATS, NISO-STS, DITA or any other schemata alike. As such, the changes are encoded using proprietary XML elements and attributes in a separate namespace. The consequence of this choice is that the resulting annotated JATS article XML does no longer conform to the JATS schema. For the DH UI, that is not a problem because it offers a read-only view with custom rendering logic.
For further downstream processing of the enriched article it would require either an extension to the JATS schema, or an transformation to equivalent elements and attributes in the schema. JATS Version 1.2, nor the closely related NISO STS 1.0, does not provide suitable constructs for encoding such changes. Alternatively, XML processing instructions may be used to encode the changes. This is not explored as part of this paper.
In its current state the following change types are recognized and encoded:
- Text additions and deletions. (See Example: Simple text changes)
- Wrapping and unwrapping of inline elements. (See Example: Inline formatting changes)
- Insertions and deletions of block-level elements.
- Moving of block-level elements (See Example: Moves).
- Block-level replacements for MathML equations and images.
- Attribute additions, deletions and modifications.
Changelog
Consider the Revisions and their comparisons diagram. Each box represents the XML of a specific revision of the article. So in this example there are three revision of the article: R1, R2 and R3. R1 and R3 are attributed to the author Charles while R2 is attributed to author Bert. The arrows between the boxes represent comparisons.
In a typical diff application, only two revisions are compared, say R1 and R3. In such application it would be impossible to attribute a change to a specific revision and thus author. A user of such application needs to perform two comparisons in order to be able to properly attribute changes to the respective authors. This can be done in a traditional diff application but is cumbersome because the author would need to switch between the two comparisons manually.
Overlapping changes are changes made by one author, say Bert in R2 which are subsequently revert by another author, Charles in R3. This is also known as an edit-war (Yasseri et al., 2012). The net result of both changes would be zero and thus would not show up in a A/B comparison between R1 and R3.
A changelog is a set of A/B comparisons stitched together in such a way that the result is a concatenation of all the individual comparisons. This is done in order to be able to attribute changes to a specific authors and to be able to display overlapping changes. Each change incorporated in the changelog is annotated with the specific revision in which to change was made. This is how the DH UI can display when and by whom a particular change was made. This information and some more are stored in the fxd:changeId attribute.
The stitching algorithm, which creates the changelog, needs to split changes if a change in a later revision is within a prior change. Consider the Example editor war fragment where a change with identified by id1 was split by a later change identified by id2.
Example editor war
A sentence with <fxd:textfxd:changeId="id1"fxd:addition="">added</fxd:text> and <fxd:textfxd:changeId="id2"fxd:deletion="">deleted</fxd:text><fxd:textfxd:changeId="id1"fxd:addition=""> text</fxd:text> and some more.
The stitching algorithm also needs to be able to track moves temporally. For example: a paragraph can first be moved, then be edited and then be moved again for example. The author must be able to see both moves and edits separately.
FDiff
The problem of document change detection is defined as the problem of finding a "minimum-cost edit script" that transforms one document into another document. A generic tree-to-tree correction (Tai, 1979) algorithm focuses on the syntactic structure. This algorithm works well for data oriented documents but not for text oriented documents. For our purposes we needed an algorithm which focuses on structured text. The core of the problem is to find changes which were made. XML documents, by definition, are trees, thus the problem is called the tree-to-tree correction problem. There are several algorithms known, each which their own running time and behavior characteristics.
The implementation of FDIff is a modified version of the X-tree Diff+ algorithm (Lee and Kim, 2006). This algorithm is optimized for structured text documents and supports insert, delete, update, move and copy (currently unused) operations.
Given 2 XML documents, referred to as left and right, the goal of the algorithm is to work out an optimal matching between the nodes. The algorithm works by finding a match between the nodes in the left and the nodes on the right. Each match has a score between 0 and 1 to indicate how good the match is. The steps required are layed out in the X-tree Diff+ paper. The algorithm is O(n) where n is the sum of the number of nodes in left and right.
The exact implementation of this algorithm is non-trivial and deservers it's own paper and thus is out of scope for this paper. The main modifications to the X-tree Diff+ will be described though.
This algorithm has the following features:
- It operates on XML document representation of structured texts.
- It leverages Fonto family concepts (e.g. block, frame, object, etc.) to optimize performance (Middel, 2019).
- It uses text similarity instead of stable identifiers.
- It produces an edit script that features the operations insert, delete, update and move. (copy is described in the paper but is not used at the moment).
- It computes an edit script in near linear time and space on average.
There are two modifications to the X-tree Diff+ algorithm, discussed in the following sections.
Block-based tree alignment
Instead of matching the two entire left and right trees, FDiff stops matching at the paragraph boundary which we'll refer to as the block boundary. This means we don't match individual text nodes or inline markup elements in the tree-to-tree-alignment phase.
The motivation for not doing this is twofold:
- Detecting moves for individual words or phrases is generally not helpful to authors: the smaller the fragment, the higher the likelihood of it showing up in the text multiple times. Consider the word "the" for example.
- We want to detect wrapping and unwrapping of inline markup. This is exceedingly difficult in X-tree Diff+ because in the case of an wrap in the middle of a text node, the entire text node would be marked as deleted, while three new nodes are inserted (before text node, the wrap element including the wrapped text as its child, and the after text node).
When two blocks are matched based on similarity (as explained in the following section), we run a different algorithm which treats all words and inline element starts/ends as a sequence or a string.
Finding the changes within a left and right strings is a well studied problem referred to as the string-to-string correction problem (Wagner and Fischer, 1974). It refers to determining the minimum number of edit operations necessary to transform one string into another. There are several algorithms to solve this problem. Typically these algorithms support the insert and delete operations.
One of the most used is the Myers O(ND) Difference (Myers, 2005) algorithm. See (Elder, 2017) and (Fraser, 2006) for details. These resources are in general more readable than the `Myers's 'An O(ND) Difference Algorithm and Its Variations'` paper. Two refinements to the linear-space Myers algorithm are also implemented. These refinements reduce the memory requirements of the classical algorithm from O(len(left) + len(right)) to O(min(len(left), len(right)))` and the worst-case execution-time requirements from O((len(left) + len(right)) * D) to O(min(len(left), len(right)) * D).
Based on the insertions and deletion, the edit script is constructed for the matched blocks. Special care was required to ensure nested inline elements were constructed in the proper order.
Modification of Step 1 (Match identical subtrees with 1-to-1 correspondence and match nodes with ID attributes)
This step is heavily modified for two main reasons:
- xml:id or similar identification attributes are unstable. Consider for example an author splitting a paragraph; Which half of the paragraph does the ID attribute belong too? What if the author copies just the text out of a paragraph? Therefore, we removed the id matching sub-step.
- From an author perspective, the primary contents of the document is the text itself. We added a sub-step to match similar text blocks (e.g. paragraphs). This allows the algorithm to detect moves of text blocks even if they contain changes.
To compute the similarity between two blocks, we implemented a MinHash-based (Broder, 1997) algorithm to compute so-called MinHashes. The algorithm computes a MinHash of the N-shingles tokenized from the input. MinHash is designed to ensure that two similar inputs generate hashes that are themselves similar. In fact, the similarity of the hashes has a direct relationship to the similarity of the inputs they were generated from. This relationship approximates to the [Jaccard Index](https://en.wikipedia.org/wiki/Jaccard_index). The implementation is a port of the [Andrei Gudkov's implementation] (http://gudok.xyz/minhash1/).
Conclusions
With Document History, Aries got just about everything it wanted in a change tracking system. Every insertion, deletion, attribute, and formatting change made throughout the article's production is tracked reliably and attributed to the correct user. Because it is based on XML differencing, the system does not affect the performance of editing functions, and there is no possibility that the change tracking system will mishandle inline tracking nodes and mangle or lose content.
One feature that Document History does not have is the ability for editors to accept and reject changes. The tracked changes version is, after all, a different document, and mapping changes back to the working copy would be a challenge. Instead, production editors keep track of their work by marking changes as "Seen," and if a change is not appropriate, they have the option to "Edit here," which places their cursor at the location of a change in the working copy. Thus, a production editor can be assured that they have seen and addressed all changes made to the article.
With Document History, the LiXuid workflow has the change tracking feature that no XML-through production system can do without.
References
- Broder, A. (1997). On the resemblance and containment of documents. Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), 21-29.
- Elder, R. . (2017). Myers Diff Algorithm - Code & Interactive Visualization; Available from: https://blog
.robertelder .org/diff-algorithm/ - Fraser, N. . (2006). Diff Strategies; Available from: https://neil
.fraser.name/writing/diff/ - Lee S.K. , , Kim D.A. (2006) X-Tree Diff+: Efficient Change Detection Algorithm in XML Documents. In: Sha E. , , Han S.K. , , Xu C.Z. , , Kim M.H. , , Yang L.T. , , Xiao B. (eds) Embedded and Ubiquitous Computing. EUC 2006. Lecture Notes in Computer Science, vol 4096. Springer, Berlin, Heidelberg. https://doi
.org/10.1007/11802167_104 . - Middel, M. (2019). How to configure an editor – An overview of how we built Fonto. XML Prague,103-116.
- Myers, E. (2005). An O(ND) difference algorithm and its variations. Algorithmica , 1, 251-266.
- O'Connor C , , Gnanapiragasam A , , Hepp M. (2012).Tracking Changes to JATS XML in an Online Proofing System. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2013 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); Available from: https://www
.ncbi.nlm .nih.gov/books/NBK159965/ - Perera C.(2014). Case Study on Redlining Application using JATS XML at the International Organization for Standardization. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2013/2014 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); Available from: https://www
.ncbi.nlm .nih.gov/books/NBK190487/ - Tai,K. (1979). The Tree-to-Tree Correction Problem. J. ACM, 26, 422-433.
- Wagner, R. , , Fischer, M. (1974). The String-to-String Correction Problem. J. ACM, 21, 168-173.
- Yasseri, T. , , Sumi, R. , , Rung, A. , , Kornai, A. , , Kertész, J. (2012). Dynamics of Conflicts in Wikipedia. PLoS ONE, 7. [PMC free article: PMC3380063] [PubMed: 22745683]
- Change Tracking in the XML-Based LiXuid Production Workflow Using Fonto’s Docume...Change Tracking in the XML-Based LiXuid Production Workflow Using Fonto’s Document History - Journal Article Tag Suite Conference (JATS-Con) Proceedings 2020/2021
Your browsing activity is empty.
Activity recording is turned off.
See more...