NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
Journal Article Tag Suite Conference (JATS-Con) Proceedings 2022 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2022. doi: 10.0000/00000000
JATS is today the de facto standard for the XML representation of journal articles. Academic publishers convert articles from tex, docx and odt to JATS with the benefit of carrying article information in a machine-readable and typesetter-tool-independent format. Following recent growth in higher education, the number of scientific articles has increased exponentially for more than a decade. Each of these articles has to go through a laborious process, from the initial screening through peer review, author revision rounds and the final decision made by the editor-in-chief. The acceptance decision is made by analysing and comparing reviewer comments to the revisions made by the authors, which is a manual process that requires a lot of time, attention and rigour. We describe within this article a new JATS comparison algorithm, called jats-diff, available on GitHub (https://github.com/milos-cuculovic/jats-diff). The algorithm is not only usable for the final decision makers, allowing them to compare different article versions using bijection between author modifications on one side and the detected differences on the other, but also for comparing preprints with their published article versions and other domains where scientific literature gets converted into JATS.
1. Introduction
The number of scientific articles being published has been growing exponentially for over a decade [1]. Senior scientists are not only pushed to publish but are also asked to conduct peer review and are members of journal editorial boards and conference committees. As editors-in-chief or conference chairs, they are in charge of making final decisions regarding new articles’ acceptance. Those two roles, together with the author and the peer reviewer, are the key players in academic publishing, and it is of interest to facilitate their tasks as much as possible. Currently, the decision-making process is manual and requires a lot of time, attention and rigour.
After the article revision round, in order to make the acceptance decision—see Figure 1—the peer reviewer (or the editor-in-chief) must assess whether the author made the requested changes—those requested by the peer reviewer. This is done by reading the review comments, comparing different versions of the article and reading the author response letter. The article comparison together with the review comments and the author response letter reading tasks are time-consuming. Moreover, change description is written by the author and may not always reflect all, nor real changes made during the revision.

Figure 1
Article peer review and decision-making process.
During the peer review process, we can define two types of change information: the requested changes, i.e., those detailed in the reviewer comments; and the effective changes, i.e., those made by the authors. For the decision maker, the effective changes are obtained by analysing the differences between the two versions of the article. The common approach here is the use of the existing text processor document compare or change-tracking tools. Although useful, both of those functions represent changes only visually, and change sets cannot be extracted for further processing. Those also rely on text processors, and in addition, the change-tracking function lies in the author’s hands and can be disabled at any time.
We explorie within this article another way of comparing academic articles—by comparing their JATS representations. Publishers convert the articles from their initial document types to JATS, the de facto standard for the academic articles’ XML representations, used by major indexing companies, including PubMed Central1 and SciELO2. JATS has the advantage of being machine readable and independent of text processors; it also has no layout information and carries only the article data and structure. The JATS article information is divided into three main elements: front, body and back. The front contains subtree elements about the article metadata: the journal, the title, the authors, the affiliations, etc. The body contains the article content organised in sections, subsections and paragraphs. It is the largest part of the document, dominated by text blocks represented as paragraphs. In addition to the text, paragraphs are also composed of <xref> elements (used for citing) and styling elements such as <b>, <i>, <sub>, <sup>, etc.
Regarding the existing text document comparison tools, with the appearance of digital text documents back in the 1970s, researchers and industrial expressed their interest in extracting and understanding textual differences. Many text-diff algorithms were created, some of them still in use, such as Hunt–McIlroy’s [2] and Myers’ [3] algorithms. They are used in version control systems (Git, Apache Subversion), text editors (Notepad++ Compare), etc. Text-diff algorithms are line-based (comparison of each line of the document A with its equivalent line on document B) and use two core edit actions to represent the changes: Insert and Delete.
In the late 1990s–early 2000s, XML [4], a specific type of text document, was widely adopted in a short period of time. Compared to the initial text documents, XML has the particularity of having a hierarchical structure, also called a tree structure, where the main elements are no longer text lines but nodes that can also have attributes to carry specific information. The existing text diff algorithms are unsuitable for comparing XML documents due to the so-called tree-to-tree editing problem defined by Selkow [5]. This problem was further demonstrated by several research groups [6,7,8].
XML documents first played a role in data interchange and archiving. Those were lately categorised as being of a data-centric type. Many XML diff algorithms were created for data-centric XML comparison, such as XmlDiff [6], XyDiff [8] and FC-XmlDiff [9]. Those algorithms had the purpose to compare versions of a high number of XML documents, each containing a lot of small- to medium-size nodes. Their main performance indicators were the execution time and the delta file size. In order to reduce the delta output, the differences between two XML documents were represented using the minimum-cost edit approach.
With the appearance of modern typesetter tools such as MS Office Word and Open Office, XML documents started to be used as storage systems for textual data. At that moment, the era of text-centric XML document began. There are today different text-centric XML document types which are widely used, such as JATS3, DocBook4, TEI5 and DITA6. Several research groups [10,11,12] have demonstrated that XML diff algorithms initially created for data-centric XML documents are not suitable for text-centric XML documents. This is mainly due to the way the performance of data-centric XML diff algorithms was measured. Unlike those, for text-centric XML diff algorithms, the quality of the delta output (readability and accuracy) is the most important.
Among the existing XML diff algorithms, 12 of them—See Table 1—were tested, and none were entirely suitable for JATS comparison [13]. The main reason for this is the inability of those algorithms to determine a strong relationship between edits made by authors and the changes represented in their deltas. This creates difficulties for human readers to understand the real changes made by authors while reading the delta output. Making a bijection between author modifications on one side and the detected differences on the other (see Figure 2) would help the human reader to achieve that goal.
Table 1
XML diff algorithms.
We propose within this article a new JATS comparison algorithm, called jats-diff, which can help us to better understand and represent human edits made on text processors by analysing the impact of those edits on JATS documents. Among others, one of the practical application areas where this new algorithm can be used is the article comparison carried out by the decision maker during the peer review process. Having direct and easy access not only to basic differences between the two articles but to real modifications made by the authors is more convenient than reading the change description document and comparing different versions of the article in order to mindfully evaluate the changes. In addition, preprint publishers and readers could also use the jats-diff algorithm in order to compare different preprint versions or compare those with their corresponding journal-published articles.
Regarding the organisation of this paper, we start in Section 2 with presenting different author modification patterns made in text processors and their impact on JATS. In Section 3, we propose seven new and specific edit actions, proper to JATS: section upgrade and downgrade, paragraph split and merge, style edit, text move and citable object reference edits. An important factor also considered is the flexibility to detect all those changes when minor text edits interfere, this being very specific to author edits on typesetter tools. In Section 4, we present a similarity index calculation between two JATS documents, its benefits and propagation towards the XML tree. In Section 5, we focus on the delta output and show how the change information is represented. We provide the jats-diff DTD and present some examples of the delta output. In Section 6, we compare jats-diff with the three other state-of-the-art XML diff algorithms that had scored the best while evaluating their JATS comparison capacities [13]. Finally, in Section 7, we discuss additional text change semantics that could be beneficial for distinguishing between simple sentence rephrase and sentence meaning changes. Moreover, change semantics may later be correlated with the reviewer comments in order to facilitate final decision making even further.
2. Impact of author edits on JATS
The current XML diff algorithms generate delta outputs with a limited number of edit patterns while comparing JATS documents. Often, a single author edit action made on the typesetter tool is interpreted as a sequence of different lower-level edit actions (insert/delete). This makes the delta output difficult to understand for a human. In this section, we analyse some common edit actions carried out by the authors on academic articles during the revision rounds and correlate those with their impact on JATS.
While being written, academic articles follow a standard structure regarding different sections they embed and in which order those sections appear. On its cover page, an article usually starts with the journal, title, authors, affiliations, abstract and keywords information. What follows is the largest part with different sections, each of which can contain subsections, paragraphs, figures, tables, etc. At the end, we usually find the reference list. The largest text blocks we can observe in the article are paragraphs. This implies that most of the changes made by authors are mainly observed on those text blocks. Some common author edit actions are represented in Figure 3, and their impact on JATS is shown in Figure 4. In the following text, we use abbreviations I, D, A, U and M for Insert, Delete, Attribute edit, Update and Move representations of JATS edits, respectively.
2.1. Paragraph split and merge
Two of the first paragraph edit actions we observed are split and merge. Authors usually split large paragraphs into smaller ones or merge several small paragraphs into a bigger one.
Figure 3 (Section one) shows a paragraph merge on the text processor. Its impact on JATS is represented in Figure 4 (sect. 1). Compared to text processors where this edit action is relatively straightforward and consists of adding or removing line breaks between text blocks, its impact on JATS is more complex. We observe a combination of U+(n-1)D, n being the initial number of the paragraphs to merge. The existing XML diff algorithms usually show this edit action as three or more edit actions: U+2D in the best case scenario or I+3D, while U is seen as a combination of D+I.
The split edit action is presented in Figure 3 (Section two) and is the opposite of merge. In Figure 4 (sect. 2), we can see the split edit impact on JATS, and the delta representation for this change is symmetrical and of similar complexity to that of merge.
Representing a split or merge edit action as a combination of a sequence of basic edit actions is not convenient for the human reader and requires a higher-level interpretation.
2.2. Text move
Another edit action we observed during the author revision rounds is the text move. Authors move parts of paragraphs from one location within the article to another. Figure 3 (Section three) shows the text move action made by the author on the text processor where part of a text labelled "m" was moved from the second to the first paragraph in Section 3. Figure 4, (sect. 3) shows the impact of this edit on JATS that appears as 2U in the best case scenario or as 2(D+I).
Again, representing a text move edit action as two paragraph updates is not convenient for the human reader and requires a higher-level interpretation.
2.3. Subsection upgrade/Section downgrade
Subsection upgrade is yet another edit action we observed during the author revision rounds. This edit action occurs when authors upgrade a subsection to a section. Each subsection/section is composed of one label, one title and the section body that is further composed of paragraphs, figures, etc. Figure 3 (Section four) shows a subsection upgrade made by the author in the text processor. This edit action is usually seen as a combination of label change, font increase and indent decrease. As JATS does not hold any layout information, the complexity of detecting such edits on JATS documents is higher, and multiple nodes are affected by the change, as we can observe in Figure 4 (sect 4.1 -> sect. 5). The entire Subsection 4.1 is removed and inserted as Section 5. Moreover, we also observe some induced changes in the following sections where their label and ID are automatically changed according to the section numbering plan. Within our example, the initial Section 5 is renumbered as Section 6. Within the current XML diff algorithms, comparing the two JATS files will result in a full D of Subsection 4.1 and a full I of Section 5, followed by an attribute and label change on the initial Section 5 becoming Section 6.
The downgrade edit action is the opposite of upgrade. Figure 3 (Section six) shows a section downgrade made by the author in the text processor. This edit action is seen as a combination of label change, font decrease and indent increase. Its impact on JATS—see Figure 4 (sect. 6.1)—is exactly the same as for an upgrade, but in the opposite direction.
Neither the subsection upgrade nor the section downgrade edit actions are well represented within the existing XML diff algorithms. Those edit actions are represented as sequences of delete–insert edit actions and are difficult to interpret from a human reader perspective.
2.4. Style edit
In Section 1, we described JATS as a structural text document with no layout information that only carries the article data and structure. However, JATS also holds textual styling that is an important piece of information in the article. The majority of the paragraphs within the article hold text styling information such as bold, italic and underline. Styling change is purely visual in the text processor—see Figure 3 (Section five -> Section six), where one part of the text was styled bold and another one italic and underlined. On the other hand, styling information is represented as node elements on JATS, and their change directly impacts the XML tree—see Figure 4 (sect. 5 -> sect. 6). Within our example, inserting a bold style around a text portion consists of inserting a new node that will wrap this text content, which adds another layer of complexity for the XML diff algorithms. Most of the existing XML diff algorithms will represent this change as a new node I containing the edited text portion and a U of the edited paragraph text node. This kind of delta output is, again, not easy for human readers and is hard to interpret as a style edit action made by the author.
2.5. Citable object reference edit
Another common element we can observe in paragraphs are citable object references. Authors usually cite bibliographies, figures, tables and other sections within the article. Those references appear as <xref> nodes in JATS and are visible in most paragraphs. In order to cite an object, its label and ID are used. Those auto-incremental values are assigned to each citable object and are dependent on its appearance in the citable objects list. A reference citation is seen as <xref ref-type="bibr" rid="B2-journal-xx-yyyyy">2</xref>, where "B2-journal-xx-yyyyy" represents the ID, and the number 2 represents the label. If the paragraph is changed from citing the bibliography B2-journal-xx-yyyyy to B3-journal-xx-yyyyy, both the ID and the label will be changed within the <xref> node. Those two properties make it difficult to track author edits on the citable objects list. Let us assume a scenario where there are five references that the author is using as the bibliography. If a new reference were to be added at position 2, the IDs and labels of all the following references would be incremented, i.e., the "B2-molecules-25-00430" would switch to "B3-molecules-25-00430", and so on. The same applies for their labels. This change would then also impact all the references those objects have within the article, where the <xref> from our previous example would change to <xref ref-type="bibr" rid="B3-journal-xx-yyyyy">3</xref>, and the same for the following references. Having those induced changes while adding or removing citable objects would result in them being represented in most XML diff tools, instead of a simple citable object I, as I+n(A+U) in the best case scenario, or I+n(D+I) if the A and U are seen as full <xref> D+I, n being the number of auto-incremented bibliographies.
This type of simple author edits creates a lot of noise in the delta output that needs to be detected and eliminated by the XML diff algorithm. Without any filtering, one reference I or D can reproduce hundreds of other edits within JATS due to the induced changes it generates.
3. "Level 2" edit actions
As seen in Section 2, we can observe several edit patterns made by authors in their text processors that have a higher-level impact on JATS. Today, those edit patterns are not recognised by any of the existing XML diff tools and are usually represented as a combination of insert and delete edit actions. From a human reader perspective, having a bijection with the author edits on one side and the JATS diff output on the other—see Figure 2—is valuable and can help us to understand what modifications are really made by the author in a given article. The following edit patterns that the new jats-diff algorithm is able to interpret and correctly represent are described within this section: paragraph merge and split, text move, subsection upgrade, section downgrade, style and citable object edits.
3.1. XML diff base - JNDiff
Due to the high scoring of the JNDiff algorithm [13] in comparing JATS documents, we decided to ground our new edit pattern recognition using the core functionalities of JNDiff for "Level 1" XML edit detection. JNDiff has good performance and is highly reliable in respect of detecting differences between two XML documents. The logic of the algorithm is as follows: It first builds one virtual tree object per document; then, it detects inserts and deletes as basic/"Level 1" edit actions; finally, it tries to refine the detected "Level 1" differences and convert them to "Level 2". Each time a specific insert–delete sequence is turned to "Level 2" change, it is removed and replaced in the change list by its "Level 2" representation. Our new edit pattern detection will be added to the JNDiff refinement logic in order to recognise and represent them in the delta.
3.2. XML tree annotation
In order to represent the XML trees in the following edit pattern detection on JATS documents, we use the conventional labelled ordered tree model. As we always compare two JATS documents A and B—two XML trees—each node belonging to the document A is labelled and the document B . We assign to each node an identifier m for document A and p for document B. In addition to the identifier, the node depth regarding the tree represented by n for document A and q for document B is also added, resulting in the annotation for document A and for document B.
As we can see in Figure 5 representing the XML tree of the document A, each node has a specific annotation. represents the article node, being the root node. and represent Section 1 and Section 2, respectively, Section 1 being the node number 0.0 at depth 1 and Section 2 the node number 0.1 at depth 1. Section titles and paragraphs are within depth 2 and have node numbers 0.0.0 to 0.1.1. Finally, text nodes are at depth 3 and have node numbers 0.0.1.0 and 0.1.1.0. On the left, we can observe shallow nodes, and on the right, deeper nodes.
![Figure 5. Example of XML tree on document A using the conventional labelled ordered tree model α[m,n] and β[p,q], where m and p represent the node identifier and n and q the depth.](/books/NBK579687/bin/cuculovic-2022-g005.gif)
Figure 5
Example of XML tree on document A using the conventional labelled ordered tree model α[m,n] and β[p,q], where m and p represent the node identifier and n and q the depth.
3.3. Text similarity vs. text equality
The existing XML diff algorithms use text equality while trying to match different edit actions. Taking the example of tree merge—see Figure 6, where the P1 and P2 paragraphs are merged into P3—the merge is detected only if the text in paragraph P3 is exactly the same as the sum of the texts in P1 and P2. This approach is suitable for data-centric XML documents where high edit precision is required. On the other hand, text-centric XML documents are prone to small textual edits where grammar and sentence rephrasing are common. It is then important to replace text equality checks with text similarity while evaluating "Level 2" changes. This way, the algorithm is more flexible and can detect "Level 2" changes even with small textual change interference represented in Figure 6 by the addition of "but the". All the following "Level 2" change patterns we present use text similarity with a threshold that we experimentally defined at 95%; however, this value can be changed for further fine tuning. Once the "Level 2" change pattern has been detected, and if the text node has a similarity different from 100%, the text updates have to be detected in the usual way regardless of the "Level 2" change.

Figure 6
Paragraphs P1 and P2 merged into P3 while a text change common to text-centric XML document was introduced, showing the benefits of text similarity over text equality use for "Level 2" change detection.
3.4. Order of the edit detection
The edit patterns detection order—See Figure 7—remains important as some "Level 2" edits are composed of a combination of "Level 1" or even "Level 2" edits. In the example whem two paragraphs are merged, seen as one paragraph delete ("Level 1") and one text update ("Level 2"), it is important to run the paragraph merge pattern detection before the tree delete and text update detection. Regarding the section upgrade, this edit pattern is composed of several tree moves and text updates, and it is thus important to run this pattern detection before tree move and text update. Following an empirical evaluation, we decided to choose the following "Level 2" edit pattern detection order:

Figure 7
Edit pattern detection order used as higher-level edits is usually composed of other "Level 1" or "Level 2" edits, which makes their detection order important.
3.5. Subsection upgrade/Section downgrade
As seen in Figure 4 (sect. 4 -> sect. 5), upgrading one subsection with x nodes into a section results in x(D+I) "Level 1" edits. Moreover, as the remaining section numberings will also change due to their auto-incremental nature, we observe additional y(A+D+I)) "Level 1" edits, y being the number of remaining sections following the upgrade: A for id; D+I for the title. In total, a simple section upgrade will result in x(D+I) + y(A+D+I) "Level 1" edits. Within our example, see Figure 8, upgrading Subsection 1.1 into Section 2 results in depth lowering on each of the upgraded section elements followed by the title and ID change.

Figure 8
Impact of subsection upgrade on JATS XML tree: we observe that the depth of the upgraded Subsection 1.1 is lowered by one, becoming, on the depth level, the same as their previous parent (Section one).
By applying the following mathematical formula on the lists of node changes detected between documents A and B, respectively annotated c and c, we can evaluate if the specific change pair fulfils the subsection upgrade condition:
The formula verifies whether the contents of Subsection 1.1 and Section 2 are similar, and in addition, whether the depth of Section 2 is lower than the depth of Subsection 1.1. Having this condition satisfied will result in a subsection upgrade detection. By running the formula on our example, the following scenario occurs: for each change detected in document B (), evaluate if the text content of "title"+"cd" is similar to the text content of "title"+"cd". In addition, evaluate if the depth of the modified element is lover than the depth of the element. As both conditions are satisfied, the tree upgrade pattern is recognised.
The delta output of such pattern detection is represented as one "Level 2" change named "upgrade", having two information elements: "upgrade_from" and "upgrade_to".
Downgrade is the opposite of upgrade, where it is enough to invert the depth comparison in order to adapt the upgrade formula to detect downgrade patters. Once a downgrade pattern has been detected, the delta output will contain one "Level 2" change named "downgrade", having two information elements: "downgrade_from" and "downgrade_to"
3.6. Paragraph merge and split
As seen in Figure 4 (sect. 1), merging x paragraphs into a unique paragraph will be seen as x+1 "Level 1" edits, composed of xD+I. Using the update "Level 2" edit, the number of edits observed is lowered to x, with (x-1)D+U. We propose here a new way of detecting tree merge. Within our example, see Figure 9, nodes and are merged with node . Represented with "Level 1" changes, this edit pattern detection results in 3D ( with content "ab", with content "cd" and with content "ef"), followed by I of with content "abcdef".

Figure 9
Impact of paragraph merge on JATS XML tree: we observe that the text contents "cd" and "ef" from paragraphs 2 and 3 are merged with the text content "ab" of paragraph 1.
By applying the following mathematical formula on the lists of node changes detected between documents A and B, respectively annotated c and c, we can evaluate whether a specific node pair fulfils the tree merge condition, the merge being valid only if two or more nodes from c fulfil the tree merge condition:
This mathematical formula verifies for every node content on whether there are nodes on whose content is subset node content and where both depths on and are identical. By running the formula on our example from Figure 9, the following scenario occurs: for each change detected in document B , test if the mathematical condition is verified for a given node in document A (); if so, the examined node is a merge candidate. The algorithm will test for a given node in document B all nodes within the same depth in document A , and , with their respective text content "ab", "cd" and "ef". As their text contents are all contained in "abcdef" and they all have the same depth 2, they will be added to the merge candidate pool. In the end, if there is more than one merge candidate in the pool, a "Level 2" tree merge edit is detected. The resulting delta using the merge pattern detection while merging n paragraphs into a unique paragraph will be seen as one "Level 2" edit, containing n-1 merge_from and one merge_to information element.
Split is the opposite of merge, where instead of evaluating whether the text content of changed nodes is contained by , we evaluate whether the text content nodes are contained by . While using the split edit pattern detection, splitting one into n paragraphs will be represented as one "Level 2" edit containing one split_from and n split_to information element.
3.7. Style edit
As seen in Figure 4 (sect. 5 -> sect. 6), styling information is seen as XML nodes on JATS. The most often observed styling elements are <b> for bold, <i> for italic, <sub> for subscript, <sup> for superscript and <u> for underline. Style edits have no narrative impact, and the paragraph textual structure remains the same. On the other hand, the JATS XML structure is heavily impacted by those styling nodes, which makes their change detection complex. Most of the existing XML diff algorithms have difficulties representing text changes in paragraphs containing styling elements. Having no impact on the narrative structure, one of the possible solutions we propose to deal with style edits is to separate styling and textual change detection on JATS. This is possible by converting style nodes to simple text using encryption (XML tags to text conversion). This way, the bold "hello" text is encrypted from initially <b>hello</b> to a pure text variant, for example, _|b|_hello_|/b|_. This simplifies a lot the detection of the styling changes as there is no need to operate with complex tree changes—everything is seen and treated as simple text.
In Figure 10, we can see an example where some parts of the paragraph 1 text content are styled. "a" is made bold, "b" is made italic and underlined and "c" remains unchanged. By observing the impact of this modification on JATS, we can observe that the node changes from having one child text node "abc" to six nodes: , , , , and . By encrypting all those newly added styling elements to simple text, we retrieve only one text node for paragraph 1, which facilitates change detection.

Figure 10
Impact of styling addition on JATS XML tree: we observe that adding bold to one part of the text and italic + underline to another changes the XML tree structure, as each styling addition is seen as a new element added to the existing XML tree.
Once we have simple text nodes on both sides, we split them into two lists, and , using the styling encrypted tags as separators. The two lists are then compared using the JAVA DiffUtils7 library that returns the containing two parameters: difference content and type. With the type having one of three values (insert, delete or change), we are able to find style insertions, deletions and updates. In our example, DiffUtils will return three style inserts: bold, italic and underline. Deletions are observed when styling is removed and edits when styling type or styled text portion change. An example can be demonstrated in Albert Einstein’s quote "Logic will get you from <b>A to Z</b>; imagination will get you everywhere" that is changed to "Logic will get you <b>from A to Z</b>; imagination will get you everywhere". Note the bold part change from <b>A to Z</b> to <b>from A to Z</b>. DiffUtils will return two differences in this example: the bold part content change with "from" added and the text part changed with "from" deleted. We interpret this change as a styling edit where the styled portion of text changed.
Using the described approach, jats-diff is able to detect three different style changes: "text–style–insert", "text–style–delete" and "text–style–update". "Text–style–update" is used for both style type changes and style content changes.
Using the styling "Level 2" pattern recognition in our previous example allows us to change the delta output from one D+6I to three text-style-insert. The change consumer can understand this way that there is only styling and no content changes applied by the author.
3.8. Text move
As seen in Figure 4 (sect. 3), moving text portions from one paragraph to another will result in four "Level 1" edit actions, 2(D+I). Making text moves within the document will be represented in a similar way to making real content changes, which does not represent real modification made by the author. Within our example (see Figure 11), text "c" from node has been moved to node . There, we can observe two change pairs: - and - .

Figure 11
Impact of text move on JATS XML: we observe that moving some text content from paragraph 2 to paragraph 1 has an impact on both nodes.
By applying the following mathematical formula for each of the detected change pairs between documents A and B, respectively annotated c and c, we can evaluate if two specific change pairs fulfil the text move condition:
The formula evaluates whether each of the change pair differences have common text between them. If true, then the text move pattern is detected. By running the formula on our example, the following scenario occurs: for each change pair between document A and B, and , evaluate whether there is another change pair, and , where the content difference between both change pairs is similar. This results in verifying whether .content–.content is similar to or equals .content– content. As both content differences will return "c", the text move condition is satisfied.
The delta output of such pattern detection is represented as one "Level 2" change named "text-move", having two information elements: "text-move_from" and "text-move_to".
3.9. Citable objects
As seen in Section 2.5, in addition to styling nodes, paragraphs are also composed of references used to cite citable objects available in the article. The most common citable objects are bibliographies, but we also observe figures, tables and sections. References are inserted as <xref> nodes containing the "ref-type" and "ID" as attributes and the citing reference label as text. The "ID" and the label are auto-incremental values dependent on the citable object appearance order; thus, inserting or removing a citable object automatically changes the remaining citable objects’ auto-incremental values. Those induced changes are not interesting for the human reader and should be ignored as they are not directly made by the author.
Figure 12 shows the impact of adding one additional bibliography at position one in the bibliography list. This change will move the initial position one bibliography to position two , which implies that its label and ID auto-incremental numbering value will change from 1 to 2. This then has a direct impact on the xref node that will change to with a different attribute "ID" and a different label, citing the previous Reference 1 that became Reference 2. This kind of induced change is not interesting for the human reader, for whom the only relevant information is the insertion of the new bibliography. We propose here a solution on how to ignore those non-relevant changes and only keep the relevant changes made by the author. The main idea is to first scan the citable objects list and detect insertions, deletions and the impact of those edits on their positions within the list. Citable object insertion will auto-increment and deletion will auto-decrement all following citable object IDs and labels, which will then impact all cited references within the paragraphs. A precise list containing the original and new citable object numbering values is then used to scan all cited references within the paragraphs and ignore the changes detected where the original numbering value is changed to the new value as an induced change. This way, only real cited reference changes are kept in the delta output, and the induced ones are ignored.
4. Similarity index between the two documents
JATS being a text-centric XML, text nodes are the most important part of the document. Having a similarity index between the two documents is beneficial for the final decision maker who can evaluate the impact that the modifications had on the textual content of the article. Due to the XML tree structure, using ordinary text diff algorithms is not possible, which is why we developed a simple and efficient algorithm that can calculate text similarity between modified text nodes and propagate upwards in the XML tree.
4.1. Text similarity index propagation
After evaluating different text diff algorithms, we decided to use the Jaccard index [21] that is calculated for every change node pair between document A and document B, regardless of whether the change is of "Level 1" or "Level 2". Once the similarity index is calculated for every change in the delta, those are propagated upwards in the XML tree by applying the following equation:
Figure 13 presents two JATS versions where node lost "b", representing half of its initial content, and node had textual 50% content changes on "d" that represent 50% of the entire text node content. The delta output will show in this example one text update per modified node and .
Using our similarity calculation algorithm, we could deduce that text node has a similarity of 50% compared to its document B version node . Calculating the same for the and nodes, we can deduce that the two text nodes have a similarity of 75% ("d" representing 50% of the entire text node, and modified by 50%). Once both similarities have been calculated, we can now propagate those upwards on the tree in order to measure the similarity between the two section trees and , both containing three paragraphs each. Here, you can find details on applying the previous formula to the Figure 13 example:
- N = 3 as Section 1 has three child nodes;
- n0 represents ; n1 represents ; n2 represents ;
- = 0.5; = 0.75; = 1;
- = 0.25; = 0.25; = 0.5.
The example previously provided in Figure 13 is rather simple for comprehension purposes, but the same mathematical formula can be applied to more complex cases where, for example, we can observe text moves, node moves, subsection upgrades, etc. as soon as we convert an xml subtree to simple text by concatenating their individual text nodes to a single text block.
4.2. Element lists and special objects similarity
We previously saw how to calculate text similarity and propagate this similarity upwards on the XML tree. Having the text similarity makes sense for paragraphs, subsections and sections; it is, however, rather useless for other types of XML subtrees that are presented as lists (authors, references, tables and figures). For those, it makes more sense to express the similarity in number of changed/unchanged elements (4/5 authors, or 28/30 references).
Figure 14 shows a modification to the last name of author 2. If we used the similarity propagation for the parent node "authors", we would observe a similarity percentage that is highly influenced by the length of the modified last name. For one-word last names, which are the most common, our two algorithms would return a similarity of 0%, although only one character has changed in the last name. To accentuate this problem even more, let us assume authors 1 and 3 have very short first and last names, and author 2 has a short first and a very long last name. The author 2 last name could, for example, be composed of 10 characters, while the other first and last names are composed of only 2 characters, meaning that regarding its size, the author 2 last name is of the same size as all the other author text nodes.
We can conclude that text similarity calculation for those special types of XML subtrees can be inappropriate as this is purely based on text content. In such cases, it is much better to use child element counters and represent their parent element similarity that way. For this concrete example, we would say that authors have a similarity of 2/3, as two authors are exactly the same, and one was modified. In order to have even a higher precision, we propose to use the following semantic information for such lists:
- Initial: number of child elements on document A;
- Final: number of child elements on document B;
- Modified: number of modified child elements;
- Deleted: number of deleted child elements;
- Inserted: number of inserted child elements.
As the new edit actions detection and the similarity index have been described, we will continue with the algorithm output description and provide some examples where we will analyse the algorithm delta outputs for the described edit actions.
5. Algorithm usage
While comparing two JATS documents, the jats-diff algorithm generates two distinct files: one XML file containing the delta output and a second text file containing the similarity index and refining the delta output using different change semantics proper to JATS documents: citable objects, special objects (math formulas and figures) and lists containing tables, references and authors.
We will see within this section how both of those files are constructed and the information they represent.
5.0.1. Delta output XML
The following Document Type Definition (DTD) defines the structure of the delta output XML document. It represents both the "Level 1" and the "Level 2" edits that can be detected by jats-diff. We observed there 13 edit actions: Delete, Insert, Update Attribute, Upgrade, Downgrade, Merge, Split, Move, Text Style Insert, Delete and Update, Text Move and Text Update.
<!DOCTYPE jats-diff [ <!ELEMENT jats-diff (delete|insert|update-attribute|upgrade |downgrade|merge|split|move|text-style-insert|text-style-delete| text-style-update|text-move|text-update)*> <!ELEMENT delete> <!ATTLIST delete nodenumberA #REQUIRED> <!ELEMENT insert> <!ATTLIST insert at #REQUIRED children #REQUIRED nodenumberB #REQUIRED pos #REQUIRED> <!ELEMENT update-attribute> <!ATTLIST update-attribute name #REQUIRED newvalue #REQUIRED nodenumberA #REQUIRED nodenumberB #REQUIRED oldvalue #REQUIRED op #REQUIRED> <!ELEMENT upgrade> <!ATTLIST upgrade at #IMPLIED nodecount #REQUIRED nodenumberA #IMPLIED nodenumberB #IMPLIED op #REQUIRED pos #IMPLIED> <!ELEMENT downgrade> <!ATTLIST downgrade at #IMPLIED nodecount #REQUIRED nodenumberA #IMPLIED nodenumberB #IMPLIED op #REQUIRED pos #IMPLIED> <!ELEMENT merge> <!ATTLIST merge at #IMPLIED direction #REQUIRED nodenumberA #IMPLIED nodenumberB #IMPLIED op #REQUIRED pos #IMPLIED> <!ELEMENT split> <!ATTLIST split at #IMPLIED direction #REQUIRED nodenumberA #IMPLIED nodenumberB #IMPLIED op #REQUIRED pos #IMPLIED> <!ELEMENT move> <!ATTLIST move at #IMPLIED direction #REQUIRED nodecount #REQUIRED nodenumberA #REQUIRED nodenumberB #REQUIRED op #REQUIRED pos #IMPLIED> <!ELEMENT text-style-insert> <!ATTLIST text-style-insert nodenumberB #REQUIRED op #REQUIRED pos #REQUIRED> <!ELEMENT text-style-delete> <!ATTLIST text-style-delete nodenumberA #REQUIRED op #REQUIRED pos #REQUIRED> <!ELEMENT text-style-update (bold|italic|underline|overline|sup|sub| monospace|preformat|named-content|sc|b|i|u)*> <!ATTLIST text-style-update nodenumberA #IMPLIED nodenumberB #IMPLIED op #REQUIRED pos #REQUIRED> <!ELEMENT move-text> <!ATTLIST move-text nodecount #REQUIRED nodenumberA #IMPLIED nodenumberB #IMPLIED op #REQUIRED text-position-from #IMPLIED text-position-to #IMPLIED> <!ELEMENT text-update> <!ATTLIST text-update length #REQUIRED nodenumberA #REQUIRED nodenumberB #REQUIRED op #REQUIRED pos #REQUIRED> ]> |
For the previously observed edits, in Table 2, we can also see the attributes used to give additional information on the specific edit action. Those attributes allow the reader to identify the exact position of the change inside the original document A or the modified document B.
Table 2
Delta output edit action attributes
We will continue with providing the delta output for specific examples per edit action:
Insert: addition of a new keyword "test keyword"
<insert at="373" nodecount="1" nodenumberB="388" pos="7"> <kwd>test keyword</kwd> </insert> |
Delete: removal of an existing keyword "river monitoring"
<delete nodecount="2" nodenumberA="386"> <kwd>river monitoring</kwd> </delete> |
Attribute Update: section 6 ID change from "sec6" to "sec6dot1"
<update-attribute name="id" newvalue="sec6dot1" nodenumberA="38" nodenumberB="36" oldvalue="sec6" op="change-attr"/> |
Section Upgrade: Subsection 2.3 that is upgraded to Section 6. The upgrade operation is composed of two parts: upgrade_to and upgrade_from
<upgrade at="388" nodecount="5" nodenumberB="601" op="upgradedTo" pos="5"> <sec id="sec6"/> </upgrade> <upgrade nodecount="4" nodenumberA="447" op="upgradedFrom"> <sec id="sec2dot3"/> </upgrade> |
Section Downgrade: Section 5 that is downgraded to Subsection 2.4. The downgrade edit representation is composed of two parts: downgrade_to and downgrade_from:
<downgrade at="410" nodecount="89" nodenumberB="452" op="downgradedTo" pos="5"> <sec id="sec2dot4"/> </downgrade> <downgrade nodecount="88" nodenumberA="517" op="downgradedFrom"> <sec id="sec5"/> </downgrade> |
Paragraph Merge: merge two paragraphs into one. The merge operation is composed of three parts: one merge_to and two merge_from:
<merge at="0" direction="9:9" nodenumberB="9" op="mergedTo" pos="4"> <p>Text paragraph four with some additional text of paragraph five</p> </merge> <merge direction="9:9" nodenumberA="9" op="mergedFrom"> <p>Text paragraph four</p> </merge> <merge direction="11:9" nodenumberA="11" op="mergedFrom"> <p>with some additional text of paragraph five</p> </merge> |
Paragraph Split: split one paragraph into two different paragraphs. The split operation is composed of three parts: one split_from and two split_to:
<split at="0" direction="9:9" nodenumberB="9" op="splitedTo" pos="4"> <p>Text paragraph four</p> </split> <split at="0" direction="9:11" nodenumberB="11" op="splitedTo" pos="5"> <p>with some additional text of paragraph five</p> </split> <split direction="9:9" nodenumberA="9" op="splitedFrom"> <p>Text paragraph four with some additional text of paragraph five</p> </split> |
Move: move one keyword from its initial position 2 to position 4 within the keywords list:
<move move="376::378" nodecount="2"> <kwd>remote sensing</kwd> </move> |
Style insert: insert bold style around the word "disasters":
<text-style-insert nodenumberB="117" op="insert-style" pos="319"> <b>disasters</b> </text-stvle-insert> |
Style delete: remove bold style around the word "monitoring":
<text-style-delete nodenumberA="117" op="delete-style" DOS="14"> <b>monitoring</b> </text-style-delete> |
Style edit: Movie one keyword from its initial position 2 to position 4:
<text-style-update nodenumberB="117" op="update-style-to" pos="14"> <i>monitoring</i> </text-style-update> <text-style-update nodenumberA="117" op="update-style-from" pos="14"> <b>monitoring</b> </text-style-update> |
Text Move: move portion of the text from one paragraph to another:
<text-move nodecount="1" nodenumberB="6" op="movedTo" text-position-to="19"> text of paragraph five </text-move≫ <text-move nodecount="1" nodenumberA="12" op="movedFrom" text-position-from="21"> text of paragraph five </text-move> |
Text Update: change one word from "central" to "centralised":
<text-update length="7" nodenumberA="368" nodenumberB="368" op="text-deleted" pos="33"> central </text-update≫ <text-update length="11" nodenumberA="368" nodenumberB="368" op="text-inserted" pos="33"> centralised </text-update> |
5.0.2. Semantics output XML
Once the initial "Level 1" and "Level 2" edits have been identified, jats-diff will refine those and use change semantics in order to improve the visual representation of the detected changes and avoid representing the so-called induced changes we saw in Section 2. Moreover, the similarity index will also be calculated, and the results will be presented in the form of an XML tree structure.
We will see below some real-life examples where change semantics are used in order to refine a long list of "Level 1" and "Level 2" induced edits and summarise those in a human-readable format:
Citable objects: insertion of a new bibliography item in the references list <ref-list> at position 153 (out of 156). Oppositely, the delta XML will show for this simple bibliography insert over a dozen of different edit actions: text updates, attribute updates, etc.
0 - article * depth: 0 * similar-word: 99.9 606 - back * depth: 1 * similar-word: 99.8 1932 - ref-list * Initial: 156 * Final: 157 * depth: 2 * similar-word: 99.7 7993 - ref * id :B153 * depth: 3 * change-type: insert |
jats-diff considers the following to be special objects: tables, bibliographies, figures and mathematical formulas. Those objects are special due to the fact that their edits are not human readable and have to be represented in a different way: for tables (as seen in the previous example), the similarity index is calculated per table content and caption.
Element lists and special objects: modification of a table:
0 - article * depth: 0 * similar-word: 100.0 3977 - back * depth: 1 * similar-word: 99.8 6312 - sec * id :sec-type="display-objects" * depth: 2 * similar-word: 99.5 6313 - table * Initial: 6 * Modified: 1 * Final: 6 * depth: 2 * similar-word: 99.6 6895 - table-wrap * id :t006 * depth: 3 * change-type: table-edit * table * 66.7 * caption * 98.0 * similar-word: 89.9 |
Most of the special objects (tables, figures) but also bibliographies are within so-called element lists. jats-diff also uses change semantics in order to represent those in a more readable way. In the previous example where a table edit is shown, we can observe that the entire table list is shown with its initial, modified and final values. The initial value represents the total number of tables in document A, the modified value represents the number of edited tables and the final value represents the total number of tables in document B.
Similarity index
As seen in the two previous examples, the similarity index is calculated and propagated through the JATS XML tree. jats-diff allows calculating different similarity indexes: similartext, similartext-word, Jaccard and TFIDF. By default, the similartext-word index is used that allows the reader to obtain insights about lexical changes. In addition to the existing lexical similarity indexes, work is in progress to add different similarity indexes such as Topic Model and Word2Vec that can allow extracting semantic changes and distinguishing between simple sentence rephrasing and sentence meaning changes.
0 - article * depth: 0 * similar-word: 99.5 155 - body * depth: 1 * similar-word: 99.3 156 - sec * id :sec1 * 1. Introduction * depth: 2 * similar-word: 99.9 169 - p * depth: 3 * change-type: text-update - text-inserted * similar-word: 99.3 175 - sec * id :sec2 * 2. Materials * depth: 2 * similar-word: 96.2 199 - sec * id :sec2dot2 * 2.2. Imagery Surveyed from UAV and Satellites * depth: 3 * similar-word: 77.8 202 - p * depth: 4 * change-type: delete |
We compare in the next section the new jats-diff algorithm with the other state-of-the-art XML diff algorithms that work well with text-centric documents.
6. Performance analyses
The performance analyses of jats-diff8 are divided into two subsections: one on information extraction capacity and one on execution performance. This being a state-of-the-art algorithm, our main effort was dedicated to the capacity to detect new edit patterns and change semantics extraction, rather than its implementation performance.
6.1. Information extraction capacity
The initial evaluation phase consists in comparing the "Level 1" and "Level 2" information extraction capacities of the new jats-diff algorithm with JNDiff, XyDiff and XCC. Table 3 shows the results grouped by algorithm and level. During this performance analysis, we created different JATS XML file pairs, one original and one modified version of the same article. The modified version is composed of one of the human edits that is described in the "Human edit description" column. The output of each of the compared algorithms is then verified for its ability to detect the given edit type.
Table 3
Level 1/2 edit detection and similarity index calculation capacities for jats-diff, JNDiff, XYDiff and XCC.
As seen in Table 3, jats-diff is able to detect all of the "Level 1" and "Level 2" edits. In addition, there is a similarity index calculated and propagated upwards of the XML for each change detected. This is followed by JNDiff with a perfect score for "Level 1" and a low score for "Level 2" edits. JnDiff can also detect "wrap" and "unwrap" edit patterns that are similar to style edits. This is followed by XyDiff with similar results in addition to text insert detection, where XyDiff mostly uses text updates to represent text inserts. This is because XyDiff calculates the longest common sub-string (LCS) and minimises the edit distance, which increases the complexity for a human reader to interpret the results. XCC follows XyDiff but with additional issues in detecting tree delete and tree move edits compared to JNDiff.
Concerning the delta output, Table 3 shows that jats-diff uses the minimal number of edit actions for almost all edit pattern detection. For a few of them where JNDiff, XyDiff or XCC output a lower number of edit actions, they are usually represented as a simple delete–insert combination which does not reflect real changes made by humans at all, which we observe in the next section where we evaluate the delta file size for each of the jats-diff. If we push the theory of minimising the number of edit actions, one could think of using the delete–insert combination on the complete document, which will minimise the number of edit actions but maximise the delta file size.
6.2. Execution performance
Although not critical in our working environment, algorithm performance for both execution time and memory usage stays important. Compared with other XML diff algorithms that are made with the purpose of comparing hundreds of thousands of XML documents (for example, XyDiff and webpage difference extraction), jats-diff will have to compare academic articles in JATS XML format during the publication process. Their number is counted in hundreds per day, which is far from the number of documents to be compared that XyDiff usually faces.
Using the JNDiff core functions for "Level 1" change detection, the execution time and memory usage of jats-diff is minimum at the level of JNDiff. Adding new "Level 2" pattern detection requires additional analysis of the detected "Level 1" differences; thus, it requires more time and memory for the algorithm to be executed. The similarity calculation and propagation is carried out separately by analysing the delta output and requires additional execution time.
We divide jats-diff into two parts: first, the new "Level 2" pattern detection, and second, the similarity calculation and propagation. JATS articles are large XML files that may vary from 100 KB to 400 KB. The tests9 were run on a JATS document pair A and B representing real-life author changes during a revision round, affecting every aspect of the article: the title, authors, affiliations, paragraphs, figures, tables and references.
Figure 15 shows both the execution time (15a) and the maximum and average memory used (15b) during a comparison of two real-life author changes in JATS documents. As expected, both parts of the new tools take more time and memory to perform the diff and semantics extraction; however, those are acceptable within our environment as the information consumers are humans with the aim to compare the original and revised version of academic articles, which does not need to be done in real time while the authors submit their revised version.

Figure 15
Execution time and memory usage for comparing two real-life author changes in JATS documents.
6.3. Current limitations
Although the jats-diff algorithm was tested for all previously described edit actions, its implementation is still a work in progress, and the algorithm is not yet tested in production; therefore, there are still some edge cases where the jats-diff fails. As regards the tests, we mostly tested the algorithm up to the 2nd level of complexity where up to two edit actions are mixed at the same time. For those, below is the list of some actions that are failing:
- Upgrade Subsection 2.3 to Section 6 and move one <p> of the upgraded subsection to Section 5;
- Split the last paragraph of Section 1 into two paragraphs below each other and move Section 1 after Section 2;
- Split the last paragraph of Section 1 into two paragraphs below each other and update the text in the second split part;
- Merge the last two paragraphs of Section 1 and move Section 1 after Section 2;
- Merge the last two paragraphs of Section 1 and update style range in the merged paragraph;
- ...
7. Discussion
We have seen in the previous sections how, both in theory and in practice, the jats-diff algorithm uses new "Level 2" edit pattern detection in order to have a bijection between author edits on one side and the changes between the two JATS documents on the other. Using the similarity index and change semantics, jats-diff can not only detect syntactic changes but also use JATS-specific change semantics in order for the human reader to be able to obtain a broader picture of the changes made by the author. Being still a work in progress, jats-diff is striving to improve the parts described below.
The current delta and semantics output files still need to be improved. Those are for now visualised as a delta XML document for "Level 2" and tree structure text document for the semantics and similarity index. Having a better and more understandable visualisation of those data could help the human reader even more. An idea would be to convert the JATS XML document pair to HTML in order to have a readable representation of the article, similarly to the approached used for versioning control systems (Git/Subversion, etc.). Additionally, we could use our change pattern detection and change semantics information in order to visually annotate the changes made by the author, and also the textual impact those changes had on specific key elements of the article, for example, on specific paragraphs, sections, titles, etc.
Another practical use-case of the presented jats-diff algorithm would be with regard to assisting the final decision maker’s work even further. In Figure 1, we can see that jats-diff helps with comparing different article versions; however, the final decision maker must still read the reviewer comments and match them with the changes made by the authors. In order to assist the final decision maker, we could use Named Entity Recognition (NER) on the reviewer comments in order to extract information on where within the article the change should happen. Once this new information has been extracted, we could match it to the existing change location and correlate those two pieces of information together. This means that we could connect the expected changes, i.e., those requested by the reviewers, to the effective changes, i.e., those made by the authors. Representing this on a graphical interface could further help to simplify the final decision-making process.
8. Conclusion
Within this article, we have described a new JATS comparison algorithm called jats-diff, which is able to detect additional "Level 2" edit patterns which are closely related to text processor edits made by the authors. This allows us to have a bijection between author modifications on one side and changes detected between two JATS documents on the other. In order to assess the need for the new "Level 2" edit pattern recognition, we started by evaluating different edit actions authors make during the revision rounds. We then assessed the impact that those edits have on JATS XML and realised that there is a need for new XML edit pattern recognition: paragraph split and merge, text move, subsection upgrade, section downgrade, style and citable object edits. Afterwards, we proposed solutions based on mathematical formulas on how to detect those new edit patterns that are a combination of existing "Level 1" and "Level 2" edits. We also proposed a way to calculate the similarity index between different parts of the JATS document and propagate it through the XML tree. We then described the delta and the semantics outputs created by jats-diff by providing some examples on how different edit actions are represented by the algorithm. Finally, we conducted a performance analysis comparing jats-diff with three other state-of-the-art XML diff algorithms: JNdiff, XyDiff and XCC. First, we evaluated the "Level 2" edit capacities where we could clearly observe that jats-diff is able to detect and represent all existing and new edit patterns described within this article. Afterwards, we evaluated the execution performance, where we measured the impact of the new "Level 2" edit detection and text similarity index computation on the time and memory used to compare two real-life author change documents.
The use of the jats-diff algorithm can facilitate peer reviewers’ and the Editors-in-Chief’s decision-making process by automating the manual comparison of different article versions. Compared with existing XML diff algorithms that represent differences between two documents with a limited set of edit patterns, jats-diff enables a bijection of author modifications and changes detected by comparing the two JATS article versions. The similarity index computed on different parts of the article also provides a clearer picture to the final decision maker in order to understand which parts of the articles are most impacted by the change.
As for the future of jats-diff, there is still work to be done on a better visualisation of recognised edits and the display of the similarity index. Converting JATS to HTML and annotating the document with the detected differences and similarity index will be the focus of future experiments.
References
- 1.
- To W., Yu B.Rise in higher education researchers and academic publications [version 1; peer review: 2 approved]Emerald Open Research 202010
.35241/emeraldopenres.13437.1. - 2.
- Hunt J.W., MacIlroy M.D. An algorithm for differential file comparison Bell Laboratories; Murray Hill, NJ, USA: 1976.
- 3.
- Myers E.W.AnO (ND) difference algorithm and its variationsAlgorithmica 1986251–266.10
.1007/BF01840446. - 4.
- W3C. Extensible Markup Language (XML), 2016.
- 5.
- Selkow S.M.The tree-to-tree editing problemInformation processing letters 1977184–186.10
.1016/0020-0190(77)90064-3. - 6.
- Chawathe S.S., Rajaraman A., Garcia-Molina H., Widom J.Change detection in hierarchically structured informationACM Sigmod Record 1996493–504.10
.1145/235968.233366. - 7.
- Chawathe S.S., Garcia-Molina H.Meaningful change detection in structured dataACM SIGMOD Record 199726–37.10
.1145/253262.253266. - 8.
- Cobena G., Abiteboul S., Marian A. Detecting changes in XML documents. Proceedings 18th International Conference on Data Engineering IEEE; San Jose, CA, USA, 2002: 200241–52.
- 9.
- Lindholm T., Kangasharju J., Tarkoma S. Fast and Simple XML Tree Differencing by Sequence Alignment. Proceedings of the 2006 ACM Symposium on Document Engineering Association for Computing Machinery; New York, NY, USA: 200610
.1145/1166160.1166183. - 10.
- Rönnau S., Scheffczyk J., Borghoff U.M. Towards XML Version Control of Office Documents. Proceedings of the 2005 ACM Symposium on Document Engineering Association for Computing Machinery; New York, NY, USA: 200510
.1145/1096601.1096606. - 11.
- Rönnau S., Borghoff U.M.Versioning XML-based office documentsMultimedia Tools and Applications 2009253–274.10
.1007/s11042-009-0271-2. - 12.
- Ciancarini P., Iorio A.D., Marchetti C., Schirinzi M., Vitali F.Bridging the gap between tracking and detecting changes in XMLSoftware: Practice and Experience 2016227–250.10
.1002/spe.2305. - 13.
- Cuculovic M., Fondement F., Devanne M., Weber J., Hassenforder M.Change Detection on JATS Academic Articles: An XML Diff Comparison StudyProceedings of the ACM Symposium on Document Engineering 20201–10.
- 14.
- Wang Y., DeWitt D.J., Cai J.Y. X-Diff: An effective change detection algorithm for XML documents. Proceedings 19th International Conference on Data Engineering (Cat. No. 03CH37405) IEEE; Bangalore, India: 2003519–530.
- 15.
- Chen Y., Madria S., Bhowmick S. DiffXML: Change Detection in XML Data. Database Systems for Advanced Applications Lee Y., Li J., Whang K.Y., Lee D. Springer:Berlin, Heidelberg, Germany: 2004289–301.
- 16.
- Langhammer F.Bauen statt modelliereniX 2004100–103.
- 17.
- Norman Walsh. DiffMK, 2015.
- 18.
- Rönnau S., Philipp G., Borghoff U.M. Efficient Change Control of XML Documents. Proceedings of the 9th ACM Symposium on Document Engineering Association for Computing Machinery; New York, NY, USA: 200910
.1145/1600193.1600197. - 19.
- Rönnau S., Borghoff U.M.XCC: change control of XML documentsComputer Science-Research and Development 201295–111.10
.1007/s00450-010-0140-2. - 20.
- Lorenz Schori. Delta.js - A JavaScript diff and patch engine for DOM trees, 2020.
- 21.
- Jaccard P.Distribution of the alpine flora in the dranse’s basin and some neighbouring regionsBulletin de la Societe vaudoise des Sciences Naturelles 1901241–272.
Footnotes
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
The evaluation was done on an Apple MacBook Pro (16-inch, 2019); processor: 2.4 GHz 8-Core Intel Core i9; memory: 32 GB 2667 MHz DDR4; SSD
Figures and Tables
- A JATS XML Comparison Algorithm for Scientific Literature - Journal Article Tag ...A JATS XML Comparison Algorithm for Scientific Literature - Journal Article Tag Suite Conference (JATS-Con) Proceedings 2022
Your browsing activity is empty.
Activity recording is turned off.
See more...