Information in morphological characters

Abstract The construction of morphological character matrices is central to paleontological systematic study, which extracts paleontological information from fossils. Although the word information has been repeatedly mentioned in a wide array of paleontological systematic studies, its meaning has rarely been clarified nor specifically defined. It is important, however, to establish a standard to measure paleontological information because fossils are hardly complete, rendering the recognition of homologous and homoplastic structures difficult. Here, based on information theory, we show the deep connections between paleontological systematic study and communication system engineering. Information is defined as the decrease of uncertainty and it is the information in morphological characters that allows distinguishing operational taxonomic units (OTUs) and reconstructing evolutionary history. We propose that concepts in communication system engineering such as source coding and channel coding, correspond to the construction of diagnostic features and the entire character matrices in paleontological studies. The two coding strategies should be distinguished following typical communication system engineering, because they serve dual purposes. With character matrices from six different vertebrate groups, we analyzed their information properties including source entropy, mutual information, and channel capacity. Estimation of channel capacity shows character saturation of all matrices in transmitting paleontological information, indicating that, due to the presence of noise, oversampling characters not only increases the burden in character scoring, but also may decrease quality of matrices. We further test the use of information entropy, which measures how informative a variable is, as a character weighting criterion in parsimony‐based systematic studies. The results show high consistency with existing knowledge with both good resolution and interpretability.

It is important, however, to establish a standard to measure paleontological information because fossils are hardly complete, rendering the recognition of homologous and homoplastic structures difficult. Here, based on information theory, we show the deep connections between paleontological systematic study and communication system engineering. Information is defined as the decrease of uncertainty and it is the information in morphological characters that allows distinguishing operational taxonomic units (OTUs) and reconstructing evolutionary history. We propose that concepts in communication system engineering such as source coding and channel coding, correspond to the construction of diagnostic features and the entire character matrices in paleontological studies. The two coding strategies should be distinguished following typical communication system engineering, because they serve dual purposes. With character matrices from six different vertebrate groups, we analyzed their information properties including source entropy, mutual information, and channel capacity. Estimation of channel capacity shows character saturation of all matrices in transmitting paleontological information, indicating that, due to the presence of noise, oversampling characters not only increases the burden in character scoring, but also may decrease quality of matrices. We further test the use of information entropy, which measures how informative a variable is, as a character weighting criterion in parsimony-based systematic studies. The results show high consistency with existing knowledge with both good resolution and interpretability.

K E Y W O R D S
character weighting, information theory, morphology, systematics

| INTRODUC TI ON
Most extinct fossil organisms only preserved their morphology rather than macro-biomolecules including DNA and proteins. Therefore, researchers need to convert the morphology of fossils into sequences, a series of scored morphological characters, for example, and analyze such sequences to identify each OTU (operational taxonomic unit, classification) and reconstruct their evolutionary history (systematics). However, unlike DNA or protein sequences coded by fixed alphabets (4 nucleotides and 20 amino acids), there is not a universal morphological alphabet that can digitize the morphology of extinct organisms into sequences. A practical and probably the most common way to convert morphology into sequences is constructing morphological characters matrices, which contain various OTUs and characters. According to the morphology of different OTUs, they are scored different states, usually 0 and 1, for different characters. The difficulties in constructing morphological characters have been realized early (Wilkinson, 1995), and many early attempts to propose methods/guidance in character construction are far from satisfactory (Estabrook et al., 1975;Hawkins et al., 1997;Sereno, 2007). The definition of "character" (in cladistics analysis) has also been discussed a lot (see review by Sereno, 2007) but is far from being universally applied.
Besides the most basic question of what a character is, discussions have been ongoing on whether to use giant matrices (Laing et al., 2018) or not (Simões et al., 2017), which anatomical structures should be represented by characters (Brocklehurst & Benevento, 2020), whether to combine morphological characters with molecular data and shape data (Nylander et al., 2004;Catalano et al., 2010), etc. Moreover, due to the incompleteness and distortions from preservation environments, most morphological character matrices can only be partially scored. If morphological characters are the most basic units in morphology-based systematic studies, which resemble the nucleotides in DNA sequences and amino acids in proteins, analyzing character matrices under the framework of information theory may help to better understand those arguments.
The word information is repeatedly used in systematic studies (Cracraft, 1974;Farris, 1979;Mickevich & Platnick, 1989;Wilkinson et al., 2004;Sereno, 2007;Simões et al., 2017;Laing et al., 2018) but often it seems to be confused with data, signal, or its embedded semantic meaning. Few studies have connected information theory with systematic studies, especially for fossil-based analyses.
Similarly, during the early development of tele-communication systems, even after the extensive applications of telegraph, telephone, and broadcast in 1940s, people did not formulize a complete theory of communication system engineering until information theory was proposed by Shannon (1948). Before constructing any communication system, it should be noticed that the signals themselves are irrelevant to their semantic meaning. Imagine a paleontologist and a local guide are working in a remote fossil locality, the guide stays in the camp and the paleontologist is looking for fossils. The paleontologist finds a dinosaur skeleton and needs tools from the camp to dig it out, but it takes too long to walk back. The paleontologist wants the local guide to bring with those tools. A smart way to do so may be to make an agreement with the local guide before leaving the camp as following: raising a red flag means the paleontologist needs fossil digging tools; raising a blue flag means he needs food.
With such agreement, the paleontologist and local guide are communicating fairly efficiently, the only concern would be whether the local guide can see the color of the flag, but not the meaning of the color itself.
The mixture of signals and their semantic meaning can cause serious problems in communication, because exactly the same signal may have totally different meanings. This ignorance had brought difficulties in improving the quality of communication because no guidance existed to maximize the efficiency of coding information source or to minimize the influence of noises in communication channels. Shannon (1948) indicated that information is the decrease of uncertainty and a typical communication system can be divided into 5 parts, the information source, encoder, channel (which usually introduces noise), decoder, and the destination ( Figure 1a). Shannon (1948, pp. 379) stated that "The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point." Paleontological systematic studies share abundant similarities with a communication system (Figure 1b) F I G U R E 1 (a) Typical communication system modified from Shannon (1948); (b) Paleontological systematic studies in abstract and focus on reconstructing the evolutionary history of extinct organisms. Most modern communication systems such as telephone, email, and instant messaging apps are for communication in spatial domains, whereas paleontological systematic studies represent communication in the temporal domain. The original organisms can be treated as the information source, fossils discovered as the received message, and the preservation environments as noisy channels.

| PALEONTOLOG IC AL SYS TEMATIC S TUDY A S A COMMUNI C ATI ON SYS TEM
Although some signals are either lost or distorted during preservation, we are interested in how much information is preserved and whether or how we can reconstruct those lost or distorted signals based on known ones. The encoder in Figure 1a encodes the original messages into signals, for example encoding "I need fossil digging tools" into a red flag as the example before, and the decoder does the vice versa. In paleontology, a widely used encoder is the morphological character matrix that encodes each OTU as a sequence of character states. The fundamental problem of paleontological studies is reconstructing at present either exactly or approximately organisms living in another age. Two questions must be answered to do so: (a) that how much information was in an organism or taxon?
and (b) how much information can be preserved?
For the first question, vast publications have described the morphology of fossils in fine detail, but such description cannot be directly applied in paleontological systematic studies. As the morphological character matrix is probably the most common encoder in paleontology, the number of characters in it determines how many signals can be encoded, which draws the upper limit of how much information can be transmitted.
For the second question, although most fossils themselves are not well-preserved, the preserved elements may help to reconstruct the missing parts. Many dinosaur species are named based on limited fossils, for example, Deinocheirus (Dinosauria, Theropoda) was firstly collected in 1965 (Kielan-Jaworowska and Dovchin, 1966), when only forelimbs and several other fragments were discovered, then Ostrom (1972) recognized its affinity with ornithomimosaurs, which was later supported by Senter (2007), and finally the discovery of almost entire skeletons ended most arguments by Lee et al. (2014).
In communication system engineering, such processes are named source coding and channel coding; their differences are listed in Table 1. Source coding focuses on minimizing the cost at encoding all original messages. For example, Morse Code uses different lengths of codes to represent each letter in the alphabet, minimizing coding costs is attained by attributing the shortest code (a single dot) to the letter with the highest frequency in English (the letter E), whereas rarer letters such as X, Y, and Z have longer codes. On the other hand, channel coding is designed to resist noises in the preservation environments. The simplest but inefficient example of channel coding is repeated codes. If an information source is randomly sending 0 and 1 via a noisy channel that has a 30% chance to reverse the original message, thus any 0 or 1 received has a 70% chance to be correct. To resist such noise, the encoder repeats each message three times, which turns "0" into "000" and "1" into "111," thus under maximum likelihood decoding principle that "000," "100," "010," and "001" are decoded as "0" and others as "1." The received message has a 78.4% chance to be correct (0.7 3 + 3 × 0.7 2 × 0.3 = 0 .784), which is better than the original encoding method. However, repeated code is usually inefficient because in this example the encoding has tripled the cost but accuracy only improves 8.4%.
The joint source-channel coding theorem (Shannon, 1948), also known as source-channel separation theorem, shows that source coding and channel coding can be separated without influencing the other. If the channel capacity is strictly greater than source information entropy, noiseless communication can be achieved via sophisticated engineering, even in a noisy channel. In practice, the information encoder is often engineered into decoupled source and channel encoders to serve different purposes as in Table 1.
Similarly, the differences between source coding and channel coding have been realized and practiced in many paleontological systematic studies. In various studies including Nelson (1972) and Cracraft (1974), researchers have shown the differences between classification (Linnaeus classification and its variants) and systematics (phylogenetic classification, evolutionary classification, evolutionary systematics, etc.). Harrison (1993)  Morphological characters usually have two states which can be coded as 0 and 1. Although sometimes more states are available, multi-state characters can always be split into several binary ones.
A morphological character can be treated as a variable with discrete distribution on a group of organisms and we are interested in how much information is in a morphological character. Information is defined as the decrease of uncertainty (Shannon, 1948). If a character is scored the same among OTUs in a group, the information given by this character should be 0 because it does not decrease any uncertainty. The information given by a variable is its information entropy and can be calculated as follows:

Source coding
Channel coding

TA B L E 1 Comparison between source coding and channel coding
where H is the information entropy (measures in bit if the base of logarithm function is 2) and p i represents the probability of ith possible value of the source variable, putatively possible states of morphological characters in paleontological studies. With the change of distribution of character states, the information given by a character also changes and reaches its maximum at equal distribution. For a binary variable P with its values in probability of p and (1 − p), its information entropy is as follows: and this relationship between information entropy and probability is illustrated in Figure 2a. with new fossils, which will change its information entropy. With the increase of OTUs, the influence from newly added OTU decreases. If every binary character has an information entropy of 1 bit, namely, character states have equal distribution, n binary characters can classify 2 n taxa in the ideal situation. Table 2 is an example character matrix including 9 taxa and 3 scored binary characters (0 means absence and 1 means presence of a structure) to illustrate the differences between source coding and channel coding in paleontological systematic studies.
In the character matrix shown in Table 2, there are 9 taxa and only 3 scored binary characters. The sequences of taxa 7 and 7.1 are the same based on the given three characters, hence they cannot be distinguished without other characters being observed. If we would combine taxa 7 and 7.1 as a single OTU, all three characters have information entropy of 1 bit. Although these 3 binary characters are sufficient to distinguish 8 OTUs, they are far from enough to produce a resolved evolutionary cladogram. Usually in practice, the number of characters is much larger than the number of taxa in a character matrix and larger character matrices seems to be a trend in paleontological systematic studies (O'leary et al., 2013;Laing et al., 2018;Baron et al., 2017a). In Table 2 and 7.1. From this simplified example, we can conclude that, to construct a comprehensive and robust character matrix, the sequences of character states should represent the source information entropy completely, and enough redundancy based on mutual information should be incorporated to minimize the influence of incomplete fossils and misidentification of character states.
Since there are many characters in a character matrix, we are interested in their mutuality that strongly influences the quality of character matrix. If two characters are strongly dependent, we can infer the state of a missing character according to observed state of the other, which may provide insight in dealing with incomplete specimens and dividing modules in mosaic evolution studies. In previous content, we show that the information of a variable is defined as the uncertainty it decreases, and thus, the uncertainty of a variable A given by another variable B is the mutuality between them, the mutual information.
where In Table 2, we can calculate the mutual information between each pair of characters and the results are 0 for all pairs (taxa 7 and 7.1 are treated as a single OTU). The zero mutual information in this designed character matrix indicates that the tail, feather, and five digits characters are independent from one another, namely, the knowledge of a character in a taxon does not decrease the uncertainty from another character. The lack of dependence also explains the vulnerability of diagnosis when a character cannot be observed.
Mutual information and joint information entropy can be further generalized to multiple variables. We can calculate the joint information entropy of the entire character matrix according to the joint distribution of character states as following: where A i represents one of the multiple variables. To simplify expression (5), we use s ij to represent the j-th unique sequence of first i characters in the matrix. For example, in the character matrix given by Table 2, s 21 = 00 and s 33 = 010.
A study by Baron et al. (2017a) proposed a significantly different dinosaur phylogeny, in which Theropoda and Ornithischia are sister …… ∑ a n P a 1 , a 2 , ……, a n log 2 P a 1 , a 2 , ……, a n in these studies are larger than many previous studies. Comparably, even before Shannon proposed the information theory, communication engineers have designed codes, for example, Morse Code, and found factors influencing transmission quality in noisy channels (Nyquist 1924(Nyquist , 1928. A general problem had been realized that blindly increasing the power of signal cannot improve communication quality after certain thresholds in noisy channels.
In typical digital communication systems, all messages are coded in 0 and 1 for transmission. The frequency of the transmitter is defined as how many changes can be made during 1 s with unit Hz.
With the increase of frequency, more signals can be sent within a given

| MATERIAL AND ME THODS
In this study, we calculated the information properties and run parsimony-based phylogenetic analyses on character matrices  (Tschopp et al., 2018). We first quantified the information entropy of each character in six matrices. To investigate the mutuality among characters, the mutual information in each character matrix is calculated. To access the differences between source coding and channel coding, we then calculated the joint information entropy of first n (n ≤ total character number) characters.
Last, we use the model of AWGN discrete channel to estimate the channel capacity of fossil preservation environments.
For characters with missing data in the character matrices, we estimate the missing parts to have equal distribution among different states. For example, a binary character is scored 0 in 20% OTUs, 1 in 40% OTUs, and missing in 40% OTUs, the estimated distribution would be 0 in 20% + 0.5 × 40% = 40% and 1 in 40% + 0.5 × 40% = 60%. We also calculate those values without the estimation of missing data as a reference. Calculation is done by custom Python 3.7 scripts.
Phylogenetic analysis was done in TNT 1.5 using traditional search with default settings (Goloboff & Catalano, 2016), in which implied weighting used k = 3 and12. The strict consensus tree was appended to the last of tree list in each group. CI and RI are discussed for the strict consensus trees.

| RE SULTS
The distribution of character information entropy (Figure 2b The mutual information within the 6 matrices is calculated ( Figure 2d) to test the mutuality between characters. Due to the existence of missing data, the diagonal line numbers showing mutuality between any character and itself do not strictly correspond to the character's information entropy but are still generally higher than other areas of the heat maps. The distribution of mutuality seems to have no pattern in most matrices. After reordering and partitioning characters by anatomical structures (crania, pectoral girdle and forelimb, pelvic girdle and hindlimb, axial bones, and others), some parts exhibit relatively high mutuality, for example, the forelimbs and hindlimbs of Carnivoramorpha (Spaulding & Flynn, 2012) show both higher inter-and intramutuality than other anatomical structures. (7) The distributions of noise power in the taxa domain and the character domain are shown in Figure 3a and b, respectively. The results show saturation in channel capacity when increasing bandwidth, the number of characters (Figure 3c). Different character matrices reach the maximum channel capacity when having 62.5% (multituberculata) to 89.7% (Diplodocidae) of total characters.

| Information source
No matter what algorithm is being used in systematics studies, the common aspect is using sequences (DNA, amino acid, and morphological characters) to characterize organisms and to interpret their evolutionary history. With fixed alphabets, DNA and protein sequences resemble digital signals in modern communication systems, while morphology of fossils is more like analog signals. Therefore, the process of character construction is the same as sampling digital signals from analog signals, and meanwhile, the probably infinite original information entropy of fossil morphology is converted into finite entropy, represented by hundreds to thousands of morphological characters, that can be more easily compared. More morphological characters usually describe organisms more completely, but it is extremely difficult to measure how completely the character matrix characterizes the overall morphology of a group of organisms. There is not a standard guidance on character selection, and many characters in matrices are selected because researchers believe they carry morphological information. The interrelationship among morphological characters and how they connect to the overall morphology remains uncertain. At least from the results of mutual information and channel capacity against bandwidth (the number of characters), we show that the dependence between characters and different anatomical structures is complex, and current morphological character matrices seem to reach the saturation of characters already. Shannon (1949) proposed the Sampling Theorem (also known as Nyquist-Shannon Sampling Theorem because early work was done by Nyquist 1924Nyquist , 1928, which bridges the continuous signals and discrete signals. With a continuous signal source of a finite bandwidth, Sampling Theorem shows the lowest sample rate to capture all information, which is twice the rate of highest rate of original signals. As the connection between bandwidth in typical communication systems and character number of paleontological systematic studies is discussed before, Sampling Theorem may be a bridge between raw morphology and morphological characters. However, the saturation of channel capacity (Figure 3c) does not necessarily mean those morphological character matrices fully represent the entire morphology of fossil specimens. Such saturation only shows these matrices cannot sufficiently transmit the sampled morphological information in themselves while some other information may be left as the sampling of characters are strongly biased.
The morphological matrix of multituberculata (Wang et al., 2019) comprises only characters from the cranial region, but the postcrania of those organisms also have information.
With the wide applications of advanced imaging techniques such as CT (computed tomography) scan, it is feasible to capture the complete morphology of fossil specimens without destruction. The unprecedented amount of data may be the stepstone to establish the F I G U R E 3 (a) noise power distribution in taxa domain; (b) noise power distribution in character domain; (c) channel capacity and bandwidth in character matrices connection between analog morphological data and digital character data. A standard workflow may be possible to morphological studies under the facilitation from information theory and high-resolution imaging.

| The properties of the channel (bandwidth, channel capacity, noise)
In this study, we use one of the most basic models, AWGN channel, to mimic preservation environments with limited explanation. AWGN

| Character matrix construction and weighting
The construction of (morphological) character matrices is central to systematic studies and has been discussed extensively. In this study, we make the initial attempt to quantify the information in existing morphological character matrices for the first time. Many results show consistence to common understanding of morphological characters, including different characters having different amount of information, mutuality existing among characters, more characters usually carrying more information, etc. Besides, we also propose that the information entropy of each character can be used as their weights in phylogenetic analysis.
As the information entropy represents how informative a character is, it may be a candidate of character weighting in phylogenetic analysis. Most researchers agree that some kinds of weighting should be applied in systematic analysis and equal weighting is one of the weighting methods (Farris, 1969;Sereno, 2007). Based on the successive weighting proposed by Farris (1969), Goloboff (1993 proposed implied weighting and extended implied weighting (Goloboff, 2014). These weighting methods refine the weights of different characters to reduce homoplasy. However, Congreve and Lamsdell (2016) indicated that implied weighting is not consistent with the idea of parsimony and increase both correctly and incorrectly resolved nodes with simulated datasets. The wide use suggests that implied weighting and its variants probably provide a direction in reconstructing better resolved trees, but neither the theoretical basis nor its utilization answers the core question of how much information is in each character and may fail when working with character matrices with too many homoplastic characters.
Birds and modern mammals are both endothermic, covered with filaments rather than scales, having four-chamber hearts, etc. If we would deliberately sample too many characters describing these features, the conclusion could easily be forced into that birds are mammals, and many synapomorphies between birds and other rep- Successive weighting, implied weighting and their variants require an initial weight or an existing tree topology, whereas information entropy weighting only depends on the information entropy in each character. In matrix construction, the choice of characters is often extremely biased toward cranial characters in vertebrate paleontology studies (Figure 2d). In the six datasets we analyzed here, the proportion of cranial characters range from 40.7% to 100% with an average of 63.2%, which immediately shows that some parts are considered to have more morphological information (or to be "more important") than others in systematic studies. Kälersjö et al. (1999) studied plant nucleotides data and their results showed that fast evolving and highly homoplastic third codon positions, contrary to traditional thought, have the strongest phylogenetic information, and they also suggest that the frequency of change should be used as in character weighting and selection.
Although these authors tried to quantify the information in different nucleotide sites, that is, molecular characters, they did not provide an explanation on how they define information/informative sites.
We tested the results from equal weighting, implied weighting (k = 3&12), and information entropy weighting of six matrices analyzed before. Ceratopsia are illustrated in Figure 4. To save space and show the differences among trees, colored columns replace the OTU names on the right side of trees and color gradients correspond to the taxa order in character matrix. Detailed phylogenetic results are provided online at https://doi.org/10.5061/dryad.8sf7m 0cnc.
Generally, they show unexpected consistence between both equal weighting and implied weighting, but slight differences are common.
The CI (consistence index) and RI (retention index) are also calculated for the most parsimonious tree of each group in Table 3. The CI of entropy weighting is generally slightly lower than other methods, and RI is slightly higher, suggesting that more homologous characters are suggested and the trees fit better for entropy weighted characters.

CO N FLI C T O F I NTE R E S T
All authors declare that they have no conflicts of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
Phylogenetic results of Ornithischia, Ceratopsia, Diplodocidae, multituberculate, Carnivoramorpha, and lizards, by equal weighting, implied weighting (k = 3 and 12), and information entropy weighting are presented in.nex files. The strict consensus tree is appended in the end of trees in each file. The phylogenetic files can be found online at https://doi.org/10.5061/dryad.8sf7m 0cnc.