On a 2-Relative Entropy

We construct a 2-categorical extension of the relative entropy functor of Baez and Fritz, and show that our construction is functorial with respect to vertical morphisms. Moreover, we show such a ‘2-relative entropy’ satisfies natural 2-categorial analogues of convex linearity, vanishing under optimal hypotheses, and lower semicontinuity. While relative entropy is a relative measure of information between probability distributions, we view our construction as a relative measure of information between channels.


Introduction
Let X and Y be finite sets which are the input and output alphabets of a discrete memoryless channel X f Y with probability transition matrix f yx , representing the probability of the output y given the input x. Every input x then determines a probability distribution on Y which we denote by f x , so that f x (y) = f yx for all x ∈ X and y ∈ Y.
The channel X f Y together with the choice of a prior distribution p on X will be denoted ( f |p), and such data then determine a distribution ϑ( f |p) on X × Y given by ϑ( f |p) (x,y) = p x f yx . Given a second channel X g Y with prior distribution q on X, the chain rule for relative entropy says that the relative entropy D(ϑ( f |p), ϑ(g|q)) is given by D(ϑ( f |p), ϑ(g|q)) = D(p, q) + ∑ x∈X p x D( f x , g x ). (1) As the RHS of (1) involves precisely the datum of the channels f and g together with the prior distributions p and q, we view the quantity D(ϑ( f |p), ϑ(g|q)) as a relative measure of information between the channels ( f |p) and (g|q). In particular, since from a Bayesian perspective D(p, q) may be thought of as the amount of information gained upon discovering that the assumed prior distribution p is actually q, it seems only natural to think of D(ϑ( f |p), ϑ(g|q)) as the amount of information gained upon learning that the assumed channel ( f |p) is actually the channel (g|q).
To make such a Bayesian interpretation more precise, we build upon the work of Baez and Fritz [1], who formulated a type of Bayesian inference as a process X → Y (including a set of conditional hypotheses on the outcome of the process), which given a prior distribution p on X yields distributions r on Y and q on X in such a way that the relative entropy D(p, q) has an operational meaning as a quantity associated with a Bayesian updating with respect to the process X → Y. (Here, X may be thought of more generally as the set of possible states of some system to be measured, while Y may be thought of as the possible outcomes of the measurement.) Baez and Fritz then proved that up to a constant multiple, the map on such processes given by is the unique map satisfying the following axioms.

1.
Functoriality: Given a composition of processes X → Y → Z,

2.
Convex Linearity: Given a collection of processes U x → V x indexed by the elements x ∈ X of a finite probability space (X, p),

3.
Vanishing Under Optimal Hypotheses: If the conditional hypotheses associated with a process X → Y are optimal, then While Baez and Fritz facilitated their exposition using the language of category theory [2], knowing that a category consists of a class of objects together with a class of composable arrows (i.e., morphisms) between objects is all that is needed for an appreciation of their construction. From such a perspective, the aforementioned processes X → Y are morphisms in a category FinStat, and the relative entropy assignment given by (2) is then a map from morphisms in FinStat to [0, ∞].
In what follows, we elevate the construction of Baez and Fritz to the level of 2categories (or more precisely, double categories), whose 2-morphisms may be viewed as certain processes between processes, or rather, processes which connect one channel to another. In particular, in Section 2, we formally introduce the category FinStat introduced by Baez and Fritz, and then, in Section 3, we review their functorial characterization of relative entropy using FinStat. In Section 4, we construct a category FinStat 2 which is a 2-level extension of FinStat, and in Section 5, we define a convex structure on FinStat 2 . In Section 6, we define a relative measure of information between channels which we refer to as conditional relative entropy, and show that it is convex linear and functorial with respect to vertical morphisms in FinStat 2 . The conditional relative entropy is then used in Section 7 to define a relative entropy assignment RE 2 on 2-morphisms via the chain rule as given by (1) (for more on the chain rule for relative entropy one may consult Chapter 2 of [3]). Moreover, we show that such a '2-relative entropy' satisfies the natural 2-level analogues of axioms 1-4 as satisfied by the relative entropy map RE.
As abstract as a relative entropy of processes between processes may seem, Shannon's Noisy Channel Coding Theorem-which is a cornerstone of information theory-is essentially a statement about transforming a noisy channel into a noiseless one via a sequence of codings and encodings. From such a viewpoint, information theory is fundamentally about processes (i.e., a sequence of codings and encodings), between processes (i.e., channels), and it is precisely this viewpoint with which we will proceed. Furthermore, there is a growing recent interest in axiomatic and categorical approaches to information theory [4][5][6][7][8][9][10][11][12][13], and the present work is a direct outgrowth of such activity.

The Category FinStat
In this section, we introduce the first-level structure of interest, which is the category FinStat introduced by Baez and Fritz [1]. Though we use the language of categories, knowing that a category consists of a class of composable arrows between a class of objects is sufficient for the comprehension of all categorical notions in this work. Definition 1. Let X and Y be finite sets. A discrete memoryless channel (or simply channel for short) X f Y associates every x ∈ X with a probability distribution f x on Y. In such a case, the sets X and Y are referred to as the set of inputs and set of outputs of the channel f , respectively, and f x (y) is the probability of receiving the output y given the input x, which will be denoted by f yx . for all x ∈ X and y ∈ Y.
Y is a channel such that for every x ∈ X there exists a y ∈ Y with f yx = 1, then such a y is necessarily unique given x, and as such, f may be identified with a function f : X → Y. In such a case, we say that the channel f is pure (or deterministic), and from here on, we will not distinguish the difference between a pure channel and the associated function from its set of inputs to its set of outputs.

Definition 3.
If denotes a set with a single element, then a channel p X is simply a probability distribution on X, and, in such a case, we will use p x to denote the probability of x as given by p for all x ∈ X. The pair (X, p) is then referred to as a finite probability space. Notation 1. The datum of a channel X f Y together with a prior distribution p X on its set of inputs will be denoted ( f |p).

Definition 4.
Let FinStat denote the category whose objects are finite probability spaces, and whose morphisms (X, p) −→ (Y, q) consist of the following data: and a composition of morphisms in FinStat is obtained via function composition and composition of stochastic sections. In such a case, it is straightforward to show that a composition of stochastic sections is a stochastic section, etc. The morphism corresponding to diagram (3) will often be denoted ( f , p, s). (3), a straight arrow is used for X f −→ Y as f is a function, as opposed to a noisy channel. (3) is as follows. The set X is thought of as the set of possible states of the system, and f : X → Y is then thought of as a measurement process, so that Y is then thought of as the set of possible states of some measuring apparatus. The stochastic section Y s X is then thought of as a set of hypotheses about the state of the system given a state of the measuring apparatus. In particular, s xy is thought of as the probability the system was in state x given the state y of the measuring apparatus.

Definition 5.
If the stochastic section Y s X in diagram (3) is such that s • q = p, then s will be referred to as an optimal hypothesis for ( f |p).
Definition 6. Let (X, p) be a finite probability space, and let U x µ x V x be a collection of channels with prior distributions where x∈X µ x is the channel given by where x u is such that u ∈ U x u . Such a convex combination will be denoted x∈X p x (µ x |q x ).

The Baez and Fritz Characterization of Relative Entropy
We now recall the Baez and Fritz characterization of relative entropy in FinStat.
Definition 7. Let (X, p) be a finite probability space, and let where r x = µ x • q x for all x ∈ X.
Definition 8. Let ( f , p, s) be a morphism in FinStat, and let r = s • f • p. The relative entropy of ( f , p, s) is the non-negative extended real number RE( f , p, s) ∈ [0, ∞] given by where D(p, r) = ∑ x p x log(p x /r x ) is the relative entropy between the distributions p and r on X.
• F is said to be convex linear if and only if for every convex combination x∈X p If F ∈ S, then F = cRE for some non-negative constant c ∈ R.

The Category FinStat 2
In this section, we introduce the second-level structure of interest, namely, the double category FinStat 2 , which is a 2-level extension of FinStat. Definition 10. Let FinStat 2 denote the 2-category whose objects and 1-morphisms coincide with those of FinStat, and whose 2-morphisms are constructed as follows. Given 1-morphisms , a 2-morphism ♠ : (µ, p, s) ⇒ (ν, q, t) consists of channels X f Y and X f Y such that The 2-morphism ♠ : (µ, p, s) ⇒ (ν, q, t) may then be summarized by the following diagram.

Remark 4.
A crucial point is that in the above diagram, all arrows necessarily commute except for any compositions involving the the outer 'wings', s and t. For example, the compositions s • µ • p and t • f • µ need not be equal to p and f , respectively. (5) should be thought of as a flattened out pyramid, whose base is the inner square and whose vertex is obtained by the identification of the upper and lower stars in the diagram.

Remark 6.
For an operational interpretation of a 2-morphism in FinStat 2 as given by diagram (5), one may consider X and Y as sample spaces associated with all possible outcomes of experiments E X and E Y . As the sets X and Y are endowed with prior distributions p and q, the maps µ : X → X and ν : Y → Y are then random variables with values in X and Y , and the stochastic sections s and t then represent conditional hypotheses about the outcomes of the measurements corresponding to µ and ν. The channels f : X Y and f : X Y then represent stochastic processes such that taking the measurement µ followed by f results in the same process as first letting X evolve according to f and then taking the measurement ν.

Example 1.
For a real-life scenario which realizes a 2-morphism in FinStat, suppose two experimenters Alice and Bob are collaborating on a project to verify predictions of a theory. As such, Alice and her data analyst partner Alicia travel to a mountain in Brazil during a solar eclipse to perform experiments while Bob and and his data-analyst partner Bernie travel to a mountain in Montenegro at the same time for the same purpose. Alice and Bob will then perform experiments in their separate locations and hand their results over to Alicia and Bernie, who will then analyze the data to produce numerical results. At the end of each day, Alice will report her results to Bob over a noisy channel while Alicia will report her results to Bernie over a noisy channel, so that Bob and Bernie may compare their results with Alice and Alicia's. We then summarize such a scenario with the following 2-morphism in FinStat 2 : In diagram (6), p and q are assumed prior distributions on Alice and Bob's measurements, while s and t are empirical conditional distributions on Alice and Bob's measurements given the data outcomes of Alicia and Bernie's analysis. Moreover, if the communication channel f is less reliable than f , then the composition t • f • µ provides a Bayesian updating for the channel f .
The vertical composition ♣ • ♠ is then summarized by the following diagram.
The horizontal composition ♥ • ♠ is then summarized by the following diagram.

Convexity in FinStat 2
We now generalize the convex structure on morphisms in FinStat to 2-morphisms in FinStat 2 . For this, let (X, p) be a finite probability space, and let ♠ x be a collection of 2-morphisms in FinStat 2 indexed by X, where ♠ x is summarized by the following diagram.
x r x f f Definition 11. The convex sum x∈X p x ♠ x is the 2-morphism in FinStat 2 summarized by the following diagram.

Conditional Relative Entropy in FinStat 2
We now introduce a measure of information associated with 2-morphisms in FinStat 2 which we refer to as 'conditional relative entropy'. The results proved in this section are essentially all lemmas for the results proved in the next section, where we introduce a 2-level extension of the relative entropy map RE, and show that it satisfies the 2-level analogues of the characterizing axioms of relative entropy.

Definition 12. With every 2-morphism
we associate the non-negative extended real number CE(♠) ∈ [0, ∞] given by where D(−, −) is the standard relative entropy. We refer to CE(♠) as the conditional relative entropy of ♠.

Remark 7.
We refer to CE(♠) as conditional relative entropy as its defining formula (7) is structurally similar to the defining formula for conditional entropy. In particular, if X f Y is a channel with prior distribution p X, then the conditional entropy H( f |p) is given by where H( f x ) is the Shannon entropy of the distribution f x on Y.

Proposition 1.
Conditional relative entropy in FinStat 2 is convex linear, i.e., if (X, p) is a finite probability space and ♠ x is a collection of 2-morphisms in FinStat 2 indexed by X, then Proof. Suppose x∈X p x ♠ x is summarized by the following diagram as desired.
Z be a composition of channels.
1. If f is a pure channel, then (g Proof. The statements 1 and 2 follow immediately from the definitions of pure channel and stochastic section.

Lemma 2.
Let ♠ be a 2-morphism in FinStat 2 as summarized by the diagram Proof. To prove item (1), let x ∈ X and y ∈ Y. Then, where the third equality follows from Lemma 1 since µ is a pure channel, and the fourth equality follows also from Lemma 1 since t is a stochastic section of a pure channel. We then have , as desired.
To prove item (2), the condition ν • f = f • µ is equivalent to the equation (ν • f ) y x = ( f • µ) y x for all y ∈ Y and x ∈ X. In addition, since it follows that f y µ(x) = ∑ y∈ν −1 (y ) f yx , thus, for all x ∈ X and y ∈ Y, it follows that f y x = ∑ y∈ν −1 (y ) f yx for all x ∈ µ −1 (x ). As such, we have Proof of Theorem 2. Suppose ♠ and ♣ are such that the vertical composition ♣ • ♠ is summarized by the following diagram.
By item (1) in Lemma 2, we then have and since t is a section of the pure channel ν, by item (2) of Lemma 1, it follows that for every y ∈ Y we have (t • t ) y(ν •ν)(y) = t yν(y) t ν(y)(ν •ν)(y) .
we associate the non-negative extended real number RE 2 (♠) ∈ [0, ∞] given by which we refer to as the 2-relative entropy of ♠. We note that the quantity RE(µ, p, s) appearing on the RHS of (8) is the relative entropy associated with the morphism (µ, p, s) in FinStat, so that RE(µ, p, s) = D(p, s • µ • p), where D(−, −) is the standard relative entropy.
Proposition 2. 2-Relative entropy is convex linear, i.e., if (X, p) is a finite probability space and ♠ x is a collection of 2-morphisms in FinStat 2 indexed by X, then Proof. Suppose x∈X p x ♠ x is summarized by the following diagram By Theorem 1, we know that the relative entropy RE is convex linear over 1-morphisms in FinStat 2 , and by Proposition 1, we know conditional relative entropy is convex linear over 2-morphisms in FinStat 2 , thus We then have as desired.
Proof. Suppose ♠ and ♣ are such that the vertical composition ♣ • ♠ is summarized by the following diagram. Then, where the second equality follows from Theorems 1 and 2.
Proof. Since s and t are optimal hypotheses, it follows that RE(µ, p, s) = CE(♠) = 0, from which the proposition follows.
Proof. Since the 2-relative entropy RE 2 is a linear combination of 1-level relative entropies, and 1-level relative entropies are lower semicontinuous by Theorem 1, it follows that RE 2 is lower semicontinuous.

Conclusions, Limitations and Future Research
In this work, we have constructed a 2-categorical extension RE 2 of the relative entropy functor RE of Baez and Fritz [1], yielding a new measure of information which we view as a relative measure of information between noisy channels. Moreover, we show that our construction satisfies natural 2-level analogues of functoriality, convex linearity, vanishing under optimal hypotheses and lower semicontinuity. As the relative entropy functor of Baez and Fritz is uniquely characterized by such properties, it is only natural to question if our 2-level extension RE 2 of RE is also uniquely characterized by the 2-level analogues of such properties. It would also be interesting to investigate alternative versions of 2morphisms in FinStat where the 2-morphisms are less restrictive, such as where the base pyramid of a 2-morphism is not assumed to be commutative. While taking the 2-relative entropy associated with such morphisms would not be functorial, such less restrictive morphisms would provide more flexibility for potential applications. Finally, as there are many other categories of interest with respect to information theory [8,14,15], it would be interesting to investigate 2-level extensions of such categories as well.