Skip to main content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Proc ACM Int Conf Inf Knowl Manag. Author manuscript; available in PMC 2015 Apr 27.
Published in final edited form as:
Proc ACM Int Conf Inf Knowl Manag. 2014; 2014: 211–220.
doi: 10.1145/2661829.2661989
PMCID: PMC4410801
NIHMSID: NIHMS679948
PMID: 25927011

A Mixtures-of-Trees Framework for Multi-Label Classification

Abstract

We propose a new probabilistic approach for multi-label classification that aims to represent the class posterior distribution P(Y|X). Our approach uses a mixture of tree-structured Bayesian networks, which can leverage the computational advantages of conditional tree-structured models and the abilities of mixtures to compensate for tree-structured restrictions. We develop algorithms for learning the model from data and for performing multi-label predictions using the learned model. Experiments on multiple datasets demonstrate that our approach outperforms several state-of-the-art multi-label classification methods.

Keywords: Multi-label classification, Bayesian network, Mixture of trees

1. INTRODUCTION

In many real-world applications, a data instance is naturally associated with multiple class labels. For example, a document can cover multiple topics [21, 42], an image can be annotated with multiple tags [6, 29] and a single gene may be associated with several functional classes [9, 42]. Multi-label classification (MLC) formulates such situations by assuming each data instance is associated with a subset of d labels. Alternatively, this problem can be defined by associating each instance with d binary class variables Y1, …Yd, where Yi denotes whether or not the i-th label is present in the instance. The goal is to learn a function that assigns to each instance, represented by a feature vector x = (x1, …, xm), the most probable assignment of the class variables y = (y1, …, yd). However, learning of such a function can be very challenging because the number of possible label configurations is exponential in d.

A simple solution to the above problem is to assume that all class variables are conditionally independent of each other and learn d functions to predict each class separately [9, 6]. However, this may not suffice for many real-world problems where dependences among output variables exist. To overcome this limitation, multiple machine learning methods that model class relations have been proposed in recent years. These include two-layer classification models [14, 8], classifier chains [31, 41, 10], output coding methods [18, 34, 44, 45] and multi-dimensional Bayesian network classifiers [38, 5, 1].

In this work, we develop and study a new probabilistic approach for modeling and learning an MLC. Our approach aims to represent the class posterior distribution P(Y1, …, Yd|X) such that it captures multivariate dependences among features and labels. Our proposed model is defined by a mixture of Conditional Tree-structured Bayesian Networks (CTBNs) [2]. A CTBN defines P(Y1, …, Yd|X) using a directed tree structure to model the relations among the class variables conditioned on the feature variables. The main advantage of CTBN is that it allows efficient learning and inference. A mixture of CTBNs leverages the computational advantages of CTBNs and the ability of a mixture to compensate for the tree-structure restriction.

Our new mixture model extends the work by [26] that models and learns the joint distribution over many variables using tree-structured distributions and their mixtures, to learn conditional distributions where the multivariate relations among Y components are conditioned on inputs X. To support learning and inference in the new model, we develop and test new algorithms for: (1) learning the parameters of conditional trees mixtures, (2) selecting individual tree structures and (3) inferring the maximum a posteriori (MAP) output label configurations.

An important advantage of our method compared to existing MLC methods is that it gives a well-defined model of posterior class probabilities. That is, our model lets us calculate P(Y = y|X = x) for any (x, y) input-output pair. This is extremely useful not only for prediction, but also for decision making [30, 3], conditional outlier analysis [15, 16, 17], or for performing any inference over subsets of output class variables. In contrast to our approach, the majority of existing MLC methods aim to only identify the best output configuration for the given x.

2. PROBLEM DEFINITION

In Multi-Label Classification (MLC), each instance is associated with d binary class variables Y1, …Yd. We are given labeled training data D={x(n),y(n)}n=1N, where x(n)=(x1(n),,xm(n)) is a m-dimensional feature vector representing the n-th instance (the input) and y(n)=(y1(n),,yd(n)) is its corresponding d-dimensional class vector (the output). We want to learn a function h (from D) that assigns to each instance, represented by its feature vector, a class vector:

h:m → {0,1}d

One way to approach this task is to model and learn the conditional joint distribution P(Y|X), where Y = (Y1, …, Yd) is a random variable for the class vector and X is a random variable for the feature vector. Assuming the 0–1 loss function, the optimal classifier h* assigns to each instance x the maximum a posteriori (MAP) assignment of class variables:

h(x)=argmaxyP(Y=yX=x)=argmaxy1,,ydP(Y1=y1,,Yd=ydX=x)
(1)

A key challenge for modeling and learning P(Y|X) from data, as well as for defining the corresponding MAP classifier, is that the number of all possible class assignments one has to consider is 2d. The goal of this paper is to develop a new, efficient model and methods for its learning and inference that overcome this difficulty.

Notation

For notational convenience, we will omit the index superscript (n) when it is not necessary. We may also abbreviate the expressions by omitting variable names; e.g., P(Y1 = y1, …, Yd = yd|X= x) = P (y1, …, yd|x).

3. RELATED RESEARCH

In this section, we briefly review the research work related to our approach and pinpoint the main differences.

MLC method based on learning independent classifiers was studied by [9, 6]. Zhang and Zhou [43] presented a multi-label k-nearest neighbor method, which learns a classifier for each class by combining k-nearest neighbor with Bayesian inference. To model possible class dependences, [14, 8] proposed adding a second layer of classifiers that combine input features with the outputs of independent classifiers. The limitation of these early approaches is that class dependences are either not modeled at all, or modeled in a very limited way.

The classifier chains (CC) method [31] models the class posterior distribution P(Y|X) by decomposing the relations among class variables using the chain rule:

P(Y1,,YdX)=i=1dP(YiX,Y1,,Yi-1)
(2)

Each component in the chain is a classifier that is learned separately by incorporating the predictions of preceding classifiers as additional features. Zhang and Zhang [41] realized that the performance of CC is influenced by the order of classes in the chain and presented a method to learn such ordering from data. Dembczynski et al. [10] discussed the suboptimality of CC and presented probabilistic classifier chains to estimate the entire posterior distribution of classes. However, this method has to evaluate exponentially many label configurations, which greatly limits its applicability.

Another approach for modeling P(Y|X) relies on conditional random fields (CRFs) [24]. Ghamrawi and McCallum [13] presented a method called collective multi-label with features classifier (CMLF) that captures label co-occurrences conditioned on features. However, CMLF assumes a fully connected CRF structure which results in a high computational cost. Later, Shahaf et al. [32] and Bradley et al. [7] proposed to learn tractable (low-treewidth) structures of class variables for CRFs using conditional mutual information. More recently, Pakdaman et al. [28] used pairwise CRFs to model the class dependences and presented L2-optimization-based structure and parameter learning algorithms. Although the later methods share similarities with our approach by modeling the conditional dependences in Y space using restricted structures, their optimization of the likelihood of data is computationally more costly. To alleviate this, CRF-based methods often resort to optimization of a surrogate objective function (e.g., the pseudo-likelihood of data [28]) or include specific assumptions (e.g., features are assumed to be discrete [13]; relevant features for each class are assumed to be known [32, 7]), which complicate the application of the methods.

Multi-dimensional Bayesian network classifiers (MBC) [38, 5, 1] build a generative model of P(X, Y) using special Bayesian network structures that assume all class variables are top nodes and all feature variables are their descendants. Although our approach can be compared to MBC, there are significant differences and advantages: (1) MBC only handles discrete features and, thus, all features should be a priori discretized; while we handle both continuous and discrete features. (2) MBC defines a joint distribution over both feature and class variables and the search space of the model increases with the input dimensionality m; while our search space does not depend on m. (3) Feature selection in MBC is done explicitly by learning the individual relationships between features and class variables; while we perform feature selection by regularizing the base classifiers. (4) MBC requires expensive marginalization to obtain class conditional distribution P(Y|X); while we directly estimate P(Y|X).

An alternative approach for MLC is based on output coding. The idea is to compress the output into a codeword, learn how to predict the codeword and then recover the correct output from the noisy predictions. A variety of approaches have been devised by using different compression techniques, such as compressed sensing [18], principal component analysis [34] and canonical correlation analysis [44]. The state-of-the-art in output coding utilizes a maximum margin formulation [45] that promotes both discriminative and predictable codes. The limitation of output coding methods is that they can only predict the single “best” output for a given input, and they cannot compute probabilities for different input-output pairs.

Several researchers proposed using ensemble methods for MLC. Read et al. [31] presented a simple method that averages the predictions of multiple random classifier chains trained on a random subset of the data. Antonucci et al. [1] proposed an ensemble of multi-dimensional Bayesian networks combined via simple averaging. These networks represent different Y relations (the structures are set a priori and not learned) and all of the networks adopt the naïve Bayes assumption (the features are independent given the classes). Unlike these methods, our approach learns the structures in the mixture, its parameters and mixing coefficients from data in a principled way.

4. PRELIMINARY

The MLC solution we propose in this work combines multiple base MLC classifiers using the mixtures-of-trees (MT) [26, 39] framework, which uses a mixture of multiple trees to define a generative model of P(Y) for discrete multidimensional domains. The base classifiers we use are based on the conditional tree-structured Bayesian networks (CTBN) [2]. To begin with, we briefly review the basics of MT and CTBN.

MT consists of a set of trees that are combined using mixture coefficients λk to represent the joint distribution P(y). The model is defined by the following decomposition:

P(y)=k=1KλkP(yTk),
(3)

where P(y|Tk) are called mixture components that represent the distribution of outputs defined by the k-th tree Tk. Note that a mixture can be understood as a soft-multiplexer, where we have a hidden selector variable which takes a value k ∈ {1, …, K} with probability λk. That is, by having a convex combination of mutually complementary tree-structured models, MT aims at achieving a more expressive and accurate model.

While MT is not as computationally efficient as individual trees, it has been considered as a useful approximation at a fraction of the computational cost learning general graphical models [22]. MT has been successfully adopted in a range of applications, including modeling of handwriting patterns, medical diagnostic network, automated application screening, gene classification and identification [26], face detection [20], video tracking [19], road traffic modeling [39] and climate modeling [22].

In this work, we apply the MT framework in context of MLC. In particular, we combine MT with CTBN to model individual trees. CTBN is a recently proposed probabilistic MLC method that has been shown to be competitive and efficient on a range of domains. CTBN defines P(Y|X) using a collection of classifiers modeling relations in between features and individual labels that are tied together using a special Bayesian network structure that approximates the dependence relations among the class variables. In modeling of the dependences, it allows each class variable to have at most one other class variable as a parent (without creating a cycle) besides the feature vector X.

A CTBN T defines the joint distribution of class vector (y1, …, yd) conditioned on feature vector x as:

P(y1,,ydx,T)=i=1dP(yix,yπ(i,T)),
(4)

where π(i, T) denotes the parent class of class Yi in T (by convention, π(i, T) = {} if Yi does not have a parent class). For example, the conditional joint distribution of class assignment (y1, y2, y3, y4) given x according to the network T in Figure 1 is defined as:

P(y1y2y3y4xT) = P(y3x) · P(y2xy3) · P(y1xy2) · P(y4xy2)

Although our proposed method is motivated by MT, there are significant extensions and differences. We summarize the key distinctions below.

  1. Model: Our model represents P(Y|X), the class posterior distribution for MLC, using CTBNs that each consists of a collection of logistic regression models, linked together by a directed tree; on the other hand, the MT model [26] represents the joint distribution P(Y) using standard tree-structured Bayesian networks.
  2. Structure learning: Our structure learning algorithm optimizes P(Y|X) using weighted conditional log-likelihood criterion; while MT relies on the standard Chow-Liu algorithm [23] that optimizes P(Y) using mutual information.
  3. Parameter learning: Not surprisingly, both our parameter learning method and that of MT rely on the EM algorithm. However, the criteria and how to optimize them are very different. For example, the M-step of our algorithm corresponds to learning of instance-weighted logistic regression classifiers; while that of MT is based on simple (weighted) counting.

5. OUR METHOD

In this section, we describe Mixture of Conditional Tree-structured Bayesian Networks (MC), which uses the MT framework in combination with the CTBN classifiers to improve the classification accuracy of MLC tasks, and develop algorithms for its learning and predictions. In section 5.1, we describe the mixture defined by the MC model. In section 5.2 through 5.4, we present the learning and prediction algorithms for the MC model.

5.1 Representation

By following the definition of MT in Equation (3), MC defines the multivariate posterior distribution of class vector y = (y1, …, yd) as:

P(yx)=k=1KλkP(yx,Tk),
(5)

where λk ≥ 0, ∀k; and k=1Kλk=1. Here each mixture component P(y|x, Tk) is the distribution defined by CTBN Tk (as in Equation (4)) and mixture coefficients are denoted by λk. Figure 2 depicts an example MC model, which consists of K CTBNs and the mixture coefficients λk.

5.2 Parameter Learning

In this section, we describe how to learn the parameters of MC by assuming the structures of individual CTBNs are known and fixed. The parameters of the MC model are the mixture coefficients {λ1, …, λK} as well as the parameters of each CTBN in the mixture {θ1, …, θK}.

Given training data D = {x(n), y(n)} : n ∈ 1, …, N, the objective is to optimize the log-likelihood of D, which we refer to as the observed log-likelihood.

n=1NlogP(y(n)x(n))=n=1Nlogk=1KλkP(y(n)x(n),Tk)

However, this is very difficult to directly optimize because it contains the log of the sum. Hence, we cast this optimization in the expectation-maximization (EM) framework. Let us associate each instance (x(n), y(n)) with a hidden variable z(n) ∈ {1, …, K} indicating which CTBN it belongs. The complete log-likelihood (assuming z(n) are observed) is:

n=1NlogP(y(n),z(n)x(n))
(6)

=n=1Nlogk=1KP(y(n),Tkx(n))𝟙[z(n)=k]=n=1Nlogk=1K[λkP(y(n)x(n),Tk)]𝟙[z(n)=k]=n=1Nk=1K𝟙[z(n)=k][logλk+logP(y(n)x(n),Tk)],
(7)

where An external file that holds a picture, illustration, etc.
Object name is nihms679948ig1.jpg[z(n) = k] is the indicator function, which is one if the n-th instance belongs to the k-th CTBN and zero otherwise; and λk is the mixture coefficient of CTBN Tk, which can be interpreted as its prior probability in the data.

The EM algorithm iteratively optimizes the expected complete log-likelihood, which is always a lower bound to the observed log-likelihood [27]. In the E-step, the expectation is computed with the current set of parameters; in the M-step, the parameters of the mixture (λk, θk : k = {1, …, K}) are relearned to maximize the expected complete log-likelihood. In the following, we describe our parameter learning algorithm by deriving the E-step and the M-step for MC.

5.2.1 E-step

In the E-step, we compute the expectation of the hidden variables. Let γk(n) denote P(z(n) = k|y(n), x(n)), the posterior of the hidden variable z(n) given the observations and the current parameters. Using Bayes rule, we write:

γk(n)=λkP(y(n)x(n),Tk)kλkP(y(n)x(n),Tk)
(8)

5.2.2 M-step

In the M-step, we learn the model parameters {λ1, …, λK, θ1, …, θK} that maximize the expected complete log-likelihood, which is a lower bound of the observed log-likelihood. Let us first define the following two quantities:

Γk=n=1Nγk(n),wk(n)=γk(n)Γk

Γk can be interpreted as the number of observations that belongs to the k-th CTBN (hence, k=1KΓk=N), and wk(n) is the renormalized posterior γk(n), which can be interpreted as the weight of the n-th instance on the k-th CTBN.

Note that when taking the expectation of the complete log-likelihood (Equation (6)), only the indicator An external file that holds a picture, illustration, etc.
Object name is nihms679948ig1.jpg[z(n) = k] is affected by the expectation. By using the notations introduced above, we rewrite the expected complete log-likelihood:

n=1Nk=1Kγk(n)[logλk+logP(y(n)x(n),Tk)]=k=1KΓklogλk+k=1KΓkn=1Nwk(n)logP(y(n)x(n),Tk)
(9)

We wish to maximize (9) with respect to {λ1, …, λK, θ1, …, θK} subject to the constraint k=1Kλk=1. Notice that (9) consists of two terms and each term has a disjoint subset of parameters – which allows us to maximize (9) term by term. By maximizing the first term with respect to λj (the mixture coefficient of Tj), we obtain:

λj=Γjk=1KΓk=ΓjN

To maximize the second term, we train θj (the parameters of Tj) to maximize:

θj=argmaxn=1Nwj(n)logP(y(n)x(n),Tj)
(10)

It turns out (10) is the instance-weighted log-likelihood, and we use instance-weighted logistic regression to optimize it. Algorithm 1 outlines our parameter learning algorithm.

5.2.3 Complexity

E-step

We compute γk(n) for each instance on every CTBN. To compute γk(n), we should estimate P(y (n)|x(n), Tk), which requires applying the logistic regression classifiers for each node of Tk, which requires O(md) multiplications. Hence, the complexity of the E-step is O(KNmd).

M-step

The major computational cost of the M-step is to learn the instance-weighted logistic regression models for the nodes of every CTBN. Hence, the complexity is O(Kd) times the complexity of learning logistic regression.

5.3 Structure Learning

In this section, we describe how to automatically learn multiple CTBN structures from data. We apply a sequential boosting-like heuristic, where in each iteration we learn the structure that focuses on the instances that are not well predicted by the previous structures (i.e., the MC model learned so far). In the following, we first describe how to learn a single CTBN structure from instance-weighted data. After that, we describe how to re-weight the instances and present our algorithm for learning the overall MC model.

Algorithm 1

learn-MC-parameters

Input: Training data D; base CTBNs T1, …, TK
Output: Model parameters {θ1, …, θK, λ1, …, λK}
1:repeat
2:E-step:
3:for k = 1 to K, n = 1 to N do
4:  Compute γk(n) using Equation (8)
5:end for
6:M-step:
7:for k = 1 to K do
8:   Γk=n=1Nγk(n)
9:  wk(n) = γk(n)/Γk
10:  λk = Γk/N
11:   θk=argmaxn=1Nwk(n)logP(y(n)x(n),Tk)
12:end for
13:until convergence

5.3.1 Learning a Single CTBN Structure on Weighted Data

The goal here is to discover the CTBN structure that maximizes the weighted conditional log-likelihood (WCLL) on {D, Ω}, where D={x(n),y(n)}n=1N is the data and Ω={ω(n)}n=1N is the weight for each instance. We do this by partitioning D into two parts: training data Dtr and hold-out data Dh. Given a CTBN structure T, we train its parameters using Dtr and the corresponding instance weights. On the other hand, we use WCLL of Dh to score T.

score(T)=(x(n),y(n))Dhω(n)logP(y(n)x(n),T)=(x(n),y(n))Dhi=1dω(n)logP(yi(n)x(n),yπ(i,T)(n))
(11)

In the following, we describe our algorithm for obtaining the CTBN structure that optimizes Equation (11) without having to evaluate all of the exponentially many possible tree structures.

Let us first define a weighted directed graph G = (V, E), which has one vertex Vi for each class label Yi and a directed edge Eji from each vertex Vj to each vertex Vi (i.e., G is complete). In addition, each vertex Vi has a self-loop Eii. The weight of edge Eji, denoted as Wji, is the WCLL of class Yi conditioned on X and Yj:

Wji=(x(n),y(n))Dhω(n)logP(yi(n)x(n),yj(n))

The weight of self-loop Eii, denoted as Wϕi, is the WCLL of class Yi conditioned only on X. Using the definition of edge weights, Equation (11) can be simplified as the sum of the edge weights:

score(T)=n=1dWπ(i,T)i

Now we have transformed the problem of finding the optimal tree structure into the problem of finding the tree in G that has the maximum sum of edge weights. The solution can be obtained by solving the maximum branching (arborescence) problem [11], which finds the maximum weight tree in a weighted directed graph.

5.3.2 Learning Multiple CTBN Structures

In order to obtain multiple CTBN structures for the MC model, we apply the algorithm described above multiple times with different sets of instance weights. We assign the weights such that we give higher weights for poorly predicted instances and lower weights for well-predicted instances.

We start with assigning all instances uniform weights (i.e., all instances are equally important a priori).

ω(n) = 1/N:n = 1, …, N

Using this initial set of weights, we find the initial CTBN structure T1 (and its parameters θ1) and set the current model M to be T1. We then estimate the prediction error margin ω(n) = 1 − P(y(n)|x(n), M) for each instance and renormalize such that n=1Nω(n)=1. We use {ω(n)} to find the next CTBN structure T2. After that, we set the current model to be the MC model learned by mixing T1 and T2 according to Algorithm 1.

We repeat the process by incrementally adding trees to the mixture. To stop the process, we use internal validation approach. Specifically, the data used for learning are split to internal train and test sets. The structure of the trees and parameters are always learned on the internal train set. The quality of the current mixture is evaluated on the internal test set. The mixture growth stops when the log-likelihood on the internal test set for the new mixture is worse than for the previous mixture. The trees included in the previous mixture are then fixed, and the parameters of the mixture are relearned on the full training data.

5.3.3 Complexity

In order to learn a single CTBN structure, we compute edge weights for the complete graph G, which requires estimating P(Yi|X, Yj) for all d2 pairs of classes. Finding the maximum branching in G can be obtained in O(d2) using [35]. To learn K CTBN structures for the mixture, we repeat these steps K times. Therefore, the overall complexity is O(d2) times the complexity of learning logistic regression.

5.4 Prediction

In order to make a prediction for a new instance x, we want to find the MAP assignment of the class variables (see Equation (1)). In general, this requires to evaluate all possible assignments of values to d class variables, which is exponential in d.

One important advantage of the CTBN model is that the MAP inference can be done more efficiently by avoiding blind enumeration of all possible assignments. More specifically, the MAP inference on a CTBN is linear in the number of classes (O(d)) when implemented using a variant of the max-sum algorithm [23] on a tree structure.

However, our MC model consists of multiple CTBNs and the MAP solution may, at the end, require enumeration of exponentially many class assignments. To address this problem, we rely on approximate MAP inference. Two commonly applied MAP approximation approaches are convex programming relaxation via dual decomposition [33], and simulated annealing using a Markov chain [40]. In this work, we use the latter approach. Briefly, we search the space of all assignments by defining a Markov chain that is induced by local changes to individual class labels. The annealed version of the exploration procedure [40] is then used to speed up the search. We initialize our MAP algorithm using the following heuristic: first, we identify the MAP assignments for each CTBN in the mixture individually, and after that, we pick the best assignment from among these candidates. We have found this (efficient) heuristic to work very well and it often results in the true MAP assignment.

6. EXPERIMENTS

We perform experiments on ten publicly available multi-label datasets. These datasets are obtained from different domains such as music recognition (emotions [36]), semantic image labeling (scene [6] and image [10]), biology (yeast [12]) and text classification (enron [4] and RCV1 [25] datasets). Table 1 summarizes the characteristics of the datasets. We show the number of instances (N), number of feature variables (m) and number of class variables (d). In addition, we show two statistics: label cardinality (LC), which is the average number of labels per instance, and distinct label set (DLS), which is the number of all distinct configurations of classes that appear in the data. Note that, for RCV1 datasets, we have used the ten most common labels.

Table 1

Datasets characteristics

DatasetNmdLCDLSDomain

Emotions5937261.8727music
Yeast2,417103144.24198biology
Scene2,40729461.0715image
Image2,00013551.2420image
Enron1,7021,001533.38753text
RCV1_subset16,0008,394101.3169text
RCV1_subset26,0008,304101.2170text
RCV1_subset36,0008,328101.2274text
RCV1_subset46,0008,332101.2279text
RCV1_subset56,0008,367101.3176text

N: number of instances, m: number of features, d: number of labels, LC: label cardinality, DLS: distinct label set

6.1 Methods

We compare the performance of our proposed mixture-of-CTBNs (MC) model with simple binary relevance (BR) independent classification [9, 6] as well as several state-of-the-art MLC methods. These methods include classification with heterogeneous features (CHF) [14], multi-label k-nearest neighbor (MLKNN) [43], instance-based learning by logistic regression (IBLR) [8], classifier chains (CC) [31], ensemble of classifier chains (ECC) [31], probabilistic classifier chains (PCC) [10], ensemble of probabilistic classifier chains (EPCC) [10], multi-label conditional random fields (ML-CRF) [28], and maximum margin output coding (MMOC) [45]. We also compare MC with a single CTBN (SC) [2] model without creating a mixture.

For all methods, we use the same parameter settings as suggested in their papers: For MLKNN and IBLR, which use the k-nearest neighbor (KNN) method, we use Euclidean distance to measure similarity of instances and we set the number of nearest neighbors to 10 [43, 8]; for CC, we set the order of classes to Y1 < Y2, … < Yd [31]; for ECC and EPCC, we use 10 CCs in the ensemble [31, 10]; finally for MMOC, we set the decoding parameter to 1 [45]. Also note that all of these methods except MLKNN and MMOC are considered as meta-learners because they can work with several base classifiers. To eliminate additional effects that may bias the results, we use L2-penalized logistic regression for all of these methods and choose their regularization parameters by cross validation. For our MC model, we decide the number of mixture components using our stopping criterion (Section 5.3.2) and we use 150 iterations of simulated annealing for prediction.

6.2 Evaluation Measures

Evaluating the performance of MLC methods is more difficult than evaluating simple classification methods. The most suitable performance measure is the exact match accuracy (EMA), which computes the percentage of instances whose predicted label vectors are exactly the same as their true label vectors.

EMA=n=1Nδ(y(n),h(x(n)))

However, this measure could be too harsh, especially when the output dimensionality is high. Another very useful measure is the conditional log-likelihood loss (CLL-loss), which computes the negative conditional log-likelihood of the test instances:

CLL-loss=n=1N-log(P(y(n)x(n)))

CLL-loss evaluates how much probability mass is given to the true label vectors (the higher the probability, the smaller the loss).

Other evaluation measures used commonly in MLC literature are based on F1 scores. Micro F1 aggregates the number of true positives, false positives and false negatives for all classes and then calculates the overall F1 score. On the other hand, macro F1 computes the F1 score for each class separately and then averages these scores. Note that both measures are not the best for MLC because they do not account for the correlations between classes (see [10] and [41]). However, we report them in our performance comparisons as they have been used in other MLC literature [37].

6.3 Results

6.3.1 Performance Comparisons

We have performed ten-fold cross validation for all of our experiments. To evaluate the statistical significance of performance difference, we apply paired t-tests at 0.05 significance level. We use markers */⊛ to indicate whether MC is significantly better/worse than the compared method.

Tables 2, ,3,3, ,44 and and55 show the performance of the methods in terms of EMA, CLL-loss, micro F1 and macro F1, respectively. We only show the results of MMOC on four datasets (emotions, yeast, scene and image) because it did not finish on the remaining data (MMOC did not finish one round of the learning within a 24 hours time limit). For the same reason, we do not report the results of PCC, EPCC and MLCRF on the enron dataset. Also note that we do not report CLL-loss for MMOC, ECC and EPCC because they do not compute a probabilistic score for a given class assignment.

Table 2

Performance of each method on the benchmark datasets in terms of exact match accuracy

EMABRCHFMLKNNIBLRCCECCPCCEPCCMLCRFMMOCSCMC

Emotions0.265 *0.300 *0.283 *0.3350.268 *0.288 *0.3170.3440.303 *0.3320.3220.346
Yeast0.151 *0.163 *0.179 *0.204 *0.193 *0.204 *0.2300.2190.180 *0.2190.192 *0.235
Scene0.541 *0.605 *0.629 *0.644 *0.632 *0.658 *0.6660.6710.583 *0.6640.625 *0.680
Image0.280 *0.360 *0.346 *0.387 *0.426 *0.413 *0.4490.4420.377 *0.4480.414 *0.463
Enron0.164 *0.170 *0.078 *0.163 *0.173 *0.180----0.167 *0.187
Rcv1_subset10.334 *0.357 *0.205 *0.279 *0.429 *0.410 *0.432 *0.420 *0.344 *-0.441 *0.457
Rcv1_subset20.439 *0.465 *0.288 *0.417 *0.516 *0.509 *0.523 *0.516 *0.475 *-0.5310.536
Rcv1_subset30.466 *0.486 *0.327 *0.446 *0.539 *0.539 *0.548 *0.544 *0.489 *-0.5600.561
Rcv1_subset40.510 *0.531 *0.354 *0.491 *0.579 *0.569 *0.5880.576 *0.550 *-0.5920.591
Rcv1_subset50.439 *0.456 *0.276 *0.411 *0.497 *0.494 *0.519 *0.513 *0.457 *-0.5390.540

#win/#tie/#loss10/0/010/0/010/0/09/1/010/0/09/1/04/5/05/4/09/0/00/4/05/5/0

Marker */⊛ indicates whether MC is statistically superior/inferior to the compared method (using paired t-test at 0.05 significance level). The last row shows the total number of win/tie/loss for MC against the compared method (e.g., #win is how many times MC significantly outperforms that method).

Table 3

Performance of each method in terms of conditional log-likelihood loss

CLL-lossBRCHFMLKNNIBLRCCPCCMLCRFSCMC

Emotions153.5 *147.5 *151.7 *143.0 *169.6 *134.9139.2 *147.4 *128.8
Yeast1500.3 *1491.7 *1464.9 *1434.2 *2303.8 *932.1 ⊛1175.4 *1097.0 *1000.0
Scene344.7 *318.4 *310.9 *283.9 *395.0 *258.9313.2 *306.3 *260.1
Image432.5 *415.9 *425.3 *395.6 *480.3 *354.7401.4 *388.4 *347.1
Enron1287.3 *1272.5 *1301.2 *1287.4 *1293.5 *--1437.9 *1224.4
Rcv1_subset11443.8 *2144.2 *1873.7 *1379.5 *1701.3 *1034.3 *1369.4 *962.7951.1
Rcv1_subset21207.4 *2223.6 *1687.8 *1172.6 *1398.8 *923.0 *1123.6 *893.5 *855.8
Rcv1_subset31207.4 *2156.0 *1674.6 *1168.2 *1500.5 *896.7 *1116.4 *939.7 *837.2
Rcv1_subset41072.9 *1759.9 *1532.9 *1034.8 *1282.1 *823.0 *951.4 *790.7 *770.6
Rcv1_subset51267.0 *2283.6 *1795.5 *1234.7 *1422.0 *1009.0 *1192.4 *924.0 *894.3

#win/#tie/#loss10/0/010/0/010/0/010/0/010/0/05/3/19/0/09/1/0

Marker */⊛ indicates whether MC is statistically superior/inferior to the compared method (using paired t-test at 0.05 significance level). The last row shows the total number of win/tie/loss for MC against the compared method.

Table 4

Performance of each method in terms of micro F1

Micro_F1BRCHFMLKNNIBLRCCECCPCCEPCCMLCRFMMOCSCMC

Emotions0.645 *0.6720.656 *0.6920.621 *0.652 *0.664 *0.6880.6840.6870.6780.693
Yeast0.6350.6370.6460.661 ⊛0.6280.6310.6450.6500.619 *0.6510.6310.640
Scene0.696 *0.722 *0.7360.7580.697 *0.724 *0.722 *0.7430.713 *0.711 *0.717 *0.745
Image0.479 *0.541 *0.504 *0.5730.5500.5630.5650.5770.5580.5720.5610.573
Enron0.5510.569 ⊛0.450 *0.566 ⊛0.577 ⊛0.583 ⊛----0.5520.556
Rcv1_subset10.503 *0.5160.257 *0.459 *0.511 *0.5250.510 *0.5290.505 *-0.512 *0.525
Rcv1_subset20.568 *0.5840.317 *0.546 *0.5860.5890.5880.5910.582 *-0.5910.587
Rcv1_subset30.576 *0.5920.364 *0.564 *0.5940.610 ⊛0.5940.613 ⊛0.590-0.5960.599
Rcv1_subset40.622 *0.6370.404 *0.606 *0.6400.646 ⊛0.644 ⊛0.650 ⊛0.635-0.6380.635
Rcv1_subset50.582 *0.5970.314 *0.566 *0.5950.6030.6000.605 ⊛0.589 *-0.5980.597

#win/#tie/#loss8/2/02/7/18/2/05/3/23/6/12/5/33/5/10/6/35/4/01/3/02/8/0

Marker */⊛ indicates whether MC is statistically superior/inferior to the compared method (using paired t-test at 0.05 significance level). The last row shows the total number of win/tie/loss for MC against the compared method.

Table 5

Performance of each method in terms of macro F1

Macro_F1BRCHFMLKNNIBLRCCECCPCCEPCCMLCRFMMOCSCMC

Emotions0.632 *0.6670.6560.6900.620 *0.643 *0.6590.6830.6670.6790.6700.686
Yeast0.457 *0.461 *0.4780.498 ⊛0.4670.4770.4860.496 ⊛0.451 *0.4730.4670.477
Scene0.703 *0.730 *0.7430.7650.709 *0.7400.729 *0.7530.721 *0.721 *0.728 *0.755
Image0.486 *0.546 *0.516 *0.5810.5620.5710.5750.5860.560 *0.5780.5720.584
Enron0.478 ⊛0.4790.411 *0.4750.484 ⊛0.482 ⊛----0.4700.470
Rcv1_subset10.495 *0.5110.273 *0.463 *0.506 *0.5160.504 *0.5210.500 *-0.5070.517
Rcv1_subset20.503 *0.5260.264 *0.475 *0.5310.5390.5310.5380.516 *-0.5360.531
Rcv1_subset30.513 *0.5360.278 *0.497 *0.5470.558 ⊛0.5480.561 ⊛0.531-0.5430.542
Rcv1_subset40.499 *0.5190.269 *0.477 *0.534 ⊛0.540 ⊛0.534 ⊛0.539 ⊛0.515-0.5260.522
Rcv1_subset50.500 *0.5260.257 *0.487 *0.5360.538 ⊛0.5340.538 ⊛0.513 *-0.5360.527

#win/#tie/#loss9/0/13/7/07/3/05/4/13/5/21/5/42/6/10/5/46/3/01/3/01/9/0

Marker */⊛ indicates whether MC is statistically superior/inferior to the compared method (using paired t-test at 0.05 significance level). The last row shows the total number of win/tie/loss for MC against the compared method.

In terms of EMA (Table 2), MC clearly outperforms the other methods on most datasets. MC is significantly better than BR, CHF, MLKNN and CC on all ten datasets, significantly better than IBLR, ECC and MLCRF on nine datasets, significantly better than EPCC and SC on five datasets and significantly better than PCC on four datasets (see the last row of Table 2). Although not statistically significant, MC performs better than MMOC on all datasets MMOC is able to finish. MLKNN and IBLR perform poorly on the high-dimensional (m > 1, 000) datasets because Euclidean distances between data instances become indiscernible in high dimensions.

Interestingly, MC shows significant improvements over SC (a single CTBN) on five datasets, while SC produces competitive results as well. We attribute the improved performance of MC to the ability of mixtures to compensate for the restricted dependences modeled by CTBNs, and that of individual CTBNs to better fit the data with different weight sets. On the contrary, ECC and EPCC do not show consistent improvements over their base methods (CC and PCC, respectively) and sometimes even deteriorate the accuracy. This is due to the ad-hoc nature of their ensemble learning and prediction (see Section 3) that limits the potential improvement and disturbs the prediction of the ensemble classifiers.

Table 3 compares MC to other probabilistic MLC methods using CLL-loss. The results show that MC outperforms all other methods. This is expected because MC is tailored to optimize the conditional log-likelihood. Among the compared probabilistic methods, only PCC produces comparable results with MC because PCC explicitly evaluates all possible class assignments to compute the entire class conditional distribution. On the other hand, CC greedily seeks the mode of the class conditional distribution (Equation (2)) and results in large losses. In addition, CHF and MLKNN perform very poorly because they apply ad-hoc classification heuristics without performing proper probabilistic inference. Again, MC shows consistent improvements over SC because mixing multiple CTBNs allows us to account for different patterns in the data and, hence, improves the generalization of the model.

Lastly, Tables 4 and and55 show that MC is also very competitive in terms of micro and macro F1 scores, although optimizing them was not our immediate objective. One noteworthy observation is that ECC and EPCC do particularly well in terms of F1 scores. We consider averaging out the predictions on each class variable enhances BR-like characteristics in their ensemble decision. In the future, we will crossbreed these two different ensemble approaches (e.g., MCC/MPCC by applying our mixture framework and algorithms to CC/PCC; ECTBN using randomly structured CTBNs and simple averaging) and compare the performances.

6.3.2 Effect of the Number of Mixture Components

In the second part of our experiments, we investigate the effect of different number of mixture components in the MC model. Using three of the benchmark datasets (emotions, scene and image), we study how the performance of MC changes while we increase the number of trees in a model from 1 to 20. In particular, we use ten-fold cross validation and trace the average CLL-loss and EMA across the folds.

Figure 3 summarizes the results. Figures 3(a), 3(b) and 3(c) show how CLL-loss changes on emotions, scene and image, respectively. On all three datasets, adding first few trees brings the CLL-loss of a mixture model in a rapid improvement. Then the growth becomes slower until it reaches its first peak. After it passes the first peak, CLL-loss stops improving and becomes stable.

An external file that holds a picture, illustration, etc.
Object name is nihms679948f3.jpg

Conditional log-likelihood loss and exact match accuracy of MC with different number of mixture components

Figures 3(d), 3(e) and 3(f) show the performance changes in EMA. Notice that EMA is closely correlated with CLL-loss on all three datasets, and our stopping criteria is useful in optimizing EMA as well as CLL-loss. That is, EMA improves significantly while CLL-loss increases rapidly. Once CLL-loss becomes stable, EMA also seems to be stable and does not show any signs of fluctuation or overfitting.

7. CONCLUSION

In this work, we proposed a new probabilistic approach to multi-label classification based on the mixture of Conditional Tree-structured Bayesian Networks. We devised and presented algorithms for learning the parameters of the mixture, finding multiple tree structures and inferring the maximum a posteriori (MAP) output label configurations for the model. Our experimental evaluation on a range of datasets shows that our approach outperforms the state-of-the-art multi-label classification methods in most cases.

Acknowledgments

This work was supported by grants R01LM010019 and R01GM088224 from the NIH. Its content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

Contributor Information

Charmgil Hong, Computer Science Dept., University of Pittsburgh, Pittsburgh, PA, USA.

Iyad Batal, GE Global Research, San Ramon, CA, USA.

Milos Hauskrecht, Computer Science Dept., University of Pittsburgh, Pittsburgh, PA, USA.

References

1. Antonucci A, Corani G, Mauá DD, Gabaglio S. An ensemble of bayesian networks for multilabel classification. IJCAI. 2013:1220–1225. [Google Scholar]
2. Batal I, Hong C, Hauskrecht M. An efficient probabilistic framework for multi-dimensional classification. Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, CIKM ’13; ACM; 2013. pp. 2417–2422. [PMC free article] [PubMed] [Google Scholar]
3. Berger J. Springer series in statistics. 2. Springer; New York, NY: 1985. Statistical decision theory and Bayesian analysis. [Google Scholar]
4. Berkeley U. [Accessed: 2014-8-16];Enron email analysis. http://bailando.sims.berkeley.edu/enronemail.html.
5. Bielza C, Li G, Larrañaga P. Multi-dimensional classification with bayesian networks. International Journal of Approximate Reasoning. 2011;52(6):705– 727. [Google Scholar]
6. Boutell MR, Luo J, Shen X, Brown CM. Learning multi-label scene classification. Pattern Recognition. 2004;37(9):1757– 1771. [Google Scholar]
7. Bradley JK, Guestrin C. Learning tree conditional random fields. International Conference on Machine Learning (ICML 2010); Haifa, Israel. 2010. [Google Scholar]
8. Cheng W, Hüllermeier E. Combining instance-based learning and logistic regression for multilabel classification. Machine Learning. 2009;76(2–3):211–225. [Google Scholar]
9. Clare A, King RD. Lecture Notes in Computer Science. Springer; 2001. Knowledge discovery in multi-label phenotype data; pp. 42–53. [Google Scholar]
10. Dembczynski K, Cheng W, Hüllermeier E. Bayes optimal multilabel classification via probabilistic classifier chains. Proceedings of the 27th International Conference on Machine Learning (ICML-10); Omnipress; 2010. pp. 279–286. [Google Scholar]
11. Edmonds J. Optimum branchings. Research of the National Bureau of Standards. 1967;71B:233–240. [Google Scholar]
12. Elisseeff A, Weston J. A kernel method for multi-labelled classification. NIPS. 2001:681–687. [Google Scholar]
13. Ghamrawi N, McCallum A. Collective multi-label classification. Proceedings of the 14th ACM International Conference on Information and Knowledge Management, CIKM ’05; ACM; 2005. pp. 195–200. [Google Scholar]
14. Godbole S, Sarawagi S. Discriminative methods for multi-labeled classification. PAKDD ’04. 2004:22–30. [Google Scholar]
15. Hauskrecht M, Batal I, Valko M, Visweswaran S, Cooper GF, Clermont G. Outlier detection for patient monitoring and alerting. Journal of Biomedical Informatics. 2013 Feb;46(1):47–55. [PMC free article] [PubMed] [Google Scholar]
16. Hauskrecht M, Valko M, Batal I, Clermont G, Visweswaram S, Cooper G. Conditional outlier detection for clinical alerting. Annual American Medical Informatics Association Symposium; 2010. [PMC free article] [PubMed] [Google Scholar]
17. Hauskrecht M, Valko M, Kveton B, Visweswaram S, Cooper G. Evidence-based anomaly detection. Annual American Medical Informatics Association Symposium; November 2007.pp. 319–324. [PMC free article] [PubMed] [Google Scholar]
18. Hsu D, Kakade S, Langford J, Zhang T. Multi-label prediction via compressed sensing. NIPS. 2009:772–780. [Google Scholar]
19. Ioffe S, Forsyth D. Human tracking with mixtures of trees. Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on; 2001. pp. 690–695. [Google Scholar]
20. Ioffe S, Forsyth DA. CVPR. 2. IEEE Computer Society; 2001. Mixtures of trees for object recognition; pp. 180–185. [Google Scholar]
21. Kazawa H, Izumitani T, Taira H, Maeda E. Advances in Neural Information Processing Systems. Vol. 17. MIT Press; 2005. Maximal margin labeling for multi-topic text categorization; pp. 649–656. [Google Scholar]
22. Kirshner S, Smyth P. Infinite mixtures of trees. Proceedings of the 24th International Conference on Machine Learning, ICML ’07; New York, NY, USA. ACM; 2007. pp. 417–423. [Google Scholar]
23. Koller D, Friedman N. Probabilistic Graphical Models: Principles and Techniques. MIT Press; 2009. [Google Scholar]
24. Lafferty JD, McCallum A, Pereira FCN. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01; 2001. [Google Scholar]
25. Lewis DD, Yang Y, Rose TG, Li F. Rcv1: A new benchmark collection for text categorization research. J Mach Learn Res. 2004 Dec;5:361–397. [Google Scholar]
26. Meilă M, Jordan MI. Learning with mixtures of trees. Journal of Machine Learning Research. 2000;1:1–48. [Google Scholar]
27. Moon T. The expectation-maximization algorithm. Signal Processing Magazine, IEEE. 1996;13(6):47–60. [Google Scholar]
28. Pakdaman M, Batal I, Liu Z, Hong C, Hauskrecht M. SDM. SIAM; 2014. An optimization-based framework to learn conditional random fields for multi-label classification. [PMC free article] [PubMed] [Google Scholar]
29. Qi G-J, Hua X-S, Rui Y, Tang J, Mei T, Zhang H-J. Correlative multi-label video annotation. Proceedings of the 15th international conference on Multimedia, MULTIMEDIA ’07; ACM; 2007. pp. 17–26. [Google Scholar]
30. Raiffa H. Decision Analysis: Introductory Lectures on Choices Under Uncertainty. Mcgraw-Hill College; Jan, 1997. [PubMed] [Google Scholar]
31. Read J, Pfahringer B, Holmes G, Frank E. Classifier chains for multi-label classification. Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD ’09; Springer-Verlag; 2009. [Google Scholar]
32. Shahaf D, Guestrin C. Learning thin junction trees via graph cuts. AISTATS, volume 5 of JMLR Proceedings; 2009. pp. 113–120. JMLR.org. [Google Scholar]
33. Sontag D. PhD thesis. Massachusetts Institute of Technology; 2010. Approximate Inference in Graphical Models using LP Relaxations. [Google Scholar]
34. Tai F, Lin H-T. Multi-label classification with principle label space transformation. the 2nd International Workshop on Multi-Label Learning; 2010. [Google Scholar]
35. Tarjan RE. Finding optimum branchings. Networks. 1977;7(1):25–35. [Google Scholar]
36. Trohidis K, Tsoumakas G, Kalliris G, Vlahavas IP. Multi-label classification of music into emotions. ISMIR. 2008:325–330. [Google Scholar]
37. Tsoumakas G, Zhang M-L, Zhou Z-H. ECML PKDD Tutorial. 2009. Learning from multi-label data. [Google Scholar]
38. van der Gaag LC, de Waal PR. Multi-dimensional bayesian network classifiers. Probabilistic Graphical Models. 2006:107–114. [Google Scholar]
39. Šingliar T, Hauskrecht M. Modeling highway traffic volumes. Proceedings of the 18th European Conference on Machine Learning, ECML ’07; Springer-Verlag; 2007. pp. 732–739. [Google Scholar]
40. Yuan C, Lu T-C, Druzdzel MJ. UAI. AUAI Press; 2004. Annealed map; pp. 628–635. [Google Scholar]
41. Zhang M-L, Zhang K. Multi-label learning by exploiting label dependency. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’10; ACM; 2010. pp. 999–1008. [Google Scholar]
42. Zhang ML, Zhou ZH. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering. 2006;18(10):1338–1351. [Google Scholar]
43. Zhang ML, Zhou ZH. Ml-knn: A lazy learning approach to multi-label learning. Pattern Recogn. 2007 Jul;40(7):2038–2048. [Google Scholar]
44. Zhang Y, Schneider J. Multi-label output codes using canonical correlation analysis. AISTATS. 2011;2011 [Google Scholar]
45. Zhang Y, Schneider J. Maximum margin output coding. Proceedings of the 29th International Conference on Machine Learning; 2012. pp. 1575–1582. [Google Scholar]