Structure Learning of Bayesian Network Based on Adaptive Thresholding

Direct dependencies and conditional dependencies in restricted Bayesian network classifiers (BNCs) are two basic kinds of dependencies. Traditional approaches, such as filter and wrapper, have proved to be beneficial to identify non-significant dependencies one by one, whereas the high computational overheads make them inefficient especially for those BNCs with high structural complexity. Study of the distributions of information-theoretic measures provides a feasible approach to identifying non-significant dependencies in batch that may help increase the structure reliability and avoid overfitting. In this paper, we investigate two extensions to the k-dependence Bayesian classifier, MI-based feature selection, and CMI-based dependence selection. These two techniques apply a novel adaptive thresholding method to filter out redundancy and can work jointly. Experimental results on 30 datasets from the UCI machine learning repository demonstrate that adaptive thresholds can help distinguish between dependencies and independencies and the proposed algorithm achieves competitive classification performance compared to several state-of-the-art BNCs in terms of 0–1 loss, root mean squared error, bias, and variance.


Introduction
Classification is one of the most important tasks in machine learning. The basic problem of supervised classification is the induction of a model with feature set X = {X 1 , · · · , X n } that classifies testing instance (example) x = {x 1 , · · · , x n } into one of the several class labels {c 1 , · · · , c m } of class variable C. Bayesian network classifiers (BNCs) have many desirable properties over other numerous classification models, such as model interpretability, the ease of implementation, the ability to deal with multi-class classification problems and the comparable classification performance [1]. A BNC or B assigns the most probable label with the maximum posterior probability to x by calculating the posterior probability for each class label that is: where class label c ∈ {c 1 , · · · , c m }.
Although unrestricted BNCs are the least biased, the search-space that is needed to train such a model increases exponentially with the number of features [2]. The arising complexity issues limit the study of unrestricted BNCs and it has led to the study of restricted BNCs, from 0-dependence naive Bayes (NB) [3][4][5], 1-dependence tree-augmented naive Bayes (TAN) [6] to k-dependence Bayesian classifier (KDB) [7]. These classifiers take class variable as the common parent of all predictive C X 3 X 4 X 1 X 2 C X 3 X 4 X 1 X 2 3 (a) KDB (k=1) (b) KDB (k=2) There are two basic kinds of dependencies in restricted BNCs: (1) direct dependence between feature X i and C that can be quantified by mutual information (MI) I(X i ; C), and (2) conditional dependence between X i and X j given C that can be measured by conditional mutual information (CMI) I(X i ; X j |C). Many researchers have exploited methods, such as filter and wrapper [8][9][10][11][12][13], to select direct dependencies by removing redundant features. The filter approach operates independently of any learning algorithms that rank the features by some criteria and omit all features that do not achieve a sufficient score [14][15][16]. The wrapper approach evaluates the feature subsets every time and may produce better results. For example, Backwards Sequential Elimination (BSE) [17] uses a simple heuristic wrapper approach that seeks a subset of the available features that minimizes 0-1 loss on the training set. Forward Sequential Selection (FSS) [18] uses the reverse search direction to BSE. Although the filter and wrapper approaches have proved to be beneficial in domains with highly correlated features, the learning procedure ends only when there is no accuracy improvement, thus they are expensive to run and can break down with very large numbers of features [8,19,20]. Suppose that we need to select m from n features for classification, BSE or FSS will construct P m n or n! m! candidate BNCs to judge if there exist non-significant features or direct dependencies. It is even more difficult for BSE or FSS to select the conditional dependencies. For example, the network topology of KDB consists of nk − k 2 2 − k 2 conditional dependencies [21]. If BSE or FSS evaluate them one by one to identify those relatively non-significant ones, the high computational overheads is almost unbearable and few approaches are proposed to address this issue.
Obviously, how to efficiently identify non-significant direct and conditional dependencies are two key issues to learn BNC. Strictly speaking, there exist no direct or conditional independence due to the fact that the MI and CMI values are non-negative. However, weak dependencies, if introduced into the network topology, will result in overfitting and classification bias. For KDB, all features are indiscriminately conditionally dependent on at most k parent features even if the conditional dependencies are very weak. Discarding these redundant features or weak conditional dependencies can help increase structure reliability and avoid overfitting. Figure 2 presents the distributions of MI and CMI values for K 2 DB (KDB with k = 2) on dataset Connect-4, which has 67,557 instances (or examples), 42 features and three classes. As shown in Figure 2a, there exist minor differences among some MI values, thus the significance of corresponding direct dependencies is almost the same and they can be treated in batch. From Figure 2b, the same also applies to CMI and corresponding conditional dependencies. The filter approaches have computational efficiency while the wrapper approaches may produce better results. The algorithm proposed combines the characteristics of filter with wrapper approaches to exploit the complementary strengths. In this paper, we propose to group the direct (or conditional) dependencies into different batches using adaptive thresholds. We assume that there exists no significant difference between the MI (or CMI) values in the same batch. Then, the basic idea of filter and wrapper will be applied to a select batch rather than single dependence for each iteration. This learning strategy can help achieve much higher efficiency compared to BSE (or FSS) while retaining competitive classification performance, and above all it provides a feasible solution for selecting conditional dependencies, the number of which increases exponentially as the number of features increases.
We investigate two extensions to KDB, MI-based feature selection and CMI-based dependence selection based on a novel adaptive thresholding method. The final BNC, Adaptive KDB (AKDB), evaluates the subsets of features and conditional dependencies using leave-one-out cross validation (LOOCV). In the remaining sections, we prove that applying feature selection and dependence selection techniques to KDB can alleviate the potential redundancy problem. We present extensive experimental results, which prove that AKDB significantly outperforms several other state-of-the-art BNCs in terms of 0-1 loss, root mean squared error (RMSE), bias and variance.

Restricted Bayesian Network Classifiers
For convenience, except for the algorithm names, all the used acronyms in this work are listed in Table 1. The structure of BNC can be described as a directed acyclic graph [22]. Nodes in structure represent the class variable C or features, edge X i → X j denotes probabilistic dependency relationship between these two features and X i is one of the immediate parent nodes of X j . Thus, in a restricted BNC or B, class variable C is required as the common parent of all features and does not have any parents so the individual probability of C is P(c). We use P B (x i |π i ) to denote the individual probability of feature X i , where π i denotes the set of values of X i 's parents. The joint probability distribution can be calculated as the product of P B (x i |π i ) of all features and P(c) that is: Unfortunately, the inference of an unrestricted BNC has been proved to be an NP-hard problem [23,24] and learning a restricted or pre-fixed BNC is one approach to deal with the intractable complexity. For example, NB [25,26] is the simplest classifier among restricted BNCs that assumes each feature is conditionally independent given the class variable C. Since, in the real world, the dataset usually does not satisfy the independence assumption, this may cause a deterioration of the classification performance. KDB alleviates the independence assumption of NB that it constructs classifiers which allow feature X i within BNC to have at most k parent features. KDB firstly sets the feature order by comparing MI values and then calculates CMI values as the weights to measure the conditional relationship between features given C and select at most k parent features for one feature. MI and CMI are defined as follows: For KDB, I(X i ; C) measures the direct dependence between X i and C. I(X i ; X j |C) measures the conditional dependence between X i and X j given C. For a given training set with n features and the parameter k, KDB firstly calculates MI and CMI. Suppose that the feature order is {X 1 , · · · , X n } by comparing MI, X i will choose min(i − 1, k) features with the highest CMI values from the first i − 1 candidates. The structure learning procedure of KDB is depicted in Algorithm 1.
There have been some refinements that may improve KDB's performance. Rodríguez and Lozano [27] proposed to extend KDB to a multi-dimensional classifier, which learned a population of classifiers (nondominated solutions) by a multi-objective optimization technique and the objective functions for the multi-objective approach are the multi-dimensional k-fold cross-validation estimations of the errors. Louzada [28] proposed to generate multiple KDB networks via a naive bagging procedure by obtaining the predicted values from the adjusted models, and then combine them into a single predictive classification. Input: Training set T , list of features L = {X 1 , · · · , X n , C} and k. Output: The directed graph G. 1 Calculate I(X i ; C) from T for all features, calculate I(X i ; X j |C) from T for each pair of features (i = j); 2 Let the used features list, S, be empty; 3 Initialize the network structure G, with a single node, C; 4 repeat 5 Select and add a node X max to G, where X max = arg max I(X i ; C) and X i / ∈ S; 6 Add an edge C → X max in G; 7 Add min(|S| , k) edges from distinct features X j with the highest I(X max ; X j |C), where X j ∈ S; 8 Add X max to S; 9 until S includes all X i in L; 10 return G;

Adaptive KDB
MI and CMI are non-negative in Equation (3). I(X i ; C) = 0 (or I(X i ; X j |C) = 0) if X i and C are independent (or X i and X j are conditionally independent given C). If X i and C are regarded as independent, the edge connecting them will be removed. Practically, the estimated MI is compared to a small threshold, in order to distinguish pairs of dependent and pairs of independent features [29][30][31][32]. In the following discussion, we mainly discuss how to choose the threshold of MI. The test for conditional independence using CMI is similar.
To refine the network structure, AKDB uses an adaptive threshold to filter out those non-significant dependencies. If the threshold is high, too many dependencies will be identified as non-significant and removed, and a sparse network may underfit the training data. In contrast, if the threshold is low, few dependencies will be identified as non-significant and a dense network may overfit the training data. The thresholds control AKDB's bias-variance trade-off and, if appropriate thresholds are predefined, the lowest error will be achieved as this is a complex interplay between structure complexity and classification performance. Unfortunately, for different training datasets, the thresholds may differ and there are no formal methods to preselect the thresholds.
To guarantee satisfactory performance and overcome exhaustive experimentation, for KDB, given the feature order selected by KDB based on MI comparison, if feature X i is assumed to be independent of C when I(X i ; C) = 0, it will be at the end of the order and the edge C → X i will be removed. Furthermore, X i may be dependent on other features, whereas no feature depends on it. That is, X i will be irrelevant to classification directly or indirectly. The problem of choosing the threshold of MI turns to choosing a feature subset. Many feature selection algorithms are based on forward selection or backwards elimination strategies [18,33]. They start with either an empty set of features or a full set of features, and then only one feature is added to BNC or removed from BNC for each iteration. Feature selection is a complex task that the search space for n features is O(n 2 ). Thus, it is impractical to search the whole space exhaustively, unless n is small. Our proposed algorithm, AKDB, extends KDB to adaptively select a threshold of MI and the threshold can help remove more than one feature at each step.
To clarify the basic idea, we take datasets Hypo and Waveform for a case study. Dataset Hypo has 3772 instances, 29 features and four classes. Dataset Waveform has 100,000 instances, 21 features and three classes. Corresponding MI values (see details in Tables A1 and A2 in the Appendix A) and CMI values (see details in Tables A3 and A4 in the Appendix A) for K 2 DB are, respectively, presented in    As Figure 3 shows, the features can be divided into different parts according to the distribution of MI values. In dataset Hypo, we can see that the difference in MI values of the first 26 features is not obvious and that these features can be grouped into one part. The 27th and 28th features can be grouped into another part, and the 29th feature is the last part. On dataset Waveform, the features can also be divided into three parts. The distribution of CMI values is similar. As Figure 4 shows, the CMI values on datasets Hypo and Waveform are both divided into five groups. The difference in MI values in the same part should be non-significant and, if the MI values are small, corresponding features can be identified as redundant for classification and removed from BNC. The test for redundant conditional dependencies is similar. From Figures 3 and 4, we can see that the thresholds for identifying redundancy differ greatly for different datasets. Thus, a threshold that maximizes a performance measure should be adapted to different datasets. MI and CMI are introduced as the adaptive thresholds of redundant features and redundant conditional dependencies, respectively. AMI and ACMI respectively denote the average MI and the average CMI, which are defined as follows, and are introduced in this paper as the benchmark thresholds to distinguish between strong and weak dependencies: where |π i | is the size of π i , |X| and |F| denotes the cardinality of feature set X and feature subset F. To guarantee satisfactory performance and overcome exhaustive experimentation, we require that AMI> MI and ACMI> CMI hold. AKDB applies the greedy-search strategy to iteratively identify redundant and near-redundant features. For feature selection, we take advantage of the feature order that is determined by comparing MI. For simplicity, we adaptively provide the threshold value of MI that cuts off an entire region at the end of the order. Let δ be a user-specified parameter, 0% < δ ≤ 100% (see detail in Section 4.1). We suppose that the difference between features X i and X j is non-significant if I(X i ; C) ≤ I(X j ; C) ≤ I(X i ; C) * (1 + δ). Correspondingly, we regard the feature X j as near-redundant if X i is redundant and the difference between features X i and X j is non-significant. Given the feature order, AKDB firstly selects the feature, e.g., X i , at the end of the order and identifies it as a redundant feature. Then, we identify the near-redundant features. Finally, LOOCV is introduced to evaluate the classification performance after removing redundant and near-redundant features as it can provide the out-of-sample error with an unbiased low-variance estimation. In addition, the 0-1 loss is used as a loss function since it is an effective measure to evaluate the quality of a classifier's classification performance.
Finally, the feature subset is selected with the lowest 0-1 loss. In case of a draw, preference is given to the smallest number of features. If the MI values are distributed densely, then all redundant and near-redundant features can be identified only in a few iterations. After that, the greedy-search strategy is applied to identify redundant and near-redundant conditional dependencies. In this paper, we proposed to extend KDB by using information-threshold based techniques, FeatureSelection (FS) and DependenceSelection (DS), to respectively identify redundant features and redundant conditional dependencies. Both techniques are based on backward elimination that begins at the full set of features or conditional dependencies.
The learning procedure of FS is shown in Algorithm 2. By applying BSE, FS aims to seeks a subset of the available features that minimizes 0-1 loss on the training set. FS starts from the full set of features and corresponding MI values have been grouped into several batches. There should exist significant differences between the MI values in different batches. Suppose that, for successive batches B i and B i+1 , I m = min{I(X j ; C)} for any I(X j ; C) ∈ B i and I m+1 = min{I(X k ; C)} for any I(X k ; C) ∈ B i+1 . In this paper, FS requires that, for batches B i and B i+1 , the criterion I m * (1 + δ) < I m+1 holds, or for batch B i the criterion I m ≤ I(X j ; C) ≤ I m * (1 + δ) holds. BSE operates by iteratively removing successive batches. Then, the threshold of MI, or MI, will change from I m to I m+1 if the removal can help reduce the 0-1 loss. The features in the batch or corresponding direct dependencies will be removed from the network structure and the classification performance will be evaluated iteratively using LOOCV. This procedure will terminate if there is no 0-1 loss improvement or I m >AMI.
When the learning procedure of FS terminates, DS is applied to identify non-significant conditional dependencies and its learning procedure is similar except that CMI rather than MI values will be grouped into several batches and we need to remove batch of CMI values iteratively to improve 0-1 loss. The learning procedure of DS is shown in Algorithm 3.
The description of a complete AKDB algorithm, which includes FS and DS techniques, is shown in Algorithm 4. Both FS and DS firstly apply the filter approach to rank feature or conditional dependence by MI or CMI criteria, then use the wrapper approach to evaluate the feature subset or dependence subset every time for better 0-1 loss results.

Algorithm 2: FeatureSelection(T , BN, L, AMI)
Input: Training set T , list of features L = {X 1 , · · · , X n } and AMI. Output: KDB with the best feature subset. 1 Let MI denote the adaptive threshold of MI; 2 Let the directed graph G 0 =BN; 3 t = 0; 4 repeat 5 Let the tth Batch of features or B t = ∅; 6 Let the feature X min = arg min I( Let MI=I(X min ; C); 8 Select nodes X j from G t and add them to Remove all nodes belonging to B t and edges connecting to them from G t ; for instance x ∈ T do 12 Predict x using G t and accumulate the classification accuracy; Let the tth Batch of conditional dependencies or B t = ∅; Remove all edges belonging to B t from G t ;

11
for instance x ∈ T do 12 Predict x using G t and accumulate the classification accuracy; 17 Select G t with the lowest 0-1 loss; 18 return G t ;

Algorithm 4: AKDB
Input: Training set T with features L = {X 1 , · · · , X n , C} and k. Output: AKDB model. 1 Calculate I(X i ; C)(1 ≤ i ≤ n) from T for each feature and AMI; 2 Calculate I(X i ; X j |C)(i = j) from T for every pair of features and ACMI; 3 Let L be a list which includes all X i in decreasing order of I(X i ; C); 4 Initialize the network structure G = LearnStructure(T , L, k); // Algorithm 1 5 G = FeatureSelection(T , G, L, AMI); // Algorithm 2 6 G = DependenceSelection(T , G, ACMI); // Algorithm 3 7 return G;

Selection of the Value of Parameter for AKDB
Removing redundant features or conditional dependencies from BNC may positively affect its classification performance if the threshold value δ is selected appropriately. However, there is no priori work that can achieve this goal. We perform an empirical study to select an appropriate δ. The 0-1 loss results for all datasets with different δ values are presented in Table 3. We can see that AKDB achieves the lowest 0-1 loss results more often when δ = 10%. Although on some datasets AKDB with δ = 10% may perform relatively poorer, the difference between the 0-1 loss when δ = 10% and the lowest 0-1 loss is not significant (less than 5%). For example, on dataset Splice-C4.5, AKDB achieves the lowest 0-1 loss (0.0468) when δ = 80%, and when δ = 10% the 0-1 loss is 0.0469. From the experimental results, we argue that δ = 10% is appropriate to help identify the threshold efficiently.  The lowest 0-1 loss results for datasets are shown in bold.

Effects of Feature Selection and Conditional Dependence Selection on KDB
FS and DS are two information-threshold based techniques which are used in the proposed algorithm AKDB. Using these techniques will cause a portion of features and conditional dependencies to be removed. To prove that they can work severally, we present respectively two versions of KDB as follows: • KDB-FS, KDB with only feature selection, • KDB-DS, KDB with only conditional dependence selection.
In order to evaluate the difference between two classifiers, we define the relative ratio as follows: The values of parameter M represents different measures. Corresponding values of R M (A|B) represent the difference in percentage between two classifiers A and B based on parameter M.
In this paper, N F and N D are respectively used to denote the number of features and the number of conditional dependencies in BNC. SMI and SCMI are used to indicate the sum of MI and CMI in BNC, respectively. The results of R M (KDB-FS|K 2 DB) and R M (KDB-DS|K 2 DB) are shown in Figure 5a,b, respectively.   Furthermore, removing features based on the FS technique will not result in strong direct dependencies to be removed. For example, the value of R SMI (KDB-FS|K 2 DB) on datasets Hypo is 4.14%, although 79.31% of features are removed. On dataset Wavement (No. 30), the value of R SMI (KDB-FS|K 2 DB) is close to 0% after removing 9.52% of features. These facts suggest that those removed features in KDB show weak direct dependencies. In addition, removing weak direct dependencies may help improve the classification performance. The values of R 0-1 Loss (KDB-FS|K 2 DB) on datasets Hypo and Wavement are 21.05% and 24.61%, respectively. That is, the classification performance is improved after removing the weak direct dependencies. The significant improvement in 0-1 loss (the value of R 0-1 Loss (•) > 5%) on 12 datasets has demonstrated that the FS technique demonstrates a positive influence on classification performance.
The redundancy of conditional dependencies may also exist in KDB. Figure 5b  The factor that contributes to this high value is that the SCMI value of these 48 removed conditional dependencies reaches 2.3355, but the CMI value of each removed conditional dependence is lower than the ACMI value. When it comes to 0-1 loss, the value of R 0-1 Loss (KDB-DS|K 2 DB) on dataset Hypo is 14.04%. It indicates that deleting those weak conditional dependencies may help improve classification accuracy. KDB-DS achieves almost the same classification accuracy as K 2 DB with a simplified network structure on 14 datasets and achieves 0-1 loss improvement on 16 datasets. These results indicate that the DS technique is effective and can help reduce the structure complexity of KDB.
Both FS and DS techniques combine the characteristics of filter and wrapper approaches. The redundant features or conditional dependencies are filtered out and then we use classification accuracy to evaluate the feature subsets or the retained conditional dependencies, respectively. On the other hand, removing redundant features and conditional dependencies can reduce the parameters that are needed for probability estimates and may improve the classification accuracy. From the above discussion, we can see that both FS and DS techniques are efficient and can help improve the classification performance.

Comparison of AKDB vs. NB, NB-FSS, TAN, K 1 DB and K 2 DB
In this section, we conduct comparisons for related algorithms in terms of 0-1 loss, RMSE and bias-variance decomposition. RMSE [2] is computed as : where t is the number of training instances in training set T , c x is the true class label for the instance x, andp(c x |x) is the estimated posterior probability of the true class given x.
The win/draw/loss (W/D/L) records of 0-1 loss, RMSE and bias-variance decomposition are presented in Tables 4-6, respectively. The W/D/L record of the comparison results of every two different algorithms are presented in each cell[i; j] of every table. When one algorithm in row i (Al i ) and the another algorithm in column j (Al j ) are compared, we can observe which algorithm performs better on all datasets from cell[i; j]. This is because, in cell[i; j], a win denotes that Al i obtains a lower 0-1 loss than Al j , a loss denotes Al j that obtains a lower 0-1 loss than Al i , and a draw denotes that Al i and Al j perform comparably. We regard a difference as significant if the outcome of a one-tailed binomial sign test is less than 0.05 [39,40].    Table 4, we can see that NB-FSS performs better than NB in terms of a 0-1 loss. It indicates that FSS is feasible to NB. Surprisingly, K 2 DB does not have an obvious advantage when compared to 1-dependence classifiers. In addition, it even performs poorer when compared to TAN. However, when it comes to large datasets, as Table 7 shows, K 2 DB performs better than both TAN and K 1 DB. We can see that AKDB significantly outperforms all other algorithms. Most importantly, when compared to K 2 DB, AKDB has a 0-1 loss improvement with 15 wins and only one loss, which proves that the proposed two information-threshold based techniques are effective. This advantage is even greater on small datasets. From Table A5, AKDB never loses on small datasets and it obtains a significantly lower 0-1 loss on 11 out of 17 small datasets. On dataset Lymphography, the error is substantially reduced from 0.2365 to 0.1554. Compared to K 2 DB on large datasets, AKDB achieves W/D/L record of 4/8/1. Although the improvement is not significant, AKDB only loses on dataset Spambase. Based on these facts, we argue that AKDB is a more effective algorithm in terms of 0-1 loss. What is revealed in Table 5 is similar to that in Table 4. NB and NB-FSS perform worse, which demonstrates the limitations of the independence assumption in NB. TAN and K 1 DB get better performance than NB and NB-FSS. In addition, AKDB still achieves lower RMSE significantly more often than the other five algorithms. On average, 72.4% of the features and 59.6% of conditional dependencies are selected to build the network structure of AKDB, although in some cases the improvement in terms of RMSE is not significant. Considering that AKDB has significantly lower 0-1 loss and RMSE in comparison to other algorithms, we argue that the FS technique in tandem with the DS technique used in the proposed algorithm is powerful to improve classification accuracy.
The W/D/L records of bias-variance decomposition are presented in Table 6. We may observe that NB and NB-FSS achieve higher bias and lower variance significantly more often than the other algorithms because their structures are definite without considering the true data distribution. TAN, K 1 DB and K 2 DB are all low-bias and high-variance learners because they are derived from higher-dimensional probability estimates. Thus, these classifiers are more sensitive to the changes in the training data. AKDB performs best in terms of bias. When compared to K 1 DB and K 2 DB, AKDB obtains lower bias more often than them, as jointly applying both FS and DS to KDB can simplify the network structure. Furthermore, we can observe that AKDB shows an advantage over K 2 DB in variance. The average of variance of K 2 DB and AKDB are 0.045 and 0.025 on 13 large datasets, respectively. Based on these facts, we argue that the proposed AKDB is more stable for classification.

Tests of Significant Differences
Friedman proposed the Friedman test [41] for comparisons of multiple algorithms over multiple datasets. It first calculates the ranks of algorithms for each dataset separately, and then compares the average ranks of algorithms over datasets. The best performance algorithm getting the rank of 1, the second best rank of 2, and so on. The null-hypothesis is that there is no significant difference in terms of average ranks. The Friedman test is a non-parametric measure which can be computed as follows: where N and t respectively denote the number of datasets and the number of algorithms, and R j is the average rank of the j-th algorithm. With the 30 datasets and 6 (t = 6) algorithms, the critical value of χ 2 α for α = 0.05 with (t − 1) degrees of freedom is 11.07. The Friedman statistics χ 2 F of experimental results in Tables A5 and A6 are respectively 36.56 and 22.90, which are larger than χ 2 α , 11.07. Hence, we reject all the null-hypothesis. Figure 6 presents the results of average ranking in terms of 0-1 loss and RMSE for six algorithms. The average ranks of different algorithms based on 0-1 loss on all datasets are, respectively, {NB(4.32), NB-FSS(4.33), K 1 DB(3.07), TAN(3.83), K 2 DB(3.53), and AKDB(1.92)}. That is, the ranking of AKDB is higher than that of other algorithms, followed by TAN, K 2 DB, K 1 DB, NB, and NB-FSS. When assessing performance using RMSE, AKDB still obtains the advantage of ranking with the lowest average rank, i.e., 2.42. In order to determine which algorithm has a significant difference to others, we further employ the Nemenyi test [42]. The comparisons of six algorithms against each other with the Nemenyi test on 0-1 loss and RMSE are shown in Figure 7. Critical difference (CD) is also presented in the figure that is calculated as follows: As shown in Figure 7a, these algorithms are divided into two groups clearly in terms of 0-1 loss. One group includes AKDB and TAN, and other algorithms are in another group. AKDB ranks first although it does not have a significant advantage when compared to TAN. AKDB enjoys a significant 0-1 loss advantage relative to K 2 DB, K 1 DB, NB and NB-FSS, proving the effectiveness of the proposed information-threshold based techniques in our algorithm. As shown in Figure 7b, when RMSE is compared, AKDB still achieves lower mean ranks than the other algorithms, although the differences between AKDB, K 1 DB, K 2 DB are not significant.

Discussion
KDB is a form of restricted BNCs, and the weak direct dependencies and conditional dependencies may exist in KDB and they may be redundant. To alleviate the potential redundancy problem, we develop an extension to KDB, called AKDB, which applies feature selection and conditional dependence selection to remove redundant features and conditional dependencies. These two techniques presented in this paper, MI-based feature selection and CMI-based dependence selection, are based on adaptive thresholding. They are designed to iteratively identify relevant features and conditional dependencies in certain circumstances, and they combine the characteristics of filter and wrapper approaches. Both techniques are efficient and complementary. By providing experiments on 30 UCI datasets and comparisons with other state-of-the-art BNCs, we prove that adaptive thresholding can help select the most relevant features and conditional dependencies with an improvement in classification performance. On average, 72.4% of the features and 59.6% of conditional dependencies are selected to build the network structure of AKDB. Overall, AKDB achieves significant advantage over KDB in terms of 0-1 loss by a 8.54% reduction on average. The statistical significance of the experiment results is further confirmed by the Friedman test and Nemenyi test.

Conclusions
To efficiently identify non-significant direct and conditional dependencies, we investigate two techniques to extend KDB, MI-based feature selection and CMI-based dependence selection based on adaptive thresholding. These two techniques combine the characteristics of filter and wrapper approaches and when applied to KDB, they are severally efficient for filtering out redundancy and can help improve the classification performance. The extensive experimental results show that the final classifier, AKDB, significantly outperforms several state-of-the-art BNCs, including NB, TAN and KDB.

Conflicts of Interest:
The authors declare no conflict of interest.