Since the beginning of Summer 2011, Zhifa Liu and I, along with our advisor Dr. Changhe Yuan, have collaborated in an effort to learn better Bayesian network classifiers. In general, search-and-score structure learning algorithms, such as those I have developed, use scoring functions which approximate the posterior probability of the network structure given the dataset. However, in discriminative learning situations (i.e. classification) this is not really an appropriate function; a network could have a high posterior probability but still perform poorly as a classifier. The conditional likelihood measures the conditional probability of one variable given the rest and is more suited to this scenario. We would like to learn a network structure which maximizes the conditional likelihood of the class variable given the others.
The posterior probability of a network is related to its likelihood; this computation is decomposable, so each variable is independent of the other variables. This score can be computed by simply looping over all variables. In contrast, conditional likelihood computations are not decomposable, so we cannot simply loop over the variables to calculate the score; however, only variables in the Markov blanket of a variable affect its conditional likelihood.
Our research has focused on adapting our existing optimal search techniques to use estimates for conditional likelihood. The main novelty of the discriminative structure learners stems from the process of selecting parents for variables. In particular, we often force all variables to be in the Markov blanket of the class variable. This ensures we use all variables when predicting the class. Alternatively, we sometimes do not require a variable belong to the Markov blanket of the class variable. In this case, variables which are not in the Markov blanket do not affect classification. In this light, our algorithms can also be considered a form of feature selection. We have also investigated approximations for the conditional likelihood and methods to adopt this non-decomposable score into our heuristic search formulation.
Initial results suggest the Bayesian network classifiers we learn are competitive, in terms of accuracy, with other classification methods, such as neural networks and SVMs. For some datasets, our feature selection algorithm was able to attain competitive accuracies using only a very small subsets of all attribute variables. This suggests that the structures we learn are better able to compress the data than other classifiers.
This line of research is Zhifa's master's thesis, so he is more in control of future work than me. Some possible directions include relaxing the optimality guarantees of the heuristic search algorithms, deriving new metrics which more closely approximate the conditional likelihood, improving the efficiency of directly calculating the conditional likelihood and extending the learning algorithms to learn Bayesian classifiers with incomplete data.