Font Size: a A A

Research And Implement Of Structure Learning Algorithm For Hybrid Bayesian Networks

Posted on:2011-02-28Degree:MasterType:Thesis
Country:ChinaCandidate:P F YanFull Text:PDF
GTID:2178360305455198Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data Mining, a KDD process integrating techniques form multiple disciplines, explores some interesting trends, features, and correlations hidden in a large amount data set. According to the newest model CRISP-DM, a data mining project should contain nine phases. There are five common data mining tasks such as prediction, classification, clustering, association rules and time series. The process that predicts the values of some variables using the values of the other variables is classification, if and only if the predictive variables are discrete. At present, we could use much more models for classification, for example, decision trees, bayesian networks, neural network, association rules, k-Nearest Neighbor method, GA, rough sets, fuzzy sets and so forth, where we would focus our attention on bayesian network.Bayesian network consists of the qualitative component DAG and the quantitative component CPT. It could discompose a multivariate joint probability into a product of some local items, being used for describing the relations among variables, that corresponding to all local structures in DAG, where a local structure is a group of conditional independence asserts. So A specific bayesian network describes a set of conditional independence asserts that are quantified by the component CPT. The paper concentrates most attentions on the learning of DAG. Traditional bayesian networks learning usually needs domain experts providing some priori information to reduce the search space of network structures, so the accuracy of attained result depends on the experts'comprehension to dataset to some extent and this also is a bottleneck of graph structural leaning. To overcome this drawback, the paper proposes a novel hybrid structural leaning algorithm HBN, it firstly uses the concept of pseudo-BN, designs a new scoring function, improves operational efficiency of the algorithm used for discovering approximate minimal d-separating set and at last lots of experiments show algorithm HBN is more adaptive to construct medium and small-sized networks. Besides, the classification accuracy of algorithm HBN also is satisfactory.Chapter 1, above all, scientifically introduces the synthetic definition of DM, state-of-the-art, developing motivation, life cycle, five tasks as well as their correlation. Secondly, it introduces the category, process and models of classification. At last, this chapter briefly summarizes the objective and investigative content of the paper.Chapter 2 firstly introduces the theoretical principle of bayesian networks, which knowledge is mainly from probability theory and information theory. And then we briefly introduce the formalized definition, stage of development and advantages of bayesian networks. Subsequently, we introduce the structure learning and parameter learning of bayesian networks in detail, where the structure learning is the focal point of the chapter. In structure learning part, to begin with briefly reviewing the correlative theory, we list some research findings about three kinds of structure learning algorithms of recent years. In parameter learning part, we give out some methods handling incomplete dataset containing fill-in method, Monte Carlo method, Gaussian approximation, ML, MAP and other optimizing methods. At last we briefly introduce three BN model, namely, TAN, NBC, BN and analyze their merits and drawbacks.Chapter 3 is the keystone of the paper, amply elaborating the algorithm HBN. Algorithm HBN is a three-phase hybrid algorithm that contains variable order(VO) learning, constructing pseudo-BN and closing operation. In VO learning phase, we firstly explain the necessity of learning VO and list some fashionable VO learning algorithms. Then, we propose our algorithm that draws lessons from the index information gain(IG) in Algorithm ID3 assisting to sort variables and summarize benefit when to do so. At the end of the phase, we describe algorithm HBN and analyze its time complexity. After describing the determinative implication of pseudo-BN, we point out that there are three steps, namely, frame learning, adding arc and deleting arc, to be needed to construct the structure of pseudo-BN. Algorithm FRAMELEARN minutely describe the process of the network frame G1, which uses conditional mutual information to measure the strength of the dependence between two nodes and uses VO generated in previous phase assisting to determine the direction of the edges in the network frame G1. Algorithm ADDARC, a search-and-score based method, add some essential arcs to make G2 as the I-map of data model M. In this algorithm, we design a new scoring function that measures the extent of difference in describing margin independence and order one conditional independence between network model G2 and data model M. Besides, we use a greedy search strategy in order to guarantee that we could always attain a suboptimum network structure. Due to only considering add operation in previous step, it always results in the over-fit between model G2 and M. So in the end of this phase, we utilize algorithm DELETEREDUNDANTARC to delete some redundant arcs for avoiding the departure of G2 and potential network structure G . The algorithm is a dependence analysis based method and uses conditional mutual information I (x, y|S) to measure the extent of the dependence between nodes x and y, where S is an approximate minimal d-separating set. The paper proposes a improved algorithm, MINIMUMD-SEPARATINGSET, which operational efficiency markedly overmatches Silvia's algorithm and the accuracy of result also is highly acceptable. In the final phase of algorithm HBN, we need eliminating the fake of pseudo-BN G3, namely bringing class attribution into the network G3. When closing the pseudo-BN, we consider two factors, namely the out-degree of nodes and the direct probabilistic dependence between every node and class node, which makes our model HBN simultaneously possesses the merits of BN and TAN and avoids their drawbacks. This is another bright spot of the paper. Terminally, we briefly evaluate algorithm HBN. Chapter 4 gives out the programming implement of algorithm HBN. Through lots of experiments, it shows the algorithm HBN is relatively adaptive to construct medium and small sized network structure and possesses some merits such as fleet construction and satisfactory classificatory accuracy. However, due to adapting a greedy search strategy, for a large-scale network structure, algorithm HBN is relatively time-consuming, though also possesses the satisfactory classificatory accuracy. At last, we compare our model HBN with TAN and NBC, the experiment illustrates our model is remarkably more excellent than TAN and NBC, though the run time needed is longer than other two models.Chapter 5 firstly summarizes the context of the paper, and then forecasts some future research work, considering the drawbacks of model HBN.
Keywords/Search Tags:Data Mining, Bayesian Networks, VO, pseudo-BN, Minimal d-separating
PDF Full Text Request
Related items