Font Size: a A A

Research On Bayesian Network Based Approach For Data Mining

Posted on:2010-04-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:J H GuanFull Text:PDF
GTID:1118360302465948Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of computer technology and the applications of computer network, the data generated daily is also increasing at a unprecedented speed. How to manage or use these data, and to transform them into useful information and knowledge for decision-making, becomes an important problem. Data Mining arise in response to this demand. In short, Data Mining is the process to search useful hidden knowledge that stored in the large database. It makes use of statistics and Artificial Intelligence technology to handle and analysis data, detect hidden knowledge and construct different models according to practical problem; as a result it provides a reference basis for decision-making.Bayesian Network as a key technology to deal with uncertainty problem was paid close attention by more and more researchers and application developer because of its following features: first of all, it has rich power to express probability knowledge; second, it can synthesize prior knowledge, and realize incremental learning; third, it has a solid mathematical foundation. Therefore a large number of Bayesian Network-based data mining methods emerged. Although Bayesian Network performs well in model diagnosis and other area, there are still some issues to be resolved. For example, it is a NP-hard problem to construct unconstrained Bayesian Network, how to deal with data stream that has concept-drifting is also a problem.The main results and contributions included the algorithm based on Bayesian Network of classification and clustering, as well as dynamic integration approach for ensembles to handle concept drift. The main achievements are as follows:(1) The task of classification is to assign one of predefined classes to instances described by a set of attribute (feature) values. Classification analysis is a very active research topic in machine learning and data mining. Classification can be applied to a wide variety of domains: pattern recognition, forecasting, medical diagnosing, fault diagnosis, loan applications, user modeling, biomedical informatics, etc. At present, a great many of techniques have been developed based on Logical techniques(decision tree,rules-based techniques), based on Perceptron, based on Statistics(Bayesian Networks, SVM) and based on Instance(lazy learning).In the recent years many researchers pay attention to Bayesian networks which are efficient for classification. Naive Bayes is a kind of simple Bayesian networks with a strong assumption of independence among attribute nodes in the context of the class node.This thesis empirically evaluates algorithms for learning four types of Bayesian Networks classifier: NB, TAN ,BAN and GBN, where the latter two are constructed by using BN structure learning algorithm DSBNL. This thesis implemented these learners to test their performance in classification problems. This thesis also argued some factors that influence classifier's accuracy, such as the thresholdεand training sample size. The experiments show that the larger the data sets are the fewer the prediction errors and that for some data sets this algorithm cannot get the best accuracy with fixed threshold. The analysis of our experimental results suggests two ways to improve predictive performance of BN classifiers. Firstly, this thesis uses threshold selection based on the prediction accuracy to void overfitting and underfitting in construction of classification models. Secondly, This thesis proposed example selection based integration algorithm of NB, TAN and BAN with weight. This thesis demonstrates empirically that this integration algorithm does work as expected. Collectively, these results argue that BN classifiers deserve more attention in machine learning and data mining communities.(2) More and more data in the real world are dynamic data, rather than the static data. The pattern uncoverd in the data may change more or less as the time goes by. Incremental learning ability becomes important for Data Mining algorithm. Concept-drifting is a common problem in incremental learning. In order to adapt the variation of concept, we need to improve the adaptability of the system, but stability of the algorithm will decline in the meantime. On the contrary, we need to improve the stability of the system so that it can make better use of historical information and knowledge, now the adaptability will be affected. In consequence, it is hard to learn correctly when concept-drifting occurred.In the real world, concepts are often not stable but change with time, which is known as the problem of concept drift. Concept drifts complicate the task of learning and require unusual solutions, different from commonly used batch learning techniques. We consider synthetically generated concept drifts and an example of concept drift from the area of antibiotic resistance. Among the most popular and effective approaches to handling concept drift is ensemble learning, where a set of concept descriptions built on data blocks corresponding to different time intervals is maintained, and the final prediction is the aggregated prediction of ensemble members At present, the problem of concept drift has received considerable attention. There are two approaches to handling concept drift: instance based approach and ensemble learning approach. The literature on concept drift shows that ensemble learning is the most promising approach. In this thesis, we propose an ensemble learning algorithm AMCE(Adaptive Multiple Classifiers Ensemble) which can adaptively adjusts weights of classifiers. In order to improve the adaptive ability, this novel algorithm distributes parameters for adjusting weight to each classifier respectively, and dynamically adjusts the parameters during online learning to maximize performance. We also use pruning method based on KL divergence to eliminate redundancy classifiers. Experimental results show that the proposed algorithm improves the predictive accuracy, speed and reduces memory space, and is effective in tracking concept drift with noise.In data streams concept are often not stable but change with time. This thesis proposed a selective integration algorithm OSEN (Orientation based Selected ENsemble) for handling concept drift data streams. This algorithm selects a near optimal subset of base classifiers based on the output of each base classifier on validation dataset. The experiments with synthetic data sets simulating abrupt (SEA) and gradual (Hyperplane) concept drifts demonstrate that selective integration of classifiers built over small time intervals or fixed-sized data blocks can be significantly better than majority voting and weighted voting, which are currently the most commonly used integration techniques for handling concept drift with ensembles. This thesis also explained the working mechanism of OSEN from error-ambiguity decomposition. Based on experiments, OSEN improves the generalization ability through reducing the average generalization error of the base classifiers constituting the ensembles.This thesis proposed a selective integration algorithm DGASEN (Dynamic GASEN) for handling concept drift data streams. This algorithm selects a near optimal subset of base classifiers based on the output of each base classifier on validation dataset. The experiments show that DGASEN improves the generalization ability.(3) Clustering is the unsupervised classification of patterns into clusters, because unlike classification (known as supervised learning), no a priori labeling of some patterns is available to use in categorizing others and inferring the cluster structure of the whole data. The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in data analysis. Many algorithms have been proposed, such as model-based algorithm, distance-based algorithm, density-based algorithm and deviation-based algorithm. In this paper, we concentrate on the research of the model-based algorithm. The most frequently used approaches include mixture density models mixture models and bayesian networks.This thesis has described how to apply DPSO method to parameter estimation for NB clustering effectively. This thesis found that the DPSO algorithm converges much slowly than the EM algorithm according to experimental results, but it can get better global optimal solution. At the same time, the EM method is sensitive to initial solution and easy to get local optima. Because DPSO and EM algorithms have their own drawbacks respectively, in order to improve the efficiency of DPSO, local search algorithm-EM is introduced into the traditional DPSO method. The EM algorithm makes every particle can find the local optimal solution in current space. This local search process improves the performance of swarm.The research results of the thesis will greatly enrich and push the studies of the algorithm of clustering and classification based on Bayesian Network, as well as the algorithm to solving concept drift problem in both theoretical and technological aspects.
Keywords/Search Tags:Bayesian Network, Classification, Clustering, Ensemble learning, Concept Drift, Partical Swarm Optimization, Genetic Algorithm
PDF Full Text Request
Related items