Font Size: a A A

Research On Classifier Combination And Its Relevant Techniques Of Distributed Data Mining

Posted on:2005-05-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y WeiFull Text:PDF
GTID:2168360122998548Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the growing popularity of Internet, mining distributed data sources becomes one of the main challenges in data mining. Due to the geographically distributed nodes, massive amounts of data, consideration of security for raw data and privacy preservation for none-share data, distributed data mining (DDM) generate a global model by integrating multiple models constructed on different data sites, which is attracted more and more attention. In generally, classifier combination techniques are often used for distributed mining of classification. One notable approach to learn distributed classifier is Stacking framework. In this paper, the key problem and interrelated technique of Stacking are discussed and explored within the application to distributed scenario. Different aspects of Stacking are examined as follow:First, we identify special issue on distributed mining application of Stacking, and a combination framework for multiple classifiers is proposed. In this framework, 10-fold Cross Validation is employed to avoid "testing on the training data", guaranteeing all characteristic features used for level-1 generalization indicating the actual classification behavior by every local classifier.Second, scaling problem of Stacking need be resolved for distributed mining involved large number of nodes for the reason that high performance of Stacking generally depends on high dimension of feature spaces. Class feature vector for level-1 generalization may be one of the best solutions, but lower predictive accuracy is achieved. We investigate the generalization mechanism in level-1, and by evaluating the differences between two classifiers when classify a new example, an improved approach to construct class feature vector for level-1 generalization is presented, which is base on the weighted averaging posterior probability using the accuracy of level-0 classifiers. This method make the learning algorithm in level-1 take more attention to those classifiers with higher accuracy. Next, based on the methodology of majority voting, a new way to form feature vector for each class used in level-1 generalization is proposed, which use binary prediction for each class by level-0 classifiers. Experimental comparison shows that this approach is superior to the one based on weighted averaging posterior probability. Finally, in order to overcome the shortcoming of binary prediction, an approach based on vote for level-1 generalization is given, where all the level-0 classifier "vote but do not make" the final prediction and the voting cases will be induced by level-1 learning algorithm. Experimental evaluation exhibits that the presented methods achieves good performance on those datasets with highly skewed class distributions.In conclusion, the research in this paper provides theoretical foundation for implementing distributed data mining of classification and improving the mining efficiency, and it also offers scientific reference for algorithms designed for distributed data mining as well as the application.
Keywords/Search Tags:distributed data mining, classifier combination, Stacking, classification, inductive learning
PDF Full Text Request
Related items