Font Size: a A A

Research On Ensemble Classification Algorithm For Incomplete Data

Posted on:2014-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:J LvFull Text:PDF
GTID:2248330398979205Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Classification has been widely used in production, scientific research, all aspects of daily life, etc. With the extension of machine learning application scope and the rapid development of information technology and the Internet, a large number of data is collected every day, and new challenges and problems are emerged. In the study of psychology, the object may not fill in some experimental data by considering the protection of their privacy. In medical research, sometimes, because of the death of object, experiment is terminated unexpectedly. In these cases, the result of experiment is incomplete. Current mainstream classification algorithm is based on the complete data set but not the incomplete data. How to effectively use these incomplete data is another research hotspot in machine learning.Ensemble learning has been widely used because of its simple algorithm and good generalization performance. In recent years, ensemble learning has been used to solve the incomplete data classification, and gotten a good result. However, at present, the weight of each sub-classifier in ensemble classification algorithm for incomplete data is mainly determined by the size and dimensionality of corresponding sub-dataset, however, in fact, the degree of different attributes is different.Amount of information is an abstract concept. More or less information in a system is just a feeling. So, how to quantify the information of a system?1948, C. E. Shannon quantify the information with information entropy. From that on, we can describe information with a mathematical formula. In this paper, information entropy and mutual information are used to measure the difference between different attribute, and then the weight of each sub classifier is calculated. So, the weighted voting is more fair and the result is more accurate.Main works of this dissertation were as follows:Firstly, I illustrate the background and significance to research the incomplete data. Introduce the main method of dealing with incomplete data as well as the advantages and disadvantages. Then I introduce the theory that weak learners can be boosted to strong learners and elaborate the related concept, principle, advantages of ensemble learning and its two main algorithms:Bagging and Boosting. At last I also introduce the concept, significance and related formula of information entropy, combination entropy, conditional entropy and mutual information.Secondly, According to the shortage of current ensemble learning to deal with incomplete data, I proposed a new algorithm:classification algorithm for incomplete dataset based on conditional entropy (CEECA). In this paper, I propose a method to compute the conditional entropy for each sub dataset, and discuss in detail the validity and correctness of the algorithm. Experiment is carried out based on Bagging and Ada Boost with the UCI data set, and the algorithm I proposed performs better than traditional methodsThirdly, I proposed another new algorithm:classification algorithm for incomplete dataset based on mutual information (MIECA). This algorithm weight each sub classifier using mutual information between missed attribute and class attribute. Experiment is carried out based on Bagging and Ada Boost with the UCI data set, and the results prove the effectiveness of the proposed algorithmFinally, we summarize this paper and present the next work.
Keywords/Search Tags:Ensemble learning, Information Entropy, Mutual Information, Incomplete Data
PDF Full Text Request
Related items