Font Size: a A A

Research On Multi-view Ensemble Classification For Incomplete Data

Posted on:2017-02-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y T YanFull Text:PDF
GTID:1108330485464105Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of communication technology, internet of things, sensor technology and some other technologies, nowadays, data are generated almost everywhere. The data generated from real world applications usually incomplete due to various reasons. For example, in a social survey, respondents may refuse to respond to some questions; in bio informatics, gene expression data often contain missing values, it can be occurred for several reasons:image corruption, or contaminants due to dust or scratches on the chip. However, tradition machine learning technologies are designed for complete data, thus, how to deal with incomplete data has become a crucial issue for machine learning. Current research on incomplete data classification, such as missing value imputation methods, most of them are relying on some assumputions, eg. Missing At Random (MAR) assumption, attribute independent assumption. Ensemble learning is one of the effective methods to deal with incomplete data classification. Compared with the imputation methods, it did not relies on the MAR assumption. However, current ensemble learning methods still confronted with high complexity, and the algorithm performance is also need to be improved.Granular computing is a new method to simulate human thinking to solve problem in the field of artificial intelligence. It covers all the theories, methods and tools related to granularity and it has become an important computing tool for uncertainty, fuzziness and complexity problem sovling. Main theoretical models of Granular computing include rough sets model, fuzzy set model and quotient space problem solving model. Among these three models, quotient space theory is the main theoretical model for multi-granular computing. Based on the multi-side, multi-view problem solving thought, this dissertation first proposes to construct multi-view classifiers according to the characteristics of incomplete data, and then the weighting method to measure the importance degrees of each classifiers are also studied. After that, the optimization method for multi-view classifiers is also studied. Finally, according to the characteristic of cancer gene expression data, this dissertation proposes to conduct feature selection firstly and then conduct a selective multi-view ensemble method on the remaing data based on the best first search strategy.The main research content is summarized as follows:(1) Multi-view classifier construction for incomplete data and the importance degree measurement for sub classifierConstructing missing attribute tree (MAT) for incomplete data based on the combinations of incomplete feature. And then, a group of data subsets can be obtained according to the MAT. After that, for each data subset, a sub classifier is trained by using neural network as the base learning algorithm and bagging/adaboost as the ensemble strategy. For a testing sample, algorithm select suitable sub classifier to predit it, and then using majority voting to determine it’s final prediction result. This dissertation introuduces information entropy to measure sub classifier’s importance degree, and the relationship between prediction accuracy and several weighting methods is also discussed.(2) Optimization method for multi-view classifiersTo overcome the shortcoming that when the number of sub classifier is too large, the ensemble method based on MAT is confronted with high computational complexity. This dissertation proposes an optimization method for multi-view ensemble classification (SNNE). On the premise of guarantee the algorithm prediction ability, this method can improve algorithm execution efficiency by removing some redundant data subsets through a given threshold. Experiment result shows that, for a given parameter 0.05, SNNE can effectively improve algorithm execution efficiency on the premise of gurantee algorithm accuracy.(3) Multi-view ensemble learning based on chi-square test and extreme learning machineTo overcome the limitation that the irrelevant features with repect to class variable may degrade classification performance. This dissertation proposes a chi-square based feature selection algorithm (C_ELM). The method extent the chi-square test to deal with incomplete data, and then, it removes the most irrelevant feature of the data through a threshold. At the same time, considering that traditional gradient descent based algorithm are mostly time consuming. ELM is applied as the base learning algorithm. It generates a group of voting based ELM (v-elm) according to the MAT. Furthermore, to gurantee that all instances can be predicted, a group of candidate classifiers is trained based on single feature, respectively. Experimental results show that this method can improve algorithm performance effectively by removing the most irrelevant feature and appling ELM as base learning algorithm.(4) Selective multi-view ensemble learning for cancer gene expression dataCancer gene expression data is a typical type of data with missing values, one characteristic of gene expression data is that it is often with high dimension while the number of sample is very small. According to this characteristic, this dissertation proposes to select the most irrelevant genes firstly, and then sort the remaining genes based on the relevant degree with respect to the class variable. The best first search strategy is then applied to construct a selective multi-view ensemble classification algorithm. Experiments show that this method can remove a large number of irrelevant features at one hand, at other hand, by selecting the most relevant features, this method can effectively improving the prediction accuracy.
Keywords/Search Tags:incomplete data, Quotient Space Theory, neural networks, Multi-view, Ensemble Learning
PDF Full Text Request
Related items