Font Size: a A A

Research On The Application Of Information Gain In Data Mining Classification

Posted on:2017-08-04Degree:MasterType:Thesis
Country:ChinaCandidate:L C MaoFull Text:PDF
GTID:2348330485476554Subject:Statistics
Abstract/Summary:PDF Full Text Request
Data mining is an emerging discipline, aimed at analyzing the characteristic of data and data model is set up and dig out the inner link of data, which is applied to forecast data, one of the most widely used is pattern classification technique. Pattern classification technique based on the linear classifier, the Fisher discriminant criterion, the most widely used linear extraction method, there are a variety of improvement such as weighted Fisher discriminant method, but when the dimension is too large, the linear discriminant classification accuracy is significantly reduced after extraction. Fisher discriminant method in view of this, the optimal combination of factors on the basis of the possible combinations of all factors to get the corresponding linear discriminant, on the basis of back to the generation of accuracy. Back to the generation of the precision of combination, is the optimal combination of factors, thus improve the linear classification accuracy, but when excessive factor, computational complexity exponentially, when the number of factors is greater than 15, the algorithm cannot achieve. The thinking of KNN algorithm is selected in the feature space is similar to the unknown sample(that is, the feature space of the adjacent) of k samples, including most k sample belongs to a certain category, then the unknown sample also belong to this category. This method only on such a decision based on the nearest one or a few samples of the category to decide to stay samples belong to categories, although in principle also depends on the limit theorem, decision, but in the category only is associated with very small amounts of the adjacent samples, and will all categories as equally important, when sample uneven density, easy to cause miscarriage.KNN algorithm of evidence theory in KNN algorithm on the basis of introducing function of evidence, according to the sample under test to the distance of the training sample generation function of evidence, the evidence function integration, the most trustworthy evidence is the final classification, the algorithm effectively improves the KNN algorithm will all categories as equally important defects, make full use of the neighboring samples information. But dimension when the sample is too high, the number of attributes too much, will cause the high computational complexity and applicability is not strong. In classification method improvement, information gain is widely applied to improve the accuracy of the method, this paper introduced the information gain, to establish the optimal combination of the factors based on the information gain Fisher discriminant classifier, calculate the information gain of each factor and descending order, and, in turn, before taking a factor combination, and get the corresponding discriminant, calculate the corresponding correct back generations, select back to the generation of the precision of combination as the optimal combination, reduced the computational complexity from index for linear, so as to realize the optimal discriminant classifier combination factor optimization. At the same time, the information gain in KNN algorithm introduces evidence theory, put forward a kind of information based on evidence theory KNN algorithm, namely before set up the evidence function, calculation factor information gain, selecting information gain before a big factor combination, classification and classic KNN algorithm, and on the basis of the back to the generation of accuracy of KNN classification. Delete redundant factors, so as to screen out the important attribute, and on the basis of important attributes screening of nearest neighbor samples, effectively reduce the neighbor samples and the computational complexity in the process of evidence fusion.Experiments show that the optimized classifier is effective to eliminate the redundancy factor, performed well in the low dimensional data, not only has good classification accuracy, is more effective to improve the original classification method in the classification accuracy rate fell sharply in the high-dimensional data.
Keywords/Search Tags:information gain, Fisher discriminant, KNN classification method of evidence theory
PDF Full Text Request
Related items