Font Size: a A A

Study On Text Categorization Based On Decision Tree And K Nearest Neighbors

Posted on:2007-08-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:1119360212470835Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Text categorization is one of the most important issues of text mining, which is thought as a basic cognitional form. The researches on the methods of feature dimensions reduction, text categorization and text categorization rule extraction have not satisfied the actual applications so far. In this paper, the text feature dimensions reduction and text categorization rule extraction using decision tree are investigated, and some new KNN algorithms are represented for text categorization.In this paper, three methods for text feature dimensions reduction are presented: The first method reduces the dimensions based on pattern aggregation theory and an improvedχ~2 statistic, and then the better accuracy of categorization is acquired; the second method reduces the dimensions based on the CHI value and rough set, and then text categorization rules are extracted based on decision tree that are understood easily and have better accuracy of categorization; the third one reduces the dimensions based on the neural network theory, ranked features using a sensitivity method and selects the features using dichotomy method, then the number of the dimensions is reduced which avoids huge computation of neural network.Two methods for text categorization fuzzy rule extraction are presented based on fuzzy decision tree. The first method presents a fuzzy decision tree with merging some branches, then the number of categorization rule is reduced largely because of merging some branches; In the second method, a new method for constructing membership functions is presented, which reduces the time of data fuzzification largely, reduces the number of rules and increases categorization accuracy consequently.In this paper, three methods are presented to improve the KNN algorithm:Acquire weights of Euclid distance formula: Two methods are presented. Firstly, weights of every feature are acquired based on a sensitivity method. Then the method considers the different functions of the same feature on different classes in Euclid distance formula. The other method is based on chi-square distance theory. First, k0 approximate nearest neighbors are acquired based on SS tree. Then weights are computed based on k0 approximate nearest neighbors and chi-square distance theory. Both methods can improve the accuracy of KNN algorithm.
Keywords/Search Tags:text categorization, decision tree, KNN algorithm, fuzzy logic, rough set theory, neural network
PDF Full Text Request
Related items