Font Size: a A A

Research On Text Feature Selection And Classification Algorithm Based On CHI And KNN

Posted on:2017-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:C J FanFull Text:PDF
GTID:2348330503492738Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid expansion of text information, it is more and more difficult to acquire useful information from huge and chaotic information. As a data mining method of organizing and managing mass text information effectively, text classification is able to solve the problem of chaotic text information, and it is widely applied for information retrieval, information filtering, spam filtering, digital library and so on. It is of great significance and practical value to study a text classification method of high classification accuracy, high efficiency and good stability.Feature selection, term weighting and classification are very important parts in text categorization, so they are the main contents in this paper. This paper firstly gives a brief description of the research background and significance of text classification,and expounds the research status of text classification in domestic and abroad, and then summarizes the research contents and chapter structure. This paper also introduces the key technology of text classification and evaluation index of classification performance in order to lay a good foundation for the further research.Aiming at improving the performance of chi-square statistic feature selection, TFIDF feature weighting, and KNN classification method, researches of this paper are as follows.(1) Due to the disadvantage of chi-square statistic neglecting term frequency and enlarging the weight of features that rarely appear in a specified class and largely appear in other classes, an adaptive feature selection method based on chi-square statistic is proposed in this paper. The adaptive scaling factor is introduced into chi-square statistic algorithm to automatically adjust the proportion of items which are positively and negatively correlated with category, and eliminate the error of artificial selecting scaling factor. And term frequency factor and variance among classes are introduced into traditional chi-square statistic to select items which largely appear in a specified class and rarely appear in other classes. Combined with KNN method,experimental results show that the proposed feature selection algorithm has good classification performance on balanced corpus and unbalanced corpus, especially improves the classification performance markedly on unbalanced corpus.(2) For the disadvantage of TFIDF method ignoring feature distribution among different classes and feature distribution inside one class, an improved TFIDF feature weighting method based on chi-square statistic and information entropy is proposed in this paper. Chi-square statistic factor and intra-class distribution entropy factor are introduced into traditional TFIDF method in order to make up the defect of TFIDF method and improve the accuracy of feature weighting calculation. Combined withKNN method, experimental results show that the performance of the classifier is improved using the proposed feature weighting algorithm, and prove that the proposed method has good stability.(3) Due to the shortage of low classification efficiency with training samples increasing in KNN method, an improved KNN text classification algorithm based on K-Medoids and membership degree is proposed in this paper. On the basis of traditional KNN algorithm, the improved K-Medoids clustering algorithm is adopted to remove training samples which make little contribution to classification, to reduce the similarity computation in classification process. The Membership degree is introduced into KNN algorithm in order to treat K nearest neighbor samples of testing text differently. Experimental results show that the classification efficiency of KNN method is improved on the premise of ensuring higher classification accuracy,and the effectiveness of three methods proposed in this paper is further verified.
Keywords/Search Tags:Text classification, Feature selection, Chi-square statistic, TFIDF method, KNN algorithm
PDF Full Text Request
Related items