Font Size: a A A

Research Of Weight Algorithm In KNN Text Classification

Posted on:2011-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:X H ZhaoFull Text:PDF
GTID:2178360305971649Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the increasingly expanding of network information resources, fast and efficient discovery of resources and information has become a pressing need. Text classification has become an efficient way to solve the above problems. It's an important means of text data collection and organization.Text classification is to divide a large text document into one or a group of categories, making the content of each category represents a different theme. At present, the text classification mainly uses the vector space model based on statistics, related to the text pre-processing, Chinese word segmentation, feature selection, feature weighting methods, classification algorithms, classification performance evaluation and other processes.Feature weighting methods is an important issue of text classification based on vector space model, related to the final classification results. As one feature weighting method, the basic idea of TFIDF is to take the word frequency in the text as the TF weights, and then multiplied by the IDF function to complete the weight adjustments. The purpose of adjustment the value is to highlight the important words, inhibit secondary word. However, the simple structure of IDF function makes it impossible to properly fulfill the function of value adjustment. Therefore, the classification accuracy of TFIDF method is not satisfied.Observing the shortcomings of TFIDF algorithm, we suspected whether we can use a new kind of feature weight function instead of the aggregation of feature selection algorithm. So TFIDF function defects could be avoided. In this paper, we discussed the feature selection and weighting methods involved in the text classification. At the same time, we put forward a new feature weighting method based on the information of class distribution and location.We use the Chinese Lexical Analysis System (ICTCLAS) established by Institute of Computing Technology in our research. We use the KNN classifier in the system to test the improved TFIDF algorithm. According to the experimental results, the following conclusions can be drawn:(1) Select different feature selection algorithms, such as mutual information, expected cross entropy, information gain, weight evidence text and CHI. Use KNN classification model to conduct experiments to compare different feature selection algorithm's impact on the classification results. In these algorithms, weight evidence text is the best, followed by information gain, CHI, expected cross entropy, while mutual information is the least in classification effect.(2) Compare the micro-precision ratio of different types when the value of the formula was chosen as TFIDF, TF * CHI, TFIDF * CHI. Through the result we can see the impact of the different formula of feature weight method on individual categories classification. Through experiments we can draw the following conclusions: the improved TF * CHI weights formula are the worst both in the recall or the precision, at the same time, on the whole, TFIDF * CHI weights formula and the traditional TFIDF formula have not significant difference. But by careful study, we can find TFIDF * CHI weights formula have not large effects on the high precision rate category, but it has greatly improved on the category with worst results of classification.(3) Select different feature selection algorithm, such as mutual information, expected cross entropy, information gain, weight evidence text and CHI, and then select different feature weight methods, TFIDF, TF * feature selection function, TFIDF * feature selection function, use KNN classification model to conduct comparative experiments to compare the impact of improvements in weight calculation method on the final classification results. Through the experiment we can see, no matter what kind of feature function was selected, as long as TF * feature selection function was chosen, the classification results were below the situation of TF * feature selection function, TFIDF * feature selection function was chosen. In addition to information gain, anyone of the other five feature selection functions, the TFIDF * feature selection function has higher classification accuracy. This shows that we can not say and certainly that which kind of feature selection algorithm or weight calculating method must have good results. It depends on combination of feature selection algorithm and feature weights algorithm. It will play to their strengths and optimize classification rate, only when the combination appropriate.
Keywords/Search Tags:text classification, feature selection, feature weight method, KNN, TFIDF
PDF Full Text Request
Related items