Font Size: a A A

Chinese Text Classification Algorithm Based On Multiple-factors Feature Weighting

Posted on:2012-01-20Degree:MasterType:Thesis
Country:ChinaCandidate:H DongFull Text:PDF
GTID:2178330335478015Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of computer networks, it is possible that not only people can share resources and results in real time, but also create a mass of information resources. To be able to obtain effective knowledge and information in the confused mass of information in a timely and accurate access, text classification has been widespread concern. Text classification can largely solve the information clutter phenomenon, users can locate information accurately and conveniently.This paper analyzes some key techniques of text classification, including text representation, text segmentation, clear stopwords, feature selection, text classification algorithms, and performance evaluation. Feature weighting algorithm and the KNN classification algorithm are two important issues in the process of text classification, so this paper focus on these two issues.Firstly, we study of the traditional tf * idf weighting algorithm penetrate deeply, analyze its shortcomings, that is ,that algorithm only takes into account the characteristics of word frequency ,tf ,and the anti-document frequency idf , two factors, while ignoring that the word also has certain features. In this paper, based on the characteristics, the traditional tf * idf weighting algorithm, we consider the characteristics of the word, such as the characteristics of the location of words in the document distribution, features of word length and word category, considering that several factors, propose the weighting algorithm with Multiple features, and expand the original formula, then adjust the algorithm in order to make the weighted features be more representative.Secondly, we study some commonly used text classification algorithm, and then focus on the KNN classification algorithm. We introduce the arithmetic average ideas in order to protect the results of data of KNN classification algorithm for the classification from data oblique, and propose targeted improved algorithm, and make experiments to verify its validity. Experimental results show that the proposed algorithm is satisfactory, to a certain extent, improved the accuracy of classification, recall.
Keywords/Search Tags:text categorization, Multiple-factors Feature, feature selection, feature weighting, KNN classification algorithm
PDF Full Text Request
Related items