Font Size: a A A

Research On Feature Selection Algorithm And Classification Algorithm In Chinese Text Categoriztion

Posted on:2011-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:L ChiFull Text:PDF
GTID:2178360302994670Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of information technology, especially in the popularity of Internet, dramatic increasingly in web pages of electronic text information, how to effectively organize and manage these vast amounts of information, and how to quickly and accurately obtain the information needed by users in today's information resource management technology is a big challenge. By using the automatic text classification techniques, electronic text information can be automatically organized and managed according to categories, it meets people's demand for convenient and efficient information processing, and accuracy locates information resources.We deeply studied segmentation algorithms, feature selection methods and text classification algorithms.Firstly, by analyzing the features of Chinese text categorization in pre-processing, representation of vector space model, and the two kinds of mechanical segmentation method, we improved the segmentation method in the dictionary structure of the algorithm, the algorithm matching method, disposal strategy of algorithm to ambiguous word and disposal strategy algorithm to unknown word, and had experimental validation.Secondly, on the basis of text pre-processing, in order to improve the post-classification accuracy rate and reduce the calculation of the amount of classification algorithms, we analyzed Categorical Proportional Difference (CPD) feature selection method, and improved this method in frequency and redundancy of feature items, and experimented to compare validation.Finally, by analyzing the two shortcomings which are the enormous computational, and when there is more commonality between the categories, namely, to have more features between the training samples cross phenomenon, KNN classification accuracy will decline. we proposed an improved KNN algorithm for text classification, experimented in Chinese text categorization corpus-TanCorpV1.0 and Sohu web page corpus, comparing the traditional KNN algorithm.
Keywords/Search Tags:Text classification, Segmentation algorithm, Feature selection, Categorical proportional difference, Knearest neighbors, Precision rate, Recall rate
PDF Full Text Request
Related items