Font Size: a A A

Research On Chinese Text Categorization

Posted on:2011-12-06Degree:MasterType:Thesis
Country:ChinaCandidate:B X LiFull Text:PDF
GTID:2178330332474121Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Text classification in the search engine and data processing plays an important stage. Because the classification can improve the user-friendly search engine, and can improve search accuracy, the relevant documents will be classified output. Similarly, in the domestic Chinese text categorization has also been more and more attention.At present, whether under development or in practical text classification algorithms are the state less need to consider a number of text classification situation. The case of text classification for volume, contact the actual demand quantities of text classification, this paper K-means and the combination of multi-class SVM classifier step by step, based on the experiment, experimental use of the link between clustering and classification to obtain a more satisfactory results. I do the following work:1. Chinese text classification for the fact that too many dimensions, respectively, DF, IG, CHI and MI, and their combined feature extraction method in Naive Bayesian classifier and support vector machine in the experiment. Experimental results show that the first use of DF is lower than we set the threshold to remove low-frequency words, elimination of IG, CHI, or MI reliance on low-frequency words, and then use the IG, CHI, or MI is removed from the remaining term lower class information noise words, can be optimized classifier results.2. The text-based batch K-means and SVM multi-class classifier combination scenarios were related experiments. Experimental steps include:the first step of the text clustering, the use of JAVA to implement K means algorithm; the second step, the results of a small cluster sampling, sampling results meet the requirements of the class do not have to be classified; the third step of the sampling results are not summary of the ideal class; fourth step of the aggregate class classification using the LIB SVM software. In the experiment, combined classifier achieved good performance.
Keywords/Search Tags:Text Classification, K-means clustering algorithm, multi-label classifier
PDF Full Text Request
Related items