Font Size: a A A

Research On Chinese Text Categorization

Posted on:2008-12-11Degree:MasterType:Thesis
Country:ChinaCandidate:J X WengFull Text:PDF
GTID:2178360212494644Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Text categorization and clustering are two importance computations in the design of web search engines, which are also the crucial computing in applications, e.g. data processing, data mining etc. So the research on the text categorization and clustering is not only important in theory but also in practice.Since the Internet is the main source for people to get message for the time being, the study of text categorization and clustering not only limits on general text files, but also focuses on web pages. Therefore this thesis discusses the text categorization and clustering on both common text documents and web pages.The main contributions of this thesis can be listed as follows:1) A thorough study on the feature selection methods and text categorization methods of text document is carried on. Based on the study, the advantage of each method is analyzed. And then the traditional algorithm of support vector machine (SVM) is improved by utilizing a new kernel function combined by RBF and POLY. The experiment shows that new SVM gets better categorization results.2) Considering the disadvantage of tradition feature selection, a new way for web pages classification is proposed based on the observation that most web pages are semi-structural. The new method utilizes both the structures information and the contents of web pages, it overcomes the shortage of the traditional classification, which ignored or overlooked the structure information of web pages. The main idea of the new method is to classify the web pages first based on their structural information, and then classify them according to their text features. The experiments show that the new method can increase the precision and efficiency of web page classification. 3) This thesis has carried on preliminary research on multi-classifiers combination. The multi-classifiers is combined by two classifier models. The combination is based on Naive Bayes theory and used to classify the web pages. Our experiment shows that compare to the single classifier method, the multi-classifier work fairly well and improved the precision of classifier effectively.The organization of this thesis is as follows: We give the related work on automatic text categorization in chapter 1. In chapter 2, we introduced the main concepts of text categorization, the classical theory models, the feature selection methods, the classifier models and our support vector machine model; Moreover, we also compared our SVM with KNN and Naive Bayes in our experiment. In chapter 3 , we introduced hierarchical classify method based on taxonomy, structure information of web pages and the text of web pages, and verify the validity of our algorithm through the experiment; then we has carry on preliminary research on multi-classifiers combination in chapter 4. In chapter 5, we summarize the whole thesis.
Keywords/Search Tags:text categorization, web pages classification, feature selection, KNN, support vector machine
PDF Full Text Request
Related items