Research On Chinese Text Categorization

Posted on:2008-12-11

Degree:Master

Type:Thesis

Country:China

Candidate:J X Weng

Full Text:PDF

GTID:2178360212494644

Subject:Computer system architecture

Abstract/Summary:

Text categorization and clustering are two importance computations in the design of web search engines, which are also the crucial computing in applications, e.g. data processing, data mining etc. So the research on the text categorization and clustering is not only important in theory but also in practice.Since the Internet is the main source for people to get message for the time being, the study of text categorization and clustering not only limits on general text files, but also focuses on web pages. Therefore this thesis discusses the text categorization and clustering on both common text documents and web pages.The main contributions of this thesis can be listed as follows:1) A thorough study on the feature selection methods and text categorization methods of text document is carried on. Based on the study, the advantage of each method is analyzed. And then the traditional algorithm of support vector machine (SVM) is improved by utilizing a new kernel function combined by RBF and POLY. The experiment shows that new SVM gets better categorization results.2) Considering the disadvantage of tradition feature selection, a new way for web pages classification is proposed based on the observation that most web pages are semi-structural. The new method utilizes both the structures information and the contents of web pages, it overcomes the shortage of the traditional classification, which ignored or overlooked the structure information of web pages. The main idea of the new method is to classify the web pages first based on their structural information, and then classify them according to their text features. The experiments show that the new method can increase the precision and efficiency of web page classification. 3) This thesis has carried on preliminary research on multi-classifiers combination. The multi-classifiers is combined by two classifier models. The combination is based on Naive Bayes theory and used to classify the web pages. Our experiment shows that compare to the single classifier method, the multi-classifier work fairly well and improved the precision of classifier effectively.The organization of this thesis is as follows: We give the related work on automatic text categorization in chapter 1. In chapter 2, we introduced the main concepts of text categorization, the classical theory models, the feature selection methods, the classifier models and our support vector machine model; Moreover, we also compared our SVM with KNN and Naive Bayes in our experiment. In chapter 3 , we introduced hierarchical classify method based on taxonomy, structure information of web pages and the text of web pages, and verify the validity of our algorithm through the experiment; then we has carry on preliminary research on multi-classifiers combination in chapter 4. In chapter 5, we summarize the whole thesis.

Keywords/Search Tags:

text categorization, web pages classification, feature selection, KNN, support vector machine

Related items

1	Research On Text Classification Based On Feature Selection And Its Application
2	A Study On Text Categorization Based On Machine Learning
3	Normal Weight Based Feature Selection Method In SVM Text Categorization
4	Research Of Text Categorization System Based On SVM
5	Research On Chinese Text Categorization Based On Support Vector Machine
6	Design And Implementation Of Web Automatic Text Categorization
7	Study On Multi-classification Method Of Chinese Agricultural Web Pages
8	The Design And Implementation Of Text Classification System Based On SVM-KNN
9	The Research And Implementation Of Automatic Text Categorization For Chinese Web Documents
10	Research On Text Categorization Based On Support Vector Machine