The Research And Implementation Of Chinese Web Categorization

Posted on:2009-11-23

Degree:Master

Type:Thesis

Country:China

Candidate:J H Zhu

Full Text:PDF

GTID:2298360245989054

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

As the development of network and communication technique, especially the prevalence of the Internet, the information resource is very enrichment. Automatic text categorization, as a method of making the disorder and unsystematic information conform, is an important topic for discussion at present.This paper goes deep into analyzing and studying the key technologies of the Chinese web page categorization, including web pretreatment, feature extraction, categorization method and so on. Firstly, the Chinese web pages’ noises are sorted into two classes according to its representation characteristics. The paper provides methods separately to deal with the two different noises, and realizes a preprocessing based on HTMLParser of Chinese web pages. Secondly, this paper studies the Chinese word segment technology in-depth as Chinese web pages are written in Chinese. Then, the paper has come up with a word segment system which is based on the word dictionary. In the system, the word dictionary is divided into many smaller dictionaries according to the first word. As we use binary search within those dictionaries it has effectively improved the speed of Chinese word segment. Thirdly, an improved scheme on TFIDF based on its limitations on the feature extraction and paper expression is introduced. TFIDF ignores how the distribution of the term inner and extern the classes effects the distinguish degree of the classes. The scheme adds two variances to adjust the weight of TFIDF. The method is proved to be a better one to describe the page. Lastly, this paper compares some traditional paper categorization algorithms, and summarizes several improvements existed. Moreover, an improved kNN categorization method based on the core vector of the kinds is put forward. This method eliminates some pages far away from the core vector first, in order to reduce the infection of those pages to the core vector. Then, adding a weight to the distance when it calculates the sum of distances fall into every kind. This weight is depended on the distance between the page and the core vector. The experiment indicates that the effect of this method is better than kNN method’s.This paper realizes a categorization system to help the farther study. The experiment indicates that what this paper discussed is benefit to improve the performance of categorization and achieve the anticipative purpose.

Keywords/Search Tags:

Chinese web categorization, Chinese word segment, feature extraction, TFIDF

PDF Full Text Request

Related items

1	Research And Implementation Of The Automatic Chinese Text Categorization
2	Study On Chinese Text Categorization
3	Research Of The Automatic Chinese WEB Text Categorization In Search Engine
4	Chinese Word Segmentation Using Rule And Statistic
5	A Study On Key Issues Of Automated Text Categorization For Chinese Documents
6	Research And Implementation Of Chinese Automatic Text Classification System Based On SVM
7	The Studies On Chinese Text Categorization Based On Pso And Svm
8	Research Of Chinese Text Categorization Algorithms Based On Information Entropy
9	Research And Implementation Of Text Categorization System Based On VSM
10	Implementation Of Chinese Text Categorization System Based On SVM