Font Size: a A A

The Research And Implementation Of Chinese Web Categorization

Posted on:2009-11-23Degree:MasterType:Thesis
Country:ChinaCandidate:J H ZhuFull Text:PDF
GTID:2298360245989054Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the development of network and communication technique, especially the prevalence of the Internet, the information resource is very enrichment. Automatic text categorization, as a method of making the disorder and unsystematic information conform, is an important topic for discussion at present.This paper goes deep into analyzing and studying the key technologies of the Chinese web page categorization, including web pretreatment, feature extraction, categorization method and so on. Firstly, the Chinese web pages’ noises are sorted into two classes according to its representation characteristics. The paper provides methods separately to deal with the two different noises, and realizes a preprocessing based on HTMLParser of Chinese web pages. Secondly, this paper studies the Chinese word segment technology in-depth as Chinese web pages are written in Chinese. Then, the paper has come up with a word segment system which is based on the word dictionary. In the system, the word dictionary is divided into many smaller dictionaries according to the first word. As we use binary search within those dictionaries it has effectively improved the speed of Chinese word segment. Thirdly, an improved scheme on TFIDF based on its limitations on the feature extraction and paper expression is introduced. TFIDF ignores how the distribution of the term inner and extern the classes effects the distinguish degree of the classes. The scheme adds two variances to adjust the weight of TFIDF. The method is proved to be a better one to describe the page. Lastly, this paper compares some traditional paper categorization algorithms, and summarizes several improvements existed. Moreover, an improved kNN categorization method based on the core vector of the kinds is put forward. This method eliminates some pages far away from the core vector first, in order to reduce the infection of those pages to the core vector. Then, adding a weight to the distance when it calculates the sum of distances fall into every kind. This weight is depended on the distance between the page and the core vector. The experiment indicates that the effect of this method is better than kNN method’s.This paper realizes a categorization system to help the farther study. The experiment indicates that what this paper discussed is benefit to improve the performance of categorization and achieve the anticipative purpose.
Keywords/Search Tags:Chinese web categorization, Chinese word segment, feature extraction, TFIDF
PDF Full Text Request
Related items