Research Of Chinese Text Automatic Classification Based On Statistical Method

Posted on:2005-02-19

Degree:Master

Type:Thesis

Country:China

Candidate:C R Luo

Full Text:PDF

GTID:2168360122991531

Subject:Computer application technology

Abstract/Summary:

With the development of information technique, people have already transited into the ages in which information is extremely abundant and digitized from the age lacks information. How to acquire the useful information quickly and effectively from information-sea has become a very important problem. For this purpose, the text automatic classification has been put forward and studied in application.This paper gives details to the research on the theory and practice of the text automatic classification using the statistical method. The main aspects of the paper are as follows:1. The definition, common used models and common used algorithms of classification are discussed theoretically.2. Discuss the general methods and the key technology of constructing classifier.3. We employ vector-distance weighted algorithm, representative-vector-dista -nee algorithm and center-vector algorithm to construct the classifier. And then, the experiments of the three classification algorithms have been done respectively with different feature-set (Chinese-character feature-set and Chinese-word feature-set). According to the analysis of the experimental results, we find that: Qthe classification result with the same classifier by taking Chinese-word as feature is better than by Chinese-character. (2)the influence to classification result is highly effected by using different classifier, for example, the center-vector algorithm obtains better classification results than other two algorithms. With the character feature, the average recall is 80.73%, and the average precision is 82.94%, and with the Chinese-word feature, the average recall is 83.6%, and the average precision is 85.97%.Different corpuses influence the classification result. For example, the average recall is 89.31% and the average precision is 88.33%, by using the news web pages as corpus from the web site "www.sina.com.cn", which adopt the center-vector algorithm to structure classifier and select Chinese-word as feature.4. For the improved algorithm experimental results, the average recall is 96.35%, and the average precision is 90.87%. The experimental results indicate that the improved algorithm is suit for Chinese text automatic classification.This study can be used in network information retrieve, information filter, Chinese text automatic classification, Chinese web page automatic classification and other application fields.

Keywords/Search Tags:

Center vector algorithm, Text automatic classification, Vector space model, Statistical method

Related items

1	Research And Improvement Of Automatic Text Classification Algorithm Based On The Vector Space Model
2	The Research And Implement Of Automatic Text Classification System Which Is Based On Vector Space Model
3	Automatic Classification Research On Chinese Web Document Orientation
4	Text Classification Based On Word Vector And Topic Vector
5	Automatic Classification Research On HTML Document And Implentation Of The Tool
6	Study Of Text Classification Model Based On Key Vector
7	Improvement And Application To Weighting Terms Based On Text Classification
8	Research Of Chinese Page Automatic Classification Based On Vector Space Model
9	Automatic Classification Based On The Concept Of The Text
10	A Research On Automatic Web Text Classification Technology