Study On Chinese Text Categorization

Posted on:2011-07-18

Degree:Master

Type:Thesis

Country:China

Candidate:Y F Chen

Full Text:PDF

GTID:2178330332476287

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the development of information technology and the increase of electronic text documents, text categorization which acts as the key technology of organizing and processing large quantity of documents attracts more and more attentions. In this thesis, a thorough research has been conducted on text categorization and its related techniques, including text preprocessing, text representation, feature selection, feature weighting and classification methods.In the stage of preprocess, traditional Chinese text categorization usually apply Chinese tokenizer or N-gram to generate features directly. However, Chinese tokenizer needs the support of integrated Chinese dictionary and its segmentation processing is very complex. Classification efficiency and accuracy based on segmentation are still the issues need to be improved. Additionally, N-gram usually brings the problem of much higher feature dimension and cannot bring satisfactory classification results. Thus we propose a novel text preprocessing method which encoding Chinese text through Base64 first, and then tokenizing the encoded text with N-gram. Experimental results show that our method can achieve better performance compared with traditional method based on Chinese tokenizer.Vector space model (VSM), a widely used text representation method has been applied in the stage of text representation. Each point in VSM represents a feature term, which is calculated by feature weighting methods. A comparation between N-gram feature and Chinese word feature shows that the former is more efficient to represent text.A crucial part of text categorization is feature selection. Considering the dimension of Chinese documents' feature space is usually very high, it is necessary to do the feature selection step to choose those most document-representable features. Besides, a good feature selection method can improve the accuracy of text classification. In this stage, we discuss several feature selection methods to reduce the dimension of feature space which is composed of 4-gram features. Finally, we design and implement a Chinese text categorization prototype system based on Base64 encoding, which consists of Chinese text preprocessing module, feature selection module and classification and results'evaluation module. Experimental results conducted from Fudan University corpus show that our system is not only effective but also feasible.

Keywords/Search Tags:

Chinese text categorization, Chinese word segmentation, N-gram, Base64 encoding, Feature selection, Feature weighting

PDF Full Text Request

Related items

1	A Study On Key Issues Of Automated Text Categorization For Chinese Documents
2	Research And Implementation Of The Automatic Chinese Text Categorization
3	The Studies On Chinese Text Categorization Based On Pso And Svm
4	Research And Implementation Of Chinese Text Categorization
5	Research Of Chinese Text Categorization Algorithms Based On Information Entropy
6	Research On Chinese Text Categorization Algorithms Based On Technology Text
7	Research And Implementation Of Chinese Automatic Text Classification System Based On SVM
8	Research Of The Automatic Chinese WEB Text Categorization In Search Engine
9	Design And Implementation Of Web Automatic Text Categorization
10	Chinese Text Data Classification