Font Size: a A A

Study On Chinese Text Categorization

Posted on:2011-07-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y F ChenFull Text:PDF
GTID:2178330332476287Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of information technology and the increase of electronic text documents, text categorization which acts as the key technology of organizing and processing large quantity of documents attracts more and more attentions. In this thesis, a thorough research has been conducted on text categorization and its related techniques, including text preprocessing, text representation, feature selection, feature weighting and classification methods.In the stage of preprocess, traditional Chinese text categorization usually apply Chinese tokenizer or N-gram to generate features directly. However, Chinese tokenizer needs the support of integrated Chinese dictionary and its segmentation processing is very complex. Classification efficiency and accuracy based on segmentation are still the issues need to be improved. Additionally, N-gram usually brings the problem of much higher feature dimension and cannot bring satisfactory classification results. Thus we propose a novel text preprocessing method which encoding Chinese text through Base64 first, and then tokenizing the encoded text with N-gram. Experimental results show that our method can achieve better performance compared with traditional method based on Chinese tokenizer.Vector space model (VSM), a widely used text representation method has been applied in the stage of text representation. Each point in VSM represents a feature term, which is calculated by feature weighting methods. A comparation between N-gram feature and Chinese word feature shows that the former is more efficient to represent text.A crucial part of text categorization is feature selection. Considering the dimension of Chinese documents' feature space is usually very high, it is necessary to do the feature selection step to choose those most document-representable features. Besides, a good feature selection method can improve the accuracy of text classification. In this stage, we discuss several feature selection methods to reduce the dimension of feature space which is composed of 4-gram features. Finally, we design and implement a Chinese text categorization prototype system based on Base64 encoding, which consists of Chinese text preprocessing module, feature selection module and classification and results'evaluation module. Experimental results conducted from Fudan University corpus show that our system is not only effective but also feasible.
Keywords/Search Tags:Chinese text categorization, Chinese word segmentation, N-gram, Base64 encoding, Feature selection, Feature weighting
PDF Full Text Request
Related items