Font Size: a A A

Research On The Document Indexing And Classifying Models In Chinese Text Categorization

Posted on:2005-06-13Degree:MasterType:Thesis
Country:ChinaCandidate:X D ZhouFull Text:PDF
GTID:2168360155971757Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Automatic text categorization (or classification), the assignment of structureless text documents to one or more predefined categories based on their content, has witnessed a booming interest in the last 10 years, due to the increased availability of machine-readable electronic documents and the ensuing need to organize them. Many techniques and algorithms for automatic text categorization have been devised and proposed. However, there is still much room for improving the effectiveness of these classifiers, and new methods need to be investiaged further, especially for Chinese text categorization, on which relatively few studies have been done.Document Indexing is a crucial step for learning algorithms and classifying systems. Before applying them to the documents, documents must be transformed into suitable formal representations, and the representations of documents can capture the meaning of them to some extent. The techniques of Document Indexing including the strategies of indexing (alternative textual representation unit), reducing the high-dimensionality of original feature space and the methods in calculating the weighting of indexes. Many existent methods and the new methods proposed in this paper have been examined exhaustively. Some useful and constructive conclusions have been obtained from the results of experiments.In the research community of text categorization, there are two dominant approaches: rule-based and machine learning based. This paper combines them into a whole classifier, taking the rule-based learner as a component classifier. A new optimized rule induction algorithm has been proposed for the purpose of automatic generated "strong" decision rules. The experiment result shows that there is 8% performance improvement compared with the single classifying method based on machine learning.After reviewing the traditional classifying methods, this paper puts forward to a new method, which apply n-gram language models to classify Chinese text. Several factors that have strong impact on the performance of n-gram models have been investigated, including various order n, different smoothing techniques, size of training corpus, and alternative granularity of textual representation unit in Chinese: character-based or word-based. Meanwhile Rocchio classifier and Naive Bayes classifier have been constructed in order to compare with the n-gram models on their performance. The experiments show that the accuracy of n-gram language models in text classification is better than traditional classifying models.
Keywords/Search Tags:Text Categorization, Document Indexing, Decision Rules Learning, N-gram, Language Model
PDF Full Text Request
Related items