Font Size: a A A

Research And Implementation Of Chinese Text Categorization

Posted on:2008-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:G L ZhangFull Text:PDF
GTID:2178360212496727Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Automatic text categorization is the assignment of predefined categories to documents based on their content. It is utilized topic-up symbol knowledge, distribute news, compositor e-mail and study user's interesting, meanwhile it is the base of information search, automatic digesting and information filtration too.In order to meet the realistic requirements of practical and scalable systems that can process real text, we carry out our researches about Chinese text categorization system in the following aspects:(1) Automatic Chinese word segmentation.The automatic Chinese word segmentation is the basic research issue on Chinese information processing tasks such as information extraction, information retrieval, machine translation, text classification, automatic text summarization, speech recognition, text-to-speech, natural language understanding, and so on. Though it has been investigated for more than twenty years, it is still a bottleneck for Chinese information processing. And the main problem is the reorganization of unknown words and the processing of ambiguities segmentation. We have put forward a new method of Chinese word segmentation that base on the context and Bi-direction Matching method. Then we have made a great of improvements of the dictionary, and introduced the phrases to the dictionary, in order to reduce the ambiguities. And it was testified for that is helpful for Chinese words segmentation.(2) Feature selection.The VSM is the general model of text processing at present, and every text can be represented by items, so the feature selection has a great impact on the result of text categorization. As usual, all the text categories have a common feature set, but that will bring two issues as follows:First, the threshold of item weighting. Because of the high dimensions of items in text, so we must choose part of the items as the feature set by the threshold of item weighting, but how to select a suitable threshold is the issue that we must fact.Second, the representative degree of item. Because every text category has a common feature set, so there must be some items that in feature set could not be able to represent some text categories, and some items could be able to represent some text categories but could not be chose into the feature set.In this paper, we have made every text category has an independent feature set and took the relativity of item with text category into account.In one aspect, every text category has an independent feature set, so we can choose all the items that its weighting is not 0 into the relation feature set, and need not to think about the threshold of item weighting. In another aspect, every item that in the feature set could represent the text category very well and could not omit every item that should be in relation feature set.(3) The weighting schema.The tfc-weighting is the general weighting schema at present, but it doesn't think about the relativity of item with text category, so we introduced the statistics ofχ2 into the weighting schema in order to make the item that in feature set represents the correlative category well. In addition, we introduced the DFI (Document Frequency In category) into the weighting schema, so as to filter the items that have big relativity with the text category but only occur in a few documents in the text category.(4) The algorithm of text categorization.In this paper, the algorithm of text categorization is Item-scoring method. When a document that does not have category label, we will extract the terms of this document, and then use all the terms but not stop words to scoring every category's feature set. At last, the one that has the maximum score is the document's category.(5) Experimental results.The experimental results indicate that the improved weighting schema is superior to the tfc-weighting and the item-scoring algorithm is suit for Chinese automatic text categorization.
Keywords/Search Tags:Implementation
PDF Full Text Request
Related items