Font Size: a A A

Research And Implementation Of Chinese Text Categorization

Posted on:2009-07-23Degree:MasterType:Thesis
Country:ChinaCandidate:W B ChenFull Text:PDF
GTID:2178360272476386Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Automatic text categorization is the assignment of predefined categories to documents based on their content. It is uti distribute news, compositor e-mail and study user's interesting, meanwhile it isearch, automatic digesting and information filtration too. In order to meet the realistic requirements of practical and scalable systems that can process real text, we carry out our researches about Chinese text categorization system in the following aspects:(1) Automatic Chinese word segmentation.The automatic ation processing tasks such as information extraction, information retrieval, machine translation, text classification, automatic text summarization, speech recognition, text-to-speech, naturization of unknown on that base on the context and Bi-dirrder to reduce the ambiguities. And it was testified for that is helpful for Chinese words segmentation.(2) Feature selection.The VSM is the general model of text processing at present, and every text can be represented by items, so the As usual, all the text categories have a common feature set, but that will bring two issues as follows:First, the threshold of item weighting. Bns of items in text, so we must choose part of the items as the feature set by the threshold of item weighting, but how to select a suitable threshold is the issue that we must fact.Second, the representative degree of item. Because every text category has a common feature set, so there must be som not be able to represent some text categories, and soo represent some text categories but could not be chose into the feature set.In this paper, we have made every text category has an independent feature set and took the relativity of item with text category into account.In one aspect, every text category has an independent feature set, so we can choose all the items that its weighting is and need not to think about the threshold of item weighting. In another aspect, every item that in the feature set could represent the text category e in relation feature set. (3) The weighting schema.The tfc-weighting is the general weighting schema at present, but it doesn't think about the relativity of item with text category, so we introduced the statistics ofχ2 into the weighting schema in order to make the item that in feature set represents the correlative category well.In addition, we introduced the DFI (Document Frequency In category) into the weighting schema, so ativity with the text category but only occur in a few documents in the text category.(4) The algorithm of text categorization.In this paper, the algorithm of text categorization is Item-scoring method. When a document that does not have cat the terms of this document, and then use all the terms but not stop words to scoring every caast, the one that has the maximum score is the document's category.(5) Experimental results.The experimental resug schema is superior to the tfc-weighting and the item-scoring algorithm is suit for Chinese automatic text categorization.
Keywords/Search Tags:Text Classification, Chinese word segmentation, VSM, feature selection, weight
PDF Full Text Request
Related items