Research And Implementation Of Chinese Text Categorization

Posted on:2009-07-23

Degree:Master

Type:Thesis

Country:China

Candidate:W B Chen

Full Text:PDF

GTID:2178360272476386

Subject:Software engineering

Abstract/Summary:

Automatic text categorization is the assignment of predefined categories to documents based on their content. It is uti distribute news, compositor e-mail and study user's interesting, meanwhile it isearch, automatic digesting and information filtration too. In order to meet the realistic requirements of practical and scalable systems that can process real text, we carry out our researches about Chinese text categorization system in the following aspects:(1) Automatic Chinese word segmentation.The automatic ation processing tasks such as information extraction, information retrieval, machine translation, text classification, automatic text summarization, speech recognition, text-to-speech, naturization of unknown on that base on the context and Bi-dirrder to reduce the ambiguities. And it was testified for that is helpful for Chinese words segmentation.(2) Feature selection.The VSM is the general model of text processing at present, and every text can be represented by items, so the As usual, all the text categories have a common feature set, but that will bring two issues as follows:First, the threshold of item weighting. Bns of items in text, so we must choose part of the items as the feature set by the threshold of item weighting, but how to select a suitable threshold is the issue that we must fact.Second, the representative degree of item. Because every text category has a common feature set, so there must be som not be able to represent some text categories, and soo represent some text categories but could not be chose into the feature set.In this paper, we have made every text category has an independent feature set and took the relativity of item with text category into account.In one aspect, every text category has an independent feature set, so we can choose all the items that its weighting is and need not to think about the threshold of item weighting. In another aspect, every item that in the feature set could represent the text category e in relation feature set. (3) The weighting schema.The tfc-weighting is the general weighting schema at present, but it doesn't think about the relativity of item with text category, so we introduced the statistics ofÏ‡2 into the weighting schema in order to make the item that in feature set represents the correlative category well.In addition, we introduced the DFI (Document Frequency In category) into the weighting schema, so ativity with the text category but only occur in a few documents in the text category.(4) The algorithm of text categorization.In this paper, the algorithm of text categorization is Item-scoring method. When a document that does not have cat the terms of this document, and then use all the terms but not stop words to scoring every caast, the one that has the maximum score is the document's category.(5) Experimental results.The experimental resug schema is superior to the tfc-weighting and the item-scoring algorithm is suit for Chinese automatic text categorization.

Keywords/Search Tags:

Text Classification, Chinese word segmentation, VSM, feature selection, weight

Related items

1	Research And Implement On The Related Algorithms Of Chinese Text Classification
2	Research On Word Segmentation And Feature Selection Of Chinese Text Chinese Text Classification
3	Research On Core Technology Of The Chinese Text Classification
4	The Research And Implementation Of Chinese Text Classification System
5	Research On Network Text Classification Technique
6	Research And Implementation Of Chinese Automatic Text Classification System Based On SVM
7	Research And Application Of Internet Chinese Text Classification
8	Research On Chinese Text Categorization Algorithms Based On Technology Text
9	Design And Implementation Of Web Automatic Text Categorization
10	Chinese Text Data Classification