Re-calculation Method, Based On The Text Characteristics Of The Significance Of Information Gain Right

Posted on:2005-01-29

Degree:Master

Type:Thesis

Country:China

Candidate:Y X Bei

Full Text:PDF

GTID:2208360125467133

Subject:Computer application technology

Abstract/Summary:

With the quick changing of the technology, especially the rapid development of Internet, all kinds of information increase hotly. People can obtain plenty of text data quickly by all sorts of methods, but have to face the unavoidable and meaningful task that is how to manage the obtained data scientifically and effectively. Within the management of text data, the text categorization is a very common method, and it is the basis to process the text in more details.In the past, the process of text categorization by manpower was to read through all the articles, and then saved them by classes in terms of the judgment. It needs a lot of classification personnels who are seasoned and own special knowledge to do plentiful work. Excluding the differences among individual's thinking, the advantage of the categorization by manpower is the high accurate ratio. But in the other hand, it has the disadvantage of long period, high expense and low efficiency, and it is difficult to fulfill the actual need in today. So how to do automatic text categorization with computer is a research hotspot in the modern information processing.The main method adopted in text categorization is Vector Space Model (VSM) by now. Its idea is to divide the text into feature items composed of words or characters, and then express the text with a dot in the vector space composed of the feature items. The similarity between the texts can be measured by the inclination between the vectors.The kernel of VSM is the numeration of feature item weight which affects the effect of categorization. TFIDF is a feature item weight calculation method that wide applied in text categorization and shows good effects. But the disadvantage of TFIDF is that it can't grasp the distributing ratio of feature items among text aggregate. So it influences the final effects of categorization.In order to weigh the distributing ratio of feature items among texts, this paper adopts the meaning information gain to improve the TFIDF method, brings forward a new feature item weight calculation method M-TFIDF (Modified TFIDF) that not only gives attention to the distributing condition of feature items among texts, but also considers the distributing ratio, makes the text to get feature item weight by the improved method that can express the content of text well. Experimented with the improved method, the result proved the M-TFIDF is better than the primary TFIDF. It improves the effect of categorization and owns validity and feasibility.

Keywords/Search Tags:

text categorization, Vector Space Model, feature item weight, meaning information gain, weighted entropy

Related items

1	Research On Chinese Text Categorization Algorithms Based On Technology Text
2	The Research And Implementation Of Chinese Text Categorization System
3	Design And Realization Of Text Categorization System
4	The Research And Implementation Of Chinese Text Categorization
5	Research Of Text Categorization Based On Vector Space Model
6	Modeling And Implementation Of Chinese Text Categorization System Based On SVM
7	Research Of Text Categorization Base On Vector Space Model And Association Rules
8	Text Classification Technology And Applied Research
9	Research Of Chinese Text Categorization Algorithms Based On Information Entropy
10	Research On Classification Module Of Core Competency Assessment System