Font Size: a A A

Re-calculation Method, Based On The Text Characteristics Of The Significance Of Information Gain Right

Posted on:2005-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y X BeiFull Text:PDF
GTID:2208360125467133Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the quick changing of the technology, especially the rapid development of Internet, all kinds of information increase hotly. People can obtain plenty of text data quickly by all sorts of methods, but have to face the unavoidable and meaningful task that is how to manage the obtained data scientifically and effectively. Within the management of text data, the text categorization is a very common method, and it is the basis to process the text in more details.In the past, the process of text categorization by manpower was to read through all the articles, and then saved them by classes in terms of the judgment. It needs a lot of classification personnels who are seasoned and own special knowledge to do plentiful work. Excluding the differences among individual's thinking, the advantage of the categorization by manpower is the high accurate ratio. But in the other hand, it has the disadvantage of long period, high expense and low efficiency, and it is difficult to fulfill the actual need in today. So how to do automatic text categorization with computer is a research hotspot in the modern information processing.The main method adopted in text categorization is Vector Space Model (VSM) by now. Its idea is to divide the text into feature items composed of words or characters, and then express the text with a dot in the vector space composed of the feature items. The similarity between the texts can be measured by the inclination between the vectors.The kernel of VSM is the numeration of feature item weight which affects the effect of categorization. TFIDF is a feature item weight calculation method that wide applied in text categorization and shows good effects. But the disadvantage of TFIDF is that it can't grasp the distributing ratio of feature items among text aggregate. So it influences the final effects of categorization.In order to weigh the distributing ratio of feature items among texts, this paper adopts the meaning information gain to improve the TFIDF method, brings forward a new feature item weight calculation method M-TFIDF (Modified TFIDF) that not only gives attention to the distributing condition of feature items among texts, but also considers the distributing ratio, makes the text to get feature item weight by the improved method that can express the content of text well. Experimented with the improved method, the result proved the M-TFIDF is better than the primary TFIDF. It improves the effect of categorization and owns validity and feasibility.
Keywords/Search Tags:text categorization, Vector Space Model, feature item weight, meaning information gain, weighted entropy
PDF Full Text Request
Related items