Font Size: a A A

Dimension Reduction Method Research In Text Classification

Posted on:2006-12-23Degree:MasterType:Thesis
Country:ChinaCandidate:H F ChenFull Text:PDF
GTID:2178360212982669Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The proliferating web information makes people get in trouble with finding what they want. As a technology for information organization and management, text classification (TC) is brought forward to resolve disorderly and unsystematic phenomenon in information retrieve. But comparing to manual text classification, automatic text classification faces with many problems, mainly are:1) The dimension of vector space model for text representation is too large that it is difficult to distinguish different classes when classification algorithm is used on the large dimension vector space model (VSM)2) Training text set must cover all the character words in vector space model, or the classifier by training may be warped. But it is also difficult cover all the character words for so large dimension vector.To resolve these two main problems, the concept of dimension reduction (DR) is put forward. The methods of DR have been widely attended and researched in recent years. On the base of another's work, DR method based on concept statistic is amply researched in this paper.Firstly, basic concept and knowledge on TC is summarized, and the representation effectiveness of VSM and its effect factors to classification are analyzed. The necessary and basic thinking on DR are discussed. Based on analyzing local property of character words, the advantage of local DR is discussed. Existent algorithms on DR are analyzed and their merits and demerits are summed up. And the principle and main methods on term selection and extraction are discussed. On this base, the localization of morphology statistic is discussed and advantage of importing concept is expounded. Hierarchy relation among concepts is analyzed. Based on anatomy to existent technology on DR, together with concept analyze method, a DR algorithm based on concept statistic is put forward. And the algorithm is improved by problems found in experiments. This makes the algorithm perfect. The process is: distilling original character words from training text set, wiping off stop words, removing different meanings, and doing DR based on concept statistic in local domain using Hierarchy relation among concepts. Comparing the vector before dimension reduction, the result vector has less character words of low frequency, more character words of high frequency, the high frequency is strengthened, the number of character words is reduced, and the dimension is lowered. It has stronger associate with class that belongs to and better representation, and specially has lower redundancy and noise than before. So the DR object is well achieved.On the base of detailed explain for the algorithm, the validity and feasibility of the algorithm are evaluated by experiment. The experiment result is analyzed, and the reason for all kinds of character words frequency distributing is discussed. Moreover the possible result is forecasted when the vector is used to material classification. In addition, the accept or reject policy during DR, such as threshold selection and its foundation, is ulterior studied. And experiment result proves the threshold than is selected in this paper meets the necessary of practice.
Keywords/Search Tags:Text Classification, Vector Space Model, Dimension Reduction, Conception statistic, WordNet
PDF Full Text Request
Related items