Font Size: a A A

Study For Text Categorization Based On Feature Weighting

Posted on:2008-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:D Y TaiFull Text:PDF
GTID:2178360215951583Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Automatic text classification is defined as the task to assign pre-defined category labels to documents. Text auto-categorization systerm can organize and manage the text information availably, locating the information accurately and rapidly, supporting the information extracting effectively.The essential technologies in process of text categoriaztin based on vector space model in present mainly include pre-process, participle technology, weighting computation, feature selection and extraction and dimension descending techonogy. Algorithm of feature term weighting based on VSM is a very important problem affecting the text categorization performance. Term Frequency and Inverse Document Frequency are considered but the distribution information among class and position information in document are ignored in traditional TF-IDF. An improved feature term weighting algorithm considered of Distribution information among class and position information of terms is presented.The main research works are shown as follows:(1)The basic concept and relevant knowledge, the research background and the present situation and the exixting problems of the text categorization are introduced.(2)The essential technologies in process of text categoriaztin such as pre-process, participle technology in Chinese, vector space model, weighting computation, feature selection and extraction, dimension descending techonogy, rules for evauating text categorization performance are discussed.(3)The theories and characteristics of traditional text categorization algorithm are analyzed.(4)On basis of analyzing the traditional TF-IDF algorithm, a new improved algorithm on feature weighting calculateing which considered of distribution information among class and position information of terms is presented. Experimental results show that the improved algorithm outperformed the traditional methods in classification precision.(5)Spam Filtering system based on text classification technology is studied.
Keywords/Search Tags:Text Categorization, Vector space model, Term weighting, Feature selection, E-mail filter
PDF Full Text Request
Related items