Font Size: a A A

A Method Of Chinese Text Classification Based On The Expansion Of VSM

Posted on:2011-08-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q JingFull Text:PDF
GTID:2178330332460435Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
As internet develops with a rapid speed, texts as its main resources are increasing quickly. How to effectively organize and manage information, and how to fast, accurately and comprehensively find the useful information are important issues in the field of information science and technology. Text Classification as key technologies for text data organizing and processing, can solve the problem in much extent, so as to help people accurately position and efficiently diverse information. Therefore it has broad application prospects.The most widely used model for Automatic Text Categorization is the vector space model. Usually characteristical words are used to build a vector space model as features. Early studies are based on knowledge-based engineering methods, and feature items are determined by artificial rules; with me development of statistical machine learning theory and statistical natural language processing technology, machine learning methods are applied to determine the feature items, and have achieved good results. However, due to the training corpus resources and training time constraints, machine learning has limitations. Many feature items contributing to topic determination are not available through the conventional machine learning method. Text classification will not achieve satisfactory results with vector space model generated by such feature vectors.So the vector space model needs to re-construct.In this paper, a method of Chinese text classification based on the expansion of VSM is proposed. The features of each type of texts are analyzed, and then with the help of Hownet, sememes which are most closely related to the theme are abstracted. These sememes are used to expand feature items. Combined with the synonym table, the feature expansion set is generated and each expansion term is given proper weight to present its description power. Finally, we use the expansion set to classify texts. This article focuses on how to extract characteristics, how to set appropriate weights to expansion terms, and how to construct a new VSM. Experimental results show that this method increases the effective number of features, so that both of the classification accuracy and stability are improved. Finally, a summary of the thesis and outlook are made, pointing out what needs to research and improve in future.
Keywords/Search Tags:text classification, VSM, Hownet, sememe
PDF Full Text Request
Related items