Font Size: a A A

An Algorithm Research On Text Classification Based On Vector Space Model

Posted on:2012-12-16Degree:MasterType:Thesis
Country:ChinaCandidate:Z F ZhangFull Text:PDF
GTID:2248330395462353Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of network information, it has become the key of information processing technology that how to acquire the useful information in vast amounts of text resources fast and accurately.Text Classification, as the key technology in large-scale data processing and organizing,can solve the problem of information classifacation confusion in a large excent and can help us lacate and segregate information more accurately and effciently.At the moment, the technology of text classification has played an indispensabale role in information retrieval, automatic quizzes and other fields, and has quickly become the reasearch focus in related fields.Text similarity algorithm based on date statistics, taking VSM (Vector Space Model) as an example, has been widely used for its advantages of simple calculation and efficent realization. However, with the rapid development of network technology and the sharp increacing of information resources quantity, text types and text complexity have also changed dramatically than before, which then increasingly highlights the defects of traditional VSM algorithm in text categorization process.Combined with the application of HowNet semantic dictionary and LSI in text classification firstly, this paper has a deep comparison and analysis of the traditional VSM algorithm defects in classification process.Traditional VSM algorithm established its vector space model based on word form, whithout examing the semantic information of word entry, and ignoring the diversity and uncertainty of the word form under the same semantic. For these reasons, traditional VSM has a poor accuracy in text calssification. In addition, the establishment of the vector space aiming at the massive entry in text library makes the vector space dimensions so large that the traditional VSM has a low efficiency. Therefore, the paper makes semantic expansion for the feature vector in VSM with the help of the semantic hierarchy in HowNet, basing on the IS-A relations in semantic hierarchy. Each feature entry will be expanded with entry set that has relevant semantic characteristic, and each will be given an appropriate weight.In addition, according to the vocabulary similarity formula in HowNet, we establish synonyms set in terms of synonymous entry and introduce the concept of "flag word", and replace corresponding entry in the set with "flag word".The two stages including semantic expansion and establishment of synonyms set achieve the semantics reconstruction of VSM feature entry. And then the higher precision in VSM similarity calculation for feature vectors will be presented after the reconstruction. In terms of the defects of the traditional VSM in text categorization process, this paper established a data set for a large mount of text resources which belongs to different fileds and caculated the difference between traditional VSM and improved algorithm in recall rate and precision rate through experiment.Experimental results show that the improved algorithm is much better than traditional VSM in classification accuracy and efficiency to some extent.Finally, with respect to the shortage of improved algorithm in terms of feature entry in to heavy and disambiguation etc, this paper summaries the thesis and makes a prospect, and, at the same time, points out the text classification problems of VSM based on semantic features, which are still need to study and improve.
Keywords/Search Tags:VSM, HowNet, Text Classification, Semantic
PDF Full Text Request
Related items