An Algorithm Research On Text Classification Based On Vector Space Model

Posted on:2012-12-16

Degree:Master

Type:Thesis

Country:China

Candidate:Z F Zhang

Full Text:PDF

GTID:2248330395462353

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of network information, it has become the key of information processing technology that how to acquire the useful information in vast amounts of text resources fast and accurately.Text Classification, as the key technology in large-scale data processing and organizing,can solve the problem of information classifacation confusion in a large excent and can help us lacate and segregate information more accurately and effciently.At the moment, the technology of text classification has played an indispensabale role in information retrieval, automatic quizzes and other fields, and has quickly become the reasearch focus in related fields.Text similarity algorithm based on date statistics, taking VSM (Vector Space Model) as an example, has been widely used for its advantages of simple calculation and efficent realization. However, with the rapid development of network technology and the sharp increacing of information resources quantity, text types and text complexity have also changed dramatically than before, which then increasingly highlights the defects of traditional VSM algorithm in text categorization process.Combined with the application of HowNet semantic dictionary and LSI in text classification firstly, this paper has a deep comparison and analysis of the traditional VSM algorithm defects in classification process.Traditional VSM algorithm established its vector space model based on word form, whithout examing the semantic information of word entry, and ignoring the diversity and uncertainty of the word form under the same semantic. For these reasons, traditional VSM has a poor accuracy in text calssification. In addition, the establishment of the vector space aiming at the massive entry in text library makes the vector space dimensions so large that the traditional VSM has a low efficiency. Therefore, the paper makes semantic expansion for the feature vector in VSM with the help of the semantic hierarchy in HowNet, basing on the IS-A relations in semantic hierarchy. Each feature entry will be expanded with entry set that has relevant semantic characteristic, and each will be given an appropriate weight.In addition, according to the vocabulary similarity formula in HowNet, we establish synonyms set in terms of synonymous entry and introduce the concept of "flag word", and replace corresponding entry in the set with "flag word".The two stages including semantic expansion and establishment of synonyms set achieve the semantics reconstruction of VSM feature entry. And then the higher precision in VSM similarity calculation for feature vectors will be presented after the reconstruction. In terms of the defects of the traditional VSM in text categorization process, this paper established a data set for a large mount of text resources which belongs to different fileds and caculated the difference between traditional VSM and improved algorithm in recall rate and precision rate through experiment.Experimental results show that the improved algorithm is much better than traditional VSM in classification accuracy and efficiency to some extent.Finally, with respect to the shortage of improved algorithm in terms of feature entry in to heavy and disambiguation etc, this paper summaries the thesis and makes a prospect, and, at the same time, points out the text classification problems of VSM based on semantic features, which are still need to study and improve.

Keywords/Search Tags:

VSM, HowNet, Text Classification, Semantic

PDF Full Text Request

Related items

1	An Algorithm Research On Text Classification Based On Vector Space Model
2	Design And Implementation Of The Computer-aided Secret-level Classification System Based On Text Semantic Similarity
3	The Research On Conducting Chemical Domain Text Classifier Based On Hownet
4	The Study Of Short Text Classiifcation Algorithm Based On Semantic
5	Research On Semantic Similarity And Feature Weight Relation In Text Classification
6	Research On Deep Learning Text Classification Method Based On HowNet
7	The Research Of HowNet Based Word Similarity Computation And Its Application
8	The Research And Realization Of Text Retrieval Technology Based On Semantic Field
9	A Method Of Chinese Text Classification Based On The Expansion Of VSM
10	Research Of Web Text Clustering Based On Semantic