Font Size: a A A

Study On Text Representation Model Based On Concept

Posted on:2007-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:2178360212485407Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In this thesis, we present a text representation model based on concept. The model takes WordNet as the main source of knowledge. That is to say, the model takes every synonymy set, which WordNet contains, as a concept which can describe definite meaning. We describe a text by establishing concept vector space in which we replace terms with synonymy sets in WordNet and adjust the weights of concept vectors by considering hypernymy-hyponymy relation between synonymy sets. Then we can extract high-level information from the text.We present two text representation models based on concept (TRMC) in this thesis. The one can be used for text representation of text categorization (TRMC-TCA). The other can be used for text representation of text clustering (TRMC-TCL). In TRMC-TCA, we adjust the weights of concept vectors based on the category information of training texts. That is to say, we take the inverse category frequency of concept as one of the weight impact factors.We conduct two group experiments to test the effect of TRMC-TCA and TRMC-TCL. In Group I experiment, we choose documents from Reuters Corpus Volume I (RCV1) dataset to form our training and test sets. And we compare TRMC-TCA with text representation model based on term by the same text categorization algorithm. The result is shown that, TRMC-TCA can guarantee satisfactory precision when the number of training texts is small; and can set the number of dimensionality of concept vector space as not large value and not reduce the precision when the number of training texts is large. In Group II experiment, we use 20Newsgroups dataset to form test set. And we compare TRMC-TCL with text representation model based on term by the same text clustering algorithm. The result is shown that, TRMC-TCL can improve the performance of agglomerative hierarchical clustering algorithm.
Keywords/Search Tags:Text Representation, WordNet, Concept Vector Space
PDF Full Text Request
Related items