Font Size: a A A

Research On Text Similarity Algorithm Based On Vector Space Model

Posted on:2016-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:J TanFull Text:PDF
GTID:2208330470452895Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology and the coming of the information revolution, how to compute the similarity of all kinds of text has become a hotspot and difficulty of research. At present, the text similarity computing technology is widely used in the field of text data mining, text classification, information retrieval, information filtering, machine translation, text rechecking, etc. All kinds of text similarity research in those field is also in full swing. The study of text similarity mainly from improve precision and search speed. Now, many achievements have been made, such as mature text representation model: the Boolean Model, the Probability Model and the Vector Space Model, similarity calculation method:similarity measure and distance measure, also about the text part word, semantic research, etc. These technologies are widely applied, but in terms of efficiency and performance are also exists some problems that cannot be ignored. Such as the research emphasis in this paper:Vector Space Model, it has some problems with unable to performance the order of terms, high vector dimension and computation efficiency is low, etc. All of these problems need us to study and improvement.In this paper, we research the basis of text similarity computing, and in view of the traditional vector space model can’t reflect features special text expression ability in different positions when the text similarity calculation, we studied the improved model:Text Segment Vector Space Model. And for the problem of the non-ideal precision witch caused by text segments were coherent processed when the text segment vector model similarity calculation for the text witch structure similar to table, Independent Weighted Text Segment Vector Space Model is presented. In view of the traditional vector space model in the problem of the low computation efficiency witch caused by high vector dimension when the text similarity calculation, proposes two nonzero weight vector space model:Nonzero Weights Union Set Vector Space Model, Nonzero Weights Benchmark Vector Space Model, these two models can be respectively applicable to different environment. Finally, based on the above theories, the text filtering system was designed and implemented. Through the system we has carried on the experiments to the three kinds of improved model witch were presented in this article. The results show that the Independent Weighted Text Segment Vector Space Model is feasible and effective in improving precision and calculation efficiency, two nonzero weight vector space model is feasible and effective to reduce the dimension of the calculation and improve the computational efficiency.
Keywords/Search Tags:text similarity, VSM, text segment, independent weight, nonzero weight
PDF Full Text Request
Related items