Research On Text Similarity Algorithm Based On Vector Space Model

Posted on:2016-12-04

Degree:Master

Type:Thesis

Country:China

Candidate:J Tan

Full Text:PDF

GTID:2208330470452895

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology and the coming of the information revolution, how to compute the similarity of all kinds of text has become a hotspot and difficulty of research. At present, the text similarity computing technology is widely used in the field of text data mining, text classification, information retrieval, information filtering, machine translation, text rechecking, etc. All kinds of text similarity research in those field is also in full swing. The study of text similarity mainly from improve precision and search speed. Now, many achievements have been made, such as mature text representation model: the Boolean Model, the Probability Model and the Vector Space Model, similarity calculation method:similarity measure and distance measure, also about the text part word, semantic research, etc. These technologies are widely applied, but in terms of efficiency and performance are also exists some problems that cannot be ignored. Such as the research emphasis in this paper:Vector Space Model, it has some problems with unable to performance the order of terms, high vector dimension and computation efficiency is low, etc. All of these problems need us to study and improvement.In this paper, we research the basis of text similarity computing, and in view of the traditional vector space model canâ€™t reflect features special text expression ability in different positions when the text similarity calculation, we studied the improved model:Text Segment Vector Space Model. And for the problem of the non-ideal precision witch caused by text segments were coherent processed when the text segment vector model similarity calculation for the text witch structure similar to table, Independent Weighted Text Segment Vector Space Model is presented. In view of the traditional vector space model in the problem of the low computation efficiency witch caused by high vector dimension when the text similarity calculation, proposes two nonzero weight vector space model:Nonzero Weights Union Set Vector Space Model, Nonzero Weights Benchmark Vector Space Model, these two models can be respectively applicable to different environment. Finally, based on the above theories, the text filtering system was designed and implemented. Through the system we has carried on the experiments to the three kinds of improved model witch were presented in this article. The results show that the Independent Weighted Text Segment Vector Space Model is feasible and effective in improving precision and calculation efficiency, two nonzero weight vector space model is feasible and effective to reduce the dimension of the calculation and improve the computational efficiency.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Research On The Calculation Method Of Similarity Based On The Fusion Of Tibetan Text Segment
2	Research On Chinese Text Similarity Detection Technology Based On Word Weight Analysis
3	Research On Semantic Similarity And Feature Weight Relation In Text Classification
4	Research Of Chinese Text Classification Based On KNN
5	A Bad Text Filtering Method
6	Research On Web Text Mining
7	Research And Application On Text Similarity Algorithm Based On Semantics
8	Study Of Chinese Text Classification
9	Automatic Summarization Alorgithm For Chiness Short Text
10	Research Of Weight Algorithm In KNN Text Classification