Font Size: a A A

The Research And Implementation Of Text Similarity Computing Based On Topic Model

Posted on:2013-06-27Degree:MasterType:Thesis
Country:ChinaCandidate:C N SunFull Text:PDF
GTID:2248330371499433Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The Internet has developed to the mobile Internet age, it is not only the traditional PC can browse the Internet, cell phones, tablet PCs and other mobile devices can access the Internet. Computer information processing has entered the age of big data. These data, many of which are the form of text, such as Google search logs, Twitter and micro blogging daily updated data, Facebook, and Tencent daily user-generated data, etc., these data were not GB level, but TB level of the data. How to analyze these huge data to help corporate decision-making or improve the user experience is the main problem. The main work of this paper is text similarity computation; the main research work is to investigate the similarity of a robust method of calculating the widest possible range of applications. At the very start we introduced the vector space model and its problems, and then to explore some solutions for these problems, the main work is as follows:First, a brief introduction to the basic principles of the vector space model and similarity calculation method based on vector space model. In the same way briefly introduced the topic model, as well as topic-based model of similarity calculation method. And detailed collection of significance and algebraic significance of the topic model can be seen from the main model compared with the vector space model, have a richer mathematical and statistical basis.Second, we briefly introduced the LSI, pLSI, the LDA model and their parameter estimation method. The theme of the model after the LDA method is only just emerging, this paper introduces some research progress for the topic model, the main progress which variables to add a new observation for the characteristics of the task, as well as the introduction of semantic information by three aspects.This article describes a based on the pLSI word co-occurrence clustering algorithms, and modeling on the basis of co-occurrence phrase text, that text phrase is now more of its similarity the greater the similarity based on the assumption that the establishment of algorithm in the experimental verification is valid.Finally, the Chinese text modeling method based on the LDA model, the experimental Gibbs sampling algorithm to draw the theme of the text space and the text of the topic space for the similarity, the use of the JS distance to measure the similarity of the text, experiments show that the method better than traditional methods based on vector space model.
Keywords/Search Tags:Text Similarity, VSM, Topic Model, Gibbs Sample
PDF Full Text Request
Related items