Font Size: a A A

Study Of Text Similarity Computing Based On Markov Model

Posted on:2008-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:Z K SuFull Text:PDF
GTID:2178360242467566Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the information processing, the calculation of text similarity has been applied widely in retrieval, classification, clustering and other knowledge management-related fields. It is a very basic and important issue, which has been studied for a long time. At present, the text similarity has been studied with statistical theory, and the word frequency is concerned excessively. However, it neglects an equally important aspect - Word Order (the order of words appears in the text). In order to obtain better results from the calculation of text similarity, the word order should be applied to the text similarity field. The paper use the matrix of Markov model, the longest common sequence and all common substrings of two texts to record word order. And it also uses the traditional TF - IDF method of VSM to record word frequency.An efficient algorithm was propsed in this paper.At first it does text pretreatment, fellowed by generating TF item and IDF item with TF-IDF method (it builds and searches the tree at the same time, and it has a good efficiency).And the matrix of Markov model is built according to source texts (the paper takes a word as a state of Markov model). Then it combines the longest common sequence, the matrix of Markov model, and TF-IDF method to generate preliminary results in the calculation of text similarity. And then according to the results of comparison beteween the threshold and preliminary results to determine whether to use the common substrings of two texts or not. If the preliminary results are greater than the threshold, then an algorithm based on the difference of word order is used to seek all common substrings of two texts. Then it adjusts the preliminary results according to the length and the number of all common substrings of two texts. So it enhances the separation of dataset with the results of calculation effectively.Finally, the paper takes the classification of information of artificial tagging as the criteria to evaluate the experimental results (it uses common KNN method to evaluate experimental results).The paper tests the text similarity based on Markov model with some TREC-9 dataset. The experimental results suggest that the calculation of text similarity based on Markov model enhances by 5% to 15% compared with the TF-IDF method based on VSM under the same participle and assess conditions.
Keywords/Search Tags:Text Similarity, Markov Model, VSM, TF-IDF
PDF Full Text Request
Related items