Study Of Text Similarity Computing Based On Markov Model

Posted on:2008-02-27

Degree:Master

Type:Thesis

Country:China

Candidate:Z K Su

Full Text:PDF

GTID:2178360242467566

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In the information processing, the calculation of text similarity has been applied widely in retrieval, classification, clustering and other knowledge management-related fields. It is a very basic and important issue, which has been studied for a long time. At present, the text similarity has been studied with statistical theory, and the word frequency is concerned excessively. However, it neglects an equally important aspect - Word Order (the order of words appears in the text). In order to obtain better results from the calculation of text similarity, the word order should be applied to the text similarity field. The paper use the matrix of Markov model, the longest common sequence and all common substrings of two texts to record word order. And it also uses the traditional TF - IDF method of VSM to record word frequency.An efficient algorithm was propsed in this paper.At first it does text pretreatment, fellowed by generating TF item and IDF item with TF-IDF method (it builds and searches the tree at the same time, and it has a good efficiency).And the matrix of Markov model is built according to source texts (the paper takes a word as a state of Markov model). Then it combines the longest common sequence, the matrix of Markov model, and TF-IDF method to generate preliminary results in the calculation of text similarity. And then according to the results of comparison beteween the threshold and preliminary results to determine whether to use the common substrings of two texts or not. If the preliminary results are greater than the threshold, then an algorithm based on the difference of word order is used to seek all common substrings of two texts. Then it adjusts the preliminary results according to the length and the number of all common substrings of two texts. So it enhances the separation of dataset with the results of calculation effectively.Finally, the paper takes the classification of information of artificial tagging as the criteria to evaluate the experimental results (it uses common KNN method to evaluate experimental results).The paper tests the text similarity based on Markov model with some TREC-9 dataset. The experimental results suggest that the calculation of text similarity based on Markov model enhances by 5% to 15% compared with the TF-IDF method based on VSM under the same participle and assess conditions.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Study Of Chinese Text Similarity Research Based On Markov Word Order Gene
2	Research On Geometric Similarity Of Machine Parts By Hidden Markov Model
3	Text Similarity Computing Theory And Applied Research
4	Algorithm Research For Text Information Extraction Based On Hidden Markov Model
5	Research Of Blog Similarity Analysis Based On Hidden Markov Model
6	Research On Text Similarity Algorithm Based On Vector Space Model
7	Research And Implementation Of Text Similarity Algorithm Based On Semantic Fusion
8	Research On Short Text Similarity Based On Deep Learning
9	Research Of Web Text Mining Technology Based On Hidden Markov Model
10	Research On Error Correction Technology Of Text Recognition Based On Hidden Markov Model