Font Size: a A A

The Research About Text Similarity Measuring Through Hamming-Distance And Semantics

Posted on:2017-09-26Degree:MasterType:Thesis
Country:ChinaCandidate:Q BaoFull Text:PDF
GTID:2348330482986926Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays,humanity's increasing dependence on network makes network data scale demonstrate a trend of explosive growth.The text is an important carrier,thus its relevant text information processing technology has received more and more attention.As a key part of the technology,text-similarity measure's accuracy directly affects the results of text information processing.Currently,one of the main methods of text-similarity measure is to use the relationship between the vectors in vector space model(VSM)to reflect the degree of similarity between texts.The concept is simple and it has a strong computability.However,this method involves the processing of high-dimensional sparse matrix with a high computational complexity.In addition,it ignores the impact of semantic information on the text.Another kind of similarity algorithm based on Semanteme can overcome this shortcoming.But it needs support of knowledgebase in specific fields,and the complexity of its establishment results in more theory than practice of such algorithm.This paper proposes a new method(HSim)on the basis of the above two algorithms.The method combines the advantages of space model in the first method and semantic information in the second method,finally uses the Hamming-distance to calculate the text-similarity,thus avoiding the direct handling of high-dimensional sparse matrix.O n the one hand,this method uses the Hamming-distance to overcome the low computational efficiency of the high-dimensional sparse matrix in the first method;on the other hand,the integration of VSM model and Hamming-distance enables HSim to directly use semantic dictionary as a reference,which overcomes the complexity of the domain-specific knowledgebase's establishment in the second method.Experiments conduct clustering comparison with other text-similarity measure methods by using the training corpus and F-measures.Experimental results show that the method HSim has more superior performance than other methods,but meanwhile there are also some deficiencies in its applicability.To overcome these deficiencies,this paper optimizes the improves the steps of algorithm including two mappings,and the input set in the final calculation,as well as carries out a new experiments whose results show that the applicability of the improved method has been greatly improved.
Keywords/Search Tags:text-similarity, VSM(Vector Space Model), semantic information, Hamming-distance, mapping
PDF Full Text Request
Related items