The Research About Text Similarity Measuring Through Hamming-Distance And Semantics

Posted on:2017-09-26

Degree:Master

Type:Thesis

Country:China

Candidate:Q Bao

Full Text:PDF

GTID:2348330482986926

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Nowadays,humanity's increasing dependence on network makes network data scale demonstrate a trend of explosive growth.The text is an important carrier,thus its relevant text information processing technology has received more and more attention.As a key part of the technology,text-similarity measure's accuracy directly affects the results of text information processing.Currently,one of the main methods of text-similarity measure is to use the relationship between the vectors in vector space model(VSM)to reflect the degree of similarity between texts.The concept is simple and it has a strong computability.However,this method involves the processing of high-dimensional sparse matrix with a high computational complexity.In addition,it ignores the impact of semantic information on the text.Another kind of similarity algorithm based on Semanteme can overcome this shortcoming.But it needs support of knowledgebase in specific fields,and the complexity of its establishment results in more theory than practice of such algorithm.This paper proposes a new method(HSim)on the basis of the above two algorithms.The method combines the advantages of space model in the first method and semantic information in the second method,finally uses the Hamming-distance to calculate the text-similarity,thus avoiding the direct handling of high-dimensional sparse matrix.O n the one hand,this method uses the Hamming-distance to overcome the low computational efficiency of the high-dimensional sparse matrix in the first method;on the other hand,the integration of VSM model and Hamming-distance enables HSim to directly use semantic dictionary as a reference,which overcomes the complexity of the domain-specific knowledgebase's establishment in the second method.Experiments conduct clustering comparison with other text-similarity measure methods by using the training corpus and F-measures.Experimental results show that the method HSim has more superior performance than other methods,but meanwhile there are also some deficiencies in its applicability.To overcome these deficiencies,this paper optimizes the improves the steps of algorithm including two mappings,and the input set in the final calculation,as well as carries out a new experiments whose results show that the applicability of the improved method has been greatly improved.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Semantic Similarity Calculation Text Field Vector Space Model
2	Text Similarity Computing Theory And Applied Research
3	Research And Implementation Of Information Retrieval Based On Semantic Expansion And Matching In P2P
4	Research And Implementation Of Text Similarity Algorithm Based On Semantic Fusion
5	Research On The Method Of Determining Semantic Similarity Oriented Ontology Mapping
6	Research On Text Semantic Similarity Based On Deep Learning
7	Research On Semantic Representation Of Text Based On Topic Model
8	Chinese-Old Bilingual Text And Sentence Similarity Calculation Research
9	The Implementation And Research Of The Probabilistic Latent Semantic Analysis Model In The Search Engine's Business Text Classification System
10	The Semantic Information Retrieval Research Based On Multilayer Vector Space Model