Research On Fast Retrieval Algorithm English Sentences Based On Simhash

Posted on:2016-10-16

Degree:Master

Type:Thesis

Country:China

Candidate:Y H Cai

Full Text:PDF

GTID:2298330470451419

Subject:Signal and Information Processing

Abstract/Summary:

With the popularization of computers, the rapid development of computerapplication technology and the deepening of global integration, the communicationbarrier for people who use different languages becomes more and more prominent. Inorder to solve this problem, people produced the new discipline of machinetranslation, at the same time, it is also a hot research field of artificial intelligence.Machine translation involves many departments such as mathematics, linguistics,computer science and so on. It is a typical multiple edge interdisciplinary. It is noexaggeration to say that, almost every person, who living in the Information Age,should directly or indirectly dealing with machine translation, after entering the21stcentury. No matter on science and technology, business or politics, machinetranslation is a very important practical subject undoubtedly.The ultimate goal of machine translation is to pursue the translation resultsâ€™ reliableand elegant. However, limited by the limitation of human cognitive level researchitself, translation is not always achieving the desired result. But after nearly a centuryof development, there have been a wide variety of machine translation systems, suchas rule-based machine translation (RBMT) and statistics-based machine translation(SBMT) and so on. The environment they used is different, each with advantages anddisadvantages. In recent years, with the statistical-based machine translation systeminto the bottleneck period and the rapid development of storage technology,Example-based machine translation (EBMT) method is becoming more and morefavored. Example-based machine translation method doesnâ€™t need to parse sentences,using only original corpus of similar instances to matching and replace. The moresimilar examples of corpus, the higher the degree of similarity, the results will be more accurate. This article direct at the look up of the similar living example inEBMT to carrying out research, hoping to find a kind of high speed and accuracy ofsimilar instance retrieval algorithm.This paper first introduces the research status of similar text retrieval and machinetranslation system, introduces the main problems of their own. The author studied theprinciples of Simhash algorithm and the TF-IDF method based on vector space model(VSM) in this article. Then, mainly introduced the principle of the algorithm which become up in this paper. The new algorithm with the help of Simhash algorithmimplements the quick retrieval of similar instances. This paper studies the several keyaspects of the new algorithm and a semantic dictionary-WordNet, which be needed inthis algorithm. According to the proposed method, using C++programming languagebuilds a similar instance retrieval system model on the VS2010platform. This modelcan be used as a part of the Example-based machine translation system. With the helpof this model, the proposed method was tested. Finally, the proposed method in thispaper was compared with the based on same words method, the based on edit distancemethod and used alone TF-IDF method in terms of time and similarity calculationresults. As can be seen from the experimental results, this method can reduce the timeconsumption of similar instance retrieval, and the larger corpus, the effect is moreobvious. When these candidate similar examples contain synonyms, the result of thismethod is more objective response to the degree of similarity between two sentences.

Keywords/Search Tags:

EBMT, Sentence Similarity, Simhash, Replacement cost, SimilarInstance Retrieval

Related items

1	Based On The Instance Of English-chinese Translation System
2	Research On Text Similarity Detection Algorithm Based On Simhash
3	Research On Similar Sentence Retrieval Technology For Patents
4	Modification And Application Research On SIMHASH Algorithm
5	The Design And Implementation Of Multi-features Combination In Sentence Similarity Computation
6	Sentence-embedding And Similarity Via Hybrid Bidirectional-LSTM And CNN Utilizing Weighted-pooling Attention
7	Research Of English Sentence Similarity Measure Based On Wordnet
8	Research On Sentence Similarity Calculation Based On Neural Network
9	The Research And Implementation On Sentence Similarity Based On Deep Neural Networks
10	Research And Improvement Of Text Similarity Detection Based On Simhash