Font Size: a A A

Research On Fast Retrieval Algorithm English Sentences Based On Simhash

Posted on:2016-10-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y H CaiFull Text:PDF
GTID:2298330470451419Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the popularization of computers, the rapid development of computerapplication technology and the deepening of global integration, the communicationbarrier for people who use different languages becomes more and more prominent. Inorder to solve this problem, people produced the new discipline of machinetranslation, at the same time, it is also a hot research field of artificial intelligence.Machine translation involves many departments such as mathematics, linguistics,computer science and so on. It is a typical multiple edge interdisciplinary. It is noexaggeration to say that, almost every person, who living in the Information Age,should directly or indirectly dealing with machine translation, after entering the21stcentury. No matter on science and technology, business or politics, machinetranslation is a very important practical subject undoubtedly.The ultimate goal of machine translation is to pursue the translation results’ reliableand elegant. However, limited by the limitation of human cognitive level researchitself, translation is not always achieving the desired result. But after nearly a centuryof development, there have been a wide variety of machine translation systems, suchas rule-based machine translation (RBMT) and statistics-based machine translation(SBMT) and so on. The environment they used is different, each with advantages anddisadvantages. In recent years, with the statistical-based machine translation systeminto the bottleneck period and the rapid development of storage technology,Example-based machine translation (EBMT) method is becoming more and morefavored. Example-based machine translation method doesn’t need to parse sentences,using only original corpus of similar instances to matching and replace. The moresimilar examples of corpus, the higher the degree of similarity, the results will be more accurate. This article direct at the look up of the similar living example inEBMT to carrying out research, hoping to find a kind of high speed and accuracy ofsimilar instance retrieval algorithm.This paper first introduces the research status of similar text retrieval and machinetranslation system, introduces the main problems of their own. The author studied theprinciples of Simhash algorithm and the TF-IDF method based on vector space model(VSM) in this article. Then, mainly introduced the principle of the algorithm which become up in this paper. The new algorithm with the help of Simhash algorithmimplements the quick retrieval of similar instances. This paper studies the several keyaspects of the new algorithm and a semantic dictionary-WordNet, which be needed inthis algorithm. According to the proposed method, using C++programming languagebuilds a similar instance retrieval system model on the VS2010platform. This modelcan be used as a part of the Example-based machine translation system. With the helpof this model, the proposed method was tested. Finally, the proposed method in thispaper was compared with the based on same words method, the based on edit distancemethod and used alone TF-IDF method in terms of time and similarity calculationresults. As can be seen from the experimental results, this method can reduce the timeconsumption of similar instance retrieval, and the larger corpus, the effect is moreobvious. When these candidate similar examples contain synonyms, the result of thismethod is more objective response to the degree of similarity between two sentences.
Keywords/Search Tags:EBMT, Sentence Similarity, Simhash, Replacement cost, SimilarInstance Retrieval
PDF Full Text Request
Related items