Font Size: a A A

Research On Similar Sentence Retrieval Technology For Patents

Posted on:2011-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y K LuFull Text:PDF
GTID:2178360302488549Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the field of the nature language processing, sentence retrieval has been widely applied and concerned by people. In the system of question answering (QA), automatic text summarization, example-based machine translation (EBMT) or translation memory, the quality of a sentence retrieval module would directly affect the performance of the system. However, there are no unified standards for judging whether two sentences are similar or not. The standards of similarity judgment are different in different fields, so judgement standards are different. Until now, unified standards do not exist, and it is impossible to make out such standards for the reason that specific judgment criteria are associated with a specific application. For example, if the structure of the syntax is similar, we can think the two sentences are similar in the example retrieval system. While, in FAQ-based automatic question answering, we can judge the sentences are similar when they have similar meaning.Following the growing awareness of the intellectual property rights and the urgent need for the international exchanges, the traditional translation way of translating patents by people can not meet the rapid needs of patent translation. And it also blocks the spreading and exchange of patent techniques between China and the rest of the world to some extent. As the rapid development of machine translation, the automatic machine translation and computer-human cooperative translation become an effective way to solve the problem.The main task of this paper is to design a sentence retrieval algorithm for the computer-human cooperative translation system according to the features of patents so as to improve the performance of the system. Compared with the common documents, the patent documents have canonical format, precise expression and an abundance of terms. Aiming at the characteristics of the patent documents, this paper presents a computing method of sentence similarity based on pseudo-LCS. This method is capable of fuzzy-alignment by improving the conventional longest common subsequence (LCS) algorithm. In addition, this method joins word meaning, parts of speech, term similarity and other related information, being more effective in sentence similarity computation for the patent documents as shown by experimental results. The accuracy of our method can achieve 83.5%, while the method of the improving edit is 63.5% and the vsm method is 66.5%.
Keywords/Search Tags:sentence similarity computation, pseudo-LCS, fuzzy-alignment, term similarity computation
PDF Full Text Request
Related items