Font Size: a A A

Approximate Chinese String Matching Techniques Based On Pinyin Input Method

Posted on:2011-04-08Degree:MasterType:Thesis
Country:ChinaCandidate:B LiuFull Text:PDF
GTID:2248330395458363Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
String matching is one of the most typical problems in computer science. Many researchers have focused on this problem for a long time and it has been applied in many fields. Previous studies mainly focused on accurate string matching problem. Many single pattern matching algorithms and multi-pattern matching algorithms were proposed. However, with the rapid development of the computer and Internet as well as the continuously rising of new issues, people find that sometimes approximate string matching is more necessary and more important in practical applications. In more detail, approximate string matching has important applied values in many fields such as query and extraction of information, pattern recognition, voice recognition, text editing, optical character recognition, intrusion detection, computational biology and so on. Therefore, it has very important theoretical value and practical meaning to research and design efficient approximate string matching algorithms.Approximate string matching is also called "string matching that allows errors", which mainly aims to find the pattern string in the text and database and allows to exist k differences between the pattern string and its occurring forms in texts. For the problem of approximate string matching, though a number of algorithms have been proposed, there are few studies which focus on large size of alphabet∑. Most of experts are interested in small or middle size of alphabet∑. For large size of∑, especially for Chinese characters and Asian phonetics, there are few efficient algorithms.For the above reasons, this thesis focuses on the approximate Chinese characters matching problem based on the input method. The work of this thesis is organized as follows.First of all, the thesis introduces the research background, research purpose, main contained and organization structure in brief, summarizes the approximate string matching problem and states the basic theories and main research approaches at the same time. Then we propose our research goal and descript the measurement criteria of approximate Chinese character matching problem based on pinyin generally. We propose the similarity measurement criteria and the approximate string matching technique for Chinese characters based on pinyin input method, which includes converting Chinese strings into Chinese pinyin strings, searching target strings and deciding the possible positions of the approximate pinyin strings to construct the candidate set facing pinyin strings, constructing candidate set of approximate Chinese strings and finally outputting approximate Chinese strings. The major feature of this algorithm is that, when searching for the target strings to decide the possible positions of the approximate pinyin strings, the algorithm establishes double elements inverted lists and makes full use of q-gram technique in order to well excavate the potential of filtering technology, inverted list and the q-gram technique. The algorithm performs well both in theory and in practice. This thesis also shows the demonstration of the system interface and the analysis of the experiment results. The experiment results show the proposed algorithm can solve the approximate Chinese character matching problem based on pinyin input method efficiently.
Keywords/Search Tags:pinyin input method, Chinese string, approximate string matching, pruning, q-gram technique
PDF Full Text Request
Related items