Font Size: a A A

Research And Application Of String Approximate Matching Algorithm Based On Multivariate Information

Posted on:2020-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:Z X NiuFull Text:PDF
GTID:2428330578457317Subject:Computer technology
Abstract/Summary:PDF Full Text Request
String matching is one of the most classic problems in computer science.Early research focused on the field of exact matching,and many studies have focused on small character sets such as DNA fragments or medium-sized character sets such as English.However,there are not many studies on large character sets such as Chinese characters and even Asian languages.With the continuous appearance of new problems,people have found that it is more necessary to approximate string matching in practical applications.Approximate string matching has been applied to many fields,such as spelling check,pattern recognition,web searching,OCR error correction,DNA sequence matching and so onApproximate matching algorithm of Chinese strings is studied in this paper,and the main research results are as follows(1)A method of combining the edit distance of multivariate information is proposed In the traditional editing distance algornthm,the cost of adding,deleting,and replacing a character is simply considered,and the degree of importance of each character is not distinguished.In addition,the glyph information of the Chinese character is not considered during the matching process.Based on the traditional editing distance algorithm,a new method is proposed which combines the word frequency information and font information of Chinese characters.For strings in fixed libraries,the frequency of occurrence of characters is constant.In this paper,low-frequency characters are defined as important characters,which reduce the matching cost and increase the probability of matching for important characters.Chinese characters are different from English characters,and have unique information on the glyphs,including Wubi,structures,four corners,and strokes.They respectively represent the differences in the components,structure,shape,and complexity of Chinese characters.For two characters with large differences in glyphs,increase the cost of matching.The proposed method not only considers the word frequency of Chinese characters,but also makes full use of the differences in the glyphs of Chinese characters.The experimental results show that the edit distance method combining multiple kinds of information significantly improves the accuracy of string approximate matching(2)A method of combining the editing distance of multivariate information with the Trie tree is proposed.For approximate matching of a large number of strings,since each of them is compared by the method of editing distance,it will take a lot of time.In this paper,the algorithm based on the Trie tree is designed,combining the idea of dynamic programming and the characteristics of sharing prefix.Since it prunes the target string to reduce the number of string matching,the time of string approximate matching is reduced.The experimental results show that the combination of the edit distance of the multi-information and the Trie tree greatly reduces the time of string approximate matching while ensuring the same accuracy.
Keywords/Search Tags:Chinese string matching, Multiple information, String similarity search, Levenstein distance, Trie tree
PDF Full Text Request
Related items