Font Size: a A A

Studies On String Mathching Algorithm Based On The Character’s Law

Posted on:2013-12-28Degree:MasterType:Thesis
Country:ChinaCandidate:G H LiFull Text:PDF
GTID:2248330371476613Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The string matching problem is a very important topic in the field of information processing. With the rapid development of network and IT, the information explosion caused by the extreme expansion of the amount of information, making the rapid information processing becomes more and more important. A point of view, the efficiency of the string matching even become a bottleneck for the accurate of the information required by the people.In the field of science (such as computer science, biology, demographic, financial, etc.), nature and social life (such as word frequency, the frequency of first digital numbers, etc.), there are a lot of phenomena or systems, make a confirmed that many of the inherent features are the power-law distribution. The power-law distribution of the statistical law reveals the relationship between many seemingly in no way associated things, but this relationship is known as the unbalanced distribution of the characteristics of a certain perspective, can often move things forward. The Zipfs law, Heaps’ law, Benford’s Law are to study the long-tailed distribution of several important law.As a reference of Zipfs law, Heaps’law and Benford’s law, We make the first study of the intrinsic properties of different character inherent in the vocabulary of English words, and put forward the concept of the character characterization of the number of words that the character’s information, the results verify that the characters in reverse ranking with the distribution of the character’s information is also consistent with power-law distribution, and then, we made the character’s law. Furthermore, the imbalance of the character’s information used to guide the string matching algorithm. Obtained and after the experiments, we found that the character’s law can accelerate the process of string matching and improve the performance of string matching algorithms, we presents the string match algorithm based on the character’s law, the algorithm of three variants of the algorithm is SMCI-noSort algorithm, SMCI-partial-sort algorithm and SMCI-all-sort algorithm. The experiments show that the character’s law can very fast accelerate the process of the string matching, and thus speeding up the speed of the string matching algorithm.Specifically, the innovative research of this paper is mainly reflected in the following points:1. This paper defines a concept of using the characters to represent words, and a concept of the information of a character, through experimental demonstration, the reciprocal of the character’s order and the information of a character are all in line with the power distribution, and then put forward the character’s law.2. This paper proposed SMCI-no-sort algorithm; the idea of the algorithm is completely character-based inverted index structure for string search, the searching process make use of the uneven distribution of the characters. The algorithm and the process simply create the inverted index structure, rather than sorting the substrings, the time of preprocess is very low.3. This paper proposed the SMCI-all-sort algorithm; the algorithm is corresponding to the SMCI-no-sort algorithm to sort all the list in the preprocessing stage, in the searching process, according to the fully ordered list, the operation is only binary search process.4. This paper Proposed the SMCI-partial-sort algorithm; the algorithm makes full use of the amount of the uneven distribution of the character, it select the part of the sort the list in the preprocess phase, so as to enhance the efficiency of the preprocessing phase and search phase.
Keywords/Search Tags:The feature of the character, Character’s Information, String Matching, Pattern Matching
PDF Full Text Request
Related items