Font Size: a A A

Research On The Identification Technique Based On The Pattern Matching Algorithm

Posted on:2012-07-02Degree:MasterType:Thesis
Country:ChinaCandidate:J B WangFull Text:PDF
GTID:2218330371462548Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Malicious tampering to the information of the file format to cover up the true type is the most common means in the computer crime; in order to fight the crime, the technology of correctly identifying the true file format becomes more important. This paper focuses on the file format recognition technology and the supporting algorithm of the file format recognition technology the pattern matching algorithm, and then the improved algorithm of the pattern matching algorithms, the feature extraction algorithms, the file format recognition model and the unknown file type recognition system were proposed, which have been proved effective through experiments verification. The main contents are as follows:Firstly, a matching algorithm is proposed to adapt the long string matching. The characters of a long string in the distribution have significant statistical properties but the existing algorithms often ignore this feature, which results in the result that the algorithm is not optimized. Therefore, this algorithm by using the seat shifted table and the technology of the average segmentation, is full use of the characteristics of the pattern string to speed up the jumping distance under non-matching, and reduce the number of matches, so as to achieve improving the speed of matching under the effect of the long string. Theoretical analysis and experimental tests show that the time complexity of the algorithm is low, and matching is more efficient.Secondly, a bit-parallel matching algorithm is proposed based on the size of the machine word. More of the existing algorithms focus on the optimal time complexity, while ignoring the potential capability of the computer itself, which makes that the actually practical application of the algorithms is not optimized in speed. Based on the characteristics of the pattern string itself, the algorithm uses the bit-parallel technology to fully tap the potential of the machine itself, so as to achieve the maximum effect of accelerating the algorithm. Theoretical analysis and experimental tests show that the time complexity of the algorithm is low; in the context of a machine word the matching is more efficient.Thirdly, an approximate multi-pattern matching algorithm is proposed based on the tampering. As the existing approximate matching algorithms in the multi-mode matching have high requirements to the coupling and the fault tolerance, when the coupling is large or tolerance lower, the efficiency of the algorithm will rapidly decline. More seriously, some algorithms in existence are very complex. With the characteristics of the file tampered with, which is the way of file formats in changes is the tamper instead of using multi-deletion, using the bit-parallel technology and the seat shifted table, the algorithm can achieve matching up the characteristics in units of the number of the machine word size, and effectively solve the problems of the low rate of fault tolerance and high coupling. Theoretical analysis and experimental tests show that the algorithm in construction is relatively simple contrast to the congener algorithms, the time complexity is low, and practical application is better.Fourthly, an algorithm is proposed to solving comprehensively common sub-sequences of the files. Because the existing longest common subsequence algorithms just solve the longest common subsequence of the two text data, one and only one. But in practice, we find that the key information is not necessarily relevant in the most long common subsequence, and in the middle of the other common subsequence there may have a lot of useful information. Therefore, in order to draw a more comprehensive document features, position integrated the file types by the multi-features, combined with the application of the pattern matching algorithm, by using the seat shifted table and effective pruning techniques, the algorithm arrives at the effectively common sub-sequences. Theoretical analysis and experimental tests show that the time complexity of the algorithm is low, and practical application is better.Fifthly, a file feature model is designed. By effective classification and organic organizations, there has achieved the application of the characteristics in level, and provided a solid foundation to the establishment of the file type recognition system.Sixthly, an unknown file type recognition system is designed. Through the comprehensive utilization of the pattern matching algorithms and the feature extraction algorithm, in conjunction with a document feature model, there has designed an unknown file recognition system. Used by the system for several file types, namely ppt,word,xls and pdf's into the experiment, the results show that when the filename extension is tampered, the recognition rate of the system is up to 100%,and even when the file contents are tampered to 15%, the recognition rate of the system is still up to 80%, so that the system can effectively identify unknown files.Finally, the whole work is summarized, and it makes the prospect of future work, proposes directions for further research.
Keywords/Search Tags:pattern matching, file identification, tamper technology, bit-parallel, the seat shifted table, feature model
PDF Full Text Request
Related items