Font Size: a A A

Chinese Web Coded Identification Study Based On Word Frequency Distributions

Posted on:2015-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:H ZhangFull Text:PDF
GTID:2308330473459331Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the quickly development of computer technology, the Internet has become an inseparable part for people’s daily life and an important way for people to share information.However, the security situation of network appears more and more serious, which aroused people’s wide concern. Web content filtering is an important research topic in the security field for network, and the encoding identification is a necessary prerequisite for web content filtering. Due to historical and geographical reasons, there are a lot of Chinese encoding standards, which bring inconvenience to the content filtering of Chinese webpage. Therefore, how to quickly and accurately identifying web encoding has become a hot research topic.This dissertation introduces, several Chinese encodings, such as GB2312, GBK, BIG5, UTF8, etc, and studies some kinds of encoding identification algorithms, such as Bayes classifier algorithm, Unigram algorithm, CodeFinder algorithm, etc. Those algorithms can not exclude interference of ASCII in webpage, which leads to poor identification accuracy and low time efficiency. To solve this problem, the dissertation provides a Chinese Web encoding identification algorithm—FKI which based on word frequency distribution.According to the frequency distribution of Chinese characters, FKI selects high frequency characters to construct frequency-character-encoding table, and searches keywords in the webpage by using high frequency character encodings as keywords, which can skip the interference of noise (such as ASCII code).FKI compares the matchings of encoding belong to different encoding types in the webpage, and determines the real encoding based on the comparison results. FKI algorithm using the high frequency of characters as keywords, which is applicable to the vast majority of Chinese webpage.AC algorithm is improved in order to finding encoding in Chinese webpage effectively. Improved AC algorithm constructs a reverse finite state automata, and finds the keywords in bytes.In order to increasing the jump distance,when mismatching, using the byte which is state "0" currently corresponding as the mismatching byte, so as to improve the matching efficiency for Chinese code.Finally, the accuracy and time performance of the FKI algorithm is tested.The experimental results show that, FKI algorithm has superior time performance and accuracy for the encoding of Chinese webpage than Unigram algorithm and CodeFinder algorithm, and suitable for identificating encoding of Chinese webpage accuracyly and quickly.
Keywords/Search Tags:Chinese encoding charset, Web content filtering, Word frequency distribution, Pattern matching
PDF Full Text Request
Related items