Chinese Web Coded Identification Study Based On Word Frequency Distributions

Posted on:2015-02-23

Degree:Master

Type:Thesis

Country:China

Candidate:H Zhang

Full Text:PDF

GTID:2308330473459331

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the quickly development of computer technology, the Internet has become an inseparable part for people’s daily life and an important way for people to share information.However, the security situation of network appears more and more serious, which aroused people’s wide concern. Web content filtering is an important research topic in the security field for network, and the encoding identification is a necessary prerequisite for web content filtering. Due to historical and geographical reasons, there are a lot of Chinese encoding standards, which bring inconvenience to the content filtering of Chinese webpage. Therefore, how to quickly and accurately identifying web encoding has become a hot research topic.This dissertation introduces, several Chinese encodings, such as GB2312, GBK, BIG5, UTF8, etc, and studies some kinds of encoding identification algorithms, such as Bayes classifier algorithm, Unigram algorithm, CodeFinder algorithm, etc. Those algorithms can not exclude interference of ASCII in webpage, which leads to poor identification accuracy and low time efficiency. To solve this problem, the dissertation provides a Chinese Web encoding identification algorithm—FKI which based on word frequency distribution.According to the frequency distribution of Chinese characters, FKI selects high frequency characters to construct frequency-character-encoding table, and searches keywords in the webpage by using high frequency character encodings as keywords, which can skip the interference of noise (such as ASCII code).FKI compares the matchings of encoding belong to different encoding types in the webpage, and determines the real encoding based on the comparison results. FKI algorithm using the high frequency of characters as keywords, which is applicable to the vast majority of Chinese webpage.AC algorithm is improved in order to finding encoding in Chinese webpage effectively. Improved AC algorithm constructs a reverse finite state automata, and finds the keywords in bytes.In order to increasing the jump distance,when mismatching, using the byte which is state "0" currently corresponding as the mismatching byte, so as to improve the matching efficiency for Chinese code.Finally, the accuracy and time performance of the FKI algorithm is tested.The experimental results show that, FKI algorithm has superior time performance and accuracy for the encoding of Chinese webpage than Unigram algorithm and CodeFinder algorithm, and suitable for identificating encoding of Chinese webpage accuracyly and quickly.

Keywords/Search Tags:

Chinese encoding charset, Web content filtering, Word frequency distribution, Pattern matching

PDF Full Text Request

Related items

1	The Content Filtering Of Chinese Information In The Internet Web
2	Research On A Pattern Matching Algorithm Based On Word Frequency
3	Algorithm Based On Chinese For Matching Multiple Patterns And Its Application Research
4	Text Coverless Information Hiding Research Based On Chinese Character Encoding
5	The Chinese Web Page Filtering System Based On Content Security
6	Research And Implementation Of SMS Content Filtering Technology Under The Chinese Mobile Platforms
7	Research On Search Technology Of Chinese Information
8	Chinese Word Auto-segmentation Design And Algorithm Realization For Chinese Network Information Retrieval
9	Key Technology Research On Content-Based Text Filtering
10	Analysis And Research For Key Technology On Content-Based Web Text Filtering