Font Size: a A A

Study Of Text Filtering Based On WEB Content Security

Posted on:2018-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:S CuiFull Text:PDF
GTID:2348330518996541Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the efficiency of the information sharing and transmission of real-time also increased a lot,which has caused a boom of information on the Internet. However, the network is a double-edged sword. On the one hand, users can get the information that they want more convenient and efficient. On the other hand, some unlawful lawbreakers spread unhealthy information through the Internet, which affected social stability and people's lives. Some of illegal contents endangered the healthy development of youngsters.Therefore, cleaning the network environment and filtering objectionable content is a necessary problem to be solved.Text is a big part of the information in the Internet, so text filtering is considered an integral part of unhealthy information filtering. The traditional way of text filtering is to divide text information into two categories: normal text and undesirable text which don't account for differences among undesirable text. The goal of this article is to analyze features of different kinds of undesirable text and provide targeted filtering methods in order to improve the accuracy rate and reducing complexity. The main contributions of this dissertation include:This article reviews common ways of text filtering, especially content-based text filtering. According to the content and distribution of the text, this article proposes a classification system of undesirable text and use appropriate method to filter each kind of text. After extracting features of text and structure, match input vectors using the techniques of machine learning, particularly logistic regression and combined decision tree. The output value represents the similarity of input text and category templates. This classification system improves the filtering performance and avoid over-fitting phenomenon. Text in the Internet are varying lengths and different in expression. This article determine the length of text and extract different features for long text and short text. This enriches the features of short text, and releases the computational burden of long text. Ill text are fewer and more difficult to crawl than normal text which will cause imbalance of training data. Apart of under-sampling,this article also re-compute features' weighting to improve classification accuracy.
Keywords/Search Tags:undesirable text, text filtering, feature extraction, text classification
PDF Full Text Request
Related items