Study Of Text Filtering Based On WEB Content Security

Posted on:2018-02-14

Degree:Master

Type:Thesis

Country:China

Candidate:S Cui

Full Text:PDF

GTID:2348330518996541

Subject:Information and Communication Engineering

Abstract/Summary:

With the rapid development of the Internet, the efficiency of the information sharing and transmission of real-time also increased a lot,which has caused a boom of information on the Internet. However, the network is a double-edged sword. On the one hand, users can get the information that they want more convenient and efficient. On the other hand, some unlawful lawbreakers spread unhealthy information through the Internet, which affected social stability and people’s lives. Some of illegal contents endangered the healthy development of youngsters.Therefore, cleaning the network environment and filtering objectionable content is a necessary problem to be solved.Text is a big part of the information in the Internet, so text filtering is considered an integral part of unhealthy information filtering. The traditional way of text filtering is to divide text information into two categories: normal text and undesirable text which don’t account for differences among undesirable text. The goal of this article is to analyze features of different kinds of undesirable text and provide targeted filtering methods in order to improve the accuracy rate and reducing complexity. The main contributions of this dissertation include:This article reviews common ways of text filtering, especially content-based text filtering. According to the content and distribution of the text, this article proposes a classification system of undesirable text and use appropriate method to filter each kind of text. After extracting features of text and structure, match input vectors using the techniques of machine learning, particularly logistic regression and combined decision tree. The output value represents the similarity of input text and category templates. This classification system improves the filtering performance and avoid over-fitting phenomenon. Text in the Internet are varying lengths and different in expression. This article determine the length of text and extract different features for long text and short text. This enriches the features of short text, and releases the computational burden of long text. Ill text are fewer and more difficult to crawl than normal text which will cause imbalance of training data. Apart of under-sampling,this article also re-compute features’ weighting to improve classification accuracy.

Keywords/Search Tags:

undesirable text, text filtering, feature extraction, text classification

Related items

1	Research On Network Undesirable Text Filtering Based On Social Platform
2	Information Filtering Systems Based On Web Text Content And Design,
3	Learning-Based Text Extraction In Natural Background
4	Text Filtering Key Technologies
5	Research On Text-Content-based Web Filtering Technology
6	Research And Application Of Talent Job Online Matching Based On Text Feature Extraction Technology
7	Design And Implementation Of Text Classification Model Based On The Improved TF-IDF Feature Extraction
8	Research On Network Text Classification Technique
9	A Research On Feature Extraction Applied For Text Classification
10	Research And Implementation Of Text Comprehensive Processing Platform