Font Size: a A A

The Research And Implementation Of Text Feature-based Bad Web Page Detection System

Posted on:2021-11-29Degree:MasterType:Thesis
Country:ChinaCandidate:G J ZhuFull Text:PDF
GTID:2518306551952369Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of Internet,web pages such as pornography,gambling and others did great harm to people and the society,thus improving the ability of detecting adverse web page spreading bad information and strengthening the construction of a healthy network environment have become a pressing needs.In recent years,deep learning has made great progress in feature extraction and feature representation,and has also achieved satisfactory results in natural language processing tasks.From the word embeddings in natural language in 2013,to the sequence and sequence models,attention mechanisms,pre-trained language models,etc.,all indicate that natural language processing is developing rapidly,and the most in web pages is text,so this article will use text processing technology to detect and identify bad web pages.The main work of this article is as follows:1?Studied the text classification technology used in the detection and recognition of bad web pages in this paper,including text representation,preprocessing,feature selection,common classification algorithms and deep learning algorithms in recent years.2 ? Proposed a text classification algorithm combining the self-attention mechanism of the Transformer model and a convolutional neural network,and its effectiveness is verified through experiments.3?Web page parsing and preprocessing are performed on the actual web page data set,and data enhancement is performed.A Bi LSTM(Bi-directional Long Short-Term Memory)text classification algorithm with attention mechanism and TF-IDF(Term Frequency-Inverse Document Frequency)is proposed,and its effectiveness is verified through experiments.Then use this algorithm to extract features and fuse them with other features to get the final webpage classification model and at the same time,the webpage text keywords are extracted using the attention mechanism.4?Designed and implemented bad webpage detection software.The webpage detection results are obtained by directly entering the web address,and the keywords extracted during the recognition process of the webpage and the analysis of the webpage are displayed.
Keywords/Search Tags:Bad Webpage Detection, Text Classification, Attention Mechanism, Feature Fusion
PDF Full Text Request
Related items