Font Size: a A A

Research On Internet Sensitive Content Detection Methods

Posted on:2023-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:J R DouFull Text:PDF
GTID:2558306905999479Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the spread of the Internet,information can be spread more easily and freely,which provides opportunities for sensitive information such as pornography,violence,terrorism and politics to be widely disseminated.In order to maintain national honor and protect the physical and mental health of teenagers,it is necessary to carry out effective detection and monitoring of all kinds of sensitive information.The traditional sensitive information detection method based on character matching mainly compares the existing sensitive word table with the text to be detected.Although it has high detection accuracy,it is highly dependent on the database.If the database used for retrieval is not perfect,the detection result will be greatly compromised.In recent years,in order to avoid the censorship of network platforms,many malicious users deform the sensitive information in the text,making it impossible for network platforms to identify it.Therefore,it is necessary to improve the detection efficiency of sensitive words and their variants in order to create a healthy and safe network space.The research object of this thesis is web text data containing sensitive information.Different text lengths require different detection methods,so the text data to be detected is divided into short text and long text.Short text has less information and pays more attention to key words.Long text has a large amount of information and pays more attention to the semantic and logical relationship of context.Based on the above considerations,this thesis studies sensitive information detection methods in short text and long text respectively,and proposes sensitive information detection methods for different text patterns to improve the efficiency of sensitive information recognition.In this thesis,we design a character matching detection algorithm for various sensitive word variants based on the characteristics of short text with little information and weak semantic logic relationship in context.Firstly,four kinds of variant expressions of common sensitive words are analyzed and summarized,including homophones,close words,abbreviations and split words,and different adaptive detection methods are proposed.For sensitive word variant with similar pronunciation and shape,a sensitive information detection algorithm based on shape code is proposed.According to the pronunciation and structure of the text to be detected and the sensitive word,the algorithm first parses the two words into a shape code,then determines whether the text contains the sensitive word by calculating the editing distance between the unknown text and the sensitive word shape code,and outputs the detected sensitive word category.Sensitive information detection algorithm based on BM algorithm is adopted for the variation of sensitive word abbreviation.The algorithm firstly extracts the first letters of text and sensitive words and combines them,and then matches them through BM algorithm.A sensitive information detection algorithm based on location code is used for split sensitive word variant.The algorithm converts the split text and sensitive words into location code representation,and uses KMP algorithm to detect sensitive words in the unknown text.Finally,the three algorithms are combined by voting integration to judge the unknown short text together.Experiments prove that the simulated voting method has higher accuracy than the single sensitive information detection algorithm,thus proving the feasibility of the scheme.Compared with short texts,long texts have information redundancy.Therefore,this thesis proposes a sensitive information detection framework based on CNN-BiLSTM-Attention model.The framework consists of three stages: In the data acquisition stage,this thesis uses the existing sensitive theshows to construct keyword matching methods to crawl data sets about sensitive texts from Twitter and Weibo platforms.In the data preprocessing stage,because there is some redundant information in the long text that affects the subsequent judgment results,this thesis uses TextRank method based on Word2 vec to extract the key information of the data.In the model training stage,CNN and BiLSTM are combined to build a sensitive information detection model.First,CNN is used to obtain local key features in the text,and then BiLSTM is used to obtain semantic features of the context.Finally,the Attention mechanism was added to set different weights for features in the text,thus improving the overall detection effect of the model.Experimental results show that the accuracy of this model is 83.72%,which is better than CNN and BiLSTM-Attention models,and proves its effectiveness in long text-oriented sensitive information detection tasks.
Keywords/Search Tags:sensitive information detection, text classification, deep learning, character match, TextRank
PDF Full Text Request
Related items