Font Size: a A A

Research On Filtration Technology Of Sensitive Information Based On ERNIE-TextCNN

Posted on:2022-07-16Degree:MasterType:Thesis
Country:ChinaCandidate:W LuoFull Text:PDF
GTID:2518306572497074Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As one of the most common types of information in people's online life,text data is likely to be mixed with many sensitive information such as violence,terrorism,vulgar pornography,spam advertisements,swear words,etc.while assisting people in making statements with their opinions.Although sensitive information can be accurately located based on the algorithms of character matching and decision tree,once there are variants consisting of traditional Chinese characters,disassembled characters,and Chinese phonetic alphabets,it won't work well then.Based on traditional machine learning techniques such as bayesian and random forest,although information can be classified and filtered according to the selected features,it requires a lot of feature engineering in the early stage.The limited features make it difficult for the model to extract context-based semantic information,which leads to a low screening rate of sensitive information and fails to achieve large-scale filtering effects.In order to solve these problems,the ERNIE-Text CNN model that combines the ERNIE pre-trained model and the text-based convolutional neural network is designed and implemented.It first adopts word embeddings that have undergone lots of language training in the early stage,then adjusts dynamically through the features of sensitive information,so that it can obtain word embeddings that cover attention information and multi-dimensional features.As sensitive information usually appears in the form of words and phrases,the local sensitive information in the text can be extracted through the convolutional neural network.After information fusion and dimension reduction,the decision vector with the same dimensions as the sensitive information category is obtained.Finally,the category of sensitive information can be predicted through softmax regression.The construction of the ERNIE-Text CNN model also considers the overall convergence speed.Since the multi-head attention of ERNIE and the multi-dimensional convolution of Text CNN can be calculated in parallel,the ERNIE-Text CNN model can improve the accuracy of sensitive information filtration while ensuring rapid convergence.The comparison models are constructed under the same experimental environment.Through the analysis of test results,the ERNIE-Text CNN model achieves the best classification accuracy of sensitive information and hardly affects the convergence speed.It can divide 99.63% of the test dataset into the correct category.
Keywords/Search Tags:Sensitive information, Word embedding, ERNIE, Neural network
PDF Full Text Request
Related items