Font Size: a A A

Research On Data Desensitization Based On Deep Learning

Posted on:2021-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:X R ZhengFull Text:PDF
GTID:2428330611999751Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the advent of the information society,the explosive growth of data has catalyzed the vigorous development of the data industry,but the effective protection of sensitive information in various forms of data while reducing the damage to the utility of data can speed up data circulation and further promote the development of industry.Data desensitization is a data security technology that can protect sensitive data in a targeted manner and retain the original data information to the greatest extent.Our thesis takes data desensitization technology as the research object,and studies sensitive information recognition technology based on deep learning unstructured data(text data,especially electronic medical records);at the same time,the thesis also studies structured data desensitization technology based on GAN(Generative Adversarial Networks).In the desensitization of medical electronic medical records,the traditional recognition method of sensitive information based on rules and regular expressions requires a lot of expert knowledge,the mobility is poor,and the recognition mode is also dull.Recognition technology based on deep learning and machine learning came into being.A series of recognition systems based on recurrent neural networks have greatly improved the recognition effect.However,their semantic extraction ability is slightly weak,and the parallelism is relatively poor.Moreover,traditional static word vectors cannot combine the context to accurately represent polysemy.As a dynamic word vector based on attention mechanism,BERT has greatly improved in feature extraction,solving polysemy problems and parallelism.Based on BERT(Bidirectional Encoder Representation from Transformers),the thesis designs sensitive information recognition models BERT-CRF and char CNN-BERT-CRF.The above model uses the pre-trained language model word vector BERT as input features,which provides rich contextual semantic features for downstream labeling tasks;in the labeling stage of sensitive information,conditional random fields are introduced to optimize the labeling effect;in addition,medical texts for English Introduce the character convolution layer Char CNN to provide word formation features for the model.The experimental results show that the BERT-CRF and char CNN-BERT-CRF models designed in the thesis have achieved the most excellent results in both Chinese and English datasets.In the desensitization of structured data,the structured data desensitization methodbased on anonymization technology or disturbing technology has the problem of a oneto-one mapping relationship between the desensitized data and the original data,resulting in the risk of reverse of the desensitized data.In order to solve this problem,the thesis designs a responsive generation network Res TGAN for desensitization of structured data generation,which uses Res Net as the main structure,introduces WAGN and CGAN training ideas,and joins multi-task learning to optimize the process.Among them,the statistical information loss function is introduced to measure the statistical difference between real data and generated data under high-dimensional features.Experiments show that this model achieves the best results in terms of data utility,and the terms of security depends on the scene.
Keywords/Search Tags:Data desensitization, deep learning, Unstructured data, Structured data, Generative Adversarial Networks
PDF Full Text Request
Related items