Research On Data Desensitization Based On Deep Learning

Posted on:2021-01-23

Degree:Master

Type:Thesis

Country:China

Candidate:X R Zheng

Full Text:PDF

GTID:2428330611999751

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the advent of the information society,the explosive growth of data has catalyzed the vigorous development of the data industry,but the effective protection of sensitive information in various forms of data while reducing the damage to the utility of data can speed up data circulation and further promote the development of industry.Data desensitization is a data security technology that can protect sensitive data in a targeted manner and retain the original data information to the greatest extent.Our thesis takes data desensitization technology as the research object,and studies sensitive information recognition technology based on deep learning unstructured data(text data,especially electronic medical records);at the same time,the thesis also studies structured data desensitization technology based on GAN(Generative Adversarial Networks).In the desensitization of medical electronic medical records,the traditional recognition method of sensitive information based on rules and regular expressions requires a lot of expert knowledge,the mobility is poor,and the recognition mode is also dull.Recognition technology based on deep learning and machine learning came into being.A series of recognition systems based on recurrent neural networks have greatly improved the recognition effect.However,their semantic extraction ability is slightly weak,and the parallelism is relatively poor.Moreover,traditional static word vectors cannot combine the context to accurately represent polysemy.As a dynamic word vector based on attention mechanism,BERT has greatly improved in feature extraction,solving polysemy problems and parallelism.Based on BERT(Bidirectional Encoder Representation from Transformers),the thesis designs sensitive information recognition models BERT-CRF and char CNN-BERT-CRF.The above model uses the pre-trained language model word vector BERT as input features,which provides rich contextual semantic features for downstream labeling tasks;in the labeling stage of sensitive information,conditional random fields are introduced to optimize the labeling effect;in addition,medical texts for English Introduce the character convolution layer Char CNN to provide word formation features for the model.The experimental results show that the BERT-CRF and char CNN-BERT-CRF models designed in the thesis have achieved the most excellent results in both Chinese and English datasets.In the desensitization of structured data,the structured data desensitization methodbased on anonymization technology or disturbing technology has the problem of a oneto-one mapping relationship between the desensitized data and the original data,resulting in the risk of reverse of the desensitized data.In order to solve this problem,the thesis designs a responsive generation network Res TGAN for desensitization of structured data generation,which uses Res Net as the main structure,introduces WAGN and CGAN training ideas,and joins multi-task learning to optimize the process.Among them,the statistical information loss function is introduced to measure the statistical difference between real data and generated data under high-dimensional features.Experiments show that this model achieves the best results in terms of data utility,and the terms of security depends on the scene.

Keywords/Search Tags:

Data desensitization, deep learning, Unstructured data, Structured data, Generative Adversarial Networks

PDF Full Text Request

Related items

1	Data Augmentation For Chinese Language Models Based On Generative Adversarial Networks
2	Data Augmentation Based On Generative Adversarial Networks
3	Research On Data Augmentation Of Malicious Code Based On Generative Adversarial Networks
4	Research On Image Data Generation Technology Based On Generative Adversarial Network
5	Research On HRRP Generation Method Based On Generative Adversarial Networks
6	Research On Data Generation Model Based On Generative Adversarial Network
7	Research On Data Imputation Based On Generative Adversarial Networks Model
8	Research On Image Fusion Method Driven By Synthetic Data
9	Imbalanced Classification Of Structured Data Based On Generative Adversarial Nets
10	Research And Implemantation Of The Transformation From Unstructured To Structured Data