Research On Data Augmentation Method For Natural Language Processing

Posted on:2023-08-07

Degree:Master

Type:Thesis

Country:China

Candidate:J Li

Full Text:PDF

GTID:2558307154479354

Subject:Engineering

Abstract/Summary:

At present,through data enhancement,deep neural networks perform well in many small sample tasks,especially in the field of computational vision.Since the operation of generating image data is relatively easy,such as translation,rotation,compression,color adjustment,etc.,although in the human eye,the generated ”new” image is not much different from the original image,but for calculation,This operation is just very effective,and it can also preserve the labels of the original image well,so the model can still classify the image.But natural language is different.For computers,natural language has an inherent discreteness.For example,the computer’s emotional cognition of text data is determined by emotional words.When adding the negative word ”No”to a sentence,The sentiment label of this sentence will change,and the semantics will also be biased.Therefore,in natural language processing tasks,especially supervised learning,the use of data enhancement is not frequent,and more rely on a large number of manual annotations to increase the data set the size of.Compared with the huge cost of manual annotation,data augmentation is still a technology that needs in-depth research and exploration.Firstly,this paper proposes a Diverse Mix Data Augmentation(DMDA)method that can improve the performance of text classification tasks.It mainly consists of four methods: back-translation,Mixup,Cutoff,and word replacement.Composed by sequential stacking.In five text classification datasets,including SST-2,CR,SUBJ,TREC,PC,the method improves the accuracy of convolutional neural networks and long short-term memory networks.The DMDA method shows stronger performance on datasets with fewer available samples,reaching a maximum accuracy of 88.5%,and using only 60% of the available datasets is comparable to the four basic methods on all datasets effect is flat.In addition,this paper also conducts low-resource experiments,ablation experiments,and enhanced sample number experiments,showing the applicability of the DMDA method.Inspired by data augmentation methods in the CV field,this paper proposes a DMDA-AT(Diverse Mix Data Augmentation Based on Adversarial Training)method based on adversarial training by adding an adversarial training module on the basis of DMDA.In this paper,label-preserving transformation experiments are conducted to demonstrate the effectiveness of data augmentation for classification tasks,and on the GLUE benchmark,both consistency regularization and contrast regularization are used,and the Ro BERTa-base language model is used to evaluate the effectiveness of the DMDA-AT method.Experimental results show that the most diverse and highest-quality augmented samples are obtained by sequential stacking of back-translation and adversarial training.In addition,by adding an adversarial training module,the DMDA-AT method outperforms the single data augmentation method and the hybrid method without adversarial training on the classification model,and it is also shown in low-resource environments and ablation experiments.effectiveness of the method.

Keywords/Search Tags:

Data Augmentation, Sample Diversity, Adversarial Training, Low Resources

Related items

1	Research And Application Of Image Data Augmentation Technology Based On Generative Adversarial Networks
2	Research On Recommendation Algorithm Based On Adversarial Training
3	Research Of SAR Image Data Diversity And Data Augmentation Method
4	Enhancing Adversarial Training With Adaptive Attack Methods And Sample Case Aware Strategies
5	ML-NIDS-oriented Adversarial Sample Generation
6	Research On HRRP Generative Data Augmentation Method Based On Feature Decoupling
7	Research On Data Augmentation Method Of Surface Roughness Image Sample
8	Research On Time Series Data Classification Method Under Small Sample Condition
9	Application Research Of Virtual Sample Generation Technology Based On Co-training
10	Deep Adversarial Data Augmentation For Extremely Low Data Regimes