Font Size: a A A

Text Data Augmentation Technique Based On Field Features

Posted on:2022-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y LiFull Text:PDF
GTID:2518306725984289Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the development of deep neural network technology,deep neural network models trained based on field text data sets are gradually applied to various fields of society to solve practical problems in various fields.The construction of deep learning models requires large-scale,high-quality field text data as a training set.In practice,reasons such as the high cost of acquiring domain text will cause problems such as lack of training data,unbalanced sample distribution,and will lead to poor generalization capabilities of deep learning models.Data augmentation is a technique that can increase the size of the training set.At present,the commonly used text data augmentation technology is easy to affect the words and semantic structure information that reflect the features of the text field when processing text data,resulting in poor text quality after augmentation,and its effect on improving the generalization ability of the model is limited.Given this,this paper takes the judicial field data set as an example,designs and implements a text data augmentation technology based on field features,including preprocessing steps for the field text data set and four feature augmentation methods.Data set preprocessing is to provide support for the subsequent augmentation of text data based on field features.The feature pruning and augmentation method based on TF-IDF weight is based on the TF-IDF value of the text segmentation in the data set,combined with dependency syntax analysis technology for pruning operation? the feature fusion augmentation method based on the topic model is to use topic model tech-nology Clustering similar texts in the data set,exchange the content of the text to be augmented with similar target texts? the feature transformation augmentation method based on dependency syntax uses dependency syntax analysis technology to decon-struct the text,and exchange branches with the same dependency relationship in the syntax tree? The feature replacement method based on word frequency and part-of-speech is to construct a high-frequency vocabulary and word vector model based on field data set analysis and replace words that meet high-frequency words and related parts of speech in the text using the word vector model to recommend field synonyms.In this thesis,a comparative experiment is designed to build a high-quality text classification model on the judicial data set.The feature augmented text and EDA aug-mented text are used as the test set.The experiment shows that the feature augmented text performs well in maintaining the category label,and effectively maintains The field features of the text.Secondly,adding data augmented using feature augmentation meth-ods and EDA methods to the original data training set of the judicial and media fields,compared with the CNN and RNN models trained on the original data,the accuracy of the model after adding the augmented data is improved.In general,the model with feature augmented text has a greater improvement in accuracy on the test set than the model with EDA augmented text.Experiments show that the text data augmentation technology based on field features has certain practicability and effectiveness.
Keywords/Search Tags:Text data augmentation, Natural language processing, TF-IDF algorithm, Topic model, Field features
PDF Full Text Request
Related items