Text Data Augmentation Technique Based On Field Features

Posted on:2022-01-10

Degree:Master

Type:Thesis

Country:China

Candidate:Z Y Li

Full Text:PDF

GTID:2518306725984289

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

With the development of deep neural network technology,deep neural network models trained based on field text data sets are gradually applied to various fields of society to solve practical problems in various fields.The construction of deep learning models requires large-scale,high-quality field text data as a training set.In practice,reasons such as the high cost of acquiring domain text will cause problems such as lack of training data,unbalanced sample distribution,and will lead to poor generalization capabilities of deep learning models.Data augmentation is a technique that can increase the size of the training set.At present,the commonly used text data augmentation technology is easy to affect the words and semantic structure information that reflect the features of the text field when processing text data,resulting in poor text quality after augmentation,and its effect on improving the generalization ability of the model is limited.Given this,this paper takes the judicial field data set as an example,designs and implements a text data augmentation technology based on field features,including preprocessing steps for the field text data set and four feature augmentation methods.Data set preprocessing is to provide support for the subsequent augmentation of text data based on field features.The feature pruning and augmentation method based on TF-IDF weight is based on the TF-IDF value of the text segmentation in the data set,combined with dependency syntax analysis technology for pruning operation? the feature fusion augmentation method based on the topic model is to use topic model tech-nology Clustering similar texts in the data set,exchange the content of the text to be augmented with similar target texts? the feature transformation augmentation method based on dependency syntax uses dependency syntax analysis technology to decon-struct the text,and exchange branches with the same dependency relationship in the syntax tree? The feature replacement method based on word frequency and part-of-speech is to construct a high-frequency vocabulary and word vector model based on field data set analysis and replace words that meet high-frequency words and related parts of speech in the text using the word vector model to recommend field synonyms.In this thesis,a comparative experiment is designed to build a high-quality text classification model on the judicial data set.The feature augmented text and EDA aug-mented text are used as the test set.The experiment shows that the feature augmented text performs well in maintaining the category label,and effectively maintains The field features of the text.Secondly,adding data augmented using feature augmentation meth-ods and EDA methods to the original data training set of the judicial and media fields,compared with the CNN and RNN models trained on the original data,the accuracy of the model after adding the augmented data is improved.In general,the model with feature augmented text has a greater improvement in accuracy on the test set than the model with EDA augmented text.Experiments show that the text data augmentation technology based on field features has certain practicability and effectiveness.

Keywords/Search Tags:

Text data augmentation, Natural language processing, TF-IDF algorithm, Topic model, Field features

PDF Full Text Request

Related items

1	Modeling And Improvement Of Recommendation System Combining Attention Mechanism And Bidirect Ional Text Features
2	Research And Application Of Topic Model For Short Texts Based On Part-of-Speech Feature And Semantic Enhancement
3	Research On Joint Learning Of Topic And Embedding Model
4	Identification And Empirical Study Of Content Features Of Domain Emerging Topics
5	Research On Text Clustering Algorithm Based On Word Frequency And Semantic
6	Research On Text Representation Model And Application In Text Classification And Natural Language Inference
7	Natural Language Processing Based On Semantic And Sentiment Aspects For Recommendation System
8	Word Embeddings Towards Text Classification Of Emotion And Topic
9	Research On The Construction Method Of Technology Domain Thematic Library Based On Multilevel Topic Vector
10	Research On Natural Language Understanding Of Air Travel Based On Joint Modeling