Font Size: a A A

Research On Word And Sentence Representation Key Technologies Based On Deep Learning Approaches

Posted on:2021-01-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q C FuFull Text:PDF
GTID:1368330605981273Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of the Internet,a large amount of text data is generated every day.How to use unsupervised methods to process massive amounts of text data and extract effective semantic or grammatical features from the text is the research hotspots in Natural Language Processing(NLP)field.Text representation research hostpots is precisely to achieve this goal.Unsupervised text representation learning methods can provide low-dimensional dense feature vectors for downstream natural language processing tasks,thereby mitigating the impact of dimensional explosion or data sparseness on downstream tasks,while text representation as a regularization strategy can effectively improve the generalization performance of natural language processing algorithms.In recent years,with the continuous improvement of hardware computing capabilities and the continuous improvement of model optimization methods,deep learning methods have gradually become the mainstream in the field of artificial intelligence,and have achieved leading results in research on images,speech,and natural language processing.This article focuses on the content of text representation based on deep learning,and researches on word representation and sentence representation.This article systematically summarizes and analyzes the text representation from the two levels of words and sentences,and proposes its own representation method.The main research contents are as follows.1.Word embedding evaluation method based on domain keywords.Aiming at the shortcomings of word embedding technology,this thesis proposes a word embedding evaluation method based on domain keywords to reduce the impact of it's shortcoming on downstream tasks.First,the task-related domain keywords are extracted by unsupervised methods,and then the scores are calculated according to the word vectors of the domain keywords to determine whether the word vectors are suitable for the task.In addition,the evaluation method proposed in this thesis also solves the three shortcomings of traditional evaluation methods,such as the evaluation method based on semantic similarity,which is highly dependent on manual annotation data sets,has no correlation with tasks,and cannot distinguish polysemous words.This method solves the above three problems while retaining the advantage of low computational cost of the internal evaluation method.In the laboratory part,it is first proved on the semantic similarity task that the method proposed in this paper can also judge whether a word embedding correctly captures the semantics of similar word pairs.It can replace the manually labeled data set to evaluate the performance of word embedding.Secondly,use the TOEFL text classification data set in the sub-disciplinary field to verify whether the evaluation method in this thesis is relevant to downstream tasks.The Pearson correlation coefficient of the evaluation score and the actual classification accuracy rate is 0.795,proving that the proposed evaluation method and performace of the actual downstream task have strong correlation.2.Dynamic context word vector based on bidirectional Transformer language model and its application.In this thesis,in view of the shortcomings of word vector representation technology,the dynamic context word vector representation technology based on language model is studied.A dynamic context word vector model based on bidirectional Transformer language model is proposed.And for text classification tasks,multi-probing tasks are used to introduce more language features that are helpful for text classification.Combining the two together improves the performance of text classification tasks.The application belongs to a semi-supervised model.By unsupervising a pre-trained language model on a large raw corpus,a general text representation is learned and then used for downstream text classification tasks for supervised training.In the language model part,a bidirectional Transformer is used to train the language model to improve the operating efficiency of the language model,and the designed multi-probing task fine-tune language model is used to adjust the text representation captured by the language model to make it more suitable for text classification tasks.In the training of downstream text classifiers,pre-trained language models are introduced.In the model training process,the hierarchical optimization method and the stepwise thawing method are used to speed up the model training and prevent overfitting.In the experimental part,the model was tested on a total of 6 text classification datasets using the three categories of sentiment analysis,problem classification,and topic classification.The classification performance has been improved to a certain extent,especially on small sample datasets.3.Improve the Attention mechanism for learning general sentence vectors.Aiming at the general sentence representation,this thesis proposes a general sentence representation method of encoder-decoder with improved Attention mechanism,which is used to learn distributed sentence representation and can be directly used for other natural language processing tasks.Using a convolutional neural network as a sentence encoder,the input sentence is mapped to a low-dimensional vector of a fixed dimension,and then input to a recurrent neural network sentence decoder to decode the vector into a sentence.Inspired by the linguistic characteristics of word vectors,different dimensions of vectors can correspond to different linguistic features.Since the output of the model encoder becomes a one-dimensional vector,and the attention mechanism cannot be used,a new attention mechanism is proposed to optimize each dimension feature of the sentence vector.This model uses a convolutional neural network as a sentence encoder to improve the efficiency of sentence coding operations.In the experimental part,the model is trained on a large original data set in advance to obtain a general sentence representation encoder.Then,experiments were performed on a total of 7 data sets including text classification task,paraphrase detection task and semantic correlation detection task.The experimental results have been improved,indicating that the general sentence encoder can propose effective features and is suitable for multiple tasks.
Keywords/Search Tags:deep learning, unsupervised, word representation, sentence representation, semi-supervised
PDF Full Text Request
Related items