Font Size: a A A

A Study On Optimization Of Pre-trained Chinese Word Embedding In Transfer Learning

Posted on:2019-04-03Degree:MasterType:Thesis
Country:ChinaCandidate:C W PanFull Text:PDF
GTID:2428330545952593Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the rise of deep learning.In natural language processing,the distributed and distributional representations have become one of the most common feature representation methods.By using the important information in the context of target word,word embedding can be used as input features for machine translation,text classification,automatic summaries,and other natural language processing tasks.The performance of word embedding representation depends on the size of the training corpus.But the size of the corpus is limited in practical systems,so we usually use word embeddings that pre-trained in large-scale corpus as word feature representations,which called transfer learning of word embedding.There are still many problems in the application of word embedding transfer using,because of the significant differences between the pre-trained corpus and the target task corpus,it will cause semantic deviations on some word vectors,and has difficulties in representing rare words and new words,finally introduces noise into model training,reduces the model performance.The common way to reduce the deviation of the word embedding representation is to correct the pre-trained word vector through a specific NLP task.However,actual system can contain multiple tasks.Whether it is training one by one or multitasks-learning,it is difficult to take multiple tasks into consideration at the same time.Focus on above issues,we have a study on pre-trained and transfer using word embedding,our contributions are as follow:(1)To improve the traditional word embedding quadratic training method,we propose an unsupervised quadratic optimization method with a wider scope of application for no labeled data.In transfer using of word embedding,based on the corpus,our method will adaptively adjust the pre-trained word vectors to optimize performance of word vectors on various natural language processing tasks.Using Sohu web news as experiment corpus,we found the proposed optimization method can improve the performances in semantic similarity and text categorization tasks.The classification error can be reduced by 7%due to the improvement in the feature level.(2)To solve the issues of rare words and strange words representation,we make use of the semantic information of the Chinese character's glyph structure features,and extract the glyph features to optimize the word embedding vectors,it can be used to improve the word representation of rare words and strange words.Especially,in the professional field tasks where the corpus size is small,with many rare words and strange words.Supported by the experiment results on Sohu web news dataset,our method improved the performance of word embedding in classification task.For texts with a length of 250 words,the classification error of the existing algorithms is reduced by 11.4%,particularly for data sparse categories which contain many rare words.Our study is about the basic research subject in natural language processing.The results have certain academic value.At the same time,the proposed methods in this paper can be used in all kinds of Chinese natural language processing tasks.It is a general method with a wide application value.
Keywords/Search Tags:Machine Learning, Deep learning, NLP, word2vec, Text categorization
PDF Full Text Request
Related items