A Study On Optimization Of Pre-trained Chinese Word Embedding In Transfer Learning

Posted on:2019-04-03

Degree:Master

Type:Thesis

Country:China

Candidate:C W Pan

Full Text:PDF

GTID:2428330545952593

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

With the rise of deep learning.In natural language processing,the distributed and distributional representations have become one of the most common feature representation methods.By using the important information in the context of target word,word embedding can be used as input features for machine translation,text classification,automatic summaries,and other natural language processing tasks.The performance of word embedding representation depends on the size of the training corpus.But the size of the corpus is limited in practical systems,so we usually use word embeddings that pre-trained in large-scale corpus as word feature representations,which called transfer learning of word embedding.There are still many problems in the application of word embedding transfer using,because of the significant differences between the pre-trained corpus and the target task corpus,it will cause semantic deviations on some word vectors,and has difficulties in representing rare words and new words,finally introduces noise into model training,reduces the model performance.The common way to reduce the deviation of the word embedding representation is to correct the pre-trained word vector through a specific NLP task.However,actual system can contain multiple tasks.Whether it is training one by one or multitasks-learning,it is difficult to take multiple tasks into consideration at the same time.Focus on above issues,we have a study on pre-trained and transfer using word embedding,our contributions are as follow:(1)To improve the traditional word embedding quadratic training method,we propose an unsupervised quadratic optimization method with a wider scope of application for no labeled data.In transfer using of word embedding,based on the corpus,our method will adaptively adjust the pre-trained word vectors to optimize performance of word vectors on various natural language processing tasks.Using Sohu web news as experiment corpus,we found the proposed optimization method can improve the performances in semantic similarity and text categorization tasks.The classification error can be reduced by 7%due to the improvement in the feature level.(2)To solve the issues of rare words and strange words representation,we make use of the semantic information of the Chinese character's glyph structure features,and extract the glyph features to optimize the word embedding vectors,it can be used to improve the word representation of rare words and strange words.Especially,in the professional field tasks where the corpus size is small,with many rare words and strange words.Supported by the experiment results on Sohu web news dataset,our method improved the performance of word embedding in classification task.For texts with a length of 250 words,the classification error of the existing algorithms is reduced by 11.4%,particularly for data sparse categories which contain many rare words.Our study is about the basic research subject in natural language processing.The results have certain academic value.At the same time,the proposed methods in this paper can be used in all kinds of Chinese natural language processing tasks.It is a general method with a wide application value.

Keywords/Search Tags:

Machine Learning, Deep learning, NLP, word2vec, Text categorization

PDF Full Text Request

Related items

1	Text Categorization Algorithm Based On Machine Learning
2	Research On Text Categorization Technology Based On Deep Learning
3	Text Categorization On Machine Learning Algorithm
4	An Analysis Of Affective Tendecny Of Short Text Based On Deep Learning
5	Research On The Method Of Chinese Text Categorization Based On Machine Learning
6	A Study On Text Categorization Based On Machine Learning
7	Research And Implementation Of Text Classification Based On Depth Learning Theory And SVM Technology
8	The Research And Application Of Text Categorization Based On Machine Learning
9	Research On Text Classification Based On Hybrid Model Of Deep Learning
10	The Study Of Automatic Chinese Patent Classification Based On Deep Learning Theory And Method