Font Size: a A A

Research On Chinese Word Representation Learning Based On Deep Learning

Posted on:2019-07-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:H ZhuangFull Text:PDF
GTID:1368330551956898Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the increasing national power of China,Chinese Language Processing(CLP)has received increasing attention.At present,the research of natural language process-ing based on deep learning mainly focuses on the area of alphabet languages.The Chi-nese language processing methods based on deep learning mostly learned from alpha-betic language.There are different and identical points in lexical,syntactical,semantic or in different granularity between Chinese and English.Chinese still faces trouble with out of vocabulary and low-frequency words which can be solved by character-level nat-ural language model.Due to the wide variety of Chinese characters,it is difficult to deal with other characters in a unified manner.At the same time,the special encoding of Chi-nese characters leads to time-consuming data processing.This dissertation constructs a stroke-based Chinese word representation learning method based on the characteristics of Chinese words' construction and the internal characteristics of Chinese characters.On the basis of this,combined with the hierarchical features of various granularity and dimensions of Chinese language,combined with characters,words,sounds,shapes and other features,build a new Chinese word representation and apply it to Chinese infor-mation processing tasks.Finally,for the efficiency of data preprocessing,distributed processing methods are used to accelerate the model training.1 This dissertation designs and proposes a Chinese character encoding method based on universal strokes.After multiple supplements,it covers the 20,902 Chinese characters of CJK unified characters,which lays a solid foundation for the rep-resentation learning.The Chinese character encoding method is also introduced into handwritten Chinese character recognition,which provides a new solution for handwritten Chinese character recognition.2 This dissertation proposes a stroke-based Chinese word vector representation learning method,which provides a true character-level(character level)repre-sentation learning approach for Chinese natural language processing,and effec-tively resolves the unregistered words and low-frequency words in Chinese word representation learning.A similar component was introduced to train the stroke vectors to better extract the association features between the characters.3 Through the analysis of different granularity features,the dissertation proposes a Chinese word representation learning algorithm based on multi-dimensional features.This method combines the features of multiple dimensions of Chinese words:sound,shape,word,and word.It guarantees the fine-grained features better combination of more dimensional features on the premise of superiority of unknown words and low-frequency words.To a certain extent,it solves the influence of typos and homophones on text comprehension.4 Aiming at the efficiency of the model training process,this dissertation proposes a memory-based hotspot data acceleration strategy to solve the hotspot data aggre-gation problem of data preprocessing,and adopts strategies such as data parallel and parameter asynchronous updating to improve the model training speed.Aiming at the efficiency of the model preprocessing process,this dissertation pro-poses a memory-based data backup strategy and data migration strategy to solve the hotspot data problem and hotspot data aggregation problem of data prepro-cessing.
Keywords/Search Tags:Representation Learning, Chinese Language Processing, Neural Network, Stroke-based Representation Learning, Multidimensional Representation Learning, Hotspot
PDF Full Text Request
Related items