Font Size: a A A

Research On Chinese Word Segmentation Methods Based On Deep Learning

Posted on:2019-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y D LiuFull Text:PDF
GTID:2428330566986070Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
In recent years,with the increasing number of textual information,people urgently need natural language processing techniques to mine and use massive amounts of textual data.Chinese word segmentation is an important basic task of natural language processing.Most of the upper level tasks must finish word segmentation steps first,and the quality of word segmentation methods will have a great impact on related tasks.Because of the problems of ambiguity and unregistered words in Chinese word segmentation tasks,the accuracy of Chinese word segmentation is limited.Research methods based on dictionary matching and traditional statistical techniques still have many limitations.With the success of deep learning methods in various fields,the research method of adopting deep learning for Chinese word segmentation is a trend.Chinese word segmentation has research methods that are divided into characters and words.Since the word tagging method can effectively improve the impact of unregistered words,this article models are based on the word tagging method for research.There are two innovations in the research of Chinese word segmentation using deep learning in this paper:The first point is to propose an improved scheme on the Chinese word segmentation model BiLSTM+CRF.Firstly the paper introduces three effective character features,namely contextual features,glyph features,and pinyin features.The context features are extracted using a convolutional neural network with GLU unit,and two different convolutional methods are proposed for the model.The character feature and the pinyin feature both use the feed-forward neural network to characterize the Chinese characters corresponding to the wubi coding and pinyin coding.For the combination of the three features,this paper proposes a feature combination method based on the attention mechanism and has achieved good results.After that,replacing the LSTM unit with a GRU unit in the RNN network effectively improves the training speed of the model.The second point is to propose a Chinese word segmentation method based on the seq2 seq model.Firstly,based on the same characteristics sequence length of Chinese word segmentation,the basic seq2 seq model is proposed.Afterwards,the basic seq2 seq model was improved by using global attention mechanism andlocal attention mechanism,and a special model variant was proposed.The experiments show that the seq2 seq model based on the local attention mechanism is superior to the global attention mechanism in Chinese word segmentation tasks.This paper also tests the effect of several scoring functions on the global attention model and verifies that the scoring function without the decoder feedback is equally effective.In addition,the introduction of attention mechanism also effectively improves the interpretability of the model.The experimental results show that the improved Bi LSTM+GRU model and seq2 seq model presented in this paper all achieved close to state-of-the-art in the task of Chinese word segmentation.After the word embedding trained with the PKU data set are transfered to the model corresponding to the MSR data set and then trained,the F1 scores of 96.8% and 97.0%are respectively obtained finally.
Keywords/Search Tags:Chinese word segmentation, Deep Learning, character features, seq2seq, Attention mechanism
PDF Full Text Request
Related items