Font Size: a A A

Research On Chinese Word Segmentation Based On Deep Learning

Posted on:2019-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:M G WangFull Text:PDF
GTID:2428330545964168Subject:Engineering
Abstract/Summary:PDF Full Text Request
Chinese automatic word segmentation refers to the process of dividing a continuous Chinese text sequence into separate word sequences according to certain rules.Since the smallest writing unit in Chinese is a single Chinese character,and there are no spaces between words and words like English as a fixed delimiter,thus,no matter which sub-domain in the Chinese Natural Language Processing,the first step is inseparable from the Chinese word segmentation.Chinese word segmentation is a basic part of Natural Language Processing and is a key link in the early text processing of other Chinese information processing tasks.The result of the word segmentation will directly affect the outcome of subsequent information processing.The implementation of most Chinese word segmentation systems relies on the matching of lexical dictionaries.However,with the rapid development of Web2.0 and Web3.0 Internet information age,the language evolution caused a great number of new words produced,and a large amount of unstructured information is distributed over the Internet,which lead to a decrease in the lexical dictionary coverage.Therefore,the accuracy of word segmentation system for processing text corpora is also decreased.To sum up,the study of Chinese automatic word segmentation is of great significance.As present,most of traditional machine learning word segmentation methods rely on the artificial feature engineering,which requires a lot of work to verify the effectiveness of these characteristics.Obviously the work efficiency is relatively low.It is possible to train neural network mode to extract features automatically after the neural network depth learning algorithm is gradually developed.This method not only liberated a lot of workers from the feature engineering,but also improved the efficiency.Based on this background,in order to improve the accuracy and recall rate of Chinese automatic word segmentation,a Chinese segmentation model based on the combination of Long Short-Term Memory neural network(LSTM)and Conditional Random Field(CRF)is applied in this paper.Firstly,the text character embeddings are trained from a large amount of unlabeled corpus data by the deep learning tool Word2 Vec.Then the character embeddings are input to the LSTM to compute their context representation vectors.Finally,these representation vectors will be applied to the CRF model as features for supervised Chinese word segmentation.Experiments were conducted on the corpus of the 2014 people's daily as well as the 4th CCF Conference on Natural Language Processing & Chinese Computing(NLPCC2015)corpus.The experiments results show that the models based on LSTM and CRF can not only reduce the feature engineering can not only reduce the artificial feature engineering in traditional machine learning participle method,but also achieve better performance than traditional machine learning method,and it is more versatile.Among those models we proposed,the two-layer Bi-LSTM+CRF model achieved the best segmentation results in the People's Daily corpus,which accuracy,recall rate and F value are as high as 99.02%,98.97% and 98.99% respectively.
Keywords/Search Tags:Deep Learning, Word Embedding, Long Short-Term Memory (LSTM), Conditional Random Field (CRF), Chinese word segmentation, Natural Language Processing
PDF Full Text Request
Related items