Research On Chinese Word Segmentation Based On Deep Learning

Posted on:2019-02-22

Degree:Master

Type:Thesis

Country:China

Candidate:M G Wang

Full Text:PDF

GTID:2428330545964168

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

Chinese automatic word segmentation refers to the process of dividing a continuous Chinese text sequence into separate word sequences according to certain rules.Since the smallest writing unit in Chinese is a single Chinese character,and there are no spaces between words and words like English as a fixed delimiter,thus,no matter which sub-domain in the Chinese Natural Language Processing,the first step is inseparable from the Chinese word segmentation.Chinese word segmentation is a basic part of Natural Language Processing and is a key link in the early text processing of other Chinese information processing tasks.The result of the word segmentation will directly affect the outcome of subsequent information processing.The implementation of most Chinese word segmentation systems relies on the matching of lexical dictionaries.However,with the rapid development of Web2.0 and Web3.0 Internet information age,the language evolution caused a great number of new words produced,and a large amount of unstructured information is distributed over the Internet,which lead to a decrease in the lexical dictionary coverage.Therefore,the accuracy of word segmentation system for processing text corpora is also decreased.To sum up,the study of Chinese automatic word segmentation is of great significance.As present,most of traditional machine learning word segmentation methods rely on the artificial feature engineering,which requires a lot of work to verify the effectiveness of these characteristics.Obviously the work efficiency is relatively low.It is possible to train neural network mode to extract features automatically after the neural network depth learning algorithm is gradually developed.This method not only liberated a lot of workers from the feature engineering,but also improved the efficiency.Based on this background,in order to improve the accuracy and recall rate of Chinese automatic word segmentation,a Chinese segmentation model based on the combination of Long Short-Term Memory neural network(LSTM)and Conditional Random Field(CRF)is applied in this paper.Firstly,the text character embeddings are trained from a large amount of unlabeled corpus data by the deep learning tool Word2 Vec.Then the character embeddings are input to the LSTM to compute their context representation vectors.Finally,these representation vectors will be applied to the CRF model as features for supervised Chinese word segmentation.Experiments were conducted on the corpus of the 2014 people's daily as well as the 4th CCF Conference on Natural Language Processing & Chinese Computing(NLPCC2015)corpus.The experiments results show that the models based on LSTM and CRF can not only reduce the feature engineering can not only reduce the artificial feature engineering in traditional machine learning participle method,but also achieve better performance than traditional machine learning method,and it is more versatile.Among those models we proposed,the two-layer Bi-LSTM+CRF model achieved the best segmentation results in the People's Daily corpus,which accuracy,recall rate and F value are as high as 99.02%,98.97% and 98.99% respectively.

Keywords/Search Tags:

Deep Learning, Word Embedding, Long Short-Term Memory (LSTM), Conditional Random Field (CRF), Chinese word segmentation, Natural Language Processing

PDF Full Text Request

Related items

1	Applied Study On Chinese Word Segmentation Based On Deep Learning
2	Research On Chinese Word Segmentation Based On Deep Learning
3	Research On Chinese Word Segmentation Method Based On Two-way Long And Short-term Memory Model
4	Research On Chinese Word Segmentation Based On Neural Network
5	Research Of Chinese Word Segmentation With Conditional Random Fields
6	Research On Tibetan Word Segmentation Algorithm Based On Deep Neural Network
7	Research On Key Techniques For Chinese Word Segmentation With The Combination Of Deep Learning Features And Shallow Machine Learning Features
8	Research And Application Of Chinese Word Segmentation Based On Conditional Random Fields
9	Research On Chinese Word Segmentation For Domain Literature
10	Chinese Word Segmentation Analysis Based On Bidirectional LSTMN Recurrent Neural Network