Font Size: a A A

BiLSTM And CNN Based Joint Model For Chinese Word Segmentation And Part-of-speech Tagging

Posted on:2020-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:J H ZhangFull Text:PDF
GTID:2428330623963757Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Chinese word segmentation and part-of-speech tagging are the most fundamental components of Chinese natural language processing.Their performance has a crucial impact on a lot of subsequent tasks.For decades,the solutions have evolved from the earliest simple string matching methods to methods based on various statistical machine learning models,and then to the deep learning methods which are very popular in recent years.Considering shortcomings of the two-step pipeline model,this paper uses a deep learning based joint model to deal with Chinese word segmentation and part-of-speech tagging.The main work consists of the following three parts:1.Based on Bi RNN-CRF sequence labeling model,Chinese word segmentation and partof-speech tagging are handled in one step.More specifically,the joint model is based on the idea of sequence labeling.Character embeddings is adopted as the input of the model.LSTM(long short-term memory)is used to extract features from the raw sentences and captures long-term dependencies.CRF(conditional random fields)is put at the output to model the dependencies between output tags,which improves the performance of the model.2.In addition to the Bi RNN-CRF framework,the neural network language model is introduced as an auxiliary task,which shares the LSTM and is co-trained with the primary task.Highway Network is introduced as an additional nonlinear layer,mapping the output of LSTM to different feature spaces,so that their training target could be mediated.3.The input and output of the model are further optimized in this paper.In terms of input,CNN(convolution neural network)is applied between the input and LSTM in order to model the complicated combinations of Chinese characters and enrich features of the character embeddings.CNN could also simulate traditional n-gram features without the problem of data sparseness.In terms of output,a new auxiliary loss function is adopted to directly guide the model to learn the differences between the high-frequency characters and low-frequency characters,which is beneficial for the model generalization ability.Finally,this paper conducts a detailed experimental analysis and develops an easy-to-use Chinese word segmentation and part-of-speech tagging prototype system.Experimental results show that the performance of the basic Bi RNN-CRF model is beneficial from all the components proposed in this paper and the joint F1 score on CTB5 and CTB7 are 94.98% and 91.52%respectively.Comparisons with related joint models show 0.92% and 0.98% improvement in F1 score on CTB5 and CTB7 respectively.As for Chinese word segmentation,the performance of our model on PKU and MSR is competitive with the best in the literature and gets 0.93%and 1.84% improvement compared with state-of-the-art Bi RNN-CRF based model.In addition,comparison between the joint model and the pipeline model is conducted in this paper.Results on the CTB5 and CTB7 show that the joint model is 3.26% and 2.16% higher than the pipeline model in terms of F1 score,which proves the effectiveness of the joint model.
Keywords/Search Tags:Chinese word segmentation, part-of-speech tagging, deep learning, joint model
PDF Full Text Request
Related items