Font Size: a A A

Research On Chinese Word Segmentation Integrating Pinyin And Tone Information

Posted on:2021-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:X T ZhouFull Text:PDF
GTID:2428330614971184Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Chinese word segmentation refers to dividing a whole sentence into individual words according to the established specifications.It is a basic task of Chinese Natural Language Processing(NLP),and is also information retrieval,intelligent question answering,machine translation,etc.The key underlying tasks of multiple NLPs.There is no doubt that Chinese word segmentation is a key link in the early text processing,and the effect of the word directly affects the subsequent task processing.The traditional word segmentation methods based on rules and statistics mostly rely on the design of manual feature templates,and a lot of work is used to verify the effectiveness of the templates,and the efficiency is low.In recent years,with the rapid development of the Internet,the emergence of a large number of new vocabularies has made the task of word segmentation increasingly difficult.If the vocabulary coverage is incomplete,the accuracy of traditional word segmentation methods will drop accordingly.At the same time,deep learning has developed rapidly,and the word segmentation method based on neural networks has been widely used in the field of NLP.The model obtained by the neural network through iterative learning of larger scale data has stronger generalization ability and better word segmentation effect.The input vector of the model is composed of a word vector and a label vector.The word embedding vector representation is done by Word2 Vec word vector preprocessing tool.The Chinese word segmentation model combined with bidirectional short-term memory neural network(Bi-LSTM)and conditional random field(CRF)is used as Basic model.In addition to comparing with the traditional conditional random field(CRF)model,this thesis studies the effect of prosody information on Chinese word segmentation in the existing Chinese word segmentation open source dataset.Due to the general performance of prosody in modern literary data sets,this article collects and organizes the prosody structure to retain the strongest Tang poetry and Song Ci to make ancient poetry data sets,obtain audio information corresponding to each line of text,in order to get the text and reading time of ancient poetry data Rhythmic structure.The main contributions of this article are as follows:(1)In this thesis,a data set for segmentation of ancient poems and words is constructed.The allusion library and text are obtained by crawling,pre-processed and combined with matching algorithms and manual proofreading to obtain a standard data set.In addition,in order to obtain sound feature information,the crawler obtains TTS audio corresponding to each line of poetry collection text.(2)For the study of the influence of prosody on the effect of word segmentation,the prosody information,including flat information,vowel information,tone information and sound feature information,are integrated into the mainstream framework and compared with the basic model.(3)For the mainstream neural network Chinese word segmentation system Bi-LSTM + CRF,an improved method of BERT fusion is proposed.The use of BERT for word vector preprocessing,the fusion of more semantic information,and the effect on model recognition are studied.This thesis studies the application of prosody in Chinese word segmentation system,and the method of BERT migration used as word vector preprocessing.A large number of experiments show that this method can improve the accuracy of word segmentation to some extent.
Keywords/Search Tags:Chinese Word Segmentation, Neural Networks, Rhythmic, Natural Language Processing, Word Embedding
PDF Full Text Request
Related items