Font Size: a A A

Automatic Segmentation And Punctuation Of Ancient Chinese Based On Deep Learning

Posted on:2022-11-19Degree:MasterType:Thesis
Country:ChinaCandidate:B C ZhuangFull Text:PDF
GTID:2518306749472054Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
With the application of natural language processing technology in different fields of Chinese and the development of deep learning related algorithms,the processing and mining of ancient Chinese classics by using natural language processing technology has been widely concerned.Automatic processing of ancient prose is the first step of human and computer processing and mining ancient prose information.Facing the vast sea of ancient Chinese classics,the realization of accurate and fast automatic sentence breaking and punctuation is conducive to the further excavation and research of the related work of ancient Chinese corpus processing.In this thesis,the biography of history book is used as the research object.Aiming at the problem that the Bi-LSTM-CRF baseline model is so limited to character granularity processing that does not apply to the characteristics of the the biography of history book,by integrating the vocabulary granularity information,the Lattice model improves the mixed coding of word vector and word vector,and the vocabulary enhancement improves the accuracy.In combination with the BERT pre-training model,the BERT-FLAT-CRF double-channel feature model is proposed,which improves the generalization ability.The main work of this thesis is as follows:(1)According to the task of automatic segmentation and punctuation of ancient Chinese,by analyzing and selecting the biography of historical books as the specific research objects,the data is cleaned and designing the annotation system according to the characteristics of the corpus,and constructing the data set to solve the problem of the lack of open source data set.(2)Facing the historical corpus,we obtain the external vocabulary information characteristic table specially corresponding to the target corpus by designing the word segmentation model based on the discovery of new words,so as to solve the problem that the existing word vector scale cannot accurately express the historical text words,so as to accurately integrate the lexical information.(3)Aiming at the characteristics of ancient biography text,this thesis improves the Lattice model,proposes a BERT-FLAT deep learning model,and the network that combines BERT pre-training and improved Lattice algorithm implements a doublechannel model of characters and vocabulary.(4)Performing experiments,the automatic segmentation and punctuation of ancient Chinese is realized.The precision and F1 of the segmentation task reached87.11% and 80.57%,increased 8.90% and 9.98% compared with Bi-LSTM-CRF.The value of the punctuation task reached 74.91% and 70.32%,increased 11.19% and 9.25%compared with Bi-LSTM-CRF.
Keywords/Search Tags:Deep Learning, BERT, Lattice, Sequence Label, Automatic Segmentation
PDF Full Text Request
Related items