Font Size: a A A

Research On Automatic Sentence Segmentation And Punctuation In Ancient Texts

Posted on:2023-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2545306836473604Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of natural language processing technology,some researchers try to apply deep learning models to the processing of ancient texts.Compared with modern writing,ancient writing not only has huge differences in terms of vocabulary and grammar,but also lacks punctuation.At present,only a small number of ancient book texts have been manually processed and have punctuation or punctuation,and there are still a large number of ancient texts without punctuation or punctuation.Manually segmenting or punctuating ancient texts requires not only high professional knowledge,but also a certain understanding of the history and culture corresponding to the text at that time.Therefore,the speed of segmenting and punctuating ancient texts is slow.In order to speed up the segmentation and punctuation of ancient book texts,some researchers try to use deep learning models to segment and punctuate ancient book texts.This paper mainly studies the automatic sentence segmentation and punctuation algorithm of ancient text based on deep learning,and optimizes and improves the model to further improve the performance of the model for ancient text segmentation and punctuation.The main work of this paper is as follows:(1)Combine the deep language model BERT,the bidirectional long-short memory network(Bi LSTM),and the conditional random field model(CRF)for automated punctuation of ancient texts.First,the BERT model is applied to the ancient book text processing task,so that the model can fully learn the text semantic information.At the same time,the Bi LSTM+CRF model can learn the characteristics of label specification information,further enhance the standardization of the model,and make the prediction results of the model more accurate.(2)A new data preprocessing method is proposed,that is,the data is processed by paragraphs.Change the data preprocessing method of dividing lines according to punctuation to dividing lines according to paragraphs,and one line of data is used as a data processing unit of the model.The changed data preprocessing method makes a data processing unit of the model contain more text information,so that the model can Learn more about textual contextual information.(3)A BERT deep learning model based on dynamic coding and the data preprocessing method divided into paragraphs is proposed.Considering the different lengths of paragraphs in ancient texts,the dynamic coding method is used for data vectorization,which further reduces the addition of unnecessary information and shortens the The number of processing units of the model is increased,and the accuracy of the model prediction results is improved.And designed and developed an automatic sentence segmentation and punctuation system for ancient texts,so that users can directly process the text that they want to segment or punctuate through the system,which is convenient for users to use.This paper conducts experiments on self-collected data sets,and uses unified evaluation indicators for evaluation.The final experimental results show that the improved BERT model can not only better learn the semantic information and contextual information of ancient texts,but also learn the normative information of tags by using the data preprocessing method and dynamic coding method of paragraph branching,which can effectively Improve the accuracy of automatic sentence segmentation and punctuation in ancient texts.
Keywords/Search Tags:ancient Chinese sentence, BERT model, Bidirectional Long Short-Term Memory Network, conditional random field, data preprocessing, dynamic coding
PDF Full Text Request
Related items