Research On Automatic Sentence Segmentation And Punctuation In Ancient Texts

Posted on:2023-02-13

Degree:Master

Type:Thesis

Country:China

Candidate:Y Wang

Full Text:PDF

GTID:2545306836473604

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of natural language processing technology,some researchers try to apply deep learning models to the processing of ancient texts.Compared with modern writing,ancient writing not only has huge differences in terms of vocabulary and grammar,but also lacks punctuation.At present,only a small number of ancient book texts have been manually processed and have punctuation or punctuation,and there are still a large number of ancient texts without punctuation or punctuation.Manually segmenting or punctuating ancient texts requires not only high professional knowledge,but also a certain understanding of the history and culture corresponding to the text at that time.Therefore,the speed of segmenting and punctuating ancient texts is slow.In order to speed up the segmentation and punctuation of ancient book texts,some researchers try to use deep learning models to segment and punctuate ancient book texts.This paper mainly studies the automatic sentence segmentation and punctuation algorithm of ancient text based on deep learning,and optimizes and improves the model to further improve the performance of the model for ancient text segmentation and punctuation.The main work of this paper is as follows:(1)Combine the deep language model BERT,the bidirectional long-short memory network(Bi LSTM),and the conditional random field model(CRF)for automated punctuation of ancient texts.First,the BERT model is applied to the ancient book text processing task,so that the model can fully learn the text semantic information.At the same time,the Bi LSTM+CRF model can learn the characteristics of label specification information,further enhance the standardization of the model,and make the prediction results of the model more accurate.(2)A new data preprocessing method is proposed,that is,the data is processed by paragraphs.Change the data preprocessing method of dividing lines according to punctuation to dividing lines according to paragraphs,and one line of data is used as a data processing unit of the model.The changed data preprocessing method makes a data processing unit of the model contain more text information,so that the model can Learn more about textual contextual information.(3)A BERT deep learning model based on dynamic coding and the data preprocessing method divided into paragraphs is proposed.Considering the different lengths of paragraphs in ancient texts,the dynamic coding method is used for data vectorization,which further reduces the addition of unnecessary information and shortens the The number of processing units of the model is increased,and the accuracy of the model prediction results is improved.And designed and developed an automatic sentence segmentation and punctuation system for ancient texts,so that users can directly process the text that they want to segment or punctuate through the system,which is convenient for users to use.This paper conducts experiments on self-collected data sets,and uses unified evaluation indicators for evaluation.The final experimental results show that the improved BERT model can not only better learn the semantic information and contextual information of ancient texts,but also learn the normative information of tags by using the data preprocessing method and dynamic coding method of paragraph branching,which can effectively Improve the accuracy of automatic sentence segmentation and punctuation in ancient texts.

Keywords/Search Tags:

ancient Chinese sentence, BERT model, Bidirectional Long Short-Term Memory Network, conditional random field, data preprocessing, dynamic coding

PDF Full Text Request

Related items

1	Application Research Of Bi-LSTM-CRF Model In Chinese Grammar Error Diagnosis
2	The Effect Of Visual Long-term Memory On Visual Short-term Memory
3	Real-Time Music Beat Recognition Based On Deep Learning
4	Automatic Composition Of Guzheng Music Based On Long Short-term Memory And Deep Q-learning Network
5	Research On Music Emotion Recognition Based On Multi-Level Features And Fusion Model
6	Research And Implementation Of Music Emotion Recognition Based On Multimodal Features Fusion
7	Music Short-term Memory Survey And Comparison Research
8	The Role Of Long-Term Memory In Interpreting
9	Short-Term Memory Of Young Patients With Major Depression And Their Self-Rating Of Short-Term Memory
10	Research On Key Problems Of Russian-Chinese Military Speech-to-speech Translation Based On Sequence-to-Sequence