Font Size: a A A

Research On Text Proofreading Method Based On Deep Learning

Posted on:2021-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:B WangFull Text:PDF
GTID:2428330611480623Subject:Computer science and technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the amount of text data on the network has increased rapidly,and at the same time,the quality of text has declined.Traditional manual proofreading has long been unable to complete such a huge amount of work,and automatic text proofreading technology came into being.This technology can not only speed up the pace of publishing,but also can use this technology to reduce errors in a large number of electronic documents that need to be saved in enterprises,and at the same time,it can also assist teachers in reviewing test papers and finding spelling errors in education.There are many problems with traditional text-based proofreading methods based on statistics and rules.On the one hand,the formulation of rules requires rich experience,high labor costs,and this pipeline-based model can easily cause errors to accumulate due to noise generated by word segmentation.On the other hand,the existing methods only use the feature information of words or words,and do not effectively use the three kinds of feature information of characters,words,and pinyin.In view of the above problems,this paper proposes a deep learning-based sequence labeling model BLSTM-CRF.No manual intervention is required,labor cost is saved,and word granularity is used to avoid noise introduced by word segmentation.In addition,the BLSTM-CRF model has been improved for the problem of inefficient use of multiple features.The lattice LSTM and the gate control mechanism are used to effectively fuse the three features of characters,words,and pinyin.The main content of this paper is divided into two aspects:(1)This paper proposes a neural network architecture BLSTM-CRF for Chinese spell checking,which is a bidrectional long-short-term memory network combined with a conditional random field model.It is a true end-to-end model that does not rely on task-specific resources,feature engineering,or data preprocessing.Second,by using word-granular vector input,the introduction of word segmentation noise is avoided.Experiments on the news and novel data sets show that the model performance F1 value has been greatly improved compared to the baseline model on the news and novel test set.(2)This paper proposes a novel spelling check model FL-LSTM-CRF,which combines the features of characters,words,and pinyin to make full use of potential information.The experimental results on the SIGHAN dataset prove the feasibility of the end-to-end framework in spelling error checking,and verify the validity of the feature information of the fusion of words,words,and pinyin on error detection tasks.With the same external resources,the FL-LSTM-CRF model is significantly better than other models.
Keywords/Search Tags:Chinese text proofreading, deep learning, sequence labeling, multi feature fusion
PDF Full Text Request
Related items