Font Size: a A A

Research And Implementation Of Chinese Text Automatic Proofreading Based On Deep Learning

Posted on:2020-11-24Degree:MasterType:Thesis
Country:ChinaCandidate:Z L YangFull Text:PDF
GTID:2428330590996481Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet enables a large number of users to generate a large amount of network text.Correcting the grammatical errors implied in the text can make the text smoother and easier to read.It is obviously unrealistic to process massive text based on manual proofreading,which makes the text automatic proofreading tasks have received much more attention in recent years.The rise of deep learning makes the sequence-to-sequence learning method widely used in text proofreading tasks.This thesis improves the existing Chinese text automatic proofreading method based on deep learning,mainly from the following aspects:(1)Training data are pre-processed and standard performance evaluation methods are implemented.The NLPCC 2018 GEC shared task training set is preprocessed and converted into proofreading parallel corpus for training the Chinese proofreading model.The Wikipedia Chinese corpus is preprocessed,and the segmented text is used for pre-training word vectors and statistical language models.The sequence of the proofreading system output is resegmented using the official word segmentation tool,and the official performance evaluation script is used to calculate the standard evaluation index.(2)For the nature of proofreading tasks,a Chinese proofreading model based on charlevel convolutional encoder-decoder network is proposed.The model performance is further improved by extending the training parallel corpus,removing the abnormal sentence pairs and initializing the embedding matrix using the pre-training word vector.Data fusion comparison experiments show that using more high-quality proofreading parallel corpus can significantly enhance the performance of the model.Data cleaning comparison experiments show that the filtering of sentence pairs which length of target side is much smaller than the source side can effectively improve the performance.Different levels of modeling comparison experiments show that char-level segmentation granularity is superior to word level and sub-word level.Comparison experiments of different initialization methods of the embedding matrix shows that using pre-trained word2 vec char vectors can effectively improve the performance of charlevel models.(3)A Chinese text automatic proofreading method based on ensemble decoding and reranking is proposed.It applies re-ranking mechanism to the N-best output of GEC models ensemble decoding.The aforementioned re-ranking mechanism combines the GEC model decoder score,the edit operation features for the proofreading task,and the 5-gram language model score to re-score the N-best candidate output corresponding to the input erroneous sentence,and the sentence with the highest score is the best corrected output of the input erroneous sentence.Experiments show that the combination of ensemble decoding and reranking is effective,and the re-ranking mechanism can significantly improve the performance of the model.(4)Aiming at the limitations of char level and sub-word level models,a Chinese text automatic proofreading method based on multi-channel fusion and re-ranking is proposed.The method combines the char-level and sub-word-level proofreading models through three prediction channels,in which each channel enables the ensemble decoding mechanism and outputs N best candidates.Then it aggregates the output results of each channel and applies the standardized LM feature re-ranking mechanism.The sentence with the highest score is the best output.The experimental results show the effectiveness of the proposed method.
Keywords/Search Tags:Chinese automatic proofreading, Neural machine translation, Convolutional encoder-decoder neural network, Re-ranking mechanism, Multi-channel fusion
PDF Full Text Request
Related items