Font Size: a A A

Research On Chinese Text Real-Word Error Automatic Detection And Correction Algorithm

Posted on:2019-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:L WangFull Text:PDF
GTID:2428330575450447Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
Text detection and correction is an important part in publishing work,and has been widely used in information retrieval,optical character recognition and speech recognition.With the development of informationtechnology and electronic publishing industry,the traditional manual proofreading method has been unable to adapt to the rapid growth of the number of electronic text.Chinese real-word errors refer to the fact that a word exists in a dictionary,but it doesn't conform to the context.At present,the traditional auto-proofreading algorithm for real-word errors is mainly based on local context features,and doesn't make good use of the relationship between words.In the past two years,some scholars have proposed the seq2seq algorithm for text detection and correction.The advantage of this algorithm is that it can use word vectors and LSTM to proofread the target words through the semantics of long and short distance words.However,because text detection and correction is judged by context word vectors entirely,the input and the output are sequences with strong uncontrollability,and some sentences with strange semantics and unexplained meanings may be as the output.Firstly,this paper improves the traditional proofreading algorithm based on statistical model and proposes a CS-N-gram text detection and correction algorithm based on confusion set and N-gram.Secondly,this paper proposes a CS-BiLSTM-CRF text detection and correction algorithm based on confusion set and sequence annotation method,which uses the traditional proofreading algorithm and sequence annotation algorithm for reference.Experimental results show that CS-BiLSTM-CRF algorithm achieves higher recall and accuracy in proofreading than CS-N-gram algorithm.At the same time,this paper summarizes the errors of CS-N-gram algorithm from the causes of the errors,analyzes the advantages and disadvantages of CS-N-gram algorithm and CS-BiLSTM-CRF algorithm.CS-BiLSTM-CRF algorithm can effectively solve the most difficult problem of CS-N-gram algorithm when adjacent words are not logged in.At the same time,it can do better proofreading combined with long-distance word semantics.But in some cases that can be proofread directly through local context,CS-BiLSTM-CRF algorithm is slightly inferior to CS-N-gram algorithm because the word vector contains multi-dimensional information.By analyzing the merits and demerits of CS-N-gram algorithm and CS-BiLSTM-CRF algorithm,this paper proposes a hybrid algorithm of automatic proofreading.This hybrid algorithm can be applied to the automatic proofreading of real-word errors in different corpus without any manual intervention such as external corpus and rule dictionary,it has greater significance in Chinese real-word errors automatic proofreading.
Keywords/Search Tags:Text Detection and Correction, Real-word Errors, Confusion Sets, N-gram, BiLSTM-CRF
PDF Full Text Request
Related items