Research On Chinese Text Real-Word Error Automatic Detection And Correction Algorithm

Posted on:2019-02-26

Degree:Master

Type:Thesis

Country:China

Candidate:L Wang

Full Text:PDF

GTID:2428330575450447

Subject:Applied statistics

Abstract/Summary:

PDF Full Text Request

Text detection and correction is an important part in publishing work,and has been widely used in information retrieval,optical character recognition and speech recognition.With the development of informationtechnology and electronic publishing industry,the traditional manual proofreading method has been unable to adapt to the rapid growth of the number of electronic text.Chinese real-word errors refer to the fact that a word exists in a dictionary,but it doesn't conform to the context.At present,the traditional auto-proofreading algorithm for real-word errors is mainly based on local context features,and doesn't make good use of the relationship between words.In the past two years,some scholars have proposed the seq2seq algorithm for text detection and correction.The advantage of this algorithm is that it can use word vectors and LSTM to proofread the target words through the semantics of long and short distance words.However,because text detection and correction is judged by context word vectors entirely,the input and the output are sequences with strong uncontrollability,and some sentences with strange semantics and unexplained meanings may be as the output.Firstly,this paper improves the traditional proofreading algorithm based on statistical model and proposes a CS-N-gram text detection and correction algorithm based on confusion set and N-gram.Secondly,this paper proposes a CS-BiLSTM-CRF text detection and correction algorithm based on confusion set and sequence annotation method,which uses the traditional proofreading algorithm and sequence annotation algorithm for reference.Experimental results show that CS-BiLSTM-CRF algorithm achieves higher recall and accuracy in proofreading than CS-N-gram algorithm.At the same time,this paper summarizes the errors of CS-N-gram algorithm from the causes of the errors,analyzes the advantages and disadvantages of CS-N-gram algorithm and CS-BiLSTM-CRF algorithm.CS-BiLSTM-CRF algorithm can effectively solve the most difficult problem of CS-N-gram algorithm when adjacent words are not logged in.At the same time,it can do better proofreading combined with long-distance word semantics.But in some cases that can be proofread directly through local context,CS-BiLSTM-CRF algorithm is slightly inferior to CS-N-gram algorithm because the word vector contains multi-dimensional information.By analyzing the merits and demerits of CS-N-gram algorithm and CS-BiLSTM-CRF algorithm,this paper proposes a hybrid algorithm of automatic proofreading.This hybrid algorithm can be applied to the automatic proofreading of real-word errors in different corpus without any manual intervention such as external corpus and rule dictionary,it has greater significance in Chinese real-word errors automatic proofreading.

Keywords/Search Tags:

Text Detection and Correction, Real-word Errors, Confusion Sets, N-gram, BiLSTM-CRF

PDF Full Text Request

Related items

1	Research On Spelling Checker/Corrector For Kazakh Corpora
2	Research On Chinese Real-word Error Automatic Detection And Correction
3	Research On Word Error Correction Methods Of Chinese Text
4	Research On Query Correction Method Based On Multiple Characteristics Mining
5	Research And Application Of The Dependency Grammar And Valence Grammar In The Real-word Errors Correction
6	Research On Text Proofreading Method Based On The Analysis Of The Mongolian Syllable
7	Research On Chinese Text Error Correction Based On N-gram And Dependency Parsing
8	Research On Short Text Emotion Classification Method Based On Word2Vec And N-Gram
9	Research Of Problems In Spoken Term Detection
10	Research On The Contextual Cohesion Of Social Media Texts For News