Font Size: a A A

Research And Implementation Of Error Detection And Error Correction Efficiency Optimization Of Chinese Text

Posted on:2022-09-10Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y YangFull Text:PDF
GTID:2518306575469074Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet era,network data is growing fleetly.However,the quality of information is declining continuously,and the misuse of characters has become a pressing problem.The traditional manual correction method is time-consuming and inefficient,which is difficult to solve the problem of demand for enormous data.Therefore,the research on automatic proofreading technology of Chinese text has profound practical significances.The sources and types of errors in Chinese texts are analyzed in this thesis.The types of errors in Chinese texts are divided into "non-word errors" and "true word errors",and different proofreading algorithms are conducted to correct the corresponding types of errors.The process of automatic proofreading of Chinese text includes two parts: error checking and error correction,for which the research has been conducted as follows:1.In the process of error checking,the method based on rules and dictionary is adopted to detect the "non-word errors" in this thesis.According to the method of detection,the sentences are segmented into words and are analyzed with the word segmentations by consulting the dictionary.With the analysis result of word segmentation,the continuous scattered strings and the nonexistent words will be regarded as words with error.In addition,the N-gram language model is used to detect the "true word errors" by analyzing the connection between adjacent phrases.If the probability value between two phrases is less than the threshold,it is considered that the word has an error.2.In the process of error correction,the characteristics of the N-gram language model and the Long Short Term Memory Networks(LSTM)language model are analyzed respectively in this thesis,and a joint proofreading algorithm based on the Tri-gram and LSTM language models is proposed.According to the proposed algorithm,the sentences are scored on the basis of Tri-gram language model,and scored for the second time with LSTM language model for further disambiguation if the variation of scores is quite small.Then the scores of all candidate sentences are compared and the candidate sentence with the highest score is to be outputted as the suggestion for error correction,so as to improve the correction effect.3.Although the LSTM language model can better grasp the long-distance information between words,the calculation speed of the model is slow.In order to improve the error correction efficiency of the LSTM model,an optimization scheme based on prefix tree merging is proposed in this thesis.According to the analysis of a large number of error correction candidate sentences,it is found that there is a high degree of similarity among the error correction candidate sentences of a sentence.The similar parts of each candidate sentence can be merged to form a prefix tree,and then the score of each sentence can be calculated in parallel with multi-threaded pipeline method.By using the above optimization scheme,the error correction efficiency of the LSTM language model can be further improved on the basis of the joint proofreading algorithm.Finally,the above-mentioned proofreading method is tested on the test set in this thesis.The experimental results show that the optimization scheme proposed in this thesis can greatly improve the error correction performance and efficiency on the original basis.
Keywords/Search Tags:Chinese text automatic proofreading, N-gram language model, LSTM language model, pipeline
PDF Full Text Request
Related items