Font Size: a A A

Chinese Text Automatic Proofreading System

Posted on:2016-04-11Degree:MasterType:Thesis
Country:ChinaCandidate:M ShiFull Text:PDF
GTID:2308330479998252Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of computer and information technology, the rapid development of statistical natural language processing, has made remarkable achievements. The demand of Electronic text automatic proofreading makes the text automatically proofreading research emerged. There are two steps in Chinese text automatic Proofreading: automatic error checking and automatic correction. This paper has done the following work:1. Local proofreading of Chinese homonymsThe error type of Chinese text varied, in this paper, based on a detailed analysis of each error type and combined with the actual discovered homophone errors accounted for a large proportion, so we did some work about homophone errors proofreading. At first, using the most simple n-gram models- 2-gram model; then combine 2-gram model and context; after analyzing the result, this paper proposes a method of utilizing contextual generalization synonyms, improved the problem of sparse data and system performance. Finally, test the system with a real test text, the recall rate was 81.2%, the accuracy was 73.4%, 88.9% correct rate.2. Long distance proofreading of Chinese homonymsFor those errors that could not be identified by local features, we used Chinese collocation. First, according to the corpus we got collocation automatically, which was the basic source; then we extracted collocation message of word in text, computed supports of collocation of all words in confusion set, then judged whether the original text was wrong according to the size of supports of collocation, at last, offered two words with two biggest support as advice.3. Non-word error proofreadingThis paper also studied how to proofread the non-word error. Here only for long term errors, including four words, five words, six words, it is a common idiom type of error. Non-word error is a concept of English text proofreading in fact, in this paper, it is for long-term, rather than characters. For solving this problem, we use the method of construct wrong words set with dictionary and massive corpus by fuzzy matching, and then we got couples of “right word wrong word”. If the text is matched to the wrong word, the system would be able to give the correct word when proofreading. We used this method proofread compositions, the effect is obvious.Finally, we built a text automatic proofread system, which mainly proofread the two kinds of errors. After testing with real test text, we pointed some shortcomings and future research directions.
Keywords/Search Tags:text automatic proofread, homophone proofread, chinese collocation, fuzzy matching, non-word error proofread
PDF Full Text Request
Related items