Font Size: a A A

Research On Corpus Parallel Processing In Chinese Proofreading

Posted on:2013-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:T LiuFull Text:PDF
GTID:2298330422474291Subject:Photogrammetry and Remote Sensing
Abstract/Summary:PDF Full Text Request
With the rapid development of computer technology and natural languageprocessing technology, processing large-scale corpus with computer has become a trend.On the one hand, with the development of the Internet, corpus appears faster and fasterand scale is getting bigger and bigger. Therefore, corpus processing needs to improveefficiency. On the other hand, computer parallel computing model and multi-coreprocessor, cluster parallel hardware architecture provides technical support forlarge-scale corpus processing. Therefore, corpus parallel processing has become a trend.This dissertation focuses on corpus parallel processing technology for the followingresearch.Firstly, the dissertation analyzes the causes of Chinese text mistakes, studies the processof the construction of the wrong words library and proofreading based on wrong wordslibrary and proposes a method of the wrong words library construction based onconfusion sets. This method can effectively solve the Non-word error.Secondly, the dissertation studies the possibility of proofreading using shallow parsing,proposes a method of Chinese chunking identification based on mutual information andparallel processing of Chinese chunking identification using MapReduce. The methodeffect is good for untagged corpus and can effectively solve the Real-word error inproofreading.Thirdly, the dissertation studies the method of Chinese Part-of-speech Tagging based onConditional Random Fields and proposes a parallel method of Conditional RandomFields using MapReduce. The method can improve efficiency.Finally, a prototype system was implemented in the dissertation using the proofreadingmethod based on wrong words library and Chinese chunk. The system can effectivelysolve problem of Chinese proofreading.
Keywords/Search Tags:Chinese Automatic Proofreading, Corpus, Wrong Word Library, Chinese Chunk, MapReduce, Part-of–speech Tagging
PDF Full Text Request
Related items