Font Size: a A A

Study On The Method Of Constructing The Confusion Set Of Chinese Characters

Posted on:2015-07-16Degree:MasterType:Thesis
Country:ChinaCandidate:H L ShiFull Text:PDF
GTID:2208330422488592Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The technology research of Chinese confused set has been not only an importantbasic subject of Chinese text automatic proofreading technology, but also a bottleneckproblem. It plays an important role in the development of Chinese text automaticproofreading technology. In the article, the related main technology of the set hasbeen studied in a deep and more comprehensive research, including the wrong type ofseed characters in Chinese text, the large storage of data dictionary, the Chinese textsegmentation algorithm, and the sort of confused set.This article studies confusion set from a new perspective, by various types ofartificially creating11,935wrong characters; Taking those characters as nodes and“possibly wrong characters” relations as sections, we construct the set of wrongcharacters, which are easily confused into a diagram; On the base of the diagram, wedesign the internal-expanding algorithm in order to find inner rules,and verify wrongcharacters set; Through external data source,we supplement the wrong charactersset, discover new pairs of wrong characters, and sort every wrong characters set.Finally, we build a dictionary of the wrong characters set. According to theexperiment and proofreading samples at random, the accuracy reaches to the percentof87.35.In this paper, the main contributions are as follows:First of all, the appearing styles of wrong characters in Chinese text have beenstudied in a deep and more comprehensive research, through large quantities of textssort out some wrong types of Chinese, including similar sound, similar shape and theerrors of adjacent rows keystroke and the pinyin phrase combination. Then weanalysis them and put forward the solution.Secondly, we automatically add wrong characters confusion set from a new angle,put forward the concepts of wrongly written characters set diagram, and find rules tocomplementing.Moreover, after supplement and validation through the wrongly written charactersdiagram, further put forward the large data sets, eventually we build a dictionary ofwrongly written characters set.Finally, we sort the wrongly written characters set through the character word frequency and similarity shape, which is benefit to the error correction system. Andthrough the no field limitation of the large-scale corpus and unfamiliar worddictionary, it is of great help to text proofreading.
Keywords/Search Tags:Wrongly Written Characters Set, self-expansion, Open Source Data, Rule and Statistics Base
PDF Full Text Request
Related items