Study On The Method Of Constructing The Confusion Set Of Chinese Characters

Posted on:2015-07-16

Degree:Master

Type:Thesis

Country:China

Candidate:H L Shi

Full Text:PDF

GTID:2208330422488592

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The technology research of Chinese confused set has been not only an importantbasic subject of Chinese text automatic proofreading technology, but also a bottleneckproblem. It plays an important role in the development of Chinese text automaticproofreading technology. In the article, the related main technology of the set hasbeen studied in a deep and more comprehensive research, including the wrong type ofseed characters in Chinese text, the large storage of data dictionary, the Chinese textsegmentation algorithm, and the sort of confused set.This article studies confusion set from a new perspective, by various types ofartificially creating11,935wrong characters; Taking those characters as nodes andâ€œpossibly wrong charactersâ€ relations as sections, we construct the set of wrongcharacters, which are easily confused into a diagram; On the base of the diagram, wedesign the internal-expanding algorithm in order to find inner rules,and verify wrongcharacters set; Through external data source,we supplement the wrong charactersset, discover new pairs of wrong characters, and sort every wrong characters set.Finally, we build a dictionary of the wrong characters set. According to theexperiment and proofreading samples at random, the accuracy reaches to the percentof87.35.In this paper, the main contributions are as follows:First of all, the appearing styles of wrong characters in Chinese text have beenstudied in a deep and more comprehensive research, through large quantities of textssort out some wrong types of Chinese, including similar sound, similar shape and theerrors of adjacent rows keystroke and the pinyin phrase combination. Then weanalysis them and put forward the solution.Secondly, we automatically add wrong characters confusion set from a new angle,put forward the concepts of wrongly written characters set diagram, and find rules tocomplementing.Moreover, after supplement and validation through the wrongly written charactersdiagram, further put forward the large data sets, eventually we build a dictionary ofwrongly written characters set.Finally, we sort the wrongly written characters set through the character word frequency and similarity shape, which is benefit to the error correction system. Andthrough the no field limitation of the large-scale corpus and unfamiliar worddictionary, it is of great help to text proofreading.

Keywords/Search Tags:

Wrongly Written Characters Set, self-expansion, Open Source Data, Rule and Statistics Base

PDF Full Text Request

Related items

1	The Design And Implementation Of Wrongly Written Chinese Characters Processing Toolkit Oriented Chinese Characters Teaching
2	Research On Communication And Navigation Integration Technology Based On 5G Open-source Base Station
3	Rule-based Dynamic Data Acquisition Technology And Its Application In The Published Statistics
4	Research And Implement Of Intrusion Detection System Rule Based On CVE Characters
5	Research And Design On Open Source Community Data Mining Key Technologies
6	Fingerprints and hand-written/printed characters processing methods for recognition via multi-stage self organized learning in a distributed computing environment
7	Research On Technologies Of Web Data Extraction On Open Source Community
8	Research On Key Technologies Of Web Data Extraction And Mining On Open Source Community
9	Research On Open Source Project Team Expansion Via Attributed Network Representation Learning
10	The Design And Implementation Of Content Management System For Biomedical Statistics Consultation