Font Size: a A A

Structure And Coding System Of Commonly Used Chinese Characters

Posted on:2014-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:S HuangFull Text:PDF
GTID:2208330434470702Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As computer technology continues to evolve, people want the computer to be able to do independently complete more challenging task. OCR (Optical Character Recognition, OCR) is the technology that let the computer automatically recognizes the text printed or written on paper according to certain rules given by people. OCR is a very important research field in natural language processing, and is related to a lot of subject such as artificial intelligence, digital image processing, pattern recognition, natural language understanding, information theory, and many other disciplines. In today’s era of rapid development of science and technology, it has put forward higher requirements to handle characters more efficient and find out valuable informationCurrently, the level of the character recognition technology still has gap as compared with the actual needs. Existing recognition method has not been able to get the desired results, for the characters has various shape and structure. For example, the printed character has different font, and the shape of handwritten characters has no unified rules. Among all the characters, Chinese character recognition technology face more difficulties because its quantity, complex structure and various font. Most of the Chinese character recognition systems are designed for all Chinese characters, which mean that it has to deal with more than10,000characters. In addition, there are many pairs of characters with similar shape. The only difference of one pair is often very small like one has a dot in the right-up corner and the other has not. Such situation causes the increasing of difficulty for distinguishing them.In the daily life, the frequently used Chinese characters have a number of3500. They are referred to as the commonly used word. Actually, although rare characters can be seen in some material, they are only used on specific occasions. According to the sampling test by computer, the usage rate of the commonly used word reached99.48%. Combined with the actual application needs, characters used for write are always within the scope of commonly used word. Accordingly, limit the characters to be processed to only3500commonly used characters could meet the general communication requirements.On one hand, by suppressing the quantity of the characters could reduce the difficulty of identification. On the other hand, although the Chinese Character is polytropic, the structure pattern which constitute Chinese characters have a fixed stroke structure which called Wubi root in Wubi input method. Wubi roots are the basic structures of the Chinese characters. They have stable shape and they themselves carry certain significance, for Chinese characters evolved from the development of hieroglyphs, some character is actually a visual representation of the object. There are about125Wubi root in the Wubi input method. They are used to represent the character. When input a character into computers, just input the Wubi root of the character in order. So, the root can be seen as two-dimensional symbols, then, Chinese characters are two-dimensional coding. Because there are no two characters have the same structure pattern, coding method in which the structure pattern are used to code the character can get unique results.By analyzing the structure pattern used in Wubi input method, this paper select93Wubi roots to encode the3149commonly used Chinese characters, and get a coding table include each commonly used Chinese character. The same time, a detailed analysis of character’s feature is completed to classify the Wubi root. Firstly, the number and distribution characteristics of the horizontal and vertical strokes of the character’s structure are extracted by analyse the number and the distribution of horizontal stokes and vertical strokes. Secondly, the paper focuses on analysis of the structure of characters in each subclass. The goal is to extract features with higher level of distinction and use them to descript these characters. Classification trees for each subclass are also built on the basis of those features to improve matching speed and recognition accuracy degrees.This paper also proposed a framework based on the root encoded character recognition system. When input a Chinese character into the computer, just input the structure root of the character, instead of the whole character. Recognition system then deal with the root. After identifying the structure root, then search in the root code table to get the corresponding Chinese characters, and it is the output.
Keywords/Search Tags:Chinese character coding, pattern recognition, structure rootclassification, character recognition, structure feature
PDF Full Text Request
Related items