Font Size: a A A

Off-line Recognition Of Chinese Handwriting: From Isolated Character To Realistic Text

Posted on:2009-06-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:T H SuFull Text:PDF
GTID:1118360278962040Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Owing to its huge potentials in application and appealing challenges in intellect, off-line recognition of handwritten Chinese character has been intensively studied by numerous researchers. Great efforts have been made to reliably identify handprinted Chinese characters during the last three decades. Accordingly, considerable advances have been achieved, covering shape normalization, feature extraction, classifier design, and linguistic postprocessing. All the fruits in the state of the art qualify the emergence of the era of handwritten text. This thesis motivates to establish the fundamental framework for the off-line recognition of Chinese handwritten text. Its contribution ranges from gathering essential data to defining evaluation criteria and from enhancing traditional methods to putting forward novel strategies. As the first step, HIT-MW database is presented to facilitate the off-line recognition task of Chinese handwritten text. To a preferable assessment, a series of evaluation criteria are then defined for the character segmentation and text recognition. Subsequently, the recognition problem is undertaken in two distinct strategies, the segmentation-based strategy and segmentation-free one. Finally, two-strategy combination systems are proposed, seeing clear complementary capacities upon the segmentation-based and the segmentation-free ones.This thesis attempts to infer the future trends and to direct the logical structure. The history of off-line character recognition is first systematically summarized, focusing on the upgrade of the recognition unit. Further reflecting on the-state-of-the-art techniques of Chinese character recognition in the collection of database and recognition method, Chinese handwritten text recognition will be the next trend. A new era comes into being which can be termed as "the era of handwritten text". Since the new era is originated from "the era of isolated character", survey on and comprehension of the recognition techniques are conducted for handwritten isolated Chinese character, and most achievements are investigated under the head of shape normalization, feature extraction, classifier design and linguistic postprocessing, respectively.This thesis establishes the HIT-MW database from a novel perspective. The database is the first text-level database of Chinese handwriting in the domain, whose success initiates the new era of handwritten text. The underlying texts of the database are sampled from China Daily Corpus and as a result, high character coverage of 99.33% is obtained on a large corpus with about 80 million characters. The writers are carefully determined and their distributions well match the real statistic. Due to its systematic sampling mechanism and strict assurance process, not only are skew, overlapping and touching textlines are included, but realistic phenomena, such as mis-writing, erasure are catched. Enough evidences support that HIT-MW database can be used to represent the whole population of Chinese handwritten text, and that the recognition results on it hold in statistics. Currently, the database is used by dozens of research groups throughout the world.This thesis first presents the basic evaluation criteria for character segmentation and text recognition. To encode the balance ability among deletion error, insertion error and substitution error, the recognition correct rate and the recognition accuracy rate are defined. To compare different character segmentation methods, the segmentation correct rate, the segmentation precision rate and the segmentation bias rate are provided. Utilizing the three segmentation rates, the segmentation ability in digits, punctuation marks and Chinese characters, and the preference in under segmentation or over segmentation can be discovered. In addition, the transcription of realistic handwritten text based on segmentation-based strategy is studied and two crucial suggestions are given. First, the advantages of new method may be of doubt, if the evidence is merely collected from single setup of shape normalization. Instead, their results should be compared under their own best setup of shape normalization. Second, the performance of classifiers based on modified quadratic discriminant function will be clearly improved after incorporating the a priori of character class, and further using the corpus rather than training data to estimate the a priori yields more robust results.This thesis proposes a segmentation-free strategy to transcribe the realistic handwritten Chinese text. During the training process, character positions are not needed. Comparisons are conducted with segmentation-based system of the same type of features and the results show the great feasibility and potential of this strategy. An enhanced four plane feature (en-FPF) within the segmentation-free recognition framework is also proposed. Unlike the previous directional planes, the planes of en-FPF can reconstruct the original image. Experimental results show that en-FPF yields bet- ter recognition performance and it yields the highest recognition rates if just one kind of feature is used. Once the fusion of en-FPF and simple cellular feature is processed with principal component analysis and data sharing techniques, the recognition correct rate of Chinese characters exceeds 50%, even when it is disturbed by the problem of data sparseness.This thesis combines the segmentation-based strategy and the segmentation-free one with serial structure and parallel structure, respectively, seeing their potential complementary capacities. To explore the complementary capacities between two systems, character matching rate (CMR) is defined first. With the help of CMR, the complementary capacities are verified between two strategies, even when they employ the same training data and the same type of feature. Then two combined systems are constructed adopting a serial combination structure and a parallel combination structure, respectively. The methods expand the research contents and ranges of multiple classifier combination. In the former, segmentation-free system is used to estimate the initial character boundaries. After a boundary refinement process, the segmentation-based system is launched. In the latter, segmentation-free system can be started simultaneously with segmentation-based system and then the recognition confidence of segmentation-based system is used to determine whose result should be delivered. Experimental results manifest the effectiveness of the combinations.
Keywords/Search Tags:handwritten text recognition, Chinese character recognition, assessment framework, segmentation-free strategy, segmentation-based strategy, multiple classifier combination
PDF Full Text Request
Related items