Font Size: a A A

Writer-independent Unconstrained Handwritten Offline Chinese Text Line Recognition

Posted on:2011-05-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:N X LiFull Text:PDF
GTID:1118330332972016Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Writer-independent, unconstrained handwritten offline Chinese text line recognition is a difficult problem in current handwritten character recognition field. Compared with previous handwritten character recognition, the problem has its own characteristics as follows: (1) the recognition object is an image of a Chinese text line, i.e., an image of a Chinese sentence, which usually contains more than ten and less the a hundred characters. Thus the wholistic recogniton approach is no longer available, and the delimitattion of the boundary of each character is needed. (2) The text line that needs recognizing is written freely, i.e., there is no constraint on the writing styles of writers. As a result, large variations occur in not only the shape of a character but also the relationships between characters. (3) Text lines are written by non-specific writers, i.e., any one is anticipated to be a writer. Consequently, the writing styles of the handwritten text lines may exhibit great diversity, which makes the technique of writer-adaptation not applicable. The above characteristics add difficulties to the problem of handwritten character recognition. Till now, the research of writer-independent, unconstrained handwritten offline Chinese text line recognition is still in labs.Unlike western characters or digits, Chinese characters have a large number of character classes as well as complex character structures, which makes it difficult to achieve the comparable recognition precision as western characters or digits. The paper investigated the problem of writer-independent, unconstrained handwritten offline Chinese text line recognition, and established a recognition system for solution. The works of this paper mainly include:1. In segmentation-based text line recognition methods, it is usually required that the boundaries of each character in a text line are determined, based on which isolated character recognition can be carried out. A novel pre-segmentation method is proposed to segment each character or radical in a text line, dealing with the following three common cases in writer-independent, unconstrained handwritten offline Chinese text lines: naturally separated characters, overlapped characters, and touched characters. Since it is able to fast generate curved segmentation paths of a text line, the proposed method can reach relatively high segmentation accuracy with an improved speed in segmentation, compared with other pre-segmentation methods.2. A traditional isolated character classifier is trained using only positive samples (i.e., samples of valid characters), and hence lack the ability to determining whether the input sample is a negative sample (i.e., the sample of an invalid character). In segmentation-based text line recognition methods, a great many negative samples are produced by pre-segmentation, bringing about severe interferences to text line recognition. In order to reduce this kind of interferences, negative training methods can be adopted by an isolated character classifier to improve its ability of recognizing negative samples. The previous negative training methods for isolated Chinese character classifiers are not very suitable in the case of writer-independent, unconstrained handwritten Chinese characcters. A negative training method based on Linear Discriminant Analysis (LDA) is proposed. Both positive and negative samples are fed into a traditional isolated character classifier, and then LDA transform is performed on the output of the classifier to estimate the probability distributions of both positive and negative samples. By modifying the output of the original classifier using the estimated probabilities, the negative training of an isolated character classifier can be achieved. Experiments show that the proposed method performs better than other negative training methods.3. In text line recognition tasks, besides the recognition of isolated characters, the relationships among the characters in a text line are also important and helpful to text line recognition. These relationships include: geometrical layout information among character images, linguistic context information in a sequence of character labels, etc. The accuracy of text line recognition may improve by intergrating the information of isolated characters with the information among characters. However, traditional methods of multiple information fusion in text line recognition either adopt too many verifiers, which adds a heavy burden in computation, or simplify the computation by empirical assumptions, which often deviates from the situations in real handwritten text lines. A new probabilistic model based on Bayesian theorem is proposed, which can integrate multiple kinds of information such as the recognition information of isolated characters, the geometrical information of a text line, as well as the linguistic context information in Chinese language. Only two classifiers are adopted to perform the proposed method. The first classifier is used for isolated character recognition, which outputs a posterior probability of the recognized character. The second classifier is employed to classify the major classes of the characters in Chinese text lines, which also outputs a posterior probabiltiy. Later, both the two posterior probabilities and the probability from n-gram languge model are multiplied. The above process achieves multiple informatin fusion in a simple way.When testing on a large scale public database- HIT-MW database, experiments show that the established recognition system works well on writer-independent, unconstrained handwritten offline Chinese text line recognition. Using a bi-gram language model, a character-level correct recognition rate of 78.82% is achieved, outperforming the latest reported results testing on the same data.Writer-independent, unconstrained handwritten offline Chinese text line recognition is a comprehensive subject that involves pattern recognition, image processing, natural language processing, etc. It is practically valuable and theoretically meaningful to future technologies such as handwritten character recognition, artificial intelligent, and so on.
Keywords/Search Tags:Handwritten character recognition, Chinese text line recognition, pre-segmentation, negative training, multiple information fusion
PDF Full Text Request
Related items