Font Size: a A A

Research Of Layout Analysis On Complex Chinese Document Images

Posted on:2011-10-17Degree:MasterType:Thesis
Country:ChinaCandidate:X DangFull Text:PDF
GTID:2178360305976533Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Optical character recognition (OCR) is an implementation of automatic text input faster and easier method, widely used online database and digital libraries. As the first step into the OCR automation phase, the accuracy of layout analysis directly affects the output of the semantic and logical relations. Out of different kinds of document layouts, Chinese document including diversified background and complicated layout is complex which making more difficult in analyzing Chinese document layout than the layout of other alphabetic languages. Thus, the study of layout analysis has important theoretical significance and application value. In order to solve the issues of existed algorithms involved in skew detection, page segmentation and plain text layout analysis which are vulnerable to the layout structure complexity, we do a great deal of experiments and acquired a series of valuable results which can be summarized in the following aspects:1. The precision of existing nearest-neighborhood algorithms for detecting skew angle is low because of selected nearest component maybe wrong. Taking into account that whether the pair of similar components is in the same row or column, improved k-nearest-neighborhood chain algorithm is proposed. This algorithm avoids the interference of mistaken nearest-neighborhood chain, so it improves the accuracy of skew angle.2. In order to remove the disadvantages of traditional run-length smoothing algorithms (RLSA) which are sensitive to the thresholds, we proposed a new constraint run-length smoothing algorithms based on the selective component according to the between-region and within-region distance. The new algorithm overcome the dependence of algorithms to the character size, spacing and the page segmentation under single background is improved.3. By using the improved color-to-gray algorithm and dynamic clustering algorithm based on edge detection we resolve the shortcomings of contradictions between running time and accuracy for page segmentation under complex background. The experiment shows that this new method speed the page segmentation without reducing the accuracy of page segmentation because of overcoming the loss of color information and segmenting only on edge image.4. Most algorithms for document layout analysis were sensitive to the parameters and had weak applicability. In order to make up these deficiencies,we presents an algorithm of region formation based on SVM for analyzing Chinese document. Seed connected components as the first feature for training are selected which can be used to form regions, next our technique decides the reading order by exploiting the projection method. Our extensive experimental results show that our proposed algorithm is more effective to analyze different kinds of document layout than other methods.
Keywords/Search Tags:Optical Character Recognition, skew detection, page segmentation, plain text layout analysis
PDF Full Text Request
Related items