Mathematical Formula Feature Extraction And Locating In Chinese Scanned Printed Document

Posted on:2011-10-11

Degree:Master

Type:Thesis

Country:China

Candidate:Z F Guo

Full Text:PDF

GTID:2178360305977850

Subject:Pattern Recognition and Intelligent Systems

Abstract/Summary:

Scientific and technical literature plays a vital role in human civilization and the development of science and technology. In today's digital era background, people want to use these literatures for more convenient and efficient, so they make them into electronic document and save them on the computer in the form of image for people use. But it also produces a lot of problems, for example, occupy massive storage capacity, low transmission rate on the network, the formulas and tables in the image can not be reused. Current OCR technology can not fully solve these problems, in particular, does not recognize the mathematical formulas in the document image, however, mathematical formulas in science and technology document is one of the most important elements. So, people begin to study how to automatically recognize the mathematical formula in the document image this worldwide difficult problem.This paper is mainly study mathematical formula locating in Chinese printed documents. Based on precursor's research, I implement a mathematical formula feature extraction system and use the feature data taken from the feature database and Parzen window algorithm to confirm isolated mathematical formula locating rate in document image and its corresponding the exact value of window width.I first use the scanner to produce 200 experiment images, in the process of image production, because the print quality of the scanned document, paper quality, the resolution of the scanner, scan mistakes and other influence factors make the experimental image need to be preprocessed before using. After image cut, tilt correction, switch to 256 grayscale, binary value and removing noise, the images can be used. This system can extract seven features of each line from the document image which are line high (HL),upper spacing(AS),lower spacing(BS),left indent(LI),right indent(RI),the distance between the formula and its corresponding serial number(LD),line density(DE) and be able to import these feature data and its related information into database in order to establish a image feature database. The database have a total number of 4,963 records, respectively corresponding 4,963 lines of 200 experiment images and there are 23 fields which are record serial number(Serial Number), image serial number(Image Number), line number (Line Number), the line on top locating (Line Top), the line on bottom locating(Line Bottom), classification(Formula or Text),the line height(h, HL), line length(l),the average character height within the line(h0),upper spacing(as,AS),lower spacing(bs,BS),left indent(left indent, LI), right indent (right indent, RI),the distance between the formula and its corresponding serial number(large distance,LD),the black pixel number(NBP),line density(DE),classification result(Recognition Result('F'or'T')).On this basis, class the data in the feature database into four categories which are isolated formula, embedded formula, text and others. After statistic, we know that there are 4,959 valid records in these 4963 records, which include 704 pure text lines,1594 isolated formula lines,2410 embedded formula lines and 251 other lines. And then, use the six features of seven features in the isolated formula class, embedded formula class and text class to compose of feature vectors(HL,AS,BS,LI,RI,DE).Getting one-tenth data of isolated formula class as the train set of the isolated formula Parzen window and the same number data as the train set of the embedded formula Parzen window and text Parzen window. Then use the remnant nine-tenths data of isolated formula class as the validation set of three class Parzen window and bring them into the Parzen window algorithm to obtain the conditional probability density of each class, and then use the Bayesian decision rule based on the smallest error rate to put the feature vector of validation set into different class in order to obtain the correct isolated formula locating rate.The window width of Parzen window is a very important value, it will have an great influence on the final locating rate. As the window width h is a real number greater than zero, so my idea is to set a smaller initial value and a bigger termination value for window width, then traverse all the interval values in a smaller step value so that we can find the best window width value. The experiment result shows that this method is effective.Compared with the previous literature, this paper uses the same method with them not only get higher isolated formula locating rates, but also confirm the corresponding the exact value of window width of Parzen window, which have never seen in previous papers. This is precisely where the innovation of this article. For further study in the future lays a solid foundation.

Keywords/Search Tags:

feature extraction, Parzen window, window width, formula locating, isolated formula

Related items

1	Mathematical Formula Extraction In Printed-Chinese Documents Based On EEN Feature Function
2	Identification Of Component And Its Type Under Different Attitude Based On SVM And Parzen Window
3	The Research On Formula Extraction In Digital Image
4	The Study Of Mathematical Formula Extraction With The Script Identification
5	Research On Technology Of Optical Formula Recognition
6	Research On Image Segmentation Based On Parzen Window And Q-learning
7	Contirbutions To Classification And Clustering Methods Based On Parzen Window Density Estimation
8	Research On The Mathematical Formula Recognition Technology For Printed Document
9	Based On Parzen Window Estimate Dynamic Random Renyi-'s Entropy Of The System Output Distribution Control Study
10	Formulas Extraction And Symbols Location In Printing Mathematic Expressions Recognition