Font Size: a A A

The Research On Formula Extraction In Digital Image

Posted on:2015-09-23Degree:MasterType:Thesis
Country:ChinaCandidate:Q YangFull Text:PDF
GTID:2308330464470440Subject:Mechanical engineering
Abstract/Summary:PDF Full Text Request
The application of text recognition in digital image becomes increasingly wide nowadays. However, text recognition for documents involved with mathematic formulas is still a challenge though pure text recognition is highly developed. Formula extraction is a key step in formula recognition because only if the formula be extracted from surrounding text, then it can be recognized, so does the whole document.Presented in this paper is a computer aided automatic formula extraction method that all extraction steps are finished by computer without human assistance. Formulas can be extracted directly after importing the original image with a whole accuracy around 80%.Firstly, original image is preprocessed by binaryzation, image enhancement and image segmentation, which makes the image not only occupies smaller storage space and free of noise, but also more obvious with formula feature. Images in the whole document are divided into independent textline images. Image preprocessing helps to increase the accuracy and efficiency of formula extraction significantly.Secondly, a formula extraction method based on "black connected components neighbor graph" is proposed for text image with only displayed expressions. This method classifies pure textline and displayed expressions only by the features of node and edge while not any recognition result. Experiment shows that the classifying distance of this method is so far that the formula extraction accuracy exceeds 80%.Then, for more common text images including embedded formula, a second stage of intensified formula extraction is implemented based on the first stage of "black connected components adjacency graph". First, distinguishing pure textline and textline that incorporating formulas in the first stage of formula extraction. Then, formulas incorporated in the textline(both displayed and embedded formulas) are obtained in the second stage of intensified extraction based on recognition result of special symbol and lexical method.Additionally, experiment is carried out to verify the accuracy of formula extraction. Result shows that the method achieves an accuracy of 80% and 75% at least for displayed and embedded equations, respectively.Finally, the author summarizes the achievements and defects in this paper, looking forward to deep research work in the future.
Keywords/Search Tags:Formula extraction, Black connected components neighbor graph, Preprocess, Feature extraction
PDF Full Text Request
Related items