Font Size: a A A

Research On Key Issues Of Printed Mathematical Formula Recognition

Posted on:2018-06-03Degree:MasterType:Thesis
Country:ChinaCandidate:T W SanFull Text:PDF
GTID:2348330515992366Subject:Engineering
Abstract/Summary:PDF Full Text Request
The recognition of printed mathematical formula refers to the identification of mathematical formulas from printed document images in order to achieve the reuse of the formula.At present,the main OCR systems in ordinary text recognition has reached a high level,but need to be improved in the mathematical formula recognition,the variety of touching character caused by complex two-dimensional structure of formula and the diversity of symbols' font and size are the main factors affecting recognition accuracy,the more effective segmentation for touching character and recognition methods are needed to improve the overall quality of recognition.Firstly,the thesis discusses the composition of the mathematical formula recognition system,the key technologies and the difficulties of recognition.Secondly,the method of segmentation for mathematical symbols is studied deeply.In this thesis,a segmentation algorithm based on the optimal path assessment is proposed for the italic symbols.This method finds the possible cutting positions,uses the characteristic factors to carry on the two-step comprehensive evaluation to cutting paths,selects the reasonable cutting line and constructs the corresponding combined folding line for the different touching directions to complete the final segmentation.This method improves the accuracy of the cutting path prediction and effectively avoids the breakdown of the normal stroke.In addition,the convolution neural network model in the depth learning theory is constructed to identify the formula symbols,and the network parameters are optimized by experiment.The convolutional neural network has two alternating convolution layers and a sampling layer.The convolution layer and the sampling window size are 5 × 5 and 2 × 2 respectively.The rectified linear units is selected as the activation function and the Dropout connection is used in the network.Through the large number of training to achieve the characteristics of mathematical symbolic pattern extraction and classification,in order to solve the existing methods of artificial extraction of the lack of features to improve the accuracy of mathematical formula symbol recognition.In order to verify the validity of the method,this thesis uses the C++ language and the OpenCV technology to realize the segmentation algorithm based on the optimal evaluation in the Visual Studio 2010 environment.The Keras framework is used to construct the convolution neural network for formula recognition.Through the experimental comparison,the accuracy rate of the segmentation method proposed in this thesis can reach 90.14%,and the accuracy of the horizontal touching symbols segmentation can reach 85.60%.In the recognition test,the neural network is used to identify the common formula character set,and the recognition rate is 97.46%.The recognition formula of the formula before and after the segmentation is 88.24% and 97.29% respectively.
Keywords/Search Tags:Mathematical formula recognition, Merged character, Formula segmentation, Convolutional neural network, Deep learning
PDF Full Text Request
Related items