Font Size: a A A

Recognition And Index System Of Math Formula Based On Deep Learning

Posted on:2021-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y C ZhanFull Text:PDF
GTID:2428330632462663Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of computer technology,many scientific documents have begun to be published and stored in the form of electronic versions.There are many formulas in these resources that cannot be identified using traditional OCR methods.There are also many formulas that need to be reused and retrieved in many online education applications,which are developing rapidly in recent years.In order to reuse and retrieve these formulas,they need to be converted into text data.However,due to the two-dimensional structure and complexity of mathematical fomnulas,identifying and retrieving mathematical formulas is still a difficult problem.Therefore,the implementation of a mathematical formula recognition and retrieval system is of great significance for the reuse of literature resources,online education and even the prevention of academic misconduct.The mathematical formula recognition and retrieval system implemented in this thesis mainly includes three parts:formula detection,formula recognition and formula retrieval.This thesis has three main contributions.Since the existing data sets required for formula detection and formula recognition are not enough to support the deep learning application,this thesis uses regular expressions and image processing methods to generate the training data set automatically.In terms of formula detection,this thesis compares a variety of detection networks,selects and optimizes the Faster-RCNN model based on the characteristics of mathematical formulas,and finally can detect the formulas in the document with an accuracy rate of above 90%.In terms of formula recognition models,this thesis proposes an end-to-end recognition method based on deep learning,using convolutional neural networks to extract image features,and using the attention-based Seq2Seq model to translate images into LaTeX text.The formula data extracted from the KDD Cup dataset has achieved good results.By comparing the image generated by the recognized text with the original image,the exact match rate has reached over 70%.Based on the completed formula detection and formula recognition module,this thesis designs the formula and document storage format,and builds a formula recognition and retrieval system based on the open source Elasticsearch retrieval engine,and has achieved good results in the evaluation index of the retrieval engine.
Keywords/Search Tags:deep learning, OCR, formula detection, formula recognition, retrieval engine
PDF Full Text Request
Related items