Font Size: a A A

Research And Implementation Of Knowledge Answering System In Insurance Field

Posted on:2020-07-30Degree:MasterType:Thesis
Country:ChinaCandidate:Q X HuFull Text:PDF
GTID:2428330572472314Subject:Software engineering
Abstract/Summary:PDF Full Text Request
At present,some books and publications are stored in the form of images.In order to facilitate the reuse and retrieval of these data and obtain the text information in images,it is necessary to convert these image data into text data.But some books and publications contain many mathematical formulas.In the process of transformation,formulas can not be converted into texts completely and accurately because of the complexity of their own structure.In image documents containing formulas,formulas are often surrounded by natural language and difficult to locate,and formulas are different from conventional characters.Their structures are irregular,so they can not be recognized simply by traditional character recognition methods.Therefore,it is very meaningful to implement a system that can quickly and accurately identify mathematical formulas in image documents.In this paper,the formula positioning technology involved in the formula recognition system is deeply studied,and the formula in the image file is located by using the deep neural network.After comparing the advantages and disadvantages of various neural network models,the Faster R-CNN model which is most suitable for this paper is selected,and the Faster R-CNN network model is improved.The model regards the formulas appearing in the image document as a whole,and uses the same processing method for the independent row formulas and the embedded row formulas.Experiments on formula localization in different linguistic contexts show that the method of formula localization in this model has better effect.In this paper,based on the current research situation in the field of mathematical formula recognition,a mathematical formula recognition system based on open source framework Tesseract is implemented by using the research results mentioned above and using Java and Python programming languages.The formula recognition system designed and implemented in this paper,after receiving the user's input picture file,will perform five steps of image preprocessing,formula location,formula recognition,character recognition and result reorganization on the input picture in turn,so as to get the recognition result of the input picture,and feedback the result to the user.In the function of formula recognition,the system regards the formulas appearing in the image as a special text,divides the formulas into fonts,constructs a classifier on the basis of traditional character recognition,recognizes the segmented fonts,records the position relations of each segmented font,merges the recognized fonts according to the specific algorithm,and finally obtains the recognition results of the formulas.In the character recognition function,the system uses the existing functions of the open source framework Tesseract to train the specific language set in the field of formula recognition to recognize the text appearing in the image.After the recognition results of formulas and texts are obtained,the result reorganization function of the system reorganizes the two results according to their original location information,and obtains the final recognition results of the input pictures.Randomly intercept 200 pictures from the final math examination questions of middle school and the scientific and technological articles of HowNet as the test set.The system tests 350 formulas of the 200 pictures,and finally satisfactory results have been obtained.
Keywords/Search Tags:Deep learning, Faster R-CNN, Formula location, OCR, Formula recognition
PDF Full Text Request
Related items