Font Size: a A A

A Research And Implementation Of Mathematical Papers' Layout Segmentation Algorithm

Posted on:2021-04-17Degree:MasterType:Thesis
Country:ChinaCandidate:L B GuoFull Text:PDF
GTID:2428330623968573Subject:Engineering
Abstract/Summary:PDF Full Text Request
In recent years,the application of artificial intelligence technology in education has aroused people's concern.The deepness of intelligent education develops,cannot be sep-arated from OCR technology.As an essential pre-processing in OCR system,the layout segmentation of document images is widely used in digitalizing documents,reviewing students' assignment and so on.It's of great significance and practical value.How-ever,there exists no layout segmentation algorithm which is applicable to all layouts,because layouts in each field have their own characteristics.Therefore,this paper takes the mathematical papers as the research objects,aiming to propose a stable and feasible method for the layout segmentation of mathematical papers.The main research contents are as follows:In view of the complicated layout structure of mathematics papers,this paper pro-poses a hybrid strategy layout segmentation algorithm that combines bottom-up clustering of document components with top-down segmentation of the maximum white rectangle and key information identification.First,Faster R--CNN is used to detect the document components(text lines,graphics,and brackets for the answers)in the document image.At the same time,the document components other than the graphics are clustered into complete text lines.In order to obtain the typographical layout area,the graphics and text lines should be clustered sequentially.Next,the maximum white rectangle algorithm is used in the typographical layout to detect the dividing line,and a more precise typo-graphical layout area is segmented according to the Y-axis relationship of the dividing line.Then,the attributes of the text lines in each typographical layout are marked in order from top to bottom,and the title text area is segmented according to the text line attributes.Finally,according to the positional relationship between the graphics and the text area of the title,each graphic is matched into the corresponding text area of the title.For the current situation that recognizing complex Chinese and English text images with mixed mathematical formulas is difficult,this paper proposes a fine--grained lay-out segmentation algorithm for structural formula analysis.Firstly,the improved Faster R CNN is used to segment the structural part of the mathematical formula in the fine--grained layout? then,the category of each character(Chinese characters or non-Chinese characters)in the remaining area should be marked? Finally,the formula structures are regarded as the main body adding the non-Chinese characters to both sides to segment the complete mathematical formulas,therefore,the two divided regions can be recognized by two independent recognition engines.In conclusion,a document image segmentation system based on the above methods was implemented.The system finally carried out detailed experiments on the test data of100 mathematical test papers.The results show that the coarse-grained layout segmen-tation method and fine--grained layout segmentation method proposed in this paper can segment the title areas and structural formula areas from the mathematical papers accu-rately and efficiently,which lays a solid foundation for further research.
Keywords/Search Tags:mathematical papers, layout segmentation, Faster R--CNN, fine--grained lay-out segmentation
PDF Full Text Request
Related items