Font Size: a A A

The Improved Algorithm For Identifying Mathematical Formulas In The Images Of PDF Documents

Posted on:2017-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:C LiuFull Text:PDF
GTID:2348330503981199Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Mathematical formula identification is an important part of mathematical formula recognition technology for printed document and also the foundation of the mathematical expression retrieval. PDF document is the important carrier of mathematical formula information. As the quality of printed document images in a PDF file and acquisition parameters are unknown, the result of mathematical formula identification will be affected. Therefore, it is necessary to specialize in improving the adaptability of mathematical formula identification algorithm.Firstly, this dissertation designed a method of mathematical formula identification in English PDF document images, which included five steps: extracting images in PDF files, preprocessing, judging columns, extracting mathematical formula character blocks, merging mathematical formula character blocks. Secondly, through analyzing and concluding characteristics of the document images in PDF files, mathematical formula features as well as their effects on mathematical formula identification, this dissertation designed a related parameter adjustment algorithm for factors that affected performance of formula identification. Through analyzing every step of identification algorithm, affected parameters and thresholds are confirmed. According to the adaptive size of characters, special rules are applied to dynamically adjust the related parameters and thresholds to reduce impacts on the performance of mathematical formula identification caused by some factors such as noises, column, images, tables, resolution, two dimension structure of the formula and so on. For two problems that coverage of formulas is not complete and symbol of formulas is incorrectly regarded as English words, the dissertation designed a method of correcting error results of formula identification to correct mistaken character blocks by recognizing and judging if formulas contain bound symbols and binary operators or not. The experimental result shows that the designed adaptive improvement program contributes to improving mathematical formula identification algorithm for the adaptability of image quality and layout changes.
Keywords/Search Tags:PDF Document, Printed Document Image, Mathematical Formula Identification, Adaptability, Parameter Adjustment, Correction
PDF Full Text Request
Related items