Font Size: a A A

Mobile Platform Oriented Complex Document Structure Analysis System

Posted on:2015-09-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y P WuFull Text:PDF
GTID:2308330479989704Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Complex document structure analysis has long been an important part for OCR system, it works for document layout processing, get element location and the reading order. This technology is widely used in instrument and business card recognition, manuscript digitization and other systems. One of the se is Google News Archive which scan newspapers in recent decades and build index so that user can directly see the newspapers of the day when searching for related news.Image processing technology is always used for pretreatment. Then morphology or geometry methods are adopted to get physical structure of document. For logical structure, machine learning and pattern recognition methods are used. This paper presents a method based on image processing and element location information, combined with empirical rules, document structure can be processed efficiently. Image processing technology in the pretreatment find all text lines on the document, then take these text lines as obstacles, through a location information based algorithm, and gradually find empty blocks on the page, while adding factors like aspect ratio. After got the column, we sort text lines by column and output the document structure.Previous methods for document structures analysis algorithm evaluating are cumbersome mainly because great effort in document structure labeling. This paper presents a labeling method with high efficiency, which transfers labeling work from PC to i Pad. Excellent user experience of mobile platform makes labeling work simplified. We can use finger drag marquee directly than mouse. For row-level labeling, this system takes image processing techniques in to extract text lines, and then gives prompt to labeling staff. If the result is correct you do not need progressive label. In most cases, the labeling system outputs correct prompt. For the evaluation of the algorithm, this paper use Levenshtein Distance, precision rate and recall rate.A series of experiments on manual labeled 202 pages of 30 papers prove that algorithm proposed works well for academic papers. We got 91.1% precision rate and 85.1% recall rate. Compared with conventional method there is a certain improvement.
Keywords/Search Tags:image processing, document structure analysis, labeling method, optical character recognition
PDF Full Text Request
Related items