Font Size: a A A

Research On PDF Structure Analysis Technology Of Academic Papers

Posted on:2021-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y L ZhouFull Text:PDF
GTID:2518306122468724Subject:Computer technology
Abstract/Summary:PDF Full Text Request
PDF provides great convenience for digital informat ion disseminat ion and electronic document distribut ion,and it has become the main carrier o f current academic papers.These large amounts o f academic paper resources can be processed,integrated,and reorganized to form reusable paper structure objects.These structure objects will be used as input data for academic paper applicat ions.In an ideal situat ion,the metadata o f PDF can store structured informat ion.However,PDF is a layout-based format,which does not provide structural info rmation.Therefor e,the extract ion o f the structure o f PDF academic papers is the main challenge for academic resource mining.This paper introduces the key issues o f analysis fro m the two aspects o f text e lements and vector elements,analyzes the factors that affect the a nalysis results,and proposes so lut ions.For text areas,a recognition method based on the rules o f text coordinates is proposed,and for non-text areas,a recognit io n method based on the expansio n of the largest rectangular area is proposed.Considering t he redundancy and overlap between the ident ified blocks,a block merging algorithm was designed,and a sort ing algorithm was designed for the problem o f incorrect block order caused by the incorrect PDF rendering order o f the mult i-co lumn layout.The experimental results show that the recognit io n algorithms proposed in this paper can extract PDF structural informat ion well and can automat ically extract and process academic resources,and it is conducive to the further use o f PDF in the field of academic papers and researches on the current academic paper resource knowledge of great significance.
Keywords/Search Tags:PDF document analysis, text, vector image
PDF Full Text Request
Related items