Font Size: a A A

Research On Structured Conversion Methods For Tabular Data

Posted on:2022-10-20Degree:MasterType:Thesis
Country:ChinaCandidate:Z X HaoFull Text:PDF
GTID:2518306485962309Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
There are three types of data in daily life: structured,semi-structured and unstructured.Semi-structured and unstructured data have a higher value density compared with structured data,because they cannot be stored and analyzed directly,and data mining has many inconveniences.Tabular data occupies a large proportion in the transmission of daily information,due to rich content,and it is an important form of data carrying.Therefore,this thesis takes tabular data as the research object,and studies the structural conversion methods of image-type tabular data and electronic-type tabular data.Through in-depth analysis of relevant domestic and foreign literatures,it is found that the existing data structure conversion methods mainly include two steps: data extraction and data organization.The data extraction of electronic-type table is relatively simple,and it only needs to use API(Application Programming Interface)to read the data into the cache.The data of image-type table cannot be extracted directly,it needs to be converted into electronic-type tabular data before completing the data extraction part.Data organization needs to determine the logical relationship according to the positional relationship between the data,process the data under the premise of keeping the logical relationship unchanged,and convert it into a structured data form that conforms to the database storage.In order to redraw the image-type tabular data into electronic-type tabular data,this thesis designs the processing method of the image.First,perform tilt correction on the image to ensure the subsequent OCR(Optical Character Recognition)recognition effect;then use the LSD(Line Segment Detector)algorithm to detect the line features in the table,and determine the line feature information of the table frame through screening;using the Harris algorithm to obtain the image corner features of the table,cluster the results to improve the positioning accuracy,complete the elimination and extraction of the corner points of the table frame based on the line characteristic information;finally segment and identify the cell image,and generate the corresponding electronic-type tabular based on the point,line and text information.In order to improve the overall anti-interference ability of the algorithm,edge detection technology is added.After comparing the detection results,Canny operator is selected as the experimental object,and some improvements have been made to solve the problems in practical applications.The improved algorithm was verified experimentally in the Python 3.7.0environment.The results show that the algorithm in this thesis has obvious edge protection effect,good image processing quality,and smooth connection of the detection results,which effectively improves the effect of edge detection and can meet the requirements of image-type tables edge detection needs.In the process of electronic-type tabular data structure conversion,the overall table is divided into index area,title area and data area.The data in each area is extracted through the docx interface provided by Python and the logical relationship between the data is recorded.The title are combined and compressed according to the tree structure.The content of the area,following the logical relationship between the data,reorganizes the tabular data to convert it into structured data,and stores the converted data in the XML file and the database respectively.The test results show that the method in this thesis can more accurately complete the feature information extraction of image-type tabular data,and can redraw the image-type tabular data into electronic-type tabular data,the electronic-type tabular data can be structurally converted and stored as well.The realization of the structural conversion of tabular data has laid a good foundation for the effective use of tabular data.
Keywords/Search Tags:Tabular data, Edge detection, Side window Gaussian filtering, Feature detection, Data structured
PDF Full Text Request
Related items