Font Size: a A A

Paper Form-based Data Service Platform Research

Posted on:2018-02-15Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhouFull Text:PDF
GTID:2428330596490050Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Despite the rapid development of information technology in recent years,there are inevitable a large number of paper forms in business process.Now these papers are inputted and handled in an artificial way to store and manage the information on them,which undoubtedly spends lots of human resources and hinders the rapid development of information in enterprise.Moreover,information in papers are heterogeneous from multiple sources and has multiple levels.For example,commonly used papers such as resumes and invoices come from different ways in different layouts and may contain hierarchy of several business entities and attributes.These features increase the difficulties of both form extraction and the management in information system.Therefore,how to transform unstructured form papers to computer-readable model and help the rapid development of enterprise information system is a big problem in the process of enterprise information.To solve this problem,we propose a framework of data service platform based on form papers.We first recognize the texts and table layout from forms by OCR tools,and combine the results to extract instance model.Then we analysis the instances information and construct entity resource model and data service to help the development of enterprise system.The main researches in the paper are as follows:1)Proposing the framework of data service platform driven by papersThis paper presents the framework of data service platform based on paper forms.The framework not only meets the demands of paper forms management in real business process,but also helps the development of enterprise information system.2)Constructing an automatic method to extract instance from formsThis paper proposes a method of extracting information from forms.Through the analysis of existing OCR tools,we select excellent tools to extract text and form layout respectively.With the help of domain rules and knowledge base,the integrated results can be transformed to instance model.3)Designing Entity Resource model and the strategy of its storageIn this paper,we consider the string and lexical information of attribute name and the text of attribute value to find the matches of all form instances.According to the features of heterogeneous and hierarchies,we define the Entity Resource Model and the strategy of its storage,which is easily used to manage and search,general to different paper forms.4)Generating the prototype system of data service platform driven by formsWe design the mapping strategy from Entity Resource Model to data service and generate the prototype system.The prototype system uses Java to develop,uses Tesseract and Abbyy Cloud SDK to recognize the texts and layouts from paper forms,uses Jena to handle instance models,uses open service API such as Boson NLP and Hownet to analysis the Chinese sematic information and uses My SQL to store the information.To verify the practicability and effectiveness of the method,the prototype system is applied to the development of law management system.To sum up,the paper proposes a method to extract information from form papers and generate data services,based on which a prototype system is constructed.With a series of experiment results,comparisons and the prototype system,the practicability and generality of this method and framework are proved.
Keywords/Search Tags:OCR Technology, Form Extraction, Model Integration, Lexical Similarity, Entity Resource Model, Data Service Platform
PDF Full Text Request
Related items