Font Size: a A A

Layout inference: File schema recognition via content-based oracles

Posted on:2010-05-22Degree:Ph.DType:Dissertation
University:University of ArkansasCandidate:Phillips, Reid AFull Text:PDF
GTID:1448390002486303Subject:Computer Science
Abstract/Summary:
Some organizations routinely (e.g., monthly) process tens of thousands of flat files, files consisting of records containing a fixed number of fields, received from third parties. Currently, the process of characterizing each file's encoding, formatting elements, structure, and content is a manual process, expensive in that the process costs human time, delays processing the files, and is error prone. This dissertation provides methods for automatically inferring the specified meta data associated with these files.;In defining the result of this process, the layout, the first step is to identify the properties to be inferred. These characteristics are requisite to read and process the contents of a file and include but are not necessarily limited to: the schema of the data records contained within a file, the character encoding, and other formatting details. Thus layout inference is concerned with providing an encompassing description of a file rather than a single characteristic (e.g., only the character encoding). Once available, the final step in the layout inference problem is to communicate the produced layout in a meaningful manner to any interested parties.;The approach to this problem described in this paper is primarily statistical in nature. Statistical solutions, while potentially more ambiguous, can be considered to be better than other solutions because they are more adaptive: gracefully handling a limited amount of error and incomplete information along with many unforeseen circumstances. Another important characteristic of the approach detailed herein is a conglomeration of expert agents. These agents provide the means for identification of the file properties as each agent is an expert concerning a respective property. By applying their respective knowledge in various ways, as appropriate with respect to the property being determined, the various layout characteristics may be inferred. Together the statistical results of expert knowledge agents provide a powerful approach to solving the layout inference problem.;The applicability of this approach towards the layout inference problem will be shown through results generated by an implemented prototype. These results will indicate the prototype's performance (i.e., accuracy and run time) with respect to a representative set of data files; consequently showing the ability of the defined approach and the promise related to certain areas of future work.;In order to mine, persist, transform, or in some other way process structured data contained within a flat file, the properties associated with a file must first be known. Within this paper, the identification of these properties will be referred to as the layout inference problem, where a layout is a specification of the characteristics associated with a file. Typically a manual task, layout inference can benefit from an automated tool designed to replace or assist human involvement in this process.
Keywords/Search Tags:Layout inference, File, Process
Related items