Font Size: a A A

Design And Implementation Of The Web Table Data Extraction And Analysis System

Posted on:2017-01-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z X CaoFull Text:PDF
GTID:2308330509457560Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Table as a form of data presentation has begun appeared in various fields of web page. Table display mode is simple and intuitive to characterizing the relationship of information,so it is widely used and has become the focus of Web Information Extraction. But people often to ignore table`s own insufficient, typically table`s headers(hereinafter referred to as the attribute name) deciding everything. But in reality, the table on the Web always appear without attribute names or attribute names difficult to understand. The research about analyzing and repairing the header information automatically will be more import at Web mining, data understanding and decision support.This thesis begins to talk about the research status, incldue the research background, purpose, meaning, related fields and the main contents of this paper. Then the thesis have a detailed description at needs analysis, overall design of the system, detailed design and implementation of the system modules, functional and non-functional testing of system about this project. Finally, this paper carry out summarizes and Prospects.The main contents of this paper are mainly the following three aspects: Web table data extracting with storaging, table data analysing and attribute names auto labeling. The main achievement of Web table data extracting is HTML page analysing, data table identifing, table data extracting and storing. The main achievement of table data analysing is depth analysing of the data. Because of different types of data having different characteristics, so the first step is to classify the table data simply. Different types of data use different processing methods to extract specific features. In this paper, we use the statistical characteristics and the structure characteristics of the data for the study, then we use the regular expressions represent structural features of the data and Use two parameters mean and variance of statistical distribution represent the statistical characteristics of the data. Last we use a lot of training data to establish a "property name- value characteristic" feature library. In the study of automatic annotation attribute name, we are mainly to complete matching attribute name for specific data. In this paper, we proposed different feature matching strategy and the establishment of matching model for different data characteristics. For the structure characteristic represented by the regular expression, we use edit distance algorithm to compute the similarity between regular expression strings and combined with a simple string matching to improve accuracy. For the statistical characteristic represented by the statistical distribution parameter, we use sample mean testing methods of Hypothesis testing knowledge compute the similarity between the two samples. Finally, in order to get the best attribute names, we should optimize the attribute names candidates by matching process.We use laboratory reams data to establish the signature of table a t experiment of this article and use the method of cross validation to optimize the Matching Model parameters(Threshold and Significant level). Through multiple iterations test, it proved that the integrated use of regular expressions, statistical distrib ution policy can be a good solution to the problem of fixing the name attribute data in table.
Keywords/Search Tags:Web mining, Table data, Attribute names auto labeling, Data features, Hypothetical test
PDF Full Text Request
Related items