Font Size: a A A

Discovering Relations Between Web Tables

Posted on:2016-12-11Degree:MasterType:Thesis
Country:ChinaCandidate:H W RenFull Text:PDF
GTID:2298330467472501Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years, a large number of structured tabular data is constantly emerging on the Internet. However, the value of web tables depends not only on the data itself, but also on the relatedness between the data. Only when the potential relatedness between them has been detected, these structured data could be fully utilized. Yet, the problem of discovering related tables has some challenges due to the heterogeneity and uncertainty in web tables. We propose two new types of relatedness between web tables, called snapshot and reference relationship, Which are beneficial for query optimization, and also helpful for returning partial results rapidly when querying on big data, and useful for answering open-world queries in data fushion systems. We propose an algorithm for discovering snapshot relationship. The relatedness between an original web table and its snapshot can be computed based on entity consistency and schema consistency. In order to assign high weights on tables which provide more fresh entities, the concept of entity freshness is introduced into our scoring method. Meanwhile, the content consistency of web tables can be enhanced by applying Bayesian analysis to our relatedness capturing framework. As a consequence, the accuracy of finding snapshots is improved. Repeated experiments prove that the algorithms can capture snapshots with high quality, which perform well in query precision and recall.We also raise a probability model for capturing the reference relationship between tables in this paper. In order to take more attention to entities that exist repeatly in the reference column, the weight of entity for the table is introduced into our scoring mothed. On the other hand, there are amounts of noise data in the web tables. Aimed at reducing the effects on unfriendly entity, our algorithm gives a novel way to identify the noise data with a probability. Thus, the weight of entity for the concept is also considered. Extensive experiments on real datasets demonstrate that the algorithm for detecting the reference relationship can search referenced tables with high quality, which also perform well in query precision and recall facing open-world queries.
Keywords/Search Tags:Web tables, Relatedness, Snapshot, Reference relationship, Dataintegration, Query optimization
PDF Full Text Request
Related items