Font Size: a A A

Recovering Semantics Of Tabular Data On The Web

Posted on:2015-03-05Degree:MasterType:Thesis
Country:ChinaCandidate:J LuoFull Text:PDF
GTID:2298330434950215Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid expansion of the Internet data in recent years, more and more services have been appeared based on mass data. Because of structure features and rich of semantics information, people attach more importance to tabular data. However, due to its uncertainty and heterogeneity for data from various sources, it is difficult for computers to understand semantics of web tables directly. Therefore, a lot of web tables with significant semantics are submerged in the Internet.At present, research on recovering semantics of tabular data on the web is still in its infancy. Existing algorithms only focus on column label annotation and one analogy primary key detection. For the tables with more than one analogy primary keys, existing algorithms often take discard policy and cannot detect the reference relationships between tabular data. However, those discarded tables often contains a richer semantics information. If we can find the reference relationships between tabular data, we can improve the results of retrieving. In this paper, we focus on the study of recovering semantics of tabular data on the web, main works are as follows:1、Existing algorithm of column label annotation has been improved, which considers the posterior probability between concepts and entities and the importance of concepts so that many unimportant concepts for a tabular data are filtered out. In this way,we can greatly improved the accuracy of the existing algorithm of column label annotation can be improved.2、Existing algorithm of analogous primary key detection has been improved by introducing the Possibility Degree for each column label, in which we synthesize two main factors that are the posterior probability between concepts and attributes and the score of a concept to be a column label.After computing Possibility Degree of each column label, we can order the candidates by their Possibility Degree.This algorithm can improve the precision of analogous primary key detection.3、For the tables containing more than one analogous primary keys,we first compute semantics similarity between tabular tables and make it to be the edge weights of bipartite graph. Using the weighted bipartite graph, we can mark the reference relationships between tabular data and recommend analogous foreign keys to those tabular data which have more than one analogy primary key.And then we can annotate the reference relationship between entity tables and related tables. The experimental results show that our column label annotation algorithm and analogous primary key detection algorithm can improve the accuracy. Semantics similarity calculation can effectively filter out semantically unrelated tabular data and reduce unnecessary time and space overheads.Our algorithm, using weighted bipartite graph can produce satisfactory results for marking reference relationships., which is help for web table search.
Keywords/Search Tags:Internet tabular data, Semantics Recovery, Semantics similarity, Analogous Primary Keys, Reference relationships
PDF Full Text Request
Related items