Font Size: a A A

Semantic Recovery Of Web Tables Based On Crowdsourcing

Posted on:2017-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:H X LiuFull Text:PDF
GTID:2308330482487120Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The Web contains a large amount of structured tables, most of which are lack of header rows, primary keys and foreign keys. The structure information is the basic of data search and integration for web tables. Algorithmic approaches have been proposed to recover structure information for web tables, but state-of-the-art technology is not yet able to provide satisfactory accuracy and recall. In recent years, crowdsourcing has been applied in natural language process, image identification, etc. We propose to improve the performance of web table annotation by crowdsourcing which leverages human intelligence to complete annotation tasks.For header and entity column recovery, we propose an improved K-means algorithm based on novel integrative distance for task reduction to minimize the number of tuples posed to the crowd. To recommend the most related tasks for human workers and decide the final answers more accurately, an evaluation mechanism is also implemented based on Answer Credibility that measures the probability of which a worker’s intuitive answer comes to be the final answer for a task. The result of extensive experiments conducted in real-world datasets shows that our framework can obviously improve annotation accuracy and time efficiency for web tables, and our task reduction and answer evaluation mechanism is effective and efficient for improving answer quality.For foreign key recovery, we raise similar foreign key and corresponding scoring mechanism to get the candidate answers for crowdsourcing according to the characteristics of web tables. We also apply a mixed model of task reduction based on attribute dependency and dynamic question schedule based on collision detection to reduce the number of tasks. Repeated experiments demonstrate that our hybrid framework perform well in precision and recall of foreign key annotation and obviously reduce the number of crowd tasks.
Keywords/Search Tags:Crowdsourcing, Web tables, Semantic recovery, Data integration
PDF Full Text Request
Related items