Font Size: a A A

Mining Entity Columns Of Web Tables Based On Functional Dependency

Posted on:2020-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:S Y ChenFull Text:PDF
GTID:2428330578452529Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of information technology,more and more tabular data has emerged on the Internet.These structured web tables have attracted much attention because they cover a wide range and have a large amount of information.Compared with traditional relational tables,web tables are irregular,uncertain and heterogeneous,which make them difficult for machines to recognize their semantics automatically.The entity column is the most semantically representative attribute column in the web table.The accurate discovery of the entity column of web tables will be greatly helpful for the machine's annotation of table topic and semantic understanding.The existing entity column discovery methods mainly include two categories,they are the entity column discovery methods based on the knowledge base and the rule-based entity column discovery methods.These two methods have some limitations.The accuracy of the entity column discovery methods based on the knowledge base entirely depends on the coverage of the labels in knowledge base.And their time efficiency of traversing knowledge base is low and they cannot be applied to multi-entity web tables.The rule-based entity column discovery methods generally have high requirements on the quality of web tables,so the accuracy of entity column discovery cannot be guaranteed.Besides,the existing entity column discovery methods have poor scalability and are difficult to extend to large-scale web tables.Aiming at the existing problems,this thesis proposes an entity column discovery method of web tables based on functional dependency.The main research work are as follows:(1)We propose approximate primary functional dependency for web tables,which only focus on the functional dependency whose determining set are all primary attributes.This kind of approximate primary functional dependency can express the relationship between attributes of web tables more accurately,and are more helpful for entity column detection and topic discovery on web tables.(2)We propose an evaluation and quantification framework aPFDMiner for detecting approximate primary functional dependency.We define two factors Conf and InfoGain to evaluate and quantify the approximate primary functional dependencies.And we design pruning rules to effectively select candidate dependencies and narrow the search space for improving accuracy and time efficiency so that our method can be applied to large-scale web tables.The experimental results show that compared with the traditional method,the time efficiency and accuracy of our algorithm aPFDMiner are better,and the scalability is better on the large data sets.(3)We propose an entity column discovery framework ECMiner based on functional dependency.ECMiner creates dependency graph based on approximate primary functional dependency sets obtained by aPFDMiner.And we design an entity column scoring model of the web tables which transforms the problem of detecting the entity columns into the problem of detecting the strongest node in the dependency graph.ECMiner is suitable for single-entity and multi-entity tables.The experimental results on multiple datasets show that compared with the existing methods,ECMiner not only has a significant improvement in accuracy and time efficiency,but also can be applied to web tables lacking headers or with multiple entities.
Keywords/Search Tags:Web tables, Entity column, Approximate primary functional dependency, Table schema dependency graph, LeaderRank
PDF Full Text Request
Related items