Font Size: a A A

A Data Augmentation Approach For Annotating Web Table Columns By Knowledge Base Classes

Posted on:2022-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:C LiuFull Text:PDF
GTID:2518306602990509Subject:Master of Applied Statistics
Abstract/Summary:PDF Full Text Request
Nowadays there are hundreds of millions of HTML tables in websites.The topics of those tables cover across various fields and rich structured information is contained in tabular data,which are often used in the tasks of information retrieval,automatic response,expansion and update of knowledge graphs,data mining,and natural language recognition.The premise of applying table data to downstream tasks is to enable the machine to understand the structure and the content of tables.This problem is usually converted into a matching task between a table and a cross-domain knowledge graph(KG).Table annotation task includes three subtasks,matching(i)column cells to KG entities,(ii)table columns to KG classes,and(iii)column pairs i.e.,intercolumn relationships to KG properties.The challenge of table annotation mainly comes from following aspects.Firstly,the diversification of table types in the web and the unfixed table size.Also,the existence of knowledge gap,i.e.,cells in the table cannot find the corresponding entity in the knowledge graph or are mapped to some completely irrelevant entity.Thirdly,words inside the table,like title,and context information are always missing.Specially,for table column annotation task,type hierarchy in knowledge graph significantly increases the evaluation difficulty.In order to solve the challenges caused by knowledge gap and to deal with the situation when the table size is too small to do annotation,this thesis proposes a data augmentation strategy based on the external dataset,such as Wikipedia,to solve the task of mapping entity column to the class in knowledge graph.During the process of data expansion,semantic information of the target column is fully utilized.At the same time,surface forms of the same entity due to different context are also taken into account.The augmentation strategy can be divided into four stages.Firstly,extract tables and entity lists with specific tags from Wikipedia dump.Secondly,link entities to their surface forms through DBpedia Lexicalizations Dataset.The third step is to filter out the possible candidate columns based on string matching.Finally,we use Fast Text with subword information to get the vector representations of candidate column and target column.Once the cosine similarity between candidate column and target column exceeds a certain threshold,the former can be regarded as a similar column.Based on data augmentation,this thesis feeds the prior information of the expanded data in tables and lists extracted from Wikipedia dump,plus the statistical features,character distribution and word embedding features to the Long ShortTerm Memory(LSTM)network to predict the type of entity columns.In order to evaluate the effectiveness of the augmentation strategy,this thesis conducts experiments on two golden standard T2Dv2 and Limaye,which are commonly used in table annotation task.The performance of proposed model two state-of-the-art models are compared with three classical methods,i.e,Lookup-Vote,Col Net and T2 K Match from several metrics,such as accuracy,recall and F1 score.Meanwhile,the performance of two sub modules,which are data augmentation strategy and feature-based prediction model,is also evaluated.Experimental results show that on T2Dv2 dataset,which has a smaller knowledge gap,the augmentation-based annotation strategy gets a close performance to the prediction result based on knowledge graph.When there is a larger knowledge gap,such as in Limaye benchmark,the prediction result using our proposed method is significantly better than that based on knowledge graph.
Keywords/Search Tags:Knowledge graph, FastText, LSTM, Column type annotation, Data augmentation
PDF Full Text Request
Related items