Font Size: a A A

Design And Implementation Of Web Data Table Detection System Based On Visual, Lexical And Semantic Features

Posted on:2014-05-04Degree:MasterType:Thesis
Country:ChinaCandidate:W ZouFull Text:PDF
GTID:2208330434472100Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Nowadays, Web information resource increases quickly, find out the helpful information from Web is one of the most important problems about Internet waiting to solve. As a compact and efficient way to present relational information, tables are used frequently in web documents. The data in table is structured and valuable. the automatic understanding of tables has many applications including Knowledge Discovery, information retrieval, web mining and so on. According to the report, about52%of HTML documents include<table>, most of these tables are only for making-up and physical layout instead of storing data. How to detect the real data table is the first problem to solve for table mining.The detection of web data tables can be done as follows. Firstly, the HTML tables surrounded by<Table> and</Table> are extracted and annotated manually. We make use of Nutch to crawl web pages and extract HTML tables from them, then annotate each HTML table as genuine data table or not. Secondly, we extract a variety of features from those HTML tables, including layout features, content features, and semantic features. Finally, based on table annotation and features extraction, we use the classification algorithms implemented in WEKA to construct the detection system. Experimental results have shown that our method is effective in data table detection.
Keywords/Search Tags:Web Table, Data Mining, Nutch, Feature extraction, informationextraction
PDF Full Text Request
Related items