Design And Implementation Of Web Data Table Detection System Based On Visual, Lexical And Semantic Features

Posted on:2014-05-04

Degree:Master

Type:Thesis

Country:China

Candidate:W Zou

Full Text:PDF

GTID:2208330434472100

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Nowadays, Web information resource increases quickly, find out the helpful information from Web is one of the most important problems about Internet waiting to solve. As a compact and efficient way to present relational information, tables are used frequently in web documents. The data in table is structured and valuable. the automatic understanding of tables has many applications including Knowledge Discovery, information retrieval, web mining and so on. According to the report, about52%of HTML documents include<table>, most of these tables are only for making-up and physical layout instead of storing data. How to detect the real data table is the first problem to solve for table mining.The detection of web data tables can be done as follows. Firstly, the HTML tables surrounded by<Table> and</Table> are extracted and annotated manually. We make use of Nutch to crawl web pages and extract HTML tables from them, then annotate each HTML table as genuine data table or not. Secondly, we extract a variety of features from those HTML tables, including layout features, content features, and semantic features. Finally, based on table annotation and features extraction, we use the classification algorithms implemented in WEKA to construct the detection system. Experimental results have shown that our method is effective in data table detection.

Keywords/Search Tags:

Web Table, Data Mining, Nutch, Feature extraction, informationextraction

PDF Full Text Request

Related items

1	Research Of Feature Extraction Algorithm Based On Rough Set
2	Design And Implementation Of The Web Table Data Extraction And Analysis System
3	The Data Mining Research Based On Comment Website
4	Study And Implication On A Feature Extraction Model Of Data Mining
5	TableSeer: Automatic table extraction, search, and understanding
6	The Design And Implementation Of Information Extraction Engine On Web Tables
7	Research On The Web Structure Mining Algorithm Based On Nutch
8	Fault-based Spectral Entropy Feature Extraction And Data Mining Technology Research
9	Study On The Application Method Of Data Mining In Analyzing Technique And Tactics Of Table Tennis Match
10	Mining rules in single-table and multiple-table databases