Research And Implementation Of The Web Page Table Structure Recognition

Posted on:2007-07-25

Degree:Master

Type:Thesis

Country:China

Candidate:K Q Lin

Full Text:PDF

GTID:2208360185456637

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The increasing ubiquity of the computer and Internet has brought about a constantly increasing amount of electronic documents. As a compact and efficient way to present relational information, tables are used frequently in web documents. According to the report, about 52% of HTML documents include tables. Therefore, extracting information from Web tables becomes one of the most important sub-field of Information Extraction(IE).Since the late 1990s the Information Extraction on web tables was probed into study, and at present,all the research on it is still quite preliminary.According to the technology studied at present,there are two methodologies.One is based on wrapper,which is a traditional methodology used in IE area and inherently highly web source or domain specific and always non-reusable.The other one is based on table structure recognization,which is mainly study in this thesis.The focus of this thesis is to investigate a methodology which is domain-independent and can adapt more web source to recognize table structure. Then application such as IE system can well understand the table.The structure recognization of web table can be processed by two phases.First we need to filter out those non-genuine tables which is uesd not for relational information display but for creating any type of multiple-column layout to facilitate easy viewing.According to lots of observations to tables in web pages,the features are identified which best capture the characteristics of the genuine table compared to a non-genuine one ,and then get the heuristic rules.Based on these rules, algorithms are designed and implemented to evaluate the performance,as measured by IE evaluation metrics-recall,precision and F-measure.The result show that the performance of our algorithms is not less effectively than others.The second phase is the analysis to the inner structure of web table.This thesis discusses deeply at this phase.The key point to recognize the table structure is to extract a abstract logistic table from variants of table structures.This thesis presents a table analysis modal,according to which,the physical table, the abstract logic table and...

Keywords/Search Tags:

HTML, Web table, Information Extraction, table location, structure recognization

PDF Full Text Request

Related items

1	Study On Tables Information Extraction Based On Web
2	Research Of Web Information Extraction Based On Table Structure
3	Research On The Technology Of The Web Employment Information Extraction Based On The HTML
4	Design And Implementation Of PDF Format Based Table Extraction Method
5	The Research And Implementation Of Table Recognition System Based On Deep Learning
6	Research And Implementation On Table Detection And Table Structure Recognition Method Based On Deep Learning
7	Method Of Entity Table Information Extraction In Web Page
8	Table Information Extraction Based On Web Structure
9	Research On Key Technologies Of Cross-document Table Fusion Based On Deep Learnin
10	Data Extraction And Integration In HTML Tables