Font Size: a A A

Research And Implementation Of The Web Page Table Structure Recognition

Posted on:2007-07-25Degree:MasterType:Thesis
Country:ChinaCandidate:K Q LinFull Text:PDF
GTID:2208360185456637Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The increasing ubiquity of the computer and Internet has brought about a constantly increasing amount of electronic documents. As a compact and efficient way to present relational information, tables are used frequently in web documents. According to the report, about 52% of HTML documents include tables. Therefore, extracting information from Web tables becomes one of the most important sub-field of Information Extraction(IE).Since the late 1990s the Information Extraction on web tables was probed into study, and at present,all the research on it is still quite preliminary.According to the technology studied at present,there are two methodologies.One is based on wrapper,which is a traditional methodology used in IE area and inherently highly web source or domain specific and always non-reusable.The other one is based on table structure recognization,which is mainly study in this thesis.The focus of this thesis is to investigate a methodology which is domain-independent and can adapt more web source to recognize table structure. Then application such as IE system can well understand the table.The structure recognization of web table can be processed by two phases.First we need to filter out those non-genuine tables which is uesd not for relational information display but for creating any type of multiple-column layout to facilitate easy viewing.According to lots of observations to tables in web pages,the features are identified which best capture the characteristics of the genuine table compared to a non-genuine one ,and then get the heuristic rules.Based on these rules, algorithms are designed and implemented to evaluate the performance,as measured by IE evaluation metrics-recall,precision and F-measure.The result show that the performance of our algorithms is not less effectively than others.The second phase is the analysis to the inner structure of web table.This thesis discusses deeply at this phase.The key point to recognize the table structure is to extract a abstract logistic table from variants of table structures.This thesis presents a table analysis modal,according to which,the physical table, the abstract logic table and...
Keywords/Search Tags:HTML, Web table, Information Extraction, table location, structure recognization
PDF Full Text Request
Related items