The Implementation And Application Of Extracting Structured Data From Web Pages

Posted on:2008-01-03

Degree:Master

Type:Thesis

Country:China

Candidate:T Wang

Full Text:PDF

GTID:2178360245493118

Subject:Computer application technology

Abstract/Summary:

The rapidly growing of the Internet makes the volumn of structured data information in the Web increasing day by day. However, most of this massive information which can be accessed in the Web is presented in the form of semi-structured html documents, making the information hard to be used by some applications. Therefore, a technique named Web information extraction, which is to get the structured data automaticly from html documents, is becoming a current hotspot today. At present, there has been a lot of research on how to extract information from Web pages and there is also a wide range of Web information extraction techniques, which are based on different principles.In this paper, we present two methods of extracting structured data from specific types of Web pages.The first method is used to deal with the pages which are generated by the same template. It takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and makes the wrappers to extract the values encoded in the pages. The algorithm EXALG_tju presented in this paper is based on the algorithm named EXALG, and deduces the unknown template layer by layer on the DOM tree of html documents.The second is to extract information from some semi-structured text in Web pages in some specific domain. It extracts the structured data in the texts using the rules and the specific human-make dictionary according to the features of the texts, and automatically annotates the extracted data using a semantic technique named Field-name fuzzy matching.The results of our experiments on the two above methods indicate that the first of our methods can correctly extract structured data in pages in most cases, while the second should cooperate with other technique to make the results satisfactoring.At last, a vertical Web search engine system which faces innovation techniques information is introduced as an application of the two kind of Web information extracting methods.

Keywords/Search Tags:

Html page, structured data, information extraction, template, equivalent class

Related items

1	Research And Implementation Of Fit-Template System Based On Mas
2	Design And Implementation Of A Conventional Template About Page Extraction
3	Research On Web-based Full-station Data Information Extraction Based On Template
4	Data Extraction And Integration In HTML Tables
5	ClusTex: Using clustering techniques for information extraction from HTML pages containing semi-structured data
6	Research And Application Of Automatic Data Extraction From Template-generated Web Pages
7	Research On Product Attribute Extraction From Semi-structured Web Pages
8	Research On Web Page Classification And Information Collection
9	Research On Data Acquisition And Information Extraction Technology For Dynamic Web Applications
10	The Research Of Semi-structured Web Pages Information Extraction