Font Size: a A A

The Implementation And Application Of Extracting Structured Data From Web Pages

Posted on:2008-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:T WangFull Text:PDF
GTID:2178360245493118Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapidly growing of the Internet makes the volumn of structured data information in the Web increasing day by day. However, most of this massive information which can be accessed in the Web is presented in the form of semi-structured html documents, making the information hard to be used by some applications. Therefore, a technique named Web information extraction, which is to get the structured data automaticly from html documents, is becoming a current hotspot today. At present, there has been a lot of research on how to extract information from Web pages and there is also a wide range of Web information extraction techniques, which are based on different principles.In this paper, we present two methods of extracting structured data from specific types of Web pages.The first method is used to deal with the pages which are generated by the same template. It takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and makes the wrappers to extract the values encoded in the pages. The algorithm EXALG_tju presented in this paper is based on the algorithm named EXALG, and deduces the unknown template layer by layer on the DOM tree of html documents.The second is to extract information from some semi-structured text in Web pages in some specific domain. It extracts the structured data in the texts using the rules and the specific human-make dictionary according to the features of the texts, and automatically annotates the extracted data using a semantic technique named Field-name fuzzy matching.The results of our experiments on the two above methods indicate that the first of our methods can correctly extract structured data in pages in most cases, while the second should cooperate with other technique to make the results satisfactoring.At last, a vertical Web search engine system which faces innovation techniques information is introduced as an application of the two kind of Web information extracting methods.
Keywords/Search Tags:Html page, structured data, information extraction, template, equivalent class
PDF Full Text Request
Related items