Font Size: a A A

Based On Tree Structures For Deep Web Data Extraction Research

Posted on:2008-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:X WangFull Text:PDF
GTID:2208360212486533Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the NetWork, the quantity of the Web sites and the web pages growth of the explosion, huge information on the NetWork. The Web page form and contents are different because the developer are difference, this result in the Web data are heterogeneous. It is for this reason, automatically acquire usefull information and data be a very challenging task.The traditional Web search engines uaually find the static Web pages, in fact, the static pages are small portion of the Web pages on the NetWork. There are a lot of information are not found by the traditional Web search engines, this part of NetWork are Deep Web. We must submit forms and extract automatically correlative information from feedback web page. The information of Deep Web The Deep Web usually refers to the part that can not find, in particular those of the dynamic genetate pages. How to effectively use covert network of informationresources has become a problem worthy of study.This paper based on the Tag-tree to realize sample pages purification, generate extract rules, and extrate information from target pages. This page apply Tidy to transform HTML document to XHTML document. Base on this XHTML document, to find location of data by comparing Tag tree of similar pages, and then generate extract rules of target pages. From the root of the Tag tree by maching function iterative many times. Wapper is constructed by XSLT realization the information extraction.This paper apply matching Tag tree to find data items and extract data. Extract rules are parameters, extract information from target pages, the result store in XML document.
Keywords/Search Tags:Deep Web, data extraction, Tag tree, XML, XSLT
PDF Full Text Request
Related items