Font Size: a A A

Semi-structured Web Information Extraction Technology And Its Application

Posted on:2005-09-26Degree:MasterType:Thesis
Country:ChinaCandidate:S M DongFull Text:PDF
GTID:2208360152966906Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, Web has become a huge, distributing and sharing resource of information. But most of web-data are represented with HTML language. So the web-data are not available to the applications because they cannot be parsed directly. For this purpose, the technology of web-information extraction appeared. On the basis of discussing general solution of Web-data extraction, we focus on the implementation of Web-data extraction and the execution of ETL script. In the implementation of Web-data extraction, we introduce an algorithm of extraction rules. In this algorithm, at first, the web pages specified by the extraction rule are acquired, then using HTML Tidy to transform the data represented by HTML to the well-formed XML document, and use XMLParser to get the DOM tree of the XML document, mapping the interesting data abtained according the mapping-rule to the target schema; Secondly, in the research of the execution of ETL script, the author finished the core modules of ETL excution, that is, EXTRACTOR and TRANSFORMER. The two module can receive the ETL script, and then parsing and executing base on the script to accomplish the tasks of extraction or transform. This thesis implement the integration of Web-data extraction technology and ETL technology. Using the function of extraction and transformation provided by the ETL tool enables the web data extracting from Web to be more satisfied to the user, and provides a valuable tool to make the huge Web-data more available.
Keywords/Search Tags:Web, HTTP, HTML, XML, XPath, ETL, DOM, Information Extraction, Extraction Rule, Mapping Rule
PDF Full Text Request
Related items