Font Size: a A A

Web Data Extraction Technology And Application

Posted on:2013-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:Q XuFull Text:PDF
GTID:2218330371454534Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Online information resources exploding quickly with the rapid development of Internet. It is become an important issue that how to achieve the necessary information quickly and efficiently. Plenty of useful information on the Web is presented by HTML page, these data are semi-structured or unstructured. Web data extraction technology is the technology for extracting structured data from the pages.After the introduction of background and development history of Web data extraction technology, this dissertation describes the basic principles of Web data extraction and primary extraction methods currently existed and focus mainly on analyzing the methods based on HTML structural analysis, and primary generating methods of data extraction rules. The extraction method of absolute path and relative path of XPath and method of location by anchor is studied detailedly. The application scope and drawbacks of these methods are given.Web data extraction method based on XPath and regular expression is proposed. The method is integrated with advantages of current XPath, anchor method and regular expression and is improved based on above analysis. This method uses regular expressions to position anchors to determine the base location of data block. Data extraction is performed using data matching in the block by relative path of XPath. Date items are accurately matched using regular expressions.To verify the effectiveness of the method, the experiments are carried out and the comparison test results are given.A commodity price comparison web site is designed and implemented using Web data extraction method based on XPath and regular expression which is proposed at this dissertation.The effect of application in concrete projects shows that the method achieves good balance between the automation of rules generating and the accuracy of the extracted data, and also has good adaptability, and maintainability.
Keywords/Search Tags:Data Extraction, XPath, Regular Expression, Anchor, Price Comparison
PDF Full Text Request
Related items