Font Size: a A A

Research And Implementation Of Web Information Extraction Based On XML

Posted on:2009-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y XuanFull Text:PDF
GTID:2178360245954997Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and the explosive growth of Web data, it is becoming more difficult when users want to obtain useful information from Web. How to find accurate information quickly and efficiently from Web has become an urgent issue to be resolved. Web information extraction technology comes into bring. The program that extracts information from web is called wrapper, and the main task of constructing wrapper is to prepare extraction rules. As a result, preparing robust and flexible extraction rules has become the research emphases of information extraction.For Web information extraction, various constructing approaches of wrapper have been proposed by people, but all of them have their limitations in the application. With the continuous development of XML technology, the application value of XML has increasingly important role in Web information extraction. Based on the study of existing information extraction technology, the standard XML is used for Web information extraction in this dissertation, and we propose a general XML-based Web information extraction solution. The main contributions of this dissertation are listed as follows:1. A general Web information extraction system is designed and developed. With this system, users can extract interested information from HTML pages, the extraction results are expressed in XML which has strong structure and expansion. Our system has the generality and flexibility. Users can quickly customize the Web information extraction wrapper applied to different areas.2. An XML-based Web data conversion algorithm is proposed and implemented. This algorithm can convert HTML to XHTML or XML effectively, it's the technical support for cleaning HTML pages in our system, and the Web information extraction work can be simplified greatly.3. A DOM-based XPath generation algorithm is posed and implemented. The information position is based on XPath in the dissertation, but it's difficult to locate information points and prepare XPath expressions in an XHTML document, the XPath generation algorithm presented gives a good solution to this problem. By integrating this algorithm into our system, the XPath expressions can be obtained automatically when users mark interested information points.4. XSLT is used as the description language of extraction rules and XPath is used to locate information to be extracted, which is conducive to the unity of extraction patterns. For the extraction of single-block information, the automatic generation of extraction rules is completed. For the extraction of multi-block information, we can get extraction rules by merging the templates after gaining all the XPath expresses of nodes to be extracted. At the same time, the extraction rules can be optimized with optimization method of data location.The thinking in Web information extraction presented in this dissertation can better solve the problem of Web information extraction, and also the precision and recall of the system can reach a higher proportion.
Keywords/Search Tags:Web information extraction, XML, DOM tree, XPath, extraction rules
PDF Full Text Request
Related items