Font Size: a A A

Research On Web Informaition Extraction Techniques

Posted on:2011-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:H WangFull Text:PDF
GTID:2178360305974534Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the past few decades, Web has become the largest public data source around the world with its rapid growth, which has also turned to be an essential part of our work and daily life. Users could get almost any information on the Internet in the light of wide range of themes and content diversity of the Web data. There are various types of data in Web, such as structured forms, semi-structured Web pages, unstructured text and multimedia files. The information in Web is heterogeneous, containing noisy data. It is worthy to study how to automatically extract useful information from related pages of these Web sites avoiding the interference of noisy data, which would provide users with a convenient and efficient information search platform. Web information extraction technology came into being in this case.The core of Web information extraction technology is wrapper construction, namely, the generation of extraction rules. At present, a variety of ways to generate extraction rules have emerged. However, these methods have different limitations in accuracy, robustness and versatility, which are difficult to meet high demands. The advantages for XML technology become more and more apparent with the development of the Internet. The content and representation are separated owing to the independent character of the XML data. And the XML documents could be easily handled by the database application in virtue of their structured nature. In this paper, we put forward the standard XML-based Web Information Extraction Technology. We extract Web information using standard XML related technique, which could extract the user interested information from the source HTML documents and quickly access to the required information. The system based on this technique is easy to maintain and keeping good scalability. The specific innovations are presented as follows.(1) Taking good advantage of XSLT in solving the problem of document conversion, we combine it with XPath to generate extraction rules, and describe the extraction rules using XSLT language, which could reach a unified extraction mode easily. It would be easy to modify and maintain for the generated rules, which could reduce the difficulty of information extraction and improve the extraction efficiency.(2) An XPath generating algorithm based on the DOM tree structure is designed and implemented, which traverses DOM trees in a depth-first way and makes the information points quickly located, and which could effectively handle the problem of the positioning for the information points to be extracted.The experimental results performed on multiple Web sites showed that the approach for Web data extraction could extract data records in similar Web pages with high accuracy rate achieved over 90%, which would better meet the data precision requirement in many real-world applications.
Keywords/Search Tags:Web information extraction, extraction rule, DOM tree, XSLT, XPath
PDF Full Text Request
Related items