Font Size: a A A

Research Of Web Information Extraction Based On Tree Structure

Posted on:2008-08-13Degree:MasterType:Thesis
Country:ChinaCandidate:Z S RenFull Text:PDF
GTID:2178360242979323Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, Web is becoming a vast, distributed, and shared information resource. Most of Web data are in the form of HTML. Due to the semi-structured nature of HTML pages, Web pages are easy for exploring by human beings while it is difficult for applications to process and use the data in the Web pages. To strengthen the availability of Web data, providing more value-added services, Web information extraction technology comes out, which wraps the Web resources, extracts semi-structured data, and provides supports to applications using Web data. Therefore, the research of Web information extraction is one of the hottest research areas in database field and has a promising future.In this paper, we first briefly introduce some basic concept of Web information extraction and also give a short introduction to the development of the technology of Web information extraction. Then we describe the definition of the web pages used by our algorithm.Secondly, we describe, compare, and analyze several kinds of Web information extraction methods commonly used at present in detail, pointing out advantages and disadvantages of each method. Furthermore, we discuss the future direction of research and development of Web information extraction.Finally, we propose tree structure based Web data extraction algorithm in view of the inadequacies of the existing methods. Our tree structure based algorithm includes: the algorithm of HTML tree construction, the algorithm of data region mining, the algorithm of data record mining, and the algorithm of record schema generation. Our algorithm cleans the Web pages using the position information of page elements, mines data region by hierarchical clustering, and generates record schema finishing data item extraction through tree matching. Theoretical analysis and experimental results show that our algorithm can improve the accuracy and efficiency of Web data extraction.
Keywords/Search Tags:Web data extraction, Web mining, information extraction
PDF Full Text Request
Related items