Research Of Web Information Extraction Based On Tree Structure

Posted on:2008-08-13

Degree:Master

Type:Thesis

Country:China

Candidate:Z S Ren

Full Text:PDF

GTID:2178360242979323

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet, Web is becoming a vast, distributed, and shared information resource. Most of Web data are in the form of HTML. Due to the semi-structured nature of HTML pages, Web pages are easy for exploring by human beings while it is difficult for applications to process and use the data in the Web pages. To strengthen the availability of Web data, providing more value-added services, Web information extraction technology comes out, which wraps the Web resources, extracts semi-structured data, and provides supports to applications using Web data. Therefore, the research of Web information extraction is one of the hottest research areas in database field and has a promising future.In this paper, we first briefly introduce some basic concept of Web information extraction and also give a short introduction to the development of the technology of Web information extraction. Then we describe the definition of the web pages used by our algorithm.Secondly, we describe, compare, and analyze several kinds of Web information extraction methods commonly used at present in detail, pointing out advantages and disadvantages of each method. Furthermore, we discuss the future direction of research and development of Web information extraction.Finally, we propose tree structure based Web data extraction algorithm in view of the inadequacies of the existing methods. Our tree structure based algorithm includes: the algorithm of HTML tree construction, the algorithm of data region mining, the algorithm of data record mining, and the algorithm of record schema generation. Our algorithm cleans the Web pages using the position information of page elements, mines data region by hierarchical clustering, and generates record schema finishing data item extraction through tree matching. Theoretical analysis and experimental results show that our algorithm can improve the accuracy and efficiency of Web data extraction.

Keywords/Search Tags:

Web data extraction, Web mining, information extraction

PDF Full Text Request

Related items

1	The Design And Implementation Of Web Information Extraction System
2	Related Studied On Information Extraction And Information Recommendation Based On Web Data Mining
3	Research Of Web Information Extraction Based On Tree Structure
4	The Design And Implementation Of Web Information Extraction System
5	The Design And Implementation Of Information Extraction Engine On Web Tables
6	Automatic Ranking List Extraction From Web Pages Based On Visual And Sematic Information
7	Internet-based Intelligent Information Mining System Modeling And Key Technologies
8	Research And Implementation Of Data Extraction Oriented To Knowledge Graph
9	Study On Key Technology Of GHMM-Based Web Text Information Extraction And System Design
10	Multi-Users Online Visual Data Mining System