Research Of Data Extraction Technology Based On Tag Tree From List Pages

Posted on:2012-02-14

Degree:Master

Type:Thesis

Country:China

Candidate:H X Jing

Full Text:PDF

GTID:2178330335965570

Subject:Computer application technology

Abstract/Summary:

With the rapid development of the Internet, web has become a huge, shared and distributed information resource set. How to automatically achieve the interesting data records or the useful information from the vast web information resources, has became a topic discussed in depth by people.Nowadays, most of web data are displayed in the unstructured or semi-structured form, lack the description of data itself, and do not contain clear semantic and specific template, thus, application program can not direct parse and utilize them. In order to avoid the embarrassment of "numerous data, short knowledge", extract structured and relevant to the topic data from the vast semi-structured web data, provide value-added services to users (e.g., monitor the stock market's prompt situation, compare the goods'prices supplied by different websites, follow cooperators'and competitors'trends, integrate enterprises' inside and outside various informations), various web data extraction technology comes out, and plays more important role day by day. Therefore, the web data extraction technology has distinct advantage and wide prospect, is an application of data extraction technology, artificial intelligence information retrieval, and natural language comprehension technology, in web information processing, and is one of the hottest research areas in multiple research fields.In this paper, for template-generated list pages, we research how to test its common template, extract embedded data, automatically achieve the data in list pages.Firstly, we briefly introduce the concept of semi-structured data, web data extraction and list pages;Secondly, we deeply research the development and existing technology of web data extraction technology, demonstrate the advantages and disadvantages of the existing technology and the development direction of the future technology after comparing;Thirdly, we particularly introduce the automaticeweb data extraction based on tree alignment as our before academic publication. It is the research foundation and core of the data extraction system based tag tree from list pages whom this paper advanced. In this paper, we realized this algorithm, improved the related process before and after tree alignment, get an integrated web data extraction system;Finally, we particularly introduce the design, realization and experiment of our data extraction system based tag tree from list pages, this system constructs tag tree, mines primary data region, identifies data record and generates record schema, reducing the scope of objective region step by step, to extract the interesting data records or the useful information. Experimental results show that this system can effectively deal with list pages, extract data information, adapt broad practical demand, and has the application value to popularize.

Keywords/Search Tags:

Web Data Extraction, Web Data Mining, Wrapper, List Page, Tag Tree Matching

Related items

1	Structure Information Extraction- Study And Implementation On Semi-auto Wrapper
2	Research On Data Extraction Of Deep Web Based On Visual Information And Tree Match
3	Research On Mining Structure Of WEB Page For Information Extraction
4	Research On Web Data Extraction Based On Web Page Structure
5	Research Of A Suffix Tree Based Automatic Wrapper Generation Method
6	Study On Automatic Extraction Of Web Data Based On DOM
7	Research On Deep Web Data Acquisition Based On Visual Information And DOM Tree
8	Research And Implementation Of Page Object Extraction Model For Vectical Search Engine
9	Web Page Attribute Extraction Method Research
10	Research And Implementation Of Chinese Web-page Classification Based On Web Data-mining