Font Size: a A A

Research Of Data Extraction Technology Based On Tag Tree From List Pages

Posted on:2012-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:H X JingFull Text:PDF
GTID:2178330335965570Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, web has become a huge, shared and distributed information resource set. How to automatically achieve the interesting data records or the useful information from the vast web information resources, has became a topic discussed in depth by people.Nowadays, most of web data are displayed in the unstructured or semi-structured form, lack the description of data itself, and do not contain clear semantic and specific template, thus, application program can not direct parse and utilize them. In order to avoid the embarrassment of "numerous data, short knowledge", extract structured and relevant to the topic data from the vast semi-structured web data, provide value-added services to users (e.g., monitor the stock market's prompt situation, compare the goods'prices supplied by different websites, follow cooperators'and competitors'trends, integrate enterprises' inside and outside various informations), various web data extraction technology comes out, and plays more important role day by day. Therefore, the web data extraction technology has distinct advantage and wide prospect, is an application of data extraction technology, artificial intelligence information retrieval, and natural language comprehension technology, in web information processing, and is one of the hottest research areas in multiple research fields.In this paper, for template-generated list pages, we research how to test its common template, extract embedded data, automatically achieve the data in list pages.Firstly, we briefly introduce the concept of semi-structured data, web data extraction and list pages;Secondly, we deeply research the development and existing technology of web data extraction technology, demonstrate the advantages and disadvantages of the existing technology and the development direction of the future technology after comparing;Thirdly, we particularly introduce the automaticeweb data extraction based on tree alignment as our before academic publication. It is the research foundation and core of the data extraction system based tag tree from list pages whom this paper advanced. In this paper, we realized this algorithm, improved the related process before and after tree alignment, get an integrated web data extraction system;Finally, we particularly introduce the design, realization and experiment of our data extraction system based tag tree from list pages, this system constructs tag tree, mines primary data region, identifies data record and generates record schema, reducing the scope of objective region step by step, to extract the interesting data records or the useful information. Experimental results show that this system can effectively deal with list pages, extract data information, adapt broad practical demand, and has the application value to popularize.
Keywords/Search Tags:Web Data Extraction, Web Data Mining, Wrapper, List Page, Tag Tree Matching
PDF Full Text Request
Related items