Web Data Based On Semi-automatic Extraction Of Information Integration

Posted on:2011-11-09

Degree:Master

Type:Thesis

Country:China

Candidate:J L Wu

Full Text:PDF

GTID:2208360302997032

Subject:Computer software and theory

Abstract/Summary:

With the rapidly development of internet technology, the information on the WEB grow heavily, so how to effectively make use of these information attracted more and more people's attention. But WEB data source integration challenges the traditional data warehouse and middleware information integration project:the middleware approach of the WEB data source integration projected the query term and then go to the website to find information, it is not only inefficient but also the query results unpredictable. Using data warehouse method for the updating WEB site information, the data updating and maintenance would be an annoying thing, and it can not use the database data directly.In order to settle these problems, I put forward an information integration architecture (Materialized Mediator Information Integration Framework) combined with middleware method and materialization method, it is referred to as MMIIF. This architecture can effectively solve the problem of integration of the traditional relational database and the WEB data source. MMIIF's data query adopt middleware method, It guaranteed the transparency of data access, and the autonomy of the underling data source. For improving the efficiency of accessing WEB data source, This paper firstly use materialization to extract data stored locally for user's access, and system administrator can updates the local data from WEB according to actual demand. I analysis the model integration and query processing of MMIIF detailedly, introduced designing of WEB data source wrapper and database wrapper.As MMIIF largely depend on the data extractor of the WEB data source to achieve materialization, This paper analysis dataildely the current data extraction technology:Currently the various data extraction technology all has its advantage and disadvantage. Although the totally automatic extraction mode need less manual work,can extract from a large number of sites, often taking a lot of useless information which user do not interested in and lacking of semantic information; Manual extraction mode is simple enough, but computing extraction rules is complex and annoying. After analyzing a lot of sites, this paper put forward a semi-automation data extractor aiming at WEB site with similar web pages. It search similar pages using URL structure comparison and themes matching, making use of XSLT as extraction rule model, with GUI interface to interact with users getting the needed data and its semantic information. So the semi-automation data extractor can achieve that the extracted data is well structured and has explicit semantic information. Finally choosing typical WEB e-commerce site and portal website as experimental data source, using data extraction evaluation recall and precision to evaluate the performance of the extractor. Experimenting results indicate that this extractor can well achieve materialization, transforming WEB data querying to local querying.

Keywords/Search Tags:

information integration, WEB data source, materialization, semi-automation data extractor

Related items

1	The Construct Of Generic Data Integration System And It's Application In Campus's Information Platform
2	Source Of Information In The Data Integration System Monitor Realization
3	Research On The Efficient Materialization And Fast Query Of Condensed Data Cube
4	Oil Depot Information System Data Integration And It's Applications
5	An Analysis Of Oil Depot Automation System Construction Of Marketing Company
6	Data Management And Integration For XML-Based Semi-Structured Data
7	Research Of Distributed Data Cube Partial Materialization Method Based On Genetic Algorithm
8	Design And Development Of Data Integration Middleware For Multi-source Heterogeneous Data
9	Research And Implementation Of P2P-Based SaaS Data Integration
10	Design And Implementation Of Real-time Data Integration Platform Of Logistics System