Font Size: a A A

Research On Domain-oriented High-quality Deep Web Data Integration Techologies

Posted on:2011-03-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:J W TianFull Text:PDF
GTID:1118330332982898Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Web can be classified into Surface Web and Deep Web according to the depth of the information. Surface Web refers to the static pages which can be indexed by the common search engines. While the Deep Web means to pages dynamically generated by querying the on-line databases. With the advancement of information technology and the development of the Internet, the number of the Web databases grows in an exponential trend. Meanwhile accessing the Deep Web has been become the main source of acquiring information. Because of the hidden feature and qualitative difference, it is a great challenge for information retrieval that how to efficiently utilize the Deep Web resource. This paper intents to do research on domain-oriented high-quality Deep Web data integration technologies. Our primary research works are as follows:(1) Locating the Deep Web sources of the same topicHow to efficiently locate query interfaces of the Deep Web resources is the key issue which must be resolved before the Deep Web integration. Thus we propose an ordinal scale model based Deep Web crawling strategy. In the strategy, we use the ordinal regression model to construct a page classifier which classifies the page into three levels. We also need link information extractor to extract the three levels'links. During the crawling, we consider the result of the classifier as the feedback which revels whether the links extracted by link information extractor satisfy the page classifier. According to the feedback, we extract the features of the links that meet the page classifier. The features can guide the crawler to quickly extract links satisfying the page classifier. Thus we can avoid many off-topic links and get the links which have delayed benefit, which increases the crawler's speed and accuracy. The experimental results indicate that our crawler can automatically extract the promising links'features and avoid many off-topic links, thereby increasing the crawler's speed and accuracy.(2) Study on uniform sampling approachOnly through these objective samples, we can evaluate the data quality of the Web databases. In this paper, we present a novel approach utilizing attributes correlation for the sampling task on non-uniform hidden databases. Firstly, we construct the sampling template according to the attributes dependency to generate initial sampling queries and propose a bottom-up algorithm to search the sampling template. Furthermore, we restrict the initial sampling queries by heuristic rules based on mutual information to locate the valid samples. At last, we conduct extensive experiments over real deep Web sites and controlled databases to illustrate the quality and efficiency of our sampling techniques.(3) Data quality based Deep Web resources ranking MethodThere are a large number of Deep Web sites on the same topic. But not all of them are of high quality; some sites may be small and have incorrect data. So it is needed to recommend high-quality Web database to the visitors. Compared with the traditional method of ranking Deep Web sources according to the link authority, it is much more precise to rank them according to the underlying data quality. Thus we propose a method of ranking the Deep Web according to the data quality. This method constructs the quality vector from several aspects to describe the quality of the Deep Web sources. Firstly, it calculates the value of each quality criteria in the quality vector. Then, use all the quality criteria to evaluate the value of the data sources. The experimental result shows that it is of good veracity and operability to evaluate quality of the Deep Web sources.(4) Non-repeatedly and completely Deep Web data extracting approachIn order to effectively retrieval the Deep Web resources, we must extract the structured data from the high-quality sites based on the ranking of the Deep Web sources. We propose a novel approach for siphoning structured data based on the hierarchy tree, which can retrieve all the data non-repeatedly in hidden databases. Firstly, we model the hidden database as a hierarchy tree. Under this theoretical framework, data retrieving is transformed into the traversing problem in a tree. We also propose techniques to narrow the query space by ordering the attributes and using heuristic rule, based on Mutual Information, to guide the traversal process. We conduct extensive experiments over real Deep Web sites and controlled databases to illustrate the good coverage and efficiency of our techniques. (5) Study on structured data integrating approachIn order to facilitate the user to retrieve the Deep Web data, it needs to integrate data from different sources to the local warehouse. First of all, this involves attributes and the corresponding attribute values matching. In our paper, we present a matching method based on attribute semantics. Secondly, in order to accurately and automatically extract structured data, we present a data positioning method based on clustering,which is used to automatically generate extracting rules. Finally, we put forward an effective duplicated data removal method based on relational operating.These techniques are essential for the effective screening and retrieving the high-quality Deep Web data. They are also significant for making full use of Deep Web resources on the Internet.
Keywords/Search Tags:Deep Web, Ordinal Regression, Data Sampling, Data Extraction, Hierarchy Tree
PDF Full Text Request
Related items