Font Size: a A A

Research On Deep Web Data Source Discovery And Sampling

Posted on:2012-11-29Degree:MasterType:Thesis
Country:ChinaCandidate:X DuFull Text:PDF
GTID:2218330338962895Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of technology and accumulation of knowledge, a great variety of rich resources are connected into network, which makes a huge increase in amount of accessible data on internet. Furthermore, in recent decades, as the popularization of various commercial applications, all sorts of large databases were set up to support these applications. Users of internet can almost find information about every field, such as back, shopping, school, institute, government, media and book, etc.The web looks haphazard as the information on the web is complex and diverse. Certainly, there is a way to classify this information according to the access method, which divides the web into two parts:surface web and deep web. Generally, the surface web is composed by static web pages which has steady hyperlink address and can be indexed by traditional search engine or visited from links in other pages. On the other hand, data in deep web cannot be reached by traditional search engine. For example, data retrieved by submit query request and generated by dynamic program language such as JavaScript.As the former research shows, there is a huge variety of data in deep web. We have to acquire high quality data in deep web in order to make full use of these data resources, analyze and mining deeply, which will stimulate the research work in related fields. The process of deep web data retrieval is also the first step of data integration. The other two steps are data extraction and data aggregate. The main task of data retrieval is to restrict the focus to a specific field, discovery data resources as more as possible, evaluate these data resources, then select these with excellent qualities, and finally try our best to retrieve data in these high quality resources.The main goal of this thesis is to solve three problems when doing research on search engine based deep web data source discovery and selection:First, the discovery of deep web query interface is based on traditional search engine, so we need to submit keyword to search engine to retrieve result. However, in order to obtain a good result and rank the result pages with deep web query interface as high as possible, how to construct a high-related keyword set is still hard to solve.Second, after analyzing the pages where deep web database interfaces exist; we find that there are usually more than one query interfaces in pages, such as traditional search interfaces, metadata search interfaces and deep web query interfaces. We find that there can be seven even eight different interfaces on one page after observing a few automobile web sites. We need more efficient method to tell deep web query interfaces from other kinds of search interfaces, and to extract these deep web query interfaces.Third, the deep web data resources have several features, for example, huge in amount, wide ranging in content, and large in data quantity. Therefore, it will cost much to establish a mirror database locally, which needs much labor power, material and financial support. What's more, data in deep web resources update frequently. So if we establish a local database, we need to update it constantly. However, it is difficult to update local database because we get data by submit query request and cannot just restrict the result to new records. An alternate way is to establish a local sample database, and to refresh to whole sample database periodically. But how to select the most important keywords as query input to get data that is large in quantity and well-distributed is still a hard-solving problem.To solve above problems, this thesis presents my research in deep web data resource discovery and selection based on search engine. The goal of the thesis is to propose a source code based page block division algorithm and a method to construct high related keyword set. The method supply us a good result which is used to sample deep web database. Then we analyze the result and compute the deviation The main work and achievement shows as follows:First, the thesis proposes a field-oriented method to automatically discovery deep web query interface. This method makes full use of web page source code and visual information to determine the deep web query interface on one page. It divides a web page into several separate areas, analyzes the source code to discovery interfaces, forms query request with high-related keywords to submit to server and finally by analyzing the result returned from server to determine real deep web query interface.Second, when forming query request, we need a high-related keyword set. The thesis proposes a method to construct field-oriented high-related keyword set. It extracts document materials from professional databases, and after processing the information it outputs a graph based keyword-connected network. Especially every keyword in the network has its own weight.Third, the thesis proposes a new method to sample deep web database. It introduces flexible keyword attributes which is no longer restricted to the ranging attributes in the query interface. The experiment result shows that the method can achieve a fine result.
Keywords/Search Tags:Deep Web Data Acquisition, Query Interface Judging, Deep Web Sampling, Deep Web Data Extraction
PDF Full Text Request
Related items