Font Size: a A A

Research On Issues In Data Acquisition Of Deep Web

Posted on:2011-08-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z M YanFull Text:PDF
GTID:1118330332481353Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of Internet technology, Web has become a huge information source. Deep Web information is richer, more thematic and better structured. With the growth of demands of the analytical applications, such as market intelligence analysis, public opinion analysis and e-commerce, Deep Web data must be integrated and some useful knowledge can be analyzed and mined in these integrated data. However, Deep Web data have such characteristics as large scale, autonomy, heterogeneity, distribution and specificity of access methods, which makes how to get data automatically from Deep Web data sources become a large challenge.As the first step of data integration, Deep Web data acquisition provides the data basis for data integration and provides the prerequisite for the data extraction and consolidation work. Deep Web data acquisition has the following problems to be resolved:(1) As web sites grows fast and changes at any time and data analysis and mining require comprehensive data, analysis-oriented deep web data acquisition needs to discover automatically deep web data sources as many as possible. (2)As Deep Web data source quality varies greatly, the process of crawling is complicated and the period of crawling is long, we need to evaluate the data sources that have been found and select high-quality data sources to obtain more comprehensive information. (3)As Deep Web data sources have plenty of data, and a lot of duplicated data are in result sets retuned by submitting different query words in the process of crawling, we need to select the words for submitting queries for acquiring efficiently data from Deep Web data source.This dissertation aims at analysis-oriented Deep Web data acquisition and places focus on query interface judging, deep web sampling, deep web quality valuation and deep web crawlling. The main research works and contributions are as follows.1. A query interface judging approach based on ensemble learning is proposed, which could solve the problem of identifying Deep Web query interfaces of a large number of web pages effectively, distinguish between deep web query interfaces and search engine query interfaces, and improve the accuracy of Deep Web interface identification.To obtain Deep Web query interfaces more effectively, this paper proposes an ensemble learning method to establish the judging model of Deep Web query interfaces using a decision tree classifier and multiple SVM classifiers. On one hand, through the analysis of inherent characteristics of querying interface pages, the method proposes 6 rules for judging if a page contains query interfaces based on the characteristics of query interfaces, and classifying the pages using a simple and effective decision tree classifier; On the other hand, the method analyzes the result pages through submitting queries to query interfaces of Deep Web or search engines, trains SVM classifiers using the result pages'features to classify pages, and obtains several training data sets which have balanced category, reduces the impact of learning algorithm from the unbalanced categories; Finally, the method integrates SVM classifiers derived from the training decision tree and multiple training data sets based on the voting, and gets the judging model of Deep Web query interfaces. The method can distinguish the Deep Web query interfaces from the interfaces of search engines by synthesizing the advantages of submit-a-query method and not-submit-a-query method, in order to identify the Deep Web query interfaces more accurately. Experiments show that this method has good feasibility and efficiency, and can obtain higher recall and precision compared with the recognition algorithm using simple machine learning methods.2. A deep web sampling method based on the keyword selecting model is proposed, which can efficiently obtain approximate random sample with high-quality from Deep Web data sources for deep web data sources evaluation.This paper proposes a sampling method based on the keyword selecting model, which makes the sampling process not be restricted by the property expressions of query interfaces. In the sampling process, for keyword property, we can select a value of the property from the current sample sets according to the occurrence frequency of a value, and then submit it to the query interface. Random Walk algorithm strategy is used for classification property and scope property. This method efficiently obtains approximate random sample with high-quality from Deep Web data sources, and through the sample we can understand useful features of the data source such as field relevance, accuracy, completeness, the scale of data, to evaluate and select Deep Web data sources.3. A deep web data sources quality valuation method based on Multi-objective decision theory is proposed, which is an effective solution to the quality valuation of the large-scale Deep Web data sources in the same field.This paper proposes a deep web data sources quantitative quality valuation method based on multi-objective decision theory. Through the establishment of Deep Web data sources quality valuation system, we can quantify and score each Deep Web data source, map the valuation problem to the multi-objective decision problem, sort the Deep Web data sources, and select high-quality data sources. For the need of Deep Web data integration in analytical applications, we propose a quality evaluation system. Using the data sample that has been obtained, the system quantifies and scores according to 16 quality evaluation factors through 4 dimensions which are Web data source quality, query interfaces, the quality of results and data. Then the system maps the score matrix to multi-objective decision making solution, calculates the weight of quality evaluation factors, ultimately gets the overall evaluation value of each Deep Web data source, then sorts the Deep Web data sources and selects high-quality data sources, minimizes the amount of Deep Web data sources that need to crawl.4. A Deep Web data crawling method based on graph model of high-frequency words coverage is proposed, which is an effective solution to the large-scale acquisition of the Deep Web data pages in the Chinese environment.This paper proposes a Deep Web data crawling method based on graph model of high-frequency words coverage. For a particular field, the method counts Chinese word frequency and gets a field-oriented properties high-frequency word list, and builds a graph model of high-frequency words coverage to estimate the new data acquisition rate of the candidate character, in order to obtain the highest possible data coverage using less time of the database queries. The method is an effective solution to Deep Web data crawling in the Chinese environment. Graph model of properties and high-frequency words coverage, which are built in the process of crawling, have a good guide to the other data sources of the same field. Experimental results demonstrate the feasibility and effectiveness.
Keywords/Search Tags:Deep Web Data Acquisition, Query Interface Judging, Deep Web Sampling, Quality Valuation, Deep Web Data Crawling
PDF Full Text Request
Related items