Research On Deep Web Source Discovery And Classification

Posted on:2012-08-15

Degree:Master

Type:Thesis

Country:China

Candidate:H L Wang

Full Text:PDF

GTID:2218330368992444

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

As the rapid development of Internet, Web information capacity is expanding continuously, which provide huge information resource for users. Enormous Web information are deepening, and hidden behind query interfaces, which can't be obtained by traditional search engines, so they are called DeepWeb. The increasing of DeepWeb information with high-speed have being a significant resource for information retrieval. Due to the heterogeneity and dynamicity of DeepWeb data, Therefore, many researchers and companies had researching how to integrate these Deep Web resources into one system.This thesis researches on Deep Web oriented data extraction and integration technology. Proposes the corresponding Models and mechanisms, effectively solves the limitations of traditional methods. The main work of this thesis is summarized as followings:1. This paper presents a classification method of data source using fuzzy set and probabilistic model. The words of each domain are classified into characteristic words and general words according to their contribution to the current domain. The fuzzy set is introduced into the simplification process of characteristic words and the common words as the normalized glossary tool, which can be able to find more precise glossary in the homepage text. And a vocabulary probabilistic model is build after the normalized process in various domains; these words are classified by calculating the distance between the data source form vector and each domain vector.2. This paper proposes a method for the data source discovery using the search engine. In order to submit high quality key words to the search engine, we introduce the ontology to the initial word construction process. Classify all the words according to their frequency in the current domain, and reclassify these words in accordance with the element quantity of the returned collection, ensure that the key word contributes greatly to the discovery of the data source query interface.3. An improvement mechanism on network form classifier is proposed in this paper, it can reclassify forms which are mistakenly classified into proper domains with association the pre-query and the post-query technique. A chart model is established by utilizing the correlation of multiple domains before the classification, and forms are sent into multi-domains'form-aggregates at the same time. Then reclassify the intersection of all domains'form-aggregates with probe-query to make the form classification more accurate.Moreover, this thesis also designs and performs several experiments on the methods mentioned in the thesis. The experimental results show that these methods are feasible and effective.

Keywords/Search Tags:

data source discovery, form classification, probe query, chart model

PDF Full Text Request

Related items

1	Research On Discovery And Classification Based On Topic-related Data Sources
2	Multi-Join Query Algorithm Research Over Data Streams
3	Source discovery and schema mapping for data integration
4	Deep Web Data Resource Intelligent Mining System
5	Astudy On The Methods Of Chinese Product Query Classification Based On User Behavior And Semantic Expansion
6	Dissemination Of Information Management, Data Mining Based On Rough Set Classification Model
7	Build, Based On Unit Outlier Algorithm And Customer Loyalty Analysis System
8	Integrated Query Processing Over Autonomous Heterogeneous Data Sources
9	Research On Source Discovery And Query Results Extraction Of Deep Web
10	Research On Deep Web Sources Classification Based On The Form Features