Font Size: a A A

Research On Deep Web Source Discovery And Classification

Posted on:2012-08-15Degree:MasterType:Thesis
Country:ChinaCandidate:H L WangFull Text:PDF
GTID:2218330368992444Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the rapid development of Internet, Web information capacity is expanding continuously, which provide huge information resource for users. Enormous Web information are deepening, and hidden behind query interfaces, which can't be obtained by traditional search engines, so they are called DeepWeb. The increasing of DeepWeb information with high-speed have being a significant resource for information retrieval. Due to the heterogeneity and dynamicity of DeepWeb data, Therefore, many researchers and companies had researching how to integrate these Deep Web resources into one system.This thesis researches on Deep Web oriented data extraction and integration technology. Proposes the corresponding Models and mechanisms, effectively solves the limitations of traditional methods. The main work of this thesis is summarized as followings:1. This paper presents a classification method of data source using fuzzy set and probabilistic model. The words of each domain are classified into characteristic words and general words according to their contribution to the current domain. The fuzzy set is introduced into the simplification process of characteristic words and the common words as the normalized glossary tool, which can be able to find more precise glossary in the homepage text. And a vocabulary probabilistic model is build after the normalized process in various domains; these words are classified by calculating the distance between the data source form vector and each domain vector.2. This paper proposes a method for the data source discovery using the search engine. In order to submit high quality key words to the search engine, we introduce the ontology to the initial word construction process. Classify all the words according to their frequency in the current domain, and reclassify these words in accordance with the element quantity of the returned collection, ensure that the key word contributes greatly to the discovery of the data source query interface.3. An improvement mechanism on network form classifier is proposed in this paper, it can reclassify forms which are mistakenly classified into proper domains with association the pre-query and the post-query technique. A chart model is established by utilizing the correlation of multiple domains before the classification, and forms are sent into multi-domains'form-aggregates at the same time. Then reclassify the intersection of all domains'form-aggregates with probe-query to make the form classification more accurate.Moreover, this thesis also designs and performs several experiments on the methods mentioned in the thesis. The experimental results show that these methods are feasible and effective.
Keywords/Search Tags:data source discovery, form classification, probe query, chart model
PDF Full Text Request
Related items