Research On Deep Web Sources Classification Based On The Form Features

Posted on:2013-04-26

Degree:Master

Type:Thesis

Country:China

Candidate:G W Zhu

Full Text:PDF

GTID:2248330377958501

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Nowadays, Deep Web consists of vast amounts of high quality information which is stillrising rapidly. However, because of its distributivity, heterogeneity, autonomy etc, it is facedwith huge challenges for users to obtain these information efficiently and quickly which theyare interested in. However, Deep Web data sources are organized by the domains in the realworld, which is the foundation for addressing this challenge. As a consequence, there is agreat significant to research on the problem of organizing deep web sources.In this paper, we collected mass deep web data sources which are from four differentdomains (i.e., Airfares, Books, Automobiles and Real estate) using Web dictionary (e.g.,UIUC Web integration repository), an automatic extraction tool developed by our researchgroup and search engines (e.g., google.com.hk). Based on the more than200sources of thesesources, we observe that: First of all,"Topic terms" appearing in the title tag can betterdistinguish the deep web data sources than other words appearing in the title tag. Specifically,For the overwhelming majority of the query interfaces, there are some words between <title>and </title> tags in these source codes, and some of these words often appear only in certaindomain; Further, in any query interface,"topic terms", to some extent, characterizes a themeof that interface, that is, a relevant domain; Secondly, The different interfaces from the samedomain often have many similar form attributes, In contrast, the different interfaces from thedifferent domains have few form attributes, even zero; Thirdly, The aggregate schemavocabulary of sources in the same domain tends to converge at a relatively small size(i.e.,about60attributes) with respect to the growth of sources; Lastly, most of the data sources indeep web are the structured sources.Inspired by the above observations, a novel classification method about the deep websources and a improved similarity measure of query interfaces were proposed to realize theautomatic classification of large masses of deep web sources, which make full use of the formfeatures(e.g., theme information and form attributes) and combine with the idea of k-meansclustering and classification. In addition, we present a strategy of query interface label toreduce the influence resulted from choosing initial centers randomly. The experimental resultsover real deep web sources indicated that our method is effective and feasible, and both the overall accuracy and recall achieve above97%.

Keywords/Search Tags:

Deep web, Automatic classification of sources, Form features, Query interfacelabel

PDF Full Text Request

Related items

1	Deep Web Sources Classification And Query Interface Schema Extraction Based On Ontology
2	A Study Of Automatic Form-Filling Based On CNN And BiLSTM-CRF
3	Automatic Classification And Identification Of Deep Web Data Source Using Multi-classifier
4	SEEDEEP: A system for exploring and querying deep web data sources
5	Research On Source Discovery And Query Results Extraction Of Deep Web
6	A framework for transparently accessing deep web sources
7	Integrating Deep Web data sources
8	Research On Deep Web Sources Classification Leveraging World Knowledge Inference
9	Research On Deep Web Source Discovery And Classification
10	Research On The Deep Web Data Sources Classification