Font Size: a A A

Research On Deep Web Sources Classification Based On The Form Features

Posted on:2013-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:G W ZhuFull Text:PDF
GTID:2248330377958501Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays, Deep Web consists of vast amounts of high quality information which is stillrising rapidly. However, because of its distributivity, heterogeneity, autonomy etc, it is facedwith huge challenges for users to obtain these information efficiently and quickly which theyare interested in. However, Deep Web data sources are organized by the domains in the realworld, which is the foundation for addressing this challenge. As a consequence, there is agreat significant to research on the problem of organizing deep web sources.In this paper, we collected mass deep web data sources which are from four differentdomains (i.e., Airfares, Books, Automobiles and Real estate) using Web dictionary (e.g.,UIUC Web integration repository), an automatic extraction tool developed by our researchgroup and search engines (e.g., google.com.hk). Based on the more than200sources of thesesources, we observe that: First of all,"Topic terms" appearing in the title tag can betterdistinguish the deep web data sources than other words appearing in the title tag. Specifically,For the overwhelming majority of the query interfaces, there are some words between <title>and </title> tags in these source codes, and some of these words often appear only in certaindomain; Further, in any query interface,"topic terms", to some extent, characterizes a themeof that interface, that is, a relevant domain; Secondly, The different interfaces from the samedomain often have many similar form attributes, In contrast, the different interfaces from thedifferent domains have few form attributes, even zero; Thirdly, The aggregate schemavocabulary of sources in the same domain tends to converge at a relatively small size(i.e.,about60attributes) with respect to the growth of sources; Lastly, most of the data sources indeep web are the structured sources.Inspired by the above observations, a novel classification method about the deep websources and a improved similarity measure of query interfaces were proposed to realize theautomatic classification of large masses of deep web sources, which make full use of the formfeatures(e.g., theme information and form attributes) and combine with the idea of k-meansclustering and classification. In addition, we present a strategy of query interface label toreduce the influence resulted from choosing initial centers randomly. The experimental resultsover real deep web sources indicated that our method is effective and feasible, and both the overall accuracy and recall achieve above97%.
Keywords/Search Tags:Deep web, Automatic classification of sources, Form features, Query interfacelabel
PDF Full Text Request
Related items