Font Size: a A A

Study On Data Sources Discovery And Selection On Deep Web

Posted on:2009-12-31Degree:MasterType:Thesis
Country:ChinaCandidate:M F LiFull Text:PDF
GTID:2178360308478306Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the increasingly development of Internet, the amount of data sources on Deep Web is rapidly growing. However, these data sources can only be acquired by dynamic query responses. Hardly can they be indexed and searched by traditional search engine such as Google and Baidu, and thus they are not fully utilized. Therefore, exploring and study on Deep Web query search engine to satisfy the wide demands of users have become the primary focus of information research. However, for the features of Deep Web, it is very difficult to achieve data sources integration from technical perspective.To discover and integrate these Deep Web data sources, we first analyzed the state of art on Deep Web, proposed data integration framework on Deep Web, analyzed four main mechanisms, respectively repository constructing mechanism, query processing mechanism, query transforming mechanism and result integration mechanism, and described the difficulties on Deep Web integration. Secondly, we described the Deep Web crawler architecture, after analyzing the interface styles and form processing mechanisms, it adopted four-level data source discovery model and presents a domain based form crawler architecture DeepRunner and algorithm DOER for acquiring data sources within one domain. Thirdly, we elaborated on the attribute distribution of Deep Web and proposed an attribute based dominant pattern growth algorithm for top-k data sources selection, and further improves by combining the co-occurrence of attributes, which further improved the precision and recall. Finally, a query translation and result integration mechanism was described.Experiment results have demonstrated the feasibility of DeepRunner for acquiring Deep Web data sources within one domain. Various experiments on large amount of data have shown the advantages of the domain based Deep Web discovery algorithm DOER and have also validated the effectiveness of the attribute based dominant pattern growth algorithm and the co-occurrence combined approach. These two algorithms are much better than traditional top-k data sources selection strategy especially under large scale data sources integration.
Keywords/Search Tags:Deep Web, domain, data sources discovery, data sources selection, Top-k, attribute based dominant pattern growth algorithm, co-occurrence
PDF Full Text Request
Related items