Font Size: a A A

Study On Key Techniques Of Web Database Integration In The Deep Web

Posted on:2010-12-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:T Z NieFull Text:PDF
GTID:1228330371450144Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the booming development of internet technology, the amount of data information in Web is rapidly increasing. As an important resource of the deep web, Web database contains a great deal of web-accessible data information. Most of the information is structured and domain-specified data records, so it can provide data set with higher data quality for science research and commercial applications. In deep web, Web databases are often heterogeneous, distributed, dynamic and autonomic. However, Web database integration has provided an effective solution for unified accessing data of web databases. As a rising research field, Web database integration has a number of challenging issues to be solved. This dissertation introduces existing research work of Web database integration, and discusses the framework of Web database integration. Thus it is toward to study some key techniques of Web database integration which include Web database schema extraction, Web database classification, search result record extraction and annotation, data cleansing of integrated data and so on. Some novel and effective approaches have been proposed to solve problems in above techniques in this dissertation, and the main contributions include:(1) A framework of Web database integration based on meta search model is proposed. This framework is used to achieve unified accessing the data in different Web database for users. The meta search model not only enables the framework transparently accessing fresh data, but reduces the execution cost of the system. Based on analyzing key techniques of web database, the framework includes a Web database search module with off-line model and a query processing module with on-line model. The former is used to discover domain-specified Web databases from internet. Then it extracts the schemas of them and classifies them into different categories. However, the work of latter is to response queries of a user by extracting and annotating search result records of different Web databases and integrating them.(2) An instance-based approach for search result schema extraction is proposed. The schema information is important for Web database integration. Given a Web database, the search interface schema determines its function, while the search result schema presents its content. But most of existing works mainly focus on search interface and seldom study search result schemas. This dissertation first provides a label-based approach to identify attributes in search interfaces. Then a two-phase method of result schema extraction is proposed, in which instance-based queries are classified into approximate query and accurate query to probe Web database separately. Furthermore, our approach searches key words of queries in the DOM tree of a result page, and identifies schema attributes based on the characters that records are continuously appeared in result pages. At last, we use attribute co-concurrency to matching more schema attributes and improving the precision and recall of our approach.(3) The content-oriented classification approach for Web database is proposed. The existing domain-oriented classification is difficult to satisfy the requirement of Web database selection. So this dissertation firstly partitions records of domain into multiple subject categories, and uses sample instance to probe Web database. Then based on the number of result records, a matrix is built to reflect the matching relationship between Web databases and subject categories. The result of content-oriented classification can provide more accurate data source for web database selection.(4)Efficient techniques for search result record (SRR) extraction are proposed. Firstly, we propose an approach for crawling search result pages based on URL matching. It can avoid the semantic matching on large amounts of contents of Web pages, which provides accurate result pages and guarantees the efficiency of Web data extraction. The paths of schema attributes, which are identified in schema extraction, are used to locate SRRs in result pages, and to achieve records extraction and make annotation at the same time. Moreover, based on paths of attributes, we generate wrappers for each Web database to improve the efficiency of extracting SRRs from sequential result pages.(5) An approach of cleaning the integration data is proposed. Firstly, to solve the problem of data quality on integration data, we provide a novel method to reinforce records based on functional dependency. Combining techniques of entity identification, this method is effective for repair the incomplete, inaccurate and incorrect value of record attributes. Secondly, an algorithm of incremental data integration is proposed, in which the integration order of record sets is determined by the result of their quality evaluation results. It is efficient for improving both data quality and efficiency of integration. (6) A domain-specified prototype system, DDW Search is designed and implemented, based on the key techniques of the Web database integration framework proposed in this dissertation. Users can submit their request by the global interface of system, and browse the uniform search result from multiple Web databases.In summary, this dissertation dedicates to study the Web database integration framework and key techniques in it, and proposes several novel solutions for research issues. A series of experiments and analysis demonstrate that these approaches are effective and efficient for Web database integration. We hope approaches and techniques in this dissertation make some contributions to the research work on the Web database integration field.
Keywords/Search Tags:Web database, Deep Web, Web database integration, meta search, search result schema, Web database classification
PDF Full Text Request
Related items