Font Size: a A A

Research On Deep Web Query Interface Discovery And Pattern Extraction

Posted on:2013-12-09Degree:MasterType:Thesis
Country:ChinaCandidate:H ChenFull Text:PDF
GTID:2208330371975573Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the explosion growth of the information on the Web appeared. However, a large part of the information has been hidden in the online database, as the traditional search engines are mostly based on web page links to search the information, so users can only obtain this hidden information which is called "Deep Web" by submitting the query interface, not by using traditional search engine. In recent years, because of Deep Web Information with bigger scale, higher quality and wider application, Deep Web has attracted the extensive attention of domestic and foeign experts and scholars gradually and becomes a hot research field of information retrieval.Due to the heterogeneous and dynamic nature of the Deep Web data, it is very challenging to make use of this information, which contains a number of issues to be solved:query interface discovery, schema extraction, interfaces integration, query transformation, results extraction and so on. At present, a lot of research work has already started for these problems and achived a series research results, but there are still many deficiencies left and the improvements are requested. The paper is forcus on the following problems in the Deep Web Data integration:the high dimension of feature space of interface discovery and schema extraction in the specific field, ignorance of traditional search engine, the lower accurancy of schema extraction, too much human intervention and not realize the fully automotive mode and so on, the main research results are as follows:(1) This paper has proposed the method to find the specific filed query interface by building a multi-layer classification model. The query interface discovery is the first problem which shall be solved in the research of Deep Web data integration. Firstly, filters the returned pages of directional crawler to confirm wheter it contains form, delete the Non-Query interface; constructed the classification model according to the different focus as query interface classification, the specific field classification, traditional search engine classification, and then set the different characteristics of the pages on the multi-layer classifier for training in order to decrease the dimensions of the feature space, increase the accuracy of classification. Futuremore, filter the traditional search engine in the query form, finally, we get the pages including the specific filed query interface. The Experiments show that this method can reduce manual intervention, effectively eliminate noise, and improve the classification results.(2) The paper proposed a method of extracting the schema of query interface of Deep Web based on the space clustering. The schema of query interface of Deep Web is constructed by a lot of attributes, but HTML is not clearly defined how to define the schema of query interface, so it is very difficult to understand the schema information of the query interface correctly. The paper starts from the query interface integration and conversion, applies space clustering to extract the hierarchy attribute tree, which mainly use the space relationship of the interface attributes (adjacent mode, alignment mode and position) to discover the intrinsic relationship among them, then use ATTACH and ATTACHONE algorithm to attach the labels to the nodes of the attribute tree and complete the task of schema extraction. The experiments show that the space clustering algorithm has higher efficiency of order recognition rate in the query interface attributes, and achieved good results in the query interface schema extraction.
Keywords/Search Tags:Deep Web, Multi-Layer Classifiers, Interfaces Discovery, Schema Extraction
PDF Full Text Request
Related items