Font Size: a A A

Research On Deep Web Search Interface And Search Result Extraction

Posted on:2011-06-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:H B ZhangFull Text:PDF
GTID:1118330332472842Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the internet develops rapidly, there are amount of online Web databases which can be accessed. The information which is stored in those databases is called Deep Web, it is dynamically produced against the submitted query on the search interface, and thus the traditional search engine can't index those data. In order to make the user access the Deep Web information conveniently, Deep Web data integration turns to be an urgent problem in information retrieval.Understanding of Deep Web search interface is a crucial problem in Deep Web data integration, and based on the analysis of Deep Web data integration's research status, this paper addresses some crucial problems which are related with Deep Web search interface, concluding the proposing of Deep Web Domain Model, Deep Web search interface discovery and schema extraction, and search result extraction and annotation. The main contributions and innovations include:●This paper proposes a Deep Web Domain Model which is based on the research of Deep Web search interface. The Deep Web Domain Model contains all of information of the Deep Web interfaces belonging to the same domain. This paper analyzes the feasibility of Deep Web Domain Model theoretically, and gives the methods for construction and storage of Deep Web Domain Model. The Domain Model can be used in many problems of the Deep Web data integration, and makes the system create a breakthrough.●This paper proposes an approach of Deep Web search interface discovery called PostClassifier which is based on Post-Query. PostClassifier first filters the interface by the rules produced by the Pre-Query approach in order to reduce resource consumption of query submitting. Using the Domain Model to juge the domain of the interface and fill the key words. PostClassifier proposes the method for identifying the interface's type based on the analysis of query result of different kind of interfaces. ●This paper proposes an approach of interface schema extraction which deals with labels and elements separately for the first time. At beginning, we construct a label tree for the interface, in this step we find the corresponding node in the Domain Model for each label, and need to deal with the repeated labels and lost labels. Then we find the elements'matched labels. We use the label's corresponding node in the Domain Model to match the element, in this way, more information can be used to find an element's label. If the lost labels have matched elements, they will also be dealt with correctly when merging the results of the previous two steps to get the final interface schema.●This paper proposes an approach of search results extraction and annotation called EaSd. EaSd uses the VIPS as the HTML page's presentation format. The query keywords tend to emerge in the query results, based on this, EaSd discovers each record, and further discovers the record block. EaSd aligns the data units of all the records to find their common patterns or features, which will be helpful for annotation. Using both Domain Model and local interface schema for annotation will resolve the problem of local interface schema inadequacy and inconsistent label. We use several methods for annotation to improve the recall and precision. Experimental results show that EaSd can discover and annotate most records.
Keywords/Search Tags:Deep Web search interface, domain model search, interface discovery, search interface shema extraction, search result extraction and annotation
PDF Full Text Request
Related items