Font Size: a A A

Deep Web Interface Discovery And Extraction Research Based On Rules

Posted on:2011-09-13Degree:MasterType:Thesis
Country:ChinaCandidate:L H YangFull Text:PDF
GTID:2178360308954098Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
To make use of the abundant and high-quality information in deep web effectively, it becomes an urgent demand to build deep web integration system. However, deep web search interface discovery and interface extraction are the preliminary to build deep web data integration system, and they play a role in deep web data integration system. The paper presents a method of interface discovery and extraction based on rules, and there are mainly contain two parts of contents, that is deep web interface discovery and interface extraction. As to deep web interface discovery, compared with the translational methods, the main improvements can be summarized as follows:(1). The way to get the set of the result pages is different. In the paper, the pages depended on the inner crawler mechanism of search engine with domain knowledge, which can avoid the low efficiency with self-design crawler and can quickly locate the pages.(2). The judgement method is different. Firstly, we present the filtering rules, and filter the irrelevant pages. Secondly, with the domain knowledge to confirm the candidatied pages. Thirdly, we apply the decision rule on the candidatied pages to indentify the search interface. So it needn't analyze all the pages in sequence like orginal's. Compared with the orignal approaches, it can save time and enhance the efficiency of discovering search interface. In addition, we can find many more search interfaces with the rules, and it has a better performance for distinguishing from the similar search engine interface correctly.As to the interface extraction, it usually extracts feature in the course of interface discovery, and controls elements's name, type, value and the word frequency act as the features. However, the semantic text was rarely to take into consideration. While the paper presents the method of interface extraction based on rules, it analyzes the interface independent, and makes full use of the interface's attibutes, that is text labels and controls element. It has a better solution of the interface extraction, and interface description is also convenient for user understand. Compared with the trandtional methods, it defines a new interface experssion. To make interface integration and query translation easy, we take both the charset and classification of interfaces into consideration and added to interface expression. Secondly, we make best use of the semantic feature of the literals with domain knowledge and the relationship between semantic and postion. It can realize the complex schema matching, and can solve the failed in interface extraction with incomplete semantic. In a word, the method can meet the different application requirements.
Keywords/Search Tags:Search interface, Search engine, Domain ontology, Interface expression
PDF Full Text Request
Related items