Font Size: a A A

Deep Web Interface Discovery Based On Domain Knowledge

Posted on:2010-06-16Degree:MasterType:Thesis
Country:ChinaCandidate:Z L YaoFull Text:PDF
GTID:2178360302461986Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, more and more people pay attention to the web databases. According to the "depth" of the information, the web can be divided into Surface Web and Deep Web. Deep Web contents oriented the Domain-special, with high quality superior to Surface Web. To make use of the abundant information in Deep Web effectively, it becomes an urgent demand to establish Deep Web Integration System. Interface discovery is the prerequisite work in Deep Web Integration System. Deep Web interface discovery needs to address the following four problems:(1) Find websites may contain the field Deep Web interface;.(2) Identify the true Deep Web interface from the websites; (3) Assess the coverage which the found interfaces take up the whole numbers of the field Deep Web interfaces; (4) Extract the attribute information of the Deep Web interface.As to the first, we employ search engine to find as many interfaces as possible with few query words and experiments prove to be effective. The difficulty of the method is to how to choose query submitted words. In this paper, we present the method about determining query words with domain knowledge and search engine. It has three aspects:(1) To compute the popularity of the words, domain knowledge is used to determine a feature word whether the word is the high-profile word or not, the word and the feature word make up the combination terms with submitted to the search engine, the target site will agree with the combination words if the target site exactly matching combination words, which the number is the more, the better matching the visibility of the term; (2) Filter the web pages based domain knowledge of the URL, we can quickly remove the sites which has nothing to do with the domain by domain knowledge, the more the rest websites, the more the field Deep Web interfaces, we can get a rule that the words standing in the front can find more the area Deep Web interface than that of the backward words by computing the rest websites and arranging in decreasing order, it can lead to the domain deep web interface with fewer query submitted words as many as possible; (3) Determine the process of query words. As to the second, the paper presents the judgment method based on SVM and the experiments show the precision and recall. As to the third, we present the evaluation method on the basis of the integrated websites. It is worth referencing to a certainty. As to the forth, the paper gives the extraction method based DOM-tree and regular expressions, and the experiments show it can accurately extract the needed information and it is easy.
Keywords/Search Tags:Deep Web Interface, Domain Ontology, Search Engine, SVM
PDF Full Text Request
Related items