Font Size: a A A

Research On Discovering Domain-Specific Deep Web Entries Based On Focused Crawling And Ontology

Posted on:2010-11-04Degree:MasterType:Thesis
Country:ChinaCandidate:B Y SongFull Text:PDF
GTID:2178360272496270Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Deep Web contains a great capacity of information, which is several hundred times than the surface web, moreover, it continues growing rapidly. These information has high -quality, domain-specific, and so many other merits, but the deep web's distribution is very sparse in the network. The traditional search engine cannot reference the deep web's content; at the same time, the content of the network grows rapidly, network environment day by day complicates, in this kind of situation if one wants to gain the domain-specific deep web's data source becomes more difficult, there is an urgent need for some effective strategy which can discovery the deep web's data source.For this issue, this paper proposed a scheme to discover the domain-specific deep web entries. This paper mainly divides into three parts: the first part is about the focused crawling, then the second part is to identify the deep web entries, finally is to utilize the domain ontology to judge the relevance of the deep web entries about the subject, following, illustrating how to use these three parts to construct an online domain-specific deep web entry crawling system.The first part, utilizing the focused crawler to crawl domain related webpage. The focused crawler relies on two aspect's directions: the link and the content to carry on crawling. To forecast a webpage's subject relevance a link directs to, the only reliable information is the link's anchor text, and the nearby link context. Extracting domain information from the related link context, establishes a subject characteristic word collection, compared to domain information which obtained from the large-scale classified catalog, the subject characteristic word collection was more flexible, which can act according to online crawling result, dynamically constructed. After having the subject characteristic collection, we can construct the standard characteristic vector of this subject. Producing the characteristic vector of the candidate link context, then calculates it and similarity of it and the standard vector to judge whether it's related to the subject, if true, downloads the homepage which this link aims to. Then, carry on the text analysis to the downloading homepage, this article compare and analyze some classic sorting algorithm, but due to the web text's characteristic vector's dimension was often very high, the computation expenses were very huge, regarding this proposed a method using the synonym set of the WordNet, replaced the word in the text, and when hypernymy's weight was bigger than concept, replaced the hyponymy concept with the hypernymy, finally produced a concept characteristic vector, but was not the characteristic word vector, achieved the goal of dimensionality reduction. Using sorter's to classify the dimensionality reduction vector, may enhance the efficiency and the precision of the classification.The second part, to judge whether the webpage is the deep web entry or not. The deep web entry is the HTML form generally, this article takes this kind of supposition as a premise, only processes the deep web entry constituted of the form. The deep web form is one special kind of form, it is subclass of inquire form which is the subclass of form that can be filled. Therefore the deep web entry we discover is a smallest subclass. Through the analysis of the form, we may find some characteristics from the multiple structure information and the text information of the form, take these characteristics as the foundation to judge the form whether is a deep web entry. This paper summarized the following rule: (1) identify the form according to the inquiry interface's lexical feature. The webpage which has the login or register form not belongs to inquires the form, they are needed to be removed according to the words such as"username","password","login","register"which appeared in register or login forms ; (2) according to the inquiry interface form"search","go"and so on glossaries determines the form whether the inquiry interface form; (3) carries on the choice according to the inquiry interface pattern size of the form. According to the inquiry interface pattern's quantitative analysis, obtains the heuristic rules that the average pattern size of inquiry interface to be bigger than or equal to 3, remove forms which inquire the interface pattern size is smaller than or equal to 2. If can pass the above heuristic rule confirmation, it proves that this is a deep web entry. Certainly there is the possibility to make a mistake, but according to overall evaluation of efficiency and the complexity of implement this paper has chosen to adapt the heuristic rules method.The third part, analyzing the subject relevance of the crawled deep web form.Because the deep web form's information is very limited, the most valuable, intuitive information is the text label corresponding to the form controls, therefore if we want to analyze the subject relevance of the form, we must extract the label from the form at first. According to the form analysis, realizes in the most situations, the label is located at control's left side or above, based on this understanding, this article uses the method called foreword traversal of the DOM tree, reads all nodes in a queue, when the nodes pops out, the label and the controls appears successively, determines which control the label matches according to order they show. After obtaining the label, next step is to determine the subject relevance of the deep web entry, we used a manual constructed domain ontology, the collect some the most basic and common concept and the relations in this domain. The domain ontology and label's matching algorithm divides into two branches, at first distinguish whether the label does match directly with the ontology concept or the concept attribute, the concept not only contains all top layer concept, but also contains the sub-concept, and the attribute contains the overall object properties and the data attributes. If there is matched concept, then returns this concept, we are called this kind of situation 1:1 match; If there's not the concept it matches, but there's attribute it matches, we are called the 1:m match. So long as can match successfully, it means that this is a domain concept, also means this is a domain-specific deep web entry.Through the on-line practical verification, it confirms the efficiency of the crawling domain-specific deep web entries system. After the process of crawling, collecting a certain amount of entries, and the harvest radio is pretty good, therefore the method proposed in this paper may bring certain profits for further work in the discovering of the domain-specific deep web entries.
Keywords/Search Tags:focused crawling, ontology, deep web entry
PDF Full Text Request
Related items