Font Size: a A A

Research On Identifying Domain-Specific Deep Web Entries

Posted on:2009-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:T C LiuFull Text:PDF
GTID:2178360242481297Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and the change of informationneeds, the integrated general searching engine which is applicable to all usershas apparently been unable to meet the deeper enquiry needs. The users needto search domain-specific information, and the focused crawling arisesconsequently. It collects information using traditional crawling machines,while the latter crawl via the text hyperlinks on the web, unable to crawlinformation on Deep Web. In order to retrieve numous resources on Deep Weband search domain-specific information, the focused crawling technologyfaced to Deep Web has become a hot topic in searching area.In order to crawl in domain-specific Deep Web, we must firstly find theentry to the domain-specific Deep Web. The paper constructs an identifyingmachine to identify the domain-specific Deep Web entries and guide crawling.The identifying machine of the domain-specific Deep Web consists of threesteps as follows:Firstly, we need to judge whether the form is a Deep Web entry. Throughobserving the forms, the paper puts forwards a series of heuristic rules asjudging standards. Heuristic rules mean that we should discard any form thatmeets one or several conditions of these four: 1, There are no orsimilar submitting labels in the form. 2, The form contains some label which isunable to fill, such as . 3, There are less than three labels in theform. 4, The form is an entry to a known searching engine. Forms which havepassed the screening above means entries to Deep Web. Of course, this is nottotally accurate. If the form that has passed the screening is not an entry toDeep Web, after we fill in the form with the given theme knowledge andsubmit it, there will be no returned web or the returned web is irrelevant to thetheme. These forms will be discarded during the latter procedures, sofundamentally they will make no adverse influence to the final result.Secondly, fill in the Deep Web entry forms which we have acquired instep 1, with the given theme knowledge, then submit it to acquire the resultweb. The paper constructs a theme knowledge tree by rebuilding the domain Vspecific theme knowledge, then matches the form element labels with thetheme knowledge tree, thus completing the filling of the form elements. Theprocedure contains the following steps: 1, Extract forms from webs and modelthem. 2, Fill in each element of the form with the given theme knowledge. 3,Submit the form that have been filled to acquire the result web.Among these steps, the extracting of form labels and the matchingbetween the labels and the theme knowledge play important and difficult rolesin this procedure. The theme knowledge is usually given as eigenvectors ofkeyword roots, which is not conducive to match labels with theme knowledge.It is needed to classify the theme knowledge and build a theme knowledgetree. The nodes on the same level of the theme knowledge tree are orderedaccording to the frequency of the roots, with the more frequently used oneslocated on the left. That is to say, when traveling the roots of the themeknowledge tree, on the same level, the more frequently used roots will beahead of the less frequently used ones. After building the theme knowledgetree, to match labels with theme knowledge has become easier.The accuracy of extracting form labels will directly relate to the accuracyof the filling of form elements. Dispite it seems to the users that the formlabels have different locations, the fact is that, no matter the labels lies on theleft of the form controls or above the controls, in the corresponding DOM treewhich is transformed by HTML original codes, all the form labels are textnodes, and the form controls are element nodes named"INPUT"or"SELECT". Traveling roots of DOM tree, regardless the form label appearson the left of or above the form controls, the text nodes corresponding tolabels must be ahead of the element nodes corresponding to the form controls.Although different form labels seem to have different locations, the approachis the same. The paper extracts labels of form elements with the followingalgorithm: 1, Travel the roots of the DOM tree, and orderly put the encoutered"INPUT"nodes,"SELECT"nodes and label nodes into one rank. 2, Set aninitial state as a null buffer zone to store node data. 3, Extract nodes from therank orderly, until the first label node has been extracted, then put the value ofthe first label node into the buffer zone. 4, Extract the next node from the rank,if it is an element node, and the content of the buffer zone is not null, the labelof the element node is just the content of the buffer zone; if it is a label node, replace the value in the buffer zone with its value. 5, Repeat step 4, until thelast element of the rank has been handled.Finally, use web classifier to analyze the result web which has beenacquired in step 2, in order to judge whether the Deep Web entry is relevant tothe given theme. The web classifier is in essence a dualistic procedure of textclassification, which mainly consists of three steps: 1, Extract the text featureof the web. 2, Generate document vector of the web. 3, Judge yes or no usingthe text classification algorithm.In the paper, experiments have verified the function of identifyingmachine of domain-specific Deep Web entries. Statistics of the experimentsstated that, the accuracy of the identifying machine constructed in the paperreached above 80%. Therefore, the method put forward in the paper can beapplied to the web to identify the domain-specific Deep Web entries.At the end, the paper builds a focused crawling framework which canidentify domain-specific Deep Web entries based on the basic focusedcrawling framework. The paper has also specified the crawling procedure ofthe focused crawling framework of Deep Web entries, as well as the handlingof some details and the reasons for the processing. From the frameworkdesign and the procedure design, it can be seen that the identifying machine ofthe domain-specific Deep Web entries, which has been constructed in thepaper, can be applied to the focused crawling framework to identify andcollect the entries to the domain-specific Deep Web.
Keywords/Search Tags:Domain-Specific
PDF Full Text Request
Related items