Font Size: a A A

Using Form Classifier To Identify Domain-Specific Deep Web Entries

Posted on:2008-11-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y XuFull Text:PDF
GTID:2178360212997019Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Deep Web sources store their content in searchable databases that only produce results dynamically in response to a direct request. Traditional search engines create their card catalogs by spidering or crawling"surface"Web pages, but cannot"see"or retrieve content in the Deep Web. However, Deep Web is 500 times larger than the known"surface"World Wide Web. There are literally hundreds of billions of highly valuable documents hidden in searchable databases that cannot be retrieved by traditional search engines. Researching and mining Deep Web is important in order to improve precision and coverage of search engines. Discovery of entry points to the Deep Web is the most important and fundamental step when crawl the Deep Web. Crawler must first discover a set of entries point to the Deep Web before it can proceed with the other steps. Very little work has been done in this area.In order to identify the Deep Web entry, this paper designs a form classifier to identify domain-specific Deep Web entries. There are two steps: first, how to identify the Deep Web entry, second, how to judge whether the Deep Web entry is relevant to the given topic. Then, we introduce how to design the form classifier.The first step: how to identify the Deep Web entry.The vast majority of Deep Web portals on the Web are HTML forms. For this reason we ignore other types of Deep Web entries such as Java GUIs. It is easy to find large numbers of HTML forms by crawling the web. The forms are various. We distinguish four classes of HTML forms: all forms, textual forms, forms which can be filled in and Deep Web forms. The smallest form subclass is deep web entry, which points to the Deep Web resources.In this paper, we extract features based on the classification and observation of forms. Fortunately, HTML forms contain complex structure that can be exploited to obtain a rich set of features. We can extract features of forms from three areas: form types, form element number and the parameter in the form controls. (1)Features can be generated based on the types of form controls present in the HTML form, for example existence of text controls and select controls. It is because different form controls have different functions. If a form is used to query, there is at least one text control or one select control. Password control is used to protect users'information. And a full of ninety-five percent of the Deep Web is publicly accessible information– not subject to fee or subscriptions, so it does not contain password control.(2) Features can be generated from the number of controls, for example, a form having a single text control versus the form having multiple text controls. Because usually there is only one text control in most search interfaces, but advanced searches in Deep Web have more options. So there are more form controls in Deep Web forms. Not all types of controls are considered in this paper. For example, the number of submit control has no influence on whether form is Deep Web form or not. (3)Features can also be generated based on the certain parameters found in the HTML form, for example, parameter name and value in an input control. Some forms are used to submit users'information which are not used to be queried. Users should fill their Emails in this kind of forms. We can get this information in the parameter of controls such as the value of parameter name or value is"email".After extracting features, We perform the same experiment with different classification schemes: C4.5 decision trees, support vector machines(SVM), na?ve bayes and K-nearest neighbor(Knn). The precision and confusion matrix of C4.5 decision trees provide an advantage over the other algorithms. So, we use the C4.5 decision trees to classify in this part. And we got an accuracy of 97.5062 % in the training set. The second step: how to judge whether the Deep Web entry is relevant to the given topic.The resource of Deep Web is large and Deep Web coverage is broader. In order to conquer the problem of the traditional search engine and satisfy the demand of users, this paper integrates the focus crawling technology with Deep Web: the form classifier can identify whether the Deep Web entry is relevant to the given topic.The only clue which can be used to judge the relevance of Deep Web entries with the given topic is the text information in the forms. The important text information in HTML form is the form labels mapping to the form fields. HTML does not provide any explicit mechanism to write forms. For example, text t is label of a control in one form, but t is the option in the select in another form. Although the option in the select is not label, it is relevant to the topic. So the option in the select is also important. Besides, some text controls have default text content, and in the button there also are some words to tell user the function of the button. Web pages usually use images. For user agents that cannot display images, the attribute alt specifies alternate text. Non-textual elements and let authors specify alternate text to serve as content when the element cannot be rendered normally. Form classifier cannot understand what does the image means, so the words in parameter alt provide valuable information.After extracting the text information in the form, we normalize the text, using a stemming algorithm, removing stop words and compute the frequence of words. Except the general stop words, this paper has collected some words such as submit, search, query. These words do not have contribution for classification. At last we use TF-IDF weighted vectors to represent text information. The dimension of the vector is low because we only extracted words from the form. The dimension is several thousands, so we can use the vector directly.We also perform the same experiment with different classification schemes: C4.5 decision trees, support vector machines (SVM), naive bayes and K-nearest neighbor (Knn). The precision of SVM is higher than naive bayes and K-nearest neighbor (Knn). So, we use SVM in this part. We got an accuracy of more than 90% in training set and the highest one is 98.4252 %.The two parts compose the form classifier to identify a domain-specific entries point to the Deep Web. Given a form, we extract features from the structure and judge whether it is a Deep Web entry using C4.5 decision trees which has been trained. We can discard a lot of forms and only leave the entries point to Deep Web. Then we judge whether the Deep Web form is relevant to the given topic using SVM which has been trained. Although calculation will spend more time in the second step, few forms can be left through the first step. We got an accuracy of 94.5455% in test set.In order to apply the form classifier in the Web, this paper build a domain-specific entries point to the Deep Web crawling framework based on the basic focused crawling framework. We apply the form classifier to the framework we create and got a number of Deep Web entries of airfare. The harvest ratio reaches 80%. So this form classifier can be applied in the Web.
Keywords/Search Tags:Domain-Specific
PDF Full Text Request
Related items