Font Size: a A A

Research On Form-Based Hidden Web

Posted on:2009-05-12Degree:MasterType:Thesis
Country:ChinaCandidate:R XuFull Text:PDF
GTID:2178360242994195Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Most of the search engine only retrieve public indexable web (PIW) which is obtained by hyperlink. But the fact is that with the development of web, more and more information are stored in web's backstage database. These data can be retrieved only through HTML form; they are called Hidden web page. In order to help people to obtain the important data in the web database, we have a system which can seach the hidden web pages. In this paper, the architecture is presented, and the key technologies are discussed.First, the common search engine's advantages and disadvantages are analysised, and the difference between common search engine and hidden web search engine are compared. The proper strategy which suits to hidden web crawlling by using link classifier and text classifier is given. This can achieve focus crawl. In addition, based on the specific characteristics of forms, the new stopping criteria that is very effective in guiding the crawler to avoid excessive speculative work in a single site is introduced.In this paper, the process of user's accessing hidden web is simulated. First, forms are converted to an understandable form for program. It means modeling to the form. Secondly, the useful forms are extracted by using heuristic rules and form classifier. At last, form label and the context of form are extracted. The results are filled in the forms automatically to find the hidden web page.We make the full use of the structure and text information of forms. The classifier includes the cooperating of label classifing and the form appendix context classifing. We use Centroid,KNN and SVM algorithm. The experiments show that SVM algorithm has the best effect.Through the experiment we verify the effectiveness of form classifing and form extracting.
Keywords/Search Tags:Page text classify, Hidden web, Web information extract, HTML form, name value table
PDF Full Text Request
Related items