Font Size: a A A

Discovering Deep-web Sources and Extracting Content using Automated Query Generation

Posted on:2012-02-18Degree:M.EType:Thesis
University:The Cooper Union for the Advancement of Science and ArtCandidate:Shrestha, SubodhFull Text:PDF
GTID:2468390011961224Subject:Engineering
Abstract/Summary:
A significant amount of information on the web today exists in structured databases that are only accessible through search forms by using the provided query interface. Such sources of information are collectively referred to as the Hidden Web or Deep Web. Regular web crawlers are unable to retrieve information from these sources since the crawlers can only follow links in order to reach information on the web, but there are no links that lead to information contained in such sources. In order to extract information out of the deep web sources, it is necessary to interact with the search forms and submit queries. With most crawlers, and as a result the search engines, unable to reach information in deep web sources, there is a need of an infrastructure that can mine the deep web and enrich information available to the users.;In this study, I tackle two problems that are involved with retrieving information out of deep web sources. First is that of discovering deep web sources automatically. With a large amount of web sites out there, it is impossible to imagine a human shifting through page after page in order to determine which of them are deep web sources. Here, I present an automated method of discovering deep web sources using crawlers that are equipped with the ability to classify sources as either deep-web or not deep-web. The second issue looked at by this study is extracting information out of the deep web sources. I propose an automated method of performing queries into the hidden web source that does not require a prior knowledge of what domain does the source belong to. Thus the method is not limited to a particular domain and is generally applicable to any field that the user is interested in. Also proposed is a response evaluation metric that aims to identify distinct pages and a query selection metric that aims to extract novel information from the source through further queries on the interface.
Keywords/Search Tags:Web, Information, Query, Using, Automated, Discovering
Related items