Discovering Deep-web Sources and Extracting Content using Automated Query Generation

Posted on:2012-02-18

Degree:M.E

Type:Thesis

University:The Cooper Union for the Advancement of Science and Art

Candidate:Shrestha, Subodh

Full Text:PDF

GTID:2468390011961224

Subject:Engineering

Abstract/Summary:

A significant amount of information on the web today exists in structured databases that are only accessible through search forms by using the provided query interface. Such sources of information are collectively referred to as the Hidden Web or Deep Web. Regular web crawlers are unable to retrieve information from these sources since the crawlers can only follow links in order to reach information on the web, but there are no links that lead to information contained in such sources. In order to extract information out of the deep web sources, it is necessary to interact with the search forms and submit queries. With most crawlers, and as a result the search engines, unable to reach information in deep web sources, there is a need of an infrastructure that can mine the deep web and enrich information available to the users.;In this study, I tackle two problems that are involved with retrieving information out of deep web sources. First is that of discovering deep web sources automatically. With a large amount of web sites out there, it is impossible to imagine a human shifting through page after page in order to determine which of them are deep web sources. Here, I present an automated method of discovering deep web sources using crawlers that are equipped with the ability to classify sources as either deep-web or not deep-web. The second issue looked at by this study is extracting information out of the deep web sources. I propose an automated method of performing queries into the hidden web source that does not require a prior knowledge of what domain does the source belong to. Thus the method is not limited to a particular domain and is generally applicable to any field that the user is interested in. Also proposed is a response evaluation metric that aims to identify distinct pages and a query selection metric that aims to extract novel information from the source through further queries on the interface.

Keywords/Search Tags:

Web, Information, Query, Using, Automated, Discovering

Related items

1	Automated Query Reformulation Approach For Document Search In Software Engineering
2	Automated Geometry Theorems Proving And Discovering Based On Point Geometry
3	Study On Discovering The Relationships And Semantic Query Among Data Resources In DataSpace
4	Research On Discovering Frequent Episodes In Event Sequences
5	Research On Performance Testing And Analysis Of XML Query Engines
6	Information Retrieval System Based On Document Query
7	Research On Technologies Of Application Protocol Signature Discovering
8	Discovering Relations Between Web Tables
9	Research On Query-directed Multi-mode Automatic Summarization
10	Research And Implementation Of Methods Of Query Transformation For Information Integration