Design and implementation of an intelligent Web crawler for corporate data scraping

Posted on:2008-09-09

Degree:M.S

Type:Thesis

University:University of Nebraska at Omaha

Candidate:Qin, Xinfeng

Full Text:PDF

GTID:2448390005453115

Subject:Information Science

Abstract/Summary:

The Internet has been a remarkable repository of public corporate data since its formative years. The current giant of Internet search, Google Inc, does not provide a satisfactory solution to this issue. For example, if a user who types "Juergen Dormann IBM" in the Google search form with an expectation of locating the biography information of this IBM executive will be disappointed because none of result pages relate to the intention of this search [22]. Although popularity of a Web resource is core to Google's PageRank algorithm, popularity does not always equate to relevance.;Therefore the primary goal of this thesis is to design and implement an integrated tool that can intelligently acquire (i.e., search and extract) data from the World Wide Web (WWW) and store it in a format that can be easily retrievable by interested consumers. Since none of the current Web crawlers is best suited to accomplish our research objective, we have implemented a novel Web crawler based on the semantic relevance of keywords encountered in biographies of corporate personnel described by Dasgupta (2005) in [10]. For now, the crawler is named BioBot, which indicates its focused interest on discovery and retrieval of executive biographies. It accomplishes Website scraping tasks much more precisely by utilizing the Websites' internal search engines for semantic discovery instead of blind scraping, Term Frequency and Inverse Document Frequency (TF*IDF) Algorithm for semantic evaluation, an Adaptive Bayesian Classifier based self-learning keyword set for semantic category definition. In addition, by using Xerces 2.0 XML parser [35], with defined DTD, only the plain text is saved into XML files with uniform markups and tags for efficient search.;With the corporate information scraped from Websites, a Web interface is also provided to provide a quick search feature for the scraped XML formatted data resource produced by BioBot. Simulated experiments show that BioBot can scrape higher-quality Web pages from trusted Websites' internal searching.

Keywords/Search Tags:

Data, Web, Corporate, Search, Crawler

Related items

1	Internet Crawler Research And Implementation
2	Research On The Key Technology And Implementation Of The Focused Crawler Based On HITS And Shark-Search
3	Design And Implementation Of Search Engine System Based On The Incremental Crawler
4	Research On The Topic Crawler Algorithm Based On Vector Space Model
5	Research Of Intranet Information Supervision System Based On Net Crawler And Full-text Search Engine
6	Research On The Key Technology And Implementation Of The Focused Crawler Based On Shark-Search And OTIE Adaptive Algorithm
7	Research And Application Of Focusing Crawler Which Faced Vertical Search Engine
8	The Research And Implementation Of Topical Web Crawler Based On Improved Shark-Search Algorithm
9	Research On APK Crawler With Automatic Pagination Detection And Search Results Extraction
10	Research On Search Strategy And Key Techniques Of Focused Crawler