Font Size: a A A

Design and implementation of an intelligent Web crawler for corporate data scraping

Posted on:2008-09-09Degree:M.SType:Thesis
University:University of Nebraska at OmahaCandidate:Qin, XinfengFull Text:PDF
GTID:2448390005453115Subject:Information Science
Abstract/Summary:
The Internet has been a remarkable repository of public corporate data since its formative years. The current giant of Internet search, Google Inc, does not provide a satisfactory solution to this issue. For example, if a user who types "Juergen Dormann IBM" in the Google search form with an expectation of locating the biography information of this IBM executive will be disappointed because none of result pages relate to the intention of this search [22]. Although popularity of a Web resource is core to Google's PageRank algorithm, popularity does not always equate to relevance.;Therefore the primary goal of this thesis is to design and implement an integrated tool that can intelligently acquire (i.e., search and extract) data from the World Wide Web (WWW) and store it in a format that can be easily retrievable by interested consumers. Since none of the current Web crawlers is best suited to accomplish our research objective, we have implemented a novel Web crawler based on the semantic relevance of keywords encountered in biographies of corporate personnel described by Dasgupta (2005) in [10]. For now, the crawler is named BioBot, which indicates its focused interest on discovery and retrieval of executive biographies. It accomplishes Website scraping tasks much more precisely by utilizing the Websites' internal search engines for semantic discovery instead of blind scraping, Term Frequency and Inverse Document Frequency (TF*IDF) Algorithm for semantic evaluation, an Adaptive Bayesian Classifier based self-learning keyword set for semantic category definition. In addition, by using Xerces 2.0 XML parser [35], with defined DTD, only the plain text is saved into XML files with uniform markups and tags for efficient search.;With the corporate information scraped from Websites, a Web interface is also provided to provide a quick search feature for the scraped XML formatted data resource produced by BioBot. Simulated experiments show that BioBot can scrape higher-quality Web pages from trusted Websites' internal searching.
Keywords/Search Tags:Data, Web, Corporate, Search, Crawler
Related items