Font Size: a A A

Research And Implementation Of Web Crawler For URL-Specified Crawling Of Ajax-Based Web Applications

Posted on:2014-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:F F LiuFull Text:PDF
GTID:2248330398472264Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Along with the emergence of Web2.0, a new kind of web application called Rich Internet Applications(RIAs) emerged, providing highly interactivity and rich user experience, such as blog and twitter. A technology that has gained a prominent position lately, under the umbrella of Web2.0, is AJAX. It has becoming a widespread technology for the development of web applications. In AJAX, the combination of JavaScript and dynamic DOM manipulation, along with asynchronous server communication is used to achieve a high level of user interactivity, speed and usability.At the same time, it changes the traditional model of web applications, shatters the metaphor of web ’pages’ with unique URLs, on which traditional web crawlers are based.This new change seriously impaires the ability of existing crawlers to truly crawl these applications. Current search engines either ignore AJAX applications or produce false negatives.This paper designs and implements a novel technique for crawling AJAX-based applcation through an URL-specified web crawler, aiming at addressing the problem of crawling AJAX content induced by client-side JavaScript. First, we discuss traditional web crawlers, and analyze the key challenges involved in crawling AJAX-based applications. For illustration, an real-world public AJAX application is utilized as an example to explain the difficulties of crawling AJAX and scenarios. Second, we present the terminology used in this paper, model AJAX web sites, pages and events, describe the overall workflow and architecture of our system. Finally, we present a detailed discussion of our crawling algorithm and implements an URL-specified web crawler, using the concepts and algorithms discussed in this paper. We separate traditional crawling into two independent modules:hyperlink extractor and web crawler. The hyperlink extractor is used to extract hyperlinks from web sites, and store them into the URL repository. Our approach is based on a webkit browser, in which we open the AJAX application, exercise client-side JavaScript code, identify clickable elements and fire events on those elements. The web crawler only downloads pages whose URL in the URL repository, which means the scope of crawling is totally identified by the URL repository, that is why we call it "URL-specified" crawling.We have performed a number of empirical studies to systematically analyze the overall performance of our approach. We evaluate the precision(percentage of dynamic content recovered), accuracy(percentage of correct states) and performance. The results show that the precision is100%. In the case of non-flip crawling, the average crawling rate is52.03kb/s. The result reveals that our system produces effective results.This crawling technique can crawl AJAX applications accurately. The system has high flexibility and scalablity, we believe that the crawling technique can be used, for instance, in building of vertical search, the field of open source intelligence gathering, and so on.
Keywords/Search Tags:AJAX, JavaScript, Web crawler, Data collection, URL-specified
PDF Full Text Request
Related items