Research And Implementation On Web Page Crawling And Analyzing Techniques For AJAX Script Network

Posted on:2013-12-25

Degree:Master

Type:Thesis

Country:China

Candidate:Y Zhang

Full Text:PDF

GTID:2298330467474707

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the advent of the Web2.0era, dynamic websites with AJAX asynchronous transfer characteristics has gradually become a mainstream form of expression in the Internet. Although the technology achieved asynchronous communication with the server end, brought a good user experience, but shattered the architecture of the website based on static pages. This leads to traditional web crawlers cannot access all the pages in the dynamic sites, a lot of data content hidden in the server end cannot be obtained or further analyzed and used, resulting in a waste of resources.Based on the above actuality, this thesis starts from analyzing the working principle and main features of AJAX technology, as well as the core technology of web page analyzing, then builds a data access model for dynamic web pages. Moreover, the thesis proposes a web page crawling and analyzing method which is based on analyzing dynamic scripts. By means of analyzing the structure and content of dynamic pages, this method completes identifying third-party framework, classifying similar sites, determining tag collection of page events, filling in the form in pages automatically. Then the method uses Watij to simulate user actions, embed Selenium as the script parser to sequentially execute corresponding scripts. The method uses breadth-first crawling strategy, combine similarity judgment of DOM tree and listening of the XMLHttpRequest object to identify new page states and control states conversion based on the state flow graph to achieve dynamic page data acquisition. After joining the path repository and local cache, the method effectively reduces the number of page reload and better handles the server end data updates.According to the working principle and flow of processing of the method, the thesis designs and implements a prototype system for dynamic web data acquisition.Experiments show that the proposed algorithm can analyze dynamic web pages effectively and get the data information from them. After optimization, the algorithm reduces the execution time greatly and has significantly improved performance compared with the same type of other methods without affecting the accurate rate of obtaining data. AjaxCrawler, the prototype system for dynamic web data acquisition implemented in this thesis, can be applied to various large web sites in the actual Internetand satisfy the users’ basic requirement of gaining data information in dynamic web pages.

Keywords/Search Tags:

AJAX, Web crawler, dynamic scripts, webpage structure parse, webpagecontent acquire

PDF Full Text Request

Related items

1	Research On Key Technology Of Collaborative Design Of Large Antenna Structure
2	Research And Implement Of Distributed Crawler System Supporting AJAX
3	Design And Implementation Of Webpage Tampering Monitoring System
4	Research And Implementation On Theme Web Crawler Of Supporting Ajax
5	Design And Implementation Of A Web Crawler System Supported AJAX
6	Research And Implementation Of Web Crawler For URL-Specified Crawling Of Ajax-Based Web Applications
7	Design And Implementation Of A Web Crawler Friendly To Ajax
8	A Web Crawler Supporting AJAX
9	Design And Implementation Of The Dynamic Crawler System Based On State Transition
10	Design And Implementation Of An Ajax Supported Deep Web Crawler System