Font Size: a A A

Research And Implementation On Web Page Crawling And Analyzing Techniques For AJAX Script Network

Posted on:2013-12-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2298330467474707Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advent of the Web2.0era, dynamic websites with AJAX asynchronous transfer characteristics has gradually become a mainstream form of expression in the Internet. Although the technology achieved asynchronous communication with the server end, brought a good user experience, but shattered the architecture of the website based on static pages. This leads to traditional web crawlers cannot access all the pages in the dynamic sites, a lot of data content hidden in the server end cannot be obtained or further analyzed and used, resulting in a waste of resources.Based on the above actuality, this thesis starts from analyzing the working principle and main features of AJAX technology, as well as the core technology of web page analyzing, then builds a data access model for dynamic web pages. Moreover, the thesis proposes a web page crawling and analyzing method which is based on analyzing dynamic scripts. By means of analyzing the structure and content of dynamic pages, this method completes identifying third-party framework, classifying similar sites, determining tag collection of page events, filling in the form in pages automatically. Then the method uses Watij to simulate user actions, embed Selenium as the script parser to sequentially execute corresponding scripts. The method uses breadth-first crawling strategy, combine similarity judgment of DOM tree and listening of the XMLHttpRequest object to identify new page states and control states conversion based on the state flow graph to achieve dynamic page data acquisition. After joining the path repository and local cache, the method effectively reduces the number of page reload and better handles the server end data updates.According to the working principle and flow of processing of the method, the thesis designs and implements a prototype system for dynamic web data acquisition.Experiments show that the proposed algorithm can analyze dynamic web pages effectively and get the data information from them. After optimization, the algorithm reduces the execution time greatly and has significantly improved performance compared with the same type of other methods without affecting the accurate rate of obtaining data. AjaxCrawler, the prototype system for dynamic web data acquisition implemented in this thesis, can be applied to various large web sites in the actual Internetand satisfy the users’ basic requirement of gaining data information in dynamic web pages.
Keywords/Search Tags:AJAX, Web crawler, dynamic scripts, webpage structure parse, webpagecontent acquire
PDF Full Text Request
Related items