Font Size: a A A

Design And Implementation Of Multi-Source Download System Based On Network Crawling Technology

Posted on:2012-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:R LiFull Text:PDF
GTID:2178330335460735Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet and the improvement of people's living standard, more and more people like to download resources from the Internet. Now downloading resources from the Internet needs complicated procedure, which is quite inefficient. Besides, many downloading tools full of advertisement, so improper operation may lead user's computer into crash.As to the above questions, this paper aims to design and impement easy-to-use software. This software set search, storage and display, and download into itself, which can not only provide a number of downloadable URLs, but also can improve the download speed.This paper introduces the web crawlers technology and hypertext transfer protocol and does an extensions based traditional web crawlers. The traditional web crawlers technology can only grasp static URL, without grasping the dynamic urls hiding behind the deep web. This leads efficiency of downloading speed and does not provide adequate URL for multi-source download.Through executing JavaScript, this paper can analyze the dynamic URLs in the deep networks. This paper uses Rhino to execute JavaScript, but there are two drawbacks:one is Rhino cannot simulate the browser built-in object and the one is that Rhino cannot analyze built-in's properties and methods which these objects added dynamically. By adding the support of DOM operations, Rhino can simulate the browser built-in objects. By modifying the methods in the built-in objects, Rhino can resolve to analyze built-in's properties and methods which these objects added dynamically. Experiments show that this improvement can make Rhino resolve more pages.Storage and Display module's main task is to store and display the downloadable URLs. There are certain rules of how to divide into groups. Only the URLs which have the same file types and file size can be divided into one group. This module uses timer as the refresh mechanism.The download module of this paper uses multi-source download technology. Users get the grouped URLs from the storage and display module. After users click on the download area, the softare will do an accurate judgement of these selected URLs. Only these URLs which truly point to the same downloaded file can be marked as multi-source download the source address. The way to decide is to download the same position of URL fragments and then calculate MD5 values of these segments. The software downloades only from these URLs that MD5 values are the same.
Keywords/Search Tags:network crawling, HTTP, multi-source downloading, Rhino
PDF Full Text Request
Related items