Font Size: a A A

Design And Implementation Of Web Crawler For Given Page

Posted on:2013-06-13Degree:MasterType:Thesis
Country:ChinaCandidate:H MaFull Text:PDF
GTID:2248330395459626Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The rapid growth of the data in the World Wide Web has made the web becomethe biggest global information base. It is very difficult for our users to find theinformation they really want within a short time facing such a semi-structureddatabase with huge date and different structures, which brings out the problem,though there is a lot of information, we are still lacking of the useful ones. Besidesinformation needed by users in different fields varies. Personalized informationcollecting technique was born to solve this problem. The network crawler for specificwebpage is one of the means to realize this technique.The disadvantage of the mass network resources and the general searchingengine technique and its inconvenience for users has been deeply analyzed in thispaper. The necessity and the urgency of developing this system were stated based onthe current international development of this technique. The working flow wasintroduced through the system structure method. Therefore two major modules, webaccessing and content capturing were analyzed briefly. For web assessing module,three general web searching strategies and their advantages and disadvantages wereexplained, as for content capturing module, the relevant difficulties and techniquepoints were mainly introduced. According to the principles that should be followed insystem design, the system structure containing application layer, business logic layerand date layer was given in a graphic form. The information collecting, accessing andsaving modules in this system were all finished after the detailed design. Meanwhilethe key parts of the system were also listed, crawling strategy, link analysis and thealgorithm realization of information extraction. And then the database was finallydesigned. The crawler system can evaluate URL, judge URL domain name, recoveryincomplete URL (restore URL network protocol, host name, the file name of thecurrent page in the server). It can select the best priority crawling strategy to collectinformation, and also analyze the collected web information (based on the HTML treestructure), capture and analyze the related BBS comments, save and provide them tothe user. Finally the friendly graphical user interface was designed, which realized thehuman-machine interaction.The correction and effectiveness of this crawler prototype system has beenproved by the experiments and tests. The effective reviews for the creeping results andfinal storage of this system has been shown by real examples. This prototype system can obtain related information from specific pages efficiently and show it to the users.
Keywords/Search Tags:Web Crawler, Crawling Strategy, Link Analysis, Information Extraction
PDF Full Text Request
Related items