Design And Implementation Of Web Crawler For Given Page

Posted on:2013-06-13

Degree:Master

Type:Thesis

Country:China

Candidate:H Ma

Full Text:PDF

GTID:2248330395459626

Subject:Software engineering

Abstract/Summary:

The rapid growth of the data in the World Wide Web has made the web becomethe biggest global information base. It is very difficult for our users to find theinformation they really want within a short time facing such a semi-structureddatabase with huge date and different structures, which brings out the problem,though there is a lot of information, we are still lacking of the useful ones. Besidesinformation needed by users in different fields varies. Personalized informationcollecting technique was born to solve this problem. The network crawler for specificwebpage is one of the means to realize this technique.The disadvantage of the mass network resources and the general searchingengine technique and its inconvenience for users has been deeply analyzed in thispaper. The necessity and the urgency of developing this system were stated based onthe current international development of this technique. The working flow wasintroduced through the system structure method. Therefore two major modules, webaccessing and content capturing were analyzed briefly. For web assessing module,three general web searching strategies and their advantages and disadvantages wereexplained, as for content capturing module, the relevant difficulties and techniquepoints were mainly introduced. According to the principles that should be followed insystem design, the system structure containing application layer, business logic layerand date layer was given in a graphic form. The information collecting, accessing andsaving modules in this system were all finished after the detailed design. Meanwhilethe key parts of the system were also listed, crawling strategy, link analysis and thealgorithm realization of information extraction. And then the database was finallydesigned. The crawler system can evaluate URL, judge URL domain name, recoveryincomplete URL (restore URL network protocol, host name, the file name of thecurrent page in the server). It can select the best priority crawling strategy to collectinformation, and also analyze the collected web information (based on the HTML treestructure), capture and analyze the related BBS comments, save and provide them tothe user. Finally the friendly graphical user interface was designed, which realized thehuman-machine interaction.The correction and effectiveness of this crawler prototype system has beenproved by the experiments and tests. The effective reviews for the creeping results andfinal storage of this system has been shown by real examples. This prototype system can obtain related information from specific pages efficiently and show it to the users.

Keywords/Search Tags:

Web Crawler, Crawling Strategy, Link Analysis, Information Extraction

Related items

1	Spider Crawling On Mobile Search Research And Implementation Strategy
2	Research On Topic Crawler Of Combining Content With Link Structure
3	Research On Intelligent Web Advertising Crawler System
4	Design And Implemention Of Focused Crawler To Application Store
5	Research And Application Of Web Crawling Algorithm Based On Semantic Analysis
6	Web Information Crawling Applied In Fabric Textile Public Service Platform
7	Research And Implementation Of Injection Molding Information Based On Web Crawler
8	Vertical Search Engine For Crawling The Web Page Design And Implementation
9	Research On The Search Strategy Of Web Spider Based On Specific Topic
10	Research And Implementation Of Web Information Automatically Crawling In Vertical Search