Font Size: a A A

Internet Crawler Research And Implementation

Posted on:2011-03-13Degree:MasterType:Thesis
Country:ChinaCandidate:D Y DongFull Text:PDF
GTID:2178360305454652Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the continues development and progress of information society,the efficienty of searching information on the internet change very important. Since 1993 to now, the Internet website has thousands of pages from the original development to today's 2 billion over the Web. In today's era of information explosion, Internet resources inclusive, so far no specific resources on the Internet for a definition, short people in urgent need now a tool to the system a fast look to their own want to get on the Internet of information resources.In this case the demand, the search engine to quickly capture technology as a tool to get data on the Internet has developed rapidly. The main purpose of this study and the work is through the technology of search engines, mainly search engine crawler technology, analytical currently the largest Internet search engine Google works and working process of crawling and crawling through the realization of a device system to analyze the building when the crawler encountered many problems while in-depth analysis of two common crawler search strategies, and finally on the various components of crawlers detailed explanation.In the first part of this paper is on Internet resources and Internet resources for the development and current situation, summed up the search engine on the importance of people's lives. In the second part of the first part of the search engine, which introduces the definition, history and classification, a brief description of the latter part of the search engine works and work steps, and finally leads to the famous Google Internet search engine Google, a simple description of system, structure and working principles. In the third chapter describes and analyzes two kinds of crawler search strategies, depth-first search strategy and the breadth-first search strategy. In the fourth chapter is the focus of this paper describes the detailed design of crawler.Internet search engine is a tool people can use it on the Internet to find information and data you want. It has become a people in their daily work, study, live and play indispensable tool, it can not be separated at all times, easy for people to find information on the Internet, but also people's ability to search the Internet greatly raised, also makes the Internet a cost-effective index lower. In other words, the search engine Shui is now a computer tech, Internet technology and traditional index to integrate theory with a successful model, which is spread to Hulian Wang backdrop of life came into being along the inevitable product. Mainly rely on search engine front end for crawling the Internet to gather information on each page on the Internet, and its index stored in the database. Search engines generally speaking can be divided into three categories:full-text search engines, directories and meta search engine indexing search engine. Full-text search engine is crawling about the Internet to gather information on the Internet web page (mainly text information), to which they are saved to the database, build the index; catalog search engine is through artificial means to collect and collate information on the site, and form their own database, according to the website link directory category; meta search engine means that the user after the query command, while many other search engines at the same time check, and then the results returned to the user.Search engines work steps can be divided into three parts:capture web page, processing web, providing access to services. Crawl page is implemented in this paper were completed crawler, is responsible for crawling the Internet to crawl Web pages. Processing page to complete by the crawler, is mainly responsible for crawling the web to be processed and classified work. Search service is to provide each user will log on after the search engine to search keywords, search engines will index the database in their search for pages that match the keyword. In addition to displaying the page URL and page title, it will also show a small section of additional content on the page written summary of information to facilitate the user's judgments.The system of Google is not particularly complex, it has two main features:1.Using the link structure of web page to calculate the rank value of each page, also known as pagerank, page-level assessment of a device.2. Google use of hyperlinks to improve its search results.Google search engine and general search engine is the same functionally divided into three parts:the introduction of Web crawling standard library and user queries. Which pages to crawl job is crawling, this part of the functionality from the URL server, crawling, storage, parser, and URL parser component; marked the introduction of library is responsible for the crawler to crawl the content analysis, then the document will be generated indexing them into a database, this part of the function from the indexing device and the classifier to achieve; user query function is the user input query expression and then analyzed to establish the database to find matching content, and according to some order to return to the user.One of Google the most important feature is the level of web design calculation, Google think that the importance of a web page link to the page by its number and importance of the decision. Google's search process is to first conduct a crawl on the Internet, download Web pages and create their own local index database, and then analyzed it for the user query the database based on the query to find a list of a user query to find documents that match, then calculate the page rank value of the document, then arranged according to the page rank value from high to low returns to the user.Crawler search strategies in general is mainly two, namely, depth-first search strategies and breadth-first search strategy. The breadth-first search strategy in a recent study showed that better than depth-first search strategy.Crawler's design website is divided into modules and web analytics module. First crawler uses the method of non-recursive implementation, the main search strategy algorithm is determined by the queue to be downloaded. Web crawler download mainly HTTP class to achieve, since in this class defines a protocol by the function Set_Request to complete, is mainly responsible for the links on the site work request. HTTP Class major label deal with Web pages, HTTP content of the title series, is an important website related components. It is the main problem is how and to link the site and how to read the pages and content of the head. Web analytic module responsible for the work of Page class, mainly parsing cookies and html content, and capture images, links, forms, etc. There are a range of content. Crawler crawler core class is class, and it is the whole system of the brain organization of the work of other types of web crawler work, work defines a crawling process. In the crawler to download web pages to complete analytical work, the will to deal with the results stored in the database which, in this system has taken way MD5 encryption of data protection. Crawler system is to work with multi-threaded, so that the work would improve the crawling speed and efficiency, can be better crawling the web pages crawled.This paper to achieve the crawling system is Linux, using VC++language completed, take the width of the first search strategy to www.jlu.edu.cn work for the crawl seed nodes, using the working mechanism for multi-threaded completed, basically normal work, crawling the whole of Jilin University campus network, almost all of Jilin University, crawl to the site's pages, the basic line crawler systems thinking and building.
Keywords/Search Tags:Internet resource, Search engine, Crawler, Crawler search strategy, Multithreading
PDF Full Text Request
Related items