Font Size: a A A

Design And Realization Of Web Crawler With Hyperlink-based Algorithm

Posted on:2008-11-01Degree:MasterType:Thesis
Country:ChinaCandidate:J T ZhuFull Text:PDF
GTID:2178360212495794Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, Internet influences people's lifestyle more and more with the development of the popularization of personal computers and network. The access to information has come through getting by hand, by computer, and now, by Internet. The greatest advantage of Internet is to share massive information. The amount of information doubles every 8 months. Web pages are now calculated by billions. Search engine, as a main application of modern information-accessing technology, is necessary for finding the information we need in the vast Internet.Search engine is the second core Internet technology inferior to portals. Along with the popularization of Internet and the explosive growth of information on Internet, people tend to pay more and more attention to search engine,which is an information retrieval tool used to search such Web files as Websites, Web pages, news group, images and sound. It is in fact a special www server, or a Website that provides information retrieval services on Internet. Different from other Websites, search engine is used to search www information manually or automatically, classifying the information according to themes, establishing index, putting the content of index into the index database, and returning the matching resource to the users according to query grammar. Facing to the profuse Internet resources, search engine provides an entrance to each user surfing online. All the users can get to where they want by search engine.The search engine with crawler is most common among all sorts of search engines. All its work is done by program automatically, with little done manually. It searches Web pages on internet with the crawler, and puts the Web pages it found into local index database automatically. The users can get updated information from index database rapidly. If the Web page of a Websiteis updated, the search engine can find the changes rapidly and update local index database, then show it to the users'query. The advantages of the search engine with crawler are high automation, low maintenance costs and more emphasis on technical innovation and improvement, so it is more suitable for research than other kinds of search engines and has become research focus.The crawler part of search engine with crawler is undoubtedly the most important factor for its efficiency. Its quality of performance influences the overall performance and processing speed directly. We should consider the following four aspects when designing an excellent crawler:1) What Web pages to be downloaded?2) How to update Web pages?3) How to reduce the burden of the Website?4) How to parallel the process of searching?The solutions to the problems we should consider are proposed in 2.4.1 in this thesis.Choosing a good Web crawling strategy is undoubtedly very important to solving the problems above. There are two main crawling strategies: depth-first and breadth-first. Researches show that breadth-first is better, so most Web crawlers use breadth-first or its improved pattern. The breadth-first strategy is introduced in section 3.1, and an improved breadth-first strategy——hyperlink-based crawling algorithm is proposed in section 3.2. The main idea of hyperlink-based crawling algorithm is to calculate the number of Web pages that contains the crawling URL for every crawling URL in the crawling queue according to the Web pages that has been crawled on the base of breadth-first strategy. The crawler will choose the URL with biggest number. We proposed an improved hyperlink-based crawling algorithm in section 3.3, whose main idea is that the internal links are more important than external links. According to this idea, we classify the URLs first into two queues: internal link queue and external link queue. The crawler will choosethe URLs in the internal queue first, and The crawler will choose the URL with biggest number in the same queue.The design and realization of Web crawler with hyperlink-based algorithm is the main work of present thesis. The following functions should be achieved when designing a crawler:1)Gets the Web page from the designated URL;2)Analyze the downloaded Web pages and put new URL into the queue;3)Reorder the URL in the queue according to hyperlink-based algorithm;4)See whether ending conditions are satisfied, get new URL from the queue and restart from step 1 if not.The following modules are designed for realizing the functions above:1)Page-downloading module: it is necessary to get the content of the page before crawling. And then a certain Web page can be analyzed and operated. The HTTP class and its derived class HTTPSocket are designed and implemented, used to communicate with server by socket and download the designated Web page;2)Page-analyzing module: The crawling system does not need complicated parser, for simple text processing is sufficient for searching for the hyperlinks and the context. In this module, the Web page downloaded in the page-downloading module with the HTTPParser class is analyzed——getting the sets of images, links and forms, then storing them into HTTPPage, making HTTPTag one by one with the set of links in HTTPPage, and storing its attribute set in the form of name/value pairs;3)Queue-operating module: In this module, the queue is simulated by operating the database. There are three fields designed for every Web page in the database: URL, Status, and Points. Field URL is used to store the URL of the Web page. Field Status is used to store the status of Web page. The statuses of Web page are"running"for the Web pages being crawled,"waiting"for theWeb pages to be crawled, and"error"for the wrong ones. Field Points is used to store the number of Web pages that contain the URL. Field Type is used to store whether the Web page is internal or external.The process of the queue-operating module is as follows. First, putting the seed URL in the database, the Status is set"waiting", Type as I and Points as 1. Second, the Web page whose Status is waiting Type is I and has the biggest Points is set its Status as"running", downloaded, analyzed for its set of hyperlinks. Third, finding every hyperlink in the database, its data is added one Point if the hyperlink has already existed; while its data is stored into the database and the Status of the hyperlink is set"waiting"keeping Points as 1 if it did not exist before. Repeat the steps until the conditions are satisfied.This crawler is accomplished by Java programming language and multithreading technology. The functions above are basically implemented. There must be some redundancy and insufficiency in the code for time limited.Researches on search engine have just started in China. The issue is discussed roughly in this thesis, hoping to lead a clue to the future studies.
Keywords/Search Tags:Hyperlink-based
PDF Full Text Request
Related items