Font Size: a A A

Research And Realization Of WAN's Distributional Homepage Information Acquisition System Based On

Posted on:2009-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:Z LiuFull Text:PDF
GTID:2178360272476619Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of the Internet, the Web information circulating has become enormously and the requirement of the Web information has also increased higher. In such a vast sea of information, finding information is as difficult as looking for a needle in a haystack. In order to solve this problem, the technology of search engine emerged, thereby it is a great challenge for web crawler who is devoted to collecting information. In view of the actual need, the research on obtain information through the internet and computer technology become the focus in current and future period of time. In order to meet the needs of all kinds of users, stand-alone web crawler , in many cases, can't afford the responsibility, based on the LAN web crawl ,as a result of the current situation at this stage, has a certain shortage. In view of the situation, this article is on based on WAN's distributional homepage information acquisition system research and realization.The network reptile technology mainly divides into the single plane, many machine two kinds. Unit operation's reptile information search speed is quite limited, on the present Internet's scale, has been unable in an effective time horizon to complete collection entire Web the duty. Many machine network reptile system uses many machine multi-tasking, raises overall system's working efficiency, and has the good extendibility, is the inevitable trend of development. Divides according to the region, many machine information acquisitions divide into two kinds: Based on local area network's many machine information acquisition and based on WAN's multi-information acquisition What but based on local area network's distribution network reptile in majority situations with is the limited public network IP company enters the Internet, the Web server to comes from identical public network IP simultaneously to send out many TCP links, and downloads the massive data in the very short time to create the very big burden to the server, the server will reject the network reptile visit stemming from the security concern. Moreover in local area network's distribution network reptile has massive thread parallel crawling, must simultaneously visit minority several even is a DNS server by carries on the domain name to URL to explain, will give the DNS server to create the very big load and possibly to cause it to be unable to respond. Also has based on local area network's distribution network reptile's local area network's public network rate of discharge is very big, the export band width bottleneck is also a question. In view of the above situation, this article launched with has realized based on WAN's distributional website homepage information acquisition system's research. This topic is aims at on the Internet the website issue homepage information actual capture, how studies to use under fully the windows environment c \ the c++ compiler as well as the network communication connection designs and develops this system, enables the system to be able as far as possible to satisfy the capture request.Network information gains system (network reptile) studying the most important being of middle designs the truss and key technology solving. On the basis having absorbed the others technology and experience, the structure designing the reptile having described a wide area network WAN-based distributed network's designs the main body of a book, the truss including a hardware among them , the software module partition and. PC machine is used for the hardware part from one controlling a node , N platform PC machine does the crawl node , is links up in wide area network WAN. The software mark, designing and climbing the insect node for the software controlling a node designs the software part.Then this article has analyzed the distribution network reptile's key technologies solution, for instance how the distributional joint operation each point, how to do the duty how assign and so on, then proposed some practical algorithms involving how to deal with these distribution network reptile's key technologies, how to realize one to have the toughness, the extendibility, the configurable distribution network reptile, and how to carry on the careful analysis on this network reptile. Specially to the distributional system's task allocation, it has used the network performance index instruction forecast effect law. Finally it has made some tests on this network reptile, has included the single plane crawling test and this network reptile's application, namely school website homepage information capture test. As well as the actual movement's effect has carried on the contrast to many kinds of distributional system task assignment's method instruction, obtains the network performance index instruction forecasting effect law for the best task scheduling method.
Keywords/Search Tags:Web Carrawler, WAN, Distributed System
PDF Full Text Request
Related items