Font Size: a A A

Research And Implementation For Web Spider Based On Web Data Mining

Posted on:2008-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:J J ZhanFull Text:PDF
GTID:2178360242979318Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The spider programming technology is the key part of search engine, which is the convenient and effective method to get the information from the WWW. Surrounding the innovative technology of Web Data Mining and based on the whole request of search engine's frame, the main work of this article is to realize the cruise of the Internet spider,and store the data of the page into the local database, place a firm foundation for the realization of intelligent search engine.The main contents of this article include:Firstly, analyze the principle of search engines and realize the first step in the work of search engine: get the page data from Internet. Secondly, describes the technology used in the article,such as HTTP protocol, Regular Expressions, Multi-thread and ADO.NET. Based on the network spider technique, the article analyzes and designs a system of a new spider. Using the BFS strategy ,Combined with multi-threads technology , this article realizes the algorithms of crawling the web-pages from Internal and External networks and analyzing the content .In this paper, the innovation lies, first, regular expression technology applications to getting WEB content to make extracting the website URL quickly and efficiently and achieving crawls the internal networks and the web-pages content and analysis algorithms. Finally compress data with Zlib algorithm and put the data into the local database. Secondly, in order to increase the speed, we adopt a special strategy to deal with the wrong URL. That is, through the server's response time to deciding whether or not to get the HTTP pages, then put the overtime URL in the wrong queue waiting for the process of the thread of dealing with wrong URL. Thirdly, after analyzing the result of experiment in the network of campus and the result of the data stored in the database, the feasibility of the spider can be validated,the prospective object of the system have been achieved.Finally,the conclusion of the whole system and the future work of the subject are presented.
Keywords/Search Tags:web mining, network spider, search engine
PDF Full Text Request
Related items