Font Size: a A A

Design And Implementation Of Web Crawler For Personalized Recommendation System

Posted on:2015-07-27Degree:MasterType:Thesis
Country:ChinaCandidate:L H YangFull Text:PDF
GTID:2298330467950352Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Internet technology has been developed rapidly in recent years. Surfing on the Internet has become the main way of retrieving information for the public. A wide variety, spread rapidly and large volume is the main characteristics of the Internet. How to fetch information timely and accurately according to these characteristics, and deliver a better service for the recommendation system of education cloud has become a serious problem. This thesis designed and implemented a Web crawler in recommendation system, according to the characteristics of the Internet. Using information extraction and web processing technology, the system provided a more accurate classification, more comprehensive data, and more timely updates of Internet search services.The concrete works can be listed as follows:1. In this thesis, development of the web crawler is outlined at first, and then architecture of the web crawler is provided. Also, the distribution characteristic of the theme page on the web.2. Search policy is studied. Relevance to subject characteristics for the URL is computed and predicted according to the string feature of URL, anchor text and parent page. URLs are crawled by the order of subject characteristics relevance as far as possible to download the higher subject related pages.3. Web pages parsing technologies are implemented. The main parts includes HTML parsing, URL extracting, page de-noising and main body extracting. The system is implemented and Messy character problems in web page downloading process are resolved. The noises are eliminated and the integrity of Web pages extraction is improved by integrating link analysis technology and statistical method.4. Finally, a web crawler system implemented. The effectiveness and rationality of the web crawler system are verified by repetitious software testing.The Web Crawler given by this thesis can be used in personalized recommendation system of education cloud project. It can recommend literature and relevant materials rapidly by acquiring and analyzing articles from all academic fields. As a result, it improved research efficiency.
Keywords/Search Tags:Web crawler, Search policy, Main body extraction, Encoding conversion
PDF Full Text Request
Related items