Font Size: a A A

Design And Implementation Of Distributed Online Book Crawler System

Posted on:2017-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y F GuoFull Text:PDF
GTID:2308330485457909Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the era of the Internet, electronic products deeply influenced into all aspects of life. All kinds of paper books are replaced by a variety of formats e-books. The reader could read books on smart phone or pad. How to choose and monitor the content of the book is very important.Designed and implemented the distributed online book crawler system.The system is based on the framework of Scarpy and uses the Redis as URL queue for distributed expansion. It greatly enhanced the efficiency of reptiles. Configured the high availability of Redis and enhanced the system’s availability. The system supports incremental crawling so that captured the newest book.Analyzed the business and the users’characteristics then made the requirement analysis. The system has scheduler module, crawler module, pipeline module and monitoring module. The author designed and completed these modules:(1) Scheduler module includes URL scheduling function and URL filter function. It could use the SHA1 to duplicate detection URLs and set URL’s priority for URL queue. Then the URLs are distributed to crawler by the priority.(2) Crawler module downloads the page of URL link, analyzes the web page, and extracts the content in the web page. It finds the new URL link and sends to URL cache queue. It also filters all the page images, documents, books detailed information and sends to the pipeline for the next step.(3) Pipeline module transfers the item to its standard format.Then it stored the pictures, the documents and the details information to different storage unit.(4) Monitoring module monitors the status of each crawler.It monitors the number of URLs crawled by each crawler and the status of each crawler device.Through functional verification demonstrates that the new Scrapy crawler system meets the distributed online book crawler system requirements.It could distributed crawler the online book website efficiently. Now the project is still in the testing phase, the next phase will monitor the content of books and book illustrations.It will provide a healthier reading environment for the reader.
Keywords/Search Tags:Scrapy, Distributed, MongoDB
PDF Full Text Request
Related items