Font Size: a A A

Design And Implementation Of Web Crawler System Based On Scrapy Framework

Posted on:2020-06-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y SunFull Text:PDF
GTID:2428330575488718Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Baidu is a large corporation which focus on search engine and artificial intelligence.Initially,its main bussiness is search engine.Later,its business expanded in all directions.There are tens of thousands of R&D teams now.The authors department is the Quality Engineering Center,and there are about 400 people in this department.My team is doing the project about Internet public opinion and language corpus,and is responsible for the research and development of crawler technology.This crawler system is developed based on Python Scrapy framework.It is specialized in capturing public opinion data in instant messaging websites MaiMai and Sina Weibo based on the requirements of my group.It involves key technologies like distributed crawling,BloomFilter algorithm,scheduling policy,anti-crawler strategy,redis buffer momery,proxies pool service,cookies pool service,login simulation etc.It improves the efficiency of crawling dramatically.The system mainly includes six parts:middleware module,entity pipeline module,crawler business module,scheduler module,proxy pool module and cookies pool module.The middleware module includes ua middleware,proxy middleware,cookies middleware and retry middleware.The entity pipeline module includes MaiMai pipeline and Sina Weibo pipeline,it contains five major functions:field information definition,data cleaning,data de-duplication,formatting and storage to database.The crawler business module includes simulation of login,request initiation,response parsing,data object creation,URL extraction&Request construction and other functions.The scheduler module realized two functions:data de-duplication and scheduling queue sharing.Besides,we have designed and developed proxy pool module and cookies pool module to maintain services and adjust to the middlewares.I have completed the entire process of requirement analysis,design and development of this system.It mainly includes middlewares module,crawler business module,pipeline module,scheduler module,proxy pool module and cookies pool module and some functions.After testing,it's found that the crawling efficiency has been significantly improved and the expected large-scale,distributed and stable crawling functions have been achieved.But still some places can be updated,which need to be done in the future.
Keywords/Search Tags:Crawler system, Anti-crawling strategy, Big data, Scrapy
PDF Full Text Request
Related items