Design And Implementation Of Web Crawler System Based On Scrapy Framework

Posted on:2020-06-05

Degree:Master

Type:Thesis

Country:China

Candidate:Y Sun

Full Text:PDF

GTID:2428330575488718

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Baidu is a large corporation which focus on search engine and artificial intelligence.Initially,its main bussiness is search engine.Later,its business expanded in all directions.There are tens of thousands of R&D teams now.The authors department is the Quality Engineering Center,and there are about 400 people in this department.My team is doing the project about Internet public opinion and language corpus,and is responsible for the research and development of crawler technology.This crawler system is developed based on Python Scrapy framework.It is specialized in capturing public opinion data in instant messaging websites MaiMai and Sina Weibo based on the requirements of my group.It involves key technologies like distributed crawling,BloomFilter algorithm,scheduling policy,anti-crawler strategy,redis buffer momery,proxies pool service,cookies pool service,login simulation etc.It improves the efficiency of crawling dramatically.The system mainly includes six parts:middleware module,entity pipeline module,crawler business module,scheduler module,proxy pool module and cookies pool module.The middleware module includes ua middleware,proxy middleware,cookies middleware and retry middleware.The entity pipeline module includes MaiMai pipeline and Sina Weibo pipeline,it contains five major functions:field information definition,data cleaning,data de-duplication,formatting and storage to database.The crawler business module includes simulation of login,request initiation,response parsing,data object creation,URL extraction&Request construction and other functions.The scheduler module realized two functions:data de-duplication and scheduling queue sharing.Besides,we have designed and developed proxy pool module and cookies pool module to maintain services and adjust to the middlewares.I have completed the entire process of requirement analysis,design and development of this system.It mainly includes middlewares module,crawler business module,pipeline module,scheduler module,proxy pool module and cookies pool module and some functions.After testing,it's found that the crawling efficiency has been significantly improved and the expected large-scale,distributed and stable crawling functions have been achieved.But still some places can be updated,which need to be done in the future.

Keywords/Search Tags:

Crawler system, Anti-crawling strategy, Big data, Scrapy

PDF Full Text Request

Related items

1	Design And Development Of Distributed Crawler Based On Scrapy Framework
2	Research And Implementation Of Web Information Automatically Crawling In Vertical Search
3	QQ Space Data Research And Analysis Based On Scrapy Crawling
4	Design And Implementation Of Distributed Web Crawler System Based On Scrapy
5	Crawling Data Of Electronic Business Platform Based On Scrapy And Construction Of Automatic Question-Answering System
6	Design And Implementation Of Distributed Netnews Crawling System Based On Scrapy
7	Research And Application Of WEB Anti-crawling Mechanism
8	Analysis Of Dangdang Information Based On Scrapy Framework Crawler And Data Mining
9	Scrapy Framework-based Web Crawler Achieved Data Capture And Analysis
10	Design And Implementation Of Data Acquisition System Based On Scrapy Technology