| With the rapid development of the Internet,the massive data mining and application will 1 ead a new trend.The results of the International Data Corporation showed that the amount of data generated in the world is as high as 1.82 ZB in 2011.At the same time,the field of life science data is also growing rapidly,especially as rapid promotion of gene sequencing technol ogy,and protein sequencing technology there accumulat a large number of biomedical data.Mea nwhile drug design,drug screening and clinical trials are also the sources of the massive data,These human health data in the field of life science have reached surprising amount.Howeve r,the medical researchers and medical workers have the defects using the medical literature,an d can not useage the maximum effect of the literature.In this paper,the basic principles of web crawler,classification and analysis algorithm of web crawler are studied.For the anti crawler,distributed crawler frame Scrapy and dynamic we b crawling technology is introduced,Based on the studies,the authors put forward a distributed Scrapy-Redis-Selenium+PhantomJS crawler framework to implement the PubMeb web crawler sy stem.The system mainly extract the title and abstract of the related subject literature.In favor of the user,to the system use the Qt framework to designthe UI interface of the crawler syste m.Finally,this paper summarizes the work and puts forward the direction of further optimizati on.In a word,this paper mainly focuses on the design and implementation of distributed craw ler based on biomedical data.The system solved the problem of the support for dynamic web pages,in addition,the speed of information collection is also improved.So,it provides the t echnical means for the distributed crawler of PubMeb web page,and can obtain the relevant medical literature data more efficiently. |