Font Size: a A A

Design And Implementation Of Distributed Information Acquisition And Distribution System Based On Python

Posted on:2020-09-26Degree:MasterType:Thesis
Country:ChinaCandidate:D WuFull Text:PDF
GTID:2518306107970209Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of network technology,the network has gradually integrated into people's life and become a part of life.When we are hungry,we think of "Mei Tuan",when we want to travel,we think of "DIDi",when we need to stay outside on business,we think of "where to go".It can be said that the network has become the mainstream way of information interaction.How to acquire network information quickly,effectively and accurately has become an urgent problem.At present,both the state and the industry strongly support big data collection,analysis and release.However,only relying on manpower collection is not only time effective,but also requires high cost.Under this background and demand,through the in-depth study and Research on the working principle of information collection technology,some commonly used crawler frameworks and collection algorithms,after in-depth analysis of the structural characteristics of information websites,according to the characteristics of the collection object,and the integration of two algorithms,four types of collection programs are designed,and based on the Scrapy framework,middleware technology is used Develop dynamic browser identity and agent pools.Using My SQL database and cloud platform virtualization technology to deploy a set of highly reliable and feasible parallel structure of distributed collection cluster,to improve the efficiency of data collection,using pyqt5 to realize cross platform information publishing program,and using selenium automation tools to solve website landing,website query and artificial data collection.Flask is used to develop the data acquisition management platform and large screen display.In addition,in order to unify the publishing format,the system also designed and implemented the data cleaning module,including data cleaning,format conversion,object removal and addition and other functions.This paper designs and implements a Python based distributed information collection and release system to collect network information data and release after classification,which greatly reduces the work intensity of information practitioners in related industries and provides technical support for faster,better and more convenient access to release information.At present,the system has been running steadily for one year from the initial collection targets to hundreds of domestic and foreign websites,and has captured 3.95 million pieces of data information.
Keywords/Search Tags:Information Acquisition and Publication, Python, Intelligent Acquisition Algorithms, Information Classification Algorithms, Selenium, Flask, PYQT5
PDF Full Text Request
Related items