Design And Implementation Of Distributed Crawler System Based On Docker Cluster

Posted on:2021-07-19

Degree:Master

Type:Thesis

Country:China

Candidate:J S Min

Full Text:PDF

GTID:2518306569490604

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

With information technologies such as mobile computing,cloud computing,and Io T entering the fast lane,the development of internet has come into the era of a blowout of information and data.Therefore,while we are entitled the convenience brought by the latest technology,users' needs of how to capture the target information from the massive amounts of data have become the primary problem to be solved.However,current classic crawler systems,which are not only difficult and inefficient to deploy,but also difficult to realize clustering and balanced utilization of computer resources,can only complete limited data capture tasks alone.Moreover,they are usually low in generalization ability and customizability.Besides,engineers usually have difficulties in obtaining transparent and effective experience in the distributed transformation of crawlers,which restricts the further development and scaled application of crawler technology.In response to the current market's demand for high-speed retrieval of target information from massive amounts of data,this article started out the research,design and implementation of distributed crawler systems,which is committed to effectively realize the distributed transformation of cralwers,and further optimizes to overcome the disadvantages of the current systems.To reduce current systems' high difficulty in large-scale deployment and cluster operation,this article focused on Docker,the lightweight virtual technology,which allows one-time release and multiple deployment,and carried out the research on combination of docker and crawler application and its design improvement.To solve the low efficiency problem caused by current systems' not realizing equalization,this article selected simulated annealing algorithm as the optimization algorithm of system scheduling,and improved it by combining with firefly algorithm to realize the equalization of system tasks.Futhermore,to improve current systems' shortcomings of poor generalization ability and low customizability,this article studied the application of deep learning methods in rule generation and solved this problem by realizing adaptive matching of rules using deep algorithm.After research on systematic improvement,the design and implementation of the proposed distributed crawler system are analyzed.First of all,this article designed,analyzed and managed the functional and non-functional requirements of the system in an engineering way.Then,it carried out the outline design and detailed design of the system in turn.The former covers the design of the overall system architecture and overall process,division of functional modules and design of interface types,while the latter includes the system architecture design,system class design and system timing design based on genetic thinking.After the implementation od the proposed system,we deployed the system in real scene,selected enterprise websites as test cases to test the system's ability in data retrieval in massive amounts of data,which fully verified the overall goal of the system.In summary,we provide a distributed crawler system based on Docker with high availability and high efficiency for high concurrent retrieval of massive amounts of data.It's capable of being deployed on different dockers with the characteristics of portability,scalability,and maintainability,which is a more comprehensive solution.Besides,the system can train and call deep learning models to automatically optimize template rules and improve data integrity which is a more comprehensive solution compared with current products.

Keywords/Search Tags:

web crawler, distributed system, micro services, deep learning, simulated annealing

PDF Full Text Request

Related items

1	Research On The Model Construction And Learning Algorithms Of Deep Learning
2	Design And Application Of Distributed Crawler System Based On Micro-Service Architecture
3	A P2P Based Distributed Microblog Crawler System
4	Design And Implementation Of Intelligent Cooperative Learning Grouping System Based On Simulated Annealing Algorithm
5	The Analysis And Application Research Of Optimizing Simulated Annealing Algorithm
6	Research On Regular Expression Grouping Based On Simulated Annealing Algorithm
7	The Research Of Router Nodes Placement Problem Based On Simulated Annealing--Genetic Algorithm In Wireless Mesh Networks
8	Research On BP Neural Network Learning Based On Particle Swarm Optimization And Simulated Annealing Algorithm
9	Mobile Robot Path Planning Based On Simulated Annealing-q Learning
10	Searching for global minimum-energy chemical systems: Approaches utilizing simulated annealing and particle swarm optimization