Font Size: a A A

Design And Implementation Of Distributed Crawler System Based On Docker Cluster

Posted on:2021-07-19Degree:MasterType:Thesis
Country:ChinaCandidate:J S MinFull Text:PDF
GTID:2518306569490604Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With information technologies such as mobile computing,cloud computing,and Io T entering the fast lane,the development of internet has come into the era of a blowout of information and data.Therefore,while we are entitled the convenience brought by the latest technology,users' needs of how to capture the target information from the massive amounts of data have become the primary problem to be solved.However,current classic crawler systems,which are not only difficult and inefficient to deploy,but also difficult to realize clustering and balanced utilization of computer resources,can only complete limited data capture tasks alone.Moreover,they are usually low in generalization ability and customizability.Besides,engineers usually have difficulties in obtaining transparent and effective experience in the distributed transformation of crawlers,which restricts the further development and scaled application of crawler technology.In response to the current market's demand for high-speed retrieval of target information from massive amounts of data,this article started out the research,design and implementation of distributed crawler systems,which is committed to effectively realize the distributed transformation of cralwers,and further optimizes to overcome the disadvantages of the current systems.To reduce current systems' high difficulty in large-scale deployment and cluster operation,this article focused on Docker,the lightweight virtual technology,which allows one-time release and multiple deployment,and carried out the research on combination of docker and crawler application and its design improvement.To solve the low efficiency problem caused by current systems' not realizing equalization,this article selected simulated annealing algorithm as the optimization algorithm of system scheduling,and improved it by combining with firefly algorithm to realize the equalization of system tasks.Futhermore,to improve current systems' shortcomings of poor generalization ability and low customizability,this article studied the application of deep learning methods in rule generation and solved this problem by realizing adaptive matching of rules using deep algorithm.After research on systematic improvement,the design and implementation of the proposed distributed crawler system are analyzed.First of all,this article designed,analyzed and managed the functional and non-functional requirements of the system in an engineering way.Then,it carried out the outline design and detailed design of the system in turn.The former covers the design of the overall system architecture and overall process,division of functional modules and design of interface types,while the latter includes the system architecture design,system class design and system timing design based on genetic thinking.After the implementation od the proposed system,we deployed the system in real scene,selected enterprise websites as test cases to test the system's ability in data retrieval in massive amounts of data,which fully verified the overall goal of the system.In summary,we provide a distributed crawler system based on Docker with high availability and high efficiency for high concurrent retrieval of massive amounts of data.It's capable of being deployed on different dockers with the characteristics of portability,scalability,and maintainability,which is a more comprehensive solution.Besides,the system can train and call deep learning models to automatically optimize template rules and improve data integrity which is a more comprehensive solution compared with current products.
Keywords/Search Tags:web crawler, distributed system, micro services, deep learning, simulated annealing
PDF Full Text Request
Related items