Font Size: a A A

Research And Implementation Of Distributed Internet Information Crawling System For Cyber Security

Posted on:2021-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:S S LiFull Text:PDF
GTID:2518306308968929Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,cyber security has gradually become a topic of public concern,and it is also related to the public's personal privacy and information security.So articles and information related to cyber security are extremely valuable.Cyber security is a highly specialized field.Network resources related to cyber security are mainly distributed in professional websites,forums,and the technical sections of some websites containing various types of information.These network resources are scattered in the network,and the public cannot understand the recent cyber security topics in a timely and accurate manner.This thesis designs and implements a distributed Internet crawling system for the field of cyber security.This system can efficiently crawl massive web page text from the internet,and extract body parts from web page text.Finally,the text that belongs to the field of cyber security is filtered.The main contents of this thesis are as follows:As a consumption queue for crawl tasks,Redis list implements a distributed crawling system where multiple crawler nodes consume queue together.This thesis designs effective URL deduplication strategy for URLs to be crawled.The strategy avoids repeated crawling and improves the efficiency of system crawling.This thesis uses message queues to decouple the parsing module from the crawler module,and it reduce the coupling of the system.In addition,this thesis designs a reasonable task scheduling strategy.This strategy coordinates the running speed of the two modules,fully improves the utilization of system network and hardware resources and improves the crawling efficiency of the system.This thesis designs effective text extraction algorithms for complex and diverse web page structures.A practical data cleaning solution is designed,and it can filter out web page text that belongs to the field of cyber security.This thesis investigates professional websites related to cyber security,and programs crawlers for these websites.These crawlers run on a regular basis every day to obtain the text of web pages published by these websites.The web page text obtained by these crawlers can be used to build a cyber security text data set.In addition,this thesis uses machine learning algorithms to design and implement a text binary classification model for cyber security.This model can filter out text belonging to the field of cyber security from the massive web page text crawled by the distributed crawling system.
Keywords/Search Tags:cyber security, distributed crawling system, web page text, queue
PDF Full Text Request
Related items