Research And Implementation Of Distributed Internet Information Crawling System For Cyber Security

Posted on:2021-03-25

Degree:Master

Type:Thesis

Country:China

Candidate:S S Li

Full Text:PDF

GTID:2518306308968929

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet,cyber security has gradually become a topic of public concern,and it is also related to the public’s personal privacy and information security.So articles and information related to cyber security are extremely valuable.Cyber security is a highly specialized field.Network resources related to cyber security are mainly distributed in professional websites,forums,and the technical sections of some websites containing various types of information.These network resources are scattered in the network,and the public cannot understand the recent cyber security topics in a timely and accurate manner.This thesis designs and implements a distributed Internet crawling system for the field of cyber security.This system can efficiently crawl massive web page text from the internet,and extract body parts from web page text.Finally,the text that belongs to the field of cyber security is filtered.The main contents of this thesis are as follows:As a consumption queue for crawl tasks,Redis list implements a distributed crawling system where multiple crawler nodes consume queue together.This thesis designs effective URL deduplication strategy for URLs to be crawled.The strategy avoids repeated crawling and improves the efficiency of system crawling.This thesis uses message queues to decouple the parsing module from the crawler module,and it reduce the coupling of the system.In addition,this thesis designs a reasonable task scheduling strategy.This strategy coordinates the running speed of the two modules,fully improves the utilization of system network and hardware resources and improves the crawling efficiency of the system.This thesis designs effective text extraction algorithms for complex and diverse web page structures.A practical data cleaning solution is designed,and it can filter out web page text that belongs to the field of cyber security.This thesis investigates professional websites related to cyber security,and programs crawlers for these websites.These crawlers run on a regular basis every day to obtain the text of web pages published by these websites.The web page text obtained by these crawlers can be used to build a cyber security text data set.In addition,this thesis uses machine learning algorithms to design and implement a text binary classification model for cyber security.This model can filter out text belonging to the field of cyber security from the massive web page text crawled by the distributed crawling system.

Keywords/Search Tags:

cyber security, distributed crawling system, web page text, queue

PDF Full Text Request

Related items

1	Research On Network Reptiles In Distributed Parallel Environment
2	Research On Customized Web Information Crawling And Pushing Techniques
3	Vertical Search Engine For Crawling The Web Page Design And Implementation
4	Key Technology Research On Web Forums Crawling And Hot Topic Detection
5	Research On The Focused Crawling Combining Synthetic Web-Page Information And Domain Ontology
6	Distributed Web Crawler System
7	Research And Application On Web Crawling And Text Mining Technology
8	Design And Implementation Of Text Error Correction System Based On Text Extraction From Distributed Video Stream
9	Research On Efficient Web Information Crawling Strategy
10	Research On Index Management And File Pretreatment Of Distributed Full-text Retrieval System