Font Size: a A A

Research On Abnormal Website Detection Method Based On Natural Language Processing And Integrated Learning

Posted on:2020-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:Q L AnFull Text:PDF
GTID:2518306305985679Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The progress and development of information technology are closely related to people's lives.The rapid popularization of information technology brings convenience to people's lives,but also brings security factors that can not be ignored.Ensuring a safe and green network environment and keeping people away from the threat of network insecurity is an important challenge facing contemporary network security,and also a major problem to be solved urgently in people's network life.This paper summarizes the research background and current research situation of the subject at home and abroad,and describes systematically the main threats,general forms of existence and attack modes of network security.According to the data form of the sample,the natural language processing technology is used to complete the transformation of the data sample from natural form to numerical form,which provides a good data environment for feature learning.It provides more possibilities for data mining;in the case of extreme imbalance of negative samples,the paper takes unsafe types as the benchmark,combines sampling and sample generation algorithm to construct a reasonable subset,and proposes an improved ensemble learning algorithm,which achieves the accurate detection of network anomaly behavior.This paper adopts the data acquisition method combining unified resource locator and page script to determine the original sample data of network security,effectively and comprehensively covers the text features related to security factors;uses the feature engineering of natural language processing technology to realize word segmentation,vectorization and feature extraction for the collected text features,and takes samples as the basis of feature analysis.Documents are abstracted from document features by topic analysis method;aiming at the imbalance of positive and negative sample proportion in the whole sample,combined with down-sampling and SMOTE sample generation algorithm,random samples are put back and small-scale sample generation is carried out for a few samples,and several groups of sample subsets with relatively balanced number ratio are reassembled to avoid a few samples can not be very good in the learning process.Aiming at the sample input mode of the final classification model in the integrated model,this paper proposes the Re-Bagging strategy of adding corresponding class samples according to the gradient descending order of the class proportion,which improves the reliability of the results of the base model generation in the overall structure,improves the possibility of correct classification of samples,and reduces the false alarm rate.
Keywords/Search Tags:Network security, Anomaly detection, Feature engineering, Data imbalance, Integrated learning
PDF Full Text Request
Related items