Imbalance Malicious Text Detection Based On Ensemble Learning

Posted on:2020-09-13

Degree:Master

Type:Thesis

Country:China

Candidate:Y J He

Full Text:PDF

GTID:2428330596475089

Subject:Computer Science and Technology

Abstract/Summary:

With the increasing mutual penetration between computer technology and human daily life,NLP(Natural Language Processing)technology has gradually played a role as a language bridge between computer and human activity production.With the help of natural language processing technology,it can break the gap between computer processing and human sequential language,and replace or assist human beings in dealing with daily tasks about character language,such as text classification,language translation,part-of-speech tagging and so on.And achieved excellent results in it.However,among the massive text data on the Internet,there is no shortage of malicious text,and its quanity is extremely uneven compared with normal text.It is difficult to only rely on manual recognition,so it is necessary to introduce malicious text detection technology in the field of NLP to analyze and classify the nature of the text.Considering the problem of unbalanced data distribution in malicious text detection,this paper studies malicious text detection from the perspective of unbalanced data classification,which is a problem of unbalanced text classification essentially.For the disorganized and unbalanced dataset of text data,the existing methods for text classification can not be well applied to this kind of dataset,which greatly limits the application of text classification technology in practical problems.Therefore,the main purpose of this paper is to train a model method with high accuracy and good robustness on the unbalanced text data set,so that it can detect malicious text well and distinguish it from normal text.First of all,this paper studies the background and significance of malicious text detection,and investigates the research status of unbalanced text classification and malicious text detection at home and abroad,and then analyzes and discusses the principle and influence of data imbalance.It also studies the existing common text classification methods which can solve the problem of data imbalance,and focuses on the in-depth study of the method based on ensemble learning,and analyzes the principles,advantages and disadvantages of the ensemble learning method based on Boosting and Bagging.In the experimental part,a series of comparative experiments are carried out based on the two traditional ensemble learning methods,and the experimental results have been analyzed and compared.Finally,in order to solve the problem of data imbalance in malicious text detection,this paper proposes a new NESTEN ensemble learning method,which can combine the advantages of Boosting and Bagging,and introduces the principle of this method in detail.This paper designs a set of rigorous comparative experiments between the new ensemble learning method named NESTEN and the traditional ensemble learning method.The final experimental results show that the NESTEN method is better than the traditional ensemble learning method.Furthermore,it is verified that new NESTEN ensemble learning method is effective in solving the problem of malicious text classification under unbalanced data conditions.

Keywords/Search Tags:

NLP, text-classification, ensemble learning, imbalance data, deep learning

Related items

1	Hybrid Ensemble Learning For Imbalanced Data
2	Classification Algorithms For Class Imbalance Data
3	Multi-target Sensitive Text Detection Based On Imbalanced Data
4	Research On Imbalanced Data Classification Methods Based On Resampling And Ensemble Learning
5	Text Matching Based On Ensemble Learning And Deep Learning
6	Research On Ensemble Learning Algorithm For High-Dimensional Data Classification
7	Two-class Imbalanced Data Classification Based On Diverse Data Generation And Ensemble Learning
8	Research And Application Of Imbalance Data Classification Based On SVM
9	Research On Text Classification Algorithm Based On Deep Ensemble Learning Of BERT, GCN And GA
10	Research And Application Of Ensemble Learning Based On Combined Resampling Methods