Font Size: a A A

Research On Spam Detection Based On Heterogeneous Ensemble Learning

Posted on:2020-04-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y J LiuFull Text:PDF
GTID:2428330599460276Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Since online shopping does not have access to goods,users can only learn about relevant product information from the e-commerce platform.The commentary information is increasingly being valued by users.Many merchants find that praise can bring huge returns,and bad reviews can make opponents lose money or even close down,so the "spam" behavior has always existed.In order to prevent sellers from vicious competition,ensure that e-commerce platforms can trade fairly,and protect consumers' rights and interests from infringement,detecting spam has always been a research hotspot.This article conducts in-depth research on spam detection.The main work is divided into the following aspects:Firstly,the Word2 vec model does not recognize the information of word pair.The Bigram-Word2 vec model is proposed.The model firstly uses the Bigram model to identify the word pair information in English.On this basis,after processing the text information,it is input into the Word2 vec model to train the relevant word vector.Secondly,the quality of the word vectors trained by the Bigram-Word2 vec model is different due to the difference in the number of word pairs.In order to further optimize the Bigram-Word2 vec model,this paper attempts to take multiple sets of values for training word vectors to find the optimal word vector.Again,in order to solve the problem of using a single machine learning model in the traditional spam detection field.This paper applies relevant knowledge in the field of heterogeneous integration learning to the field of spam detection.In the process of trying to integrate multiple heterogeneous models,two solutions are proposed for the problem that the hard voting method in heterogeneous integration learning encounters the same number of votes and how the weights in the soft voting method are set: Two-class weighted hard voting and weighted soft voting.Finally,this paper uses a variety of text feature extraction methods to extract features from Amazon datasets,and then combines multiple models to classify text.In order to explain the reasons for the unsatisfactory classification results,the concept of “repetition rate of words” is proposed.The method proposed in this paper was also verified on the data set.
Keywords/Search Tags:machine learning, heterogeneous integration learning, voting, spam detection, Word2vec
PDF Full Text Request
Related items