Font Size: a A A

Identification Algorithm Design And Module Construction Of Marketing Microblog User

Posted on:2022-08-20Degree:MasterType:Thesis
Country:ChinaCandidate:Z L YangFull Text:PDF
GTID:2518306563963309Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the rise of social networks OSN(Online Social Network)in recent years,the Weibo platform is also becoming more and more mature.With the influx of users,malicious marketing users will follow.They spread information about products in social networks and induce users to buy them for profit.Their behavior seriously pollutes the social environment of Weibo and affects the user experience.How to identify malicious marketing users from the Weibo platform,which has a huge number of users,has been a problem that needs to be refined in the field of malicious user identification.Ensemble learning algorithms are mostly used in current research in this area.Models are trained using manually selected features,and none of these methods can take into account the diversity and accuracy of Stacking’s combined base classifier well.In addition,combined with the actual situation of Weibo,it is found that malicious marketing users can be subdivided into marketing content producers and agent-based marketing users.The latter user characteristics are close to those of normal users,so current machine learning algorithms can only adequately identify the former from the feature perspective,but are not effective for the latter.This kind of research also suffers from the problem of skewed sample distribution,where the model will be more biased towards the majority class.The task of this class should instead aim to recall more minority class samples as the ultimate goal.The main work of the thesis includes:(1)A based classifier fusion model FI-Ranked-EMCS based on classification accuracy and principal component feature ranking is proposed,which uses the entropy value to measure the classification accuracy of individual samples.The principal component feature set of each machine learning model is selected by feature importance assessment,and the Spearman correlation coefficient and pairwise measure are combined to assess the variability of the principal component feature set among models,thus ensuring the diversity of the combined base classifiers.The misclassification cost is incorporated into the information gain of decision tree node splitting for constructing the cost-sensitive decision tree CSDT,from which the contrast pattern CP of majority class and minority class samples are extracted to construct the contrast pattern classifier CPCSDT,which is used as a meta-learner to deal with the sample skewed distribution problem.(2)Construct a marketing relationship network RN through three relationship links:user-user,user-blog post,and blog post-blog post,associate two types of malicious marketing users,transform the actual problem into a mathematical model using the adjacency matrix,use the malicious marketing users identified by the Stacking model as known labels in the relationship graph,and use a semi-supervised approach to predict the unknown labels,so that the agent-based marketing users are identified.(3)Implementing the construction and validation of two types of models,the experimental data are used in the Stacking model which uses the new fusion strategy,its correct rate reaches 86.3% and the recall rate reaches 57.6%,both of which are better than other random combination schemes.Comparing a base classifier fusion model that combines differences in principal component features with a model that uses only classification accuracy as a fusion strategy,when the training sample data were artificially adjusted to weaken the distribution of certain features in the sample,the former has better performance in terms of recall.The stacking model using the CP-CSDT algorithm as a meta-learner achieves an accuracy of 47.8% and a recall of 68.7%.Both outperformed other algorithm meta-learners.The CP-CSDT algorithm is derived to handle the data imbalance problem.Varying the proportion of positive and negative samples in the training set,it is found to have better model stability for extremely unbalanced datasets.Adjusting the proportion of known labels for different marketing organizations in the relational network reveals that its performance depends on the known label samples,but can recall more positive class samples without excessive loss of precision and does not disproportionately affect the recognition of negative class samples.
Keywords/Search Tags:Malicious marketing user of Weibo, MCS Strategy, Sample skewed distribution, The graph of relationship network, Ensemble learning
PDF Full Text Request
Related items