Font Size: a A A

Research On Spam Recognition Based On Microblog

Posted on:2020-05-26Degree:MasterType:Thesis
Country:ChinaCandidate:R LiuFull Text:PDF
GTID:2428330599451304Subject:Engineering
Abstract/Summary:PDF Full Text Request
Natural language processing has always been a key topic.Identifying useless spam in Chinese short text is very important for user usage and platform maintenance.This paper analyzes Chinese text processing methods and natural language processing methods,then made the following study.The first is to improve the classification effect by improving the input layer and output layer of the classifier,and verify its effectiveness through experiments.The second is to propose a multi-feature fusion text similarity calculation method by summarizing the regularity of spam.Finally,the above two methods are combined to design a spam filtering system.The main contributions of this article are as follows:(1)On the identification of the content,we improve the classification effect by improving the ways of the input and output layers.On the input of the classification algorithm,we achieve a vectorization model about Chinese semantic.In this model,we obtains a matrix of the arcs about the relationship between words by the dependency syntax analysis of Yamada firstly;then decomposition this matrix to get the vector of the text.On the output layer,the pooling layer and the fully-connected layer in the CNN are replaced by the Chunk-max pooling and the hierarchical softmax.(2)On the identification of the account number,we summarized the unusual property of the messages number which sent by this account,and the high similarity of texts,and then propose a recognition method based on the characteristics of suspicious users.This method first checks the abnormality of the information amount by setting a certain window value size;if it is abnormal,we calculate the similarity of the paper by using the multi-feature model for the information which in the abnormal time period,if the similarity exceed the threshold,it belongs to spam it that way.(3)We designed a spam filtering system based on the above two algorithms.When an account sends a message,the platform automatically obtains the information source and the message content.The identification method based on the characteristics of suspicious users and the improved CNN classifier for the information content.In the classification,we compares with the common classification and identification methods,and made two comparisons on the classification accuracy rate and the training time of the model.The text similarity is compared by calculating the cosine of the included angel.The experimental results show that the proposed algorithm has better recognition performance for spam.
Keywords/Search Tags:Convolutional Neural Network, Text Categorization, Dependency Parsing Analysis, Spam, Text Similarity
PDF Full Text Request
Related items