| At present,the network pyramid scheme has become a major tumor that hinders social development.With the advent of the era of big data,the explosive growth of the amount of data in the network,provides a new form of transmission for the pyramid selling,and the network pyramid selling emerges at the historic moment.Due to the fast dissemination speed,hidden transmission mode,disseminators scattered in the country or even outside the country,electronic evidence is difficult to obtain,providing the network pyramid selling soil for offenders.At the same time,but also to the functional departments of regulation and crackdown has brought no small challenges.There's plenty of network pyramid scheme sites spread pyramid scheme information through the network in the form of text.How to effectively mine from the massive text information and determine which is the network pyramid scheme text has become an urgent need.In the process of text classification of network pyramid scheme,due to the diversity of text features,the plenty of noise data are generated.Therefore,the training text cannot well fit the distribution of the entire feature space.To accurately classify and identify network pyramid scheme text,the traditional classification algorithm is not reliable.In addition,the format of network pyramid scheme text is disordered,and good text preprocessing will directly affect the classification results.Feature selection can affect the accuracy of text classification,but some obvious features can hardly represent the characteristics of network pyramid scheme text.This study proposes a joint topic model,Paragraph Vector Latent Dirichlet Allocation(PV_LDA),based on the characteristics of high-yield,high rebate,hierarchical salary and text topic diversity described in the text.The model uses the paragraph as the minimum processing unit to generate the topic distribution matrix of "high-interest rate" and "hierarchical salary" from the network pyramid scheme text.The Gibbs sampling is used to derive the "pyramid scheme" topic distribution matrix represented by the two features,which is used for classification processing by the classifier.For the above core technology points,the research content includes the following three points:(1)After preprocessing the text with NLPIR,the subject model based on LDA model is used to summarize the characteristic information in the text through clustering.Gibbs sampling and iterative calculation of model parameters are used to effectively obtain the subject distribution matrix from the network pyramid scheme text.(2)Will two to represent the theme of the network pyramid scheme feature fusion,improve the theme distribution matrix of network pyramid scheme characteristics of generalization ability is the key of the research will be two to represent the theme of the network pyramid scheme feature fusion,improve the theme distribution matrix of network pyramid scheme characteristics of generalization ability is one of the focus of research,the method adopts the Hadamard product,and the introduction of a joint residual vector,to incorporate two classes of subject distribution matrix,the generated joint subject distribution matrix of network pyramid scheme text more representative.(3)This study will comprehensively consider the indicators of the classifier and choose a text classifier with high accuracy and fast processing speed through cross experiment comparison.The experiment shows that the theme model proposed in this paper can capture the characteristics of network pyramid scheme more reasonably,and the generalization ability of the model is guaranteed while considering the effect of theme mining. |