Font Size: a A A

Research On Text Labeling Method For Wechat Public Accounts

Posted on:2019-08-11Degree:MasterType:Thesis
Country:ChinaCandidate:L D DengFull Text:PDF
GTID:2428330545465540Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As a widely used information acquisition channel,Wechat public accounts' texts cover all walks of life.Reasonable labeling of it's texts can help users locate articles of interest quickly,and can help facilitate user behavior analysis and build user portrait,which have important application value.But at present,there is no relevant research on the labeling of Wechat public accounts texts.For this reason,the classification of the Wechat public accounts texts has been proposed in this paper.(1)This paper proposed a topic-word embedding model.It solves the problem that traditional text representations have high data dimensions and lack the relationship in-formation between words,and can't distinguish the problem of polysemy.Firstly,using LDA to assign topics to each text;then sending the subject to the Skip-gram training in the form of pseudo-words and words in the context at the same time to get the word vector of each word and the vector of each theme;The vector of words is cascaded as a vector of text.(2)Using combined semi-supervised SVM,with annealing algorithm to automati-cally select parameters to solve the problem of a large number of parameters to be set in the semi-supervised method training stage.In initial stage,this paper's implementation does nothing but validate the internal supervised solver's parameter C on the(usually very limited)labeled set;the other parameter,? is kept fixed;In iterative stage,C*is handled by a standard annealing sequence and it is limited to assume a small finite set of possible values.Therefore,the whole process only needs to set a few parameters manually.(3)Using the clustering method to cluster a large number of unlabeled data and select the unlabeled samples used for training in proportion to solve the deviation of the general label data and unlabeled data distribution.The semi-supervised classification algorithm that randomly adds unlabeled data may not be applicable to global data problems.The selective addition of training data enables the classifier to achieve good results even when the sample distribution is not uniform.(4)The paper established a public accounts category knowledge base.The types of articles published by the same Wechat public account are relatively fixed,and the public account source has referential significance for labeling.During the training stage of the semi-supervised classifier,the knowledge base is used to assist in judging whether the unlabel sample can join the training set.In the classification stage,the knowledge base is used to assist in judging the classification result to determine whether manual labeling is needed.(5)A method of tagging the Wechat public accounts text category based on knowl-edge base and semi-supervisory was proposed.The experimental results show that the method proposed in this paper not only im-proves the accuracy of the article annotation under the Wechat public accounts,but alsoreduces the number of manual interventions.
Keywords/Search Tags:Labeling, Text Classification, Wechat Public Accounts, Topic Word Embeddings, Semi-Supervised
PDF Full Text Request
Related items