Font Size: a A A

Research On Identification Method Of Uncivilized Weibo Post Based On Semi-Supervised Learning Model

Posted on:2019-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:X L JiaFull Text:PDF
GTID:2428330548467494Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the information age,social networking applications provide a platform for people to share various information resources.Every day,millions of people log onto Tencent weibo and share their opinions.In order to make cyber language healthy,civilized and ordered,it is necessary to control the use and dissemination of bad language as much as possible.Monitoring and warning these bad language can effectively control the trend of online public opinion,and then ensure the development of network language to step in the harmonious and sound way.At present,researches on text recognition and sentiment analysis at home and abroad are mainly embodied in the methods of affective computing based on semantic dictionaries and sentiment classification based on machine learning.However,few scholars have systematically researched in the problem of monitoring bad network language.Therefore,this paper is aimed to construct a vocabulary-based sentiment dictionary and explore the identification method of Weibo uncivilized posts through semi-supervised Learning-transduction Support Vector Machine(TSVM).First,the construction of an uncivilized network language dictionary based on multi-source data.By sorting out various references,this paper manually collects the bad language dictionariesfrom different data sources,and proposes an automatic extension method of PMI-based bad language dictionaries,which helps to acquire bad cyble new words accurately.Bad language dictionaries mainly include the following six aspects:uncivilized microblogging dictionary vocabulary,politically sensitive dictionary vocabulary,uncivilized alphabet abbreviation dictionary vocabulary,uncivilized alphabet abbreviation dictionary vocabulary,uncivilized digital homophonic dictionary vocabulary and uncivilized compound homophonic dictionary vocabulary.The paper extracts an uncivilized basic emotional dictionary from Weibo texts,and the dictionary divides the bad Weibo vocabulary into six categories:basic emotion,degree adverb,negative word,network vocabulary,expression word and relational conjunction.Network words mainly rely on Internet search,and the expression dictionary is mainly emoticons provided by the Sina Weibo platform.Second,the automatic recognition of uncivilized Weibo posts based on semi-supervised learning.The paper puts forward a bad text recognition model which is constructed by transduction SVM and based on semi-supervised learning method.For the problem that TSVM is easily affected by"local maximum",it introduces a deterministic annealing strategy to further improve the classification accuracy.In addition,this paper also establishes a training set composed of 1100 text vocabularies.There are 10 testson the operation situation.In each case,select 10 text labels at random,and then increase the size of the training set of unlabeled samplesfrom 100 to 1100.Experiments show that the results of semi-supervised learning are superior to those of supervised learning.The correlation coefficient between TSVM method accuracy and model probability is 0.9798.
Keywords/Search Tags:Semi-supervised learning, Transduction support vector machine, Text recognition, Bad language dictionary
PDF Full Text Request
Related items