Font Size: a A A

Research On The Classification Method Of Uncivilized Posts Facing Baidu Tieba

Posted on:2019-11-13Degree:MasterType:Thesis
Country:ChinaCandidate:Z Z LiuFull Text:PDF
GTID:2428330548967234Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,there are many platforms for people to interact with each other freely.As it is difficult to control the Internet,leading to lots of uncivilized language phenomenon,which not only affects people's normal and healthy communication,but also has serious impact on the social culture and national quality,conflict with the socialist core values.In order to build a healthy and civilized network atmosphere,it is necessary to strengthen the supervision over the phenomenon of uncivilized language on the Internet.Recognition of the uncivilized terminology should be done firstly to implement the Supervision of work.This paper is based on the posts on Baidu Tieba,and is classified basing on the method of support vector machine.The main work of the paper includes the following two parts:Firstly,in the view of the current situation of the lack of language materials in the uncivilized language,a large number of posts on Baidu Tieba have been downloaded from the Internet.By defining the corresponding rules,they will eliminate the meaningless posts of the research,and mark the text with uncivilized language by manual annotation,so as to build a corpus of uncivilized language based on Baidu Tieba.Secondly,This paper studies the method of automatically identifying the uncivilized posts of Baidu Tieba,and uses the support vector machine as the classification model,and selects the feature items according to the chi-squared statistic.After completing the construction of Baidu Tieba,the data is processed into the format corresponding to the classification model and training the classification model.Because of the smaller data set of the non-civilization patch of Baidu,this paper uses the support vector machine model,which is specially designed for the limited text of the text.The limited text is used to calculate the quantitative training text data through the calculation,and the accuracy rate of the classification is improved,and the problem of the number of infinite samples required in the past is solved.This paper uses LibSVM which are simple and easy to use.For using vector space model to represent the text after the word segmentation processing spatial dimension and increase the amount of calculation problems,using chi-square statistic for dimension,calculating all words' statistics and choosing some words as candidates.Finally,all the N feature vectors of all categories are combined to obtain the final feature items.The paper outputs the quantified format of all texts based on the final feature items and performs corresponding training and predicts the classification model.
Keywords/Search Tags:uncivilized language, text classification, support vector machine, chisquare statistics
PDF Full Text Request
Related items