Research On The Classification Method Of Uncivilized Posts Facing Baidu Tieba

Posted on:2019-11-13

Degree:Master

Type:Thesis

Country:China

Candidate:Z Z Liu

Full Text:PDF

GTID:2428330548967234

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet,there are many platforms for people to interact with each other freely.As it is difficult to control the Internet,leading to lots of uncivilized language phenomenon,which not only affects people's normal and healthy communication,but also has serious impact on the social culture and national quality,conflict with the socialist core values.In order to build a healthy and civilized network atmosphere,it is necessary to strengthen the supervision over the phenomenon of uncivilized language on the Internet.Recognition of the uncivilized terminology should be done firstly to implement the Supervision of work.This paper is based on the posts on Baidu Tieba,and is classified basing on the method of support vector machine.The main work of the paper includes the following two parts:Firstly,in the view of the current situation of the lack of language materials in the uncivilized language,a large number of posts on Baidu Tieba have been downloaded from the Internet.By defining the corresponding rules,they will eliminate the meaningless posts of the research,and mark the text with uncivilized language by manual annotation,so as to build a corpus of uncivilized language based on Baidu Tieba.Secondly,This paper studies the method of automatically identifying the uncivilized posts of Baidu Tieba,and uses the support vector machine as the classification model,and selects the feature items according to the chi-squared statistic.After completing the construction of Baidu Tieba,the data is processed into the format corresponding to the classification model and training the classification model.Because of the smaller data set of the non-civilization patch of Baidu,this paper uses the support vector machine model,which is specially designed for the limited text of the text.The limited text is used to calculate the quantitative training text data through the calculation,and the accuracy rate of the classification is improved,and the problem of the number of infinite samples required in the past is solved.This paper uses LibSVM which are simple and easy to use.For using vector space model to represent the text after the word segmentation processing spatial dimension and increase the amount of calculation problems,using chi-square statistic for dimension,calculating all words' statistics and choosing some words as candidates.Finally,all the N feature vectors of all categories are combined to obtain the final feature items.The paper outputs the quantified format of all texts based on the final feature items and performs corresponding training and predicts the classification model.

Keywords/Search Tags:

uncivilized language, text classification, support vector machine, chisquare statistics

PDF Full Text Request

Related items

1	Research On BERT-based Uncivilized Language Detection Method
2	Research On Filtering Method For Uncivilized Text Based On Deep Learning
3	Research On Identification Method Of Uncivilized Weibo Post Based On Semi-Supervised Learning Model
4	The Study Of Text Classification Based On Support Vector Machine
5	Research On Text Classification Method Based On Support Vector Machine
6	Research On Network Uncivilized Text Classification Methods Based On Semi-supervised Learning Models
7	Text Classification Research Based On Support Vector Machine
8	Research On Text Classification Algorithm Based On Support Vector Machine And Neural Network
9	Research On Text Classification System Based On Support Vector Machine
10	Massive Text Classification Parallelization Technology Based On Support Vector Machine