Font Size: a A A

Research On Classification Of Network Public Opinion Text Based On

Posted on:2016-08-10Degree:MasterType:Thesis
Country:ChinaCandidate:S ZhangFull Text:PDF
GTID:2208330464461552Subject:Library science
Abstract/Summary:PDF Full Text Request
With the continuous development of the micro-blog, forums. More and more people tend to express their views and opinions on the network, so network events occur frequently. The network public opinion gradually developed into the main carrier of social public opinion. But the face of massive information and abundant form of content on the Internet, the relevant departments cannot collect and classify the network consensus expediently. Therefore, it has realistic requirements on classifying public network opinions automatically.Starting from relevant concepts of public network opinions, it introduces the characters and types of public network opinion data and deeply analyzes the subject characters of public network opinions. Based on studying the text character and rule of public network opinions, it studies the current subject classification technology, such as vector space model, feature selection approach, network text classification method, evaluation index and weight calculation method, wherein it emphasizes on studying mutual information, information gain, CHI statistics,cross entropy, other feature selection algorithms, Bayesian algorithm, K neighbor value, support vector algorithms, and other text classification algorithms.The system constructs the public network opinion risk classification system based on the public opinions in recent years, which is divided into eight categories: national safety,government ruling, social stability, financial economics, daily life, resource environment,spiritual civilization and non-risk; In order to test the accuracy of the classification algorithm, we need to build our own network public opinion corpus through manual work since the domestic corpuses, especially public opinion corpus is extremely insufficient. After processing and sorting Tianya Internet posts in Mar. 2012, the public network opinion text database based on Tianya Forum is formed.This article conducts an in-depth research about the algorithms of support vector machine.It points out an improved sorting algorithms supporting vector machine on the basis of BT-SVM after the investigation on the current multi-classification SVM arithmetic. It elaborates on the arithmetic with the analysis and comparison of several multi-classification algorithms supporting vector machine that have been studied many times. Comparing its merits, demerits and performance, it focuses mainly on the research of binary tree’s generating algorithm using Mahalanobis Distance as the defining method of between-class distance.To verify the efficiency and accuracy of the improved classification algorithm, this thesis detailed design and implement the network public opinion text classification system. This systemis consists of five modules: pre-processing module, feature selection module, weight calculation module, text classification module evaluation module. Wherein, the word segmentation of pre-treatment applies ICTCLAS system of Chinese Academy of Sciences; the feature dimension reduction module realizes mutual information, information gain, CHI statistics, evidence weight,cross entropy and other character selection methods, which can well improve the selecting precision of characteristic items. The weight calculation module realizes these three methods after studying TF*IDF, TF*IG and TF*IDF*IG. The text classification module realizes SVM algorithm. In the result evaluation module, the classification results are evaluated by the precision check rate and the recall check rate; and each type is evaluated respectively. The result evaluation module shows the classification result and its comparison with manual classification result.Finally, starting from the precision check rate and the recall check rate, it tests the public network opinion classification system and takes text corpus of Tianya Forum as a sample collection; after the sample by manual classification is tested, the testing result can reach as94.88%. The effectiveness and feasibility of character selection, weight calculation and SVM classification algorithm is validated by experiments. Through setting different character selection methods and weight calculation methods, it obtains the precision check rate and the recall check rate and compares them so as to select the optimal method in various algorithms; the results are:select the global method for character selection method; select expectation cross entropy for character selection method; select TF*IDF*IG for weight calculation method.
Keywords/Search Tags:Support Vector Machines, Text classification, Network public opinion, Text corpus of Tianya forum
PDF Full Text Request
Related items