Font Size: a A A

Data Mining Of Text Information In Online Public Opinion Review With R

Posted on:2018-09-20Degree:MasterType:Thesis
Country:ChinaCandidate:J L LvFull Text:PDF
GTID:2348330533465252Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
The development and application of internet provides a convenient carrier for people to obtain and exchange their information.This makes the internet become a data resource of a variety of massive information.The use of news client allows its internet users to access easily to news and information,and also to express conveniently their own viewpoints and dissemination of information at the same time.But this may be a breeding ground for rumors,leading to social contradictions,affecting social order in a certain sense,and especially,breaking the law and committing a crime.The viewpoints of internet users are the information resources of internet public opinions;they have the characteristic of complex and big data.It is of important theoretical and applicable meaning to study them statistically by the mining methods.In this dissertation,by taking the micro-blog users' comments of the telecommunications fraud as an example,the data mining of Chinese text is studied with the help of R language which has powerful drawing,data analysis function and rich expansion package.Through a series of unsupervised learning techniques and supervised learning ones,a regression model and a classification model for the micro-blog users' review points are established.Firstly,the micro-blog users' review points which we collected are segmented completely.Chinese text corpus is constructed and the corpus data is cleaned,including the removal of Chinese stop words and punctuation and set threshold to lower sparsity of corpus.For the cleaned data,the document-entry relation matrix is constructed and is as the basis for our following analysis.Then,the text data which contains the time and the number of praise are separated,and the time series diagram is analyzed basically.By using the document entry relation matrix,a regression analysis including the construction of multivariate linear regression model,decision tree model and random forest model is given for the internet users' number of praise.These three models are compared and the optimum model is found by defining a function.Furthermore,the optimum model is applied to predict the number of praise so as to improve the accuracy of tendency analysis for internet users' review points.Finally,the document entry matrix is used to classify the text data by unsupervised classification learning.Eight topics of the text data are found according to the results of clustering analysis and mixed topic model.Models are constructed by applying the support vector machine,random forest and maximum entropy learning method to the classified document-entry relation matrix.The unclassified documents are predicted classifiably by using the models,and a tendency analysis is given for the micro-blog users' review points.
Keywords/Search Tags:Internet public opinions, collection and processing of text data, document clustering and classification, analysis of network comment
PDF Full Text Request
Related items