Font Size: a A A

Research On Topic Feature Extraction And Text Classification In Social Internet Community

Posted on:2019-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:X G DangFull Text:PDF
GTID:2348330542498186Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The social internet community has become an important source of influence on online public opinion.Effective and accurate mining of the text information of social internet community is helpful for the supervision of the online public opinion.Most of the information on the social internet community exists in the form of texts and the content is fragmented.The large number of discussions about hot topics can easily lead to data imbalance.These characteristics affect the validity and accuracy of text data mining.This paper collects corpus with features of social internet community for model training.Extract the feature of text data in social internet community by using fuzzy-k-means clustering algorithm,design classification system for social internet community as well.In order to solve the problem of data imbalance,an online public opinion text multi-classification algorithm based on random forest and cost-sensitive is proposed in this paper.The algorithm uses Naive Bayes to construct cost matrix,chooses Gini index with misclassification cost to select the decision tree node.In order to verify the effect of the algorithm,this paper selects two representative improved algorithms for solving the problem of imbalanced data,which are SVM algorithm based on SMOTE oversampling on data level and continuous AdaBoost algorithm based on Bayes statistical inference on sample level.Compare with the algorithm in this paper by accuracy,recall and f-measure.After the comparative experiment,the classifier has improved performance by 8%on minority classes,which can solve the problem of data imbalance to some extent.On the basis of ensuring the overall performance of the model,the algorithm can improve the performance of classification model on imbalanced data.The feature extraction and text classification of internet community information can integrate the fragmented information by subject,enabling the Internet public opinion regulators grasp the dynamics of public opinion.
Keywords/Search Tags:E-community, text data, fragmented, imbalance, feature extraction, text classification
PDF Full Text Request
Related items