| To collect information more effectively on statistical survey opinions,it is of great practical significance and theoretical value to apply models for topic mining.In this thesis,based on an overview of the current status of domestic and international research on topic extraction models,we study the statistical topic mining method of survey opinion based on the social public survey opinion dataset.Firstly,the survey opinion data was collected by the research team which has a total of 46,962 data involving 15 aspects obtained.The dataset is processed by data cleaning,Chinese word separation,and deactivated words removal.On the basis of the characteristics of the survey opinion data itself,we have developed a user dictionary for further removal of the data set,which covers area names,street names,and store names.Then TFIDF and BOW feature extraction methods are used to form the corpus separately.Secondly,to select the model with the highest topic extraction ability,this thesis compares the effects of five topic models,LDA,DTM,NVDM-GSM,ETM,and W-LDA,based on two corpora TFIDF and BOW separately using survey opinion data from 2012 to 2018.W-LDA model has the best results by using two indicators of topic diversity,and topic coherence in a comprehensive comparison.Thus,the prior information of the models is improved based on W-LDA,and the logistic normal and Gaussian distributions are used as the priors to form the WTM-LN and WTM-Gaussian models.The empirical results show that the effectiveness of the improved models is also further improved,among which the BOW-based WTM-LN model has the best effect.Finally,four machine learning classification models---fastText,KNearest Neighbor,Naive Bayes,and XGBoost are used for training and testing based on the data with labels from 2012 to 2018.We select the optimal text classification model fastText using precision,recall,and F1 value as evaluation metrics and classify the 2019-2020 dataset without labels.Then the WTM-LN model was used for deep topic mining for each category.The optimal number of topics in each category is determined using the normalized metrics of topic coherence,topic diversity,and model loss to obtain valid information and provide targeted advice and suggestions to relevant government departments. |