Font Size: a A A

Text Mining Analysis Of Food Safety Reviews Based On Smote And LightGBM

Posted on:2021-05-21Degree:MasterType:Thesis
Country:ChinaCandidate:S Z SunFull Text:PDF
GTID:2428330647459586Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
The rise of O2O stores has caused widespread concern about food safety issues.This article collects online catering reviews,conducts text mining through feature engineering processing and empirical analysis,establishes a food safety review recognition mechanism,and implements online food safety supervision of stores by takeaway platforms.In feature engineering,text preprocessing,feature visualization,and word vector representation are performed.Text preprocessing compares the word segmentation effects of Jieba and Snownlp,and performs word cloud image analysis and network semantic analysis on the two sets of review texts that do not involve "food safety" and "food safety".Word2 vec is used for the text feature vectorization.Empirical analysis carried out model training and model optimization,constructed three basic classifiers of LR,SVM,MNB and four integrated classifiers of RF,XGBoost,Light GBM,Cat Boost,and used grid search algorithm to find the optimal parameters.In order to improve the model recall rate,double filters,oversampling and undersampling are used to optimize the model.Due to the different data distribution of the two classifiers,the dual classifier has not improved the model recall rate,but the four sampling techniques can achieve the effect of improving the recall rate.Among the original 7 models,MNB has the highest recall,but the accuracy and precision are the lowest.The accuracy is even lower than 0.5.Light GBM has the highest accuracy and precision,but the recall performance is not so good.In addition,oversampling sacrifices less accuracy and precision,in exchange for an increase in recall.In terms of oversampling and undersampling balancing the performance of indicators: Smote oversampling is the best,Adasyn oversampling is second,Cluster Centroids undersampling is third,and Near Miss is the worst.Compare the F1 values of various models,and recommend the Light GBM model with Smote oversampling as the final prediction model.
Keywords/Search Tags:Food safety, LightGBM, CatBoost, Oversampling, Undersampling
PDF Full Text Request
Related items