Font Size: a A A

Research On The Technology Of E-commerce Product Quality Risk Assessment Based On Data Mining

Posted on:2017-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:M ZouFull Text:PDF
GTID:2308330482980659Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of network and information technology, The Internet is irresistible to enter people’s life, and that changed the way of people live. However, due to the virtual and across the region of e-commerce, some products purchased by consumer may exist quality risks, In view of this problem, and on the basis of previous studies, risk evaluation model of user’s reviews based on e-commerce platform is proposed, which can accurately identify the risk, evaluate the rank of risk, and then release the corresponding forewarning information so that the regulators make a quick response to risk. In this paper, the main contents are as followings:This paper use data processing technology of R language to process the null value and outliers for data, and use the jar of ansj based on Spark to Chinese text segmentation and remove stop words, thus the data preprocessed of training is obtained. Then, adding noise in the attributes of Out-Of-Bag of the Random Forest, and order the calculation error in order of importance, to make feature selection for product quality risk evaluation of the electronic commerce, and it is realized on the Spark computing framework.In the risk assessment of e-commerce product quality, an improved Na?ve Bayesian algorithm based on Spark parallel algorithm is proposed to establish risk model. Na?ve Bayes algorithm assumes that the features are independent. In practice, however, it is closely bound up among them. So the improved algorithm use the correlation computed by MinHash between feature and label to weighted conditional probability in Na?ve Bayesian, and the parallel of algorithm is implemented in Spark.Experiments are made by the Spark large-scale cluster. The efficiency of the improved Bayesian algorithm based on Spark is better than that of Na?ve Bayesian and its serial algorithm in accuracy, recall and the time complexity on UCI data sets. With the increase of the experimental data, the efficiency of the serial algorithm is lower, but in Spark distributed environment, the efficiency is significantly improved. So the parallel algorithm based on Spark has better scalability and superiority in the large-scale data environment. And experiments show that the model is applied to user comments of e-commerce platform, it can accurately identify e-commerce product quality risk, then and make a risk pre-warning. So a new model of risk supervision is put forward.
Keywords/Search Tags:Data Mining, Naive Bayesian, Random Forest, MinHash, Spark
PDF Full Text Request
Related items