| With the continuous development of my country’s economy and the gradual penetration of the"Internet+"economic form into the second-hand car market,consumers can trade of second-hand cars on the Internet.The application of big data also makes the second-hand car information on the platform complex and diverse.It is very important to find out the important factors affecting the price of the second-hand cars from the characteristics of these mixed second-hand cars,and to use them to build an accurate prediction model.The traditional second-hand car price assessment method relies on the assessor’s long-term understanding of the entire market and the accumulation of relevant experience.In recent years,many scholars have applied the algorithms in machine learning to the field of second-hand car trading,and formed a relevant price prediction model,which can reasonably predict the transaction price of buyers and sellers from the scientific perspective of data mining,which reduces the professional threshold for second-hand car price assessment,greatly improving the scientific nature of the second-hand car price.The data in this article comes from Python crawling the second-hand car details on the Autohome platform.After preprocessing the data,a two-layer voting method was used for feature selection.The first layer used variance selection method,correlation coefficient method,maximum information coefficient method,norm-based Lasso selection method and tree model-based GBDT selection method for feature selection.The second layer voted again according to the selection results of the first layer,and finally generated three new feature subsets.In addition,the principal component dimension reduction method was used as a control,and 12 principal components with the top 80%of the cumulative variance contribution rate were selected to generate a new control data set.In this thesis,support vector model,random forest and XGBoost model were established based on the data set after feature voting selection,and the grid search method was used for parameter optimization.For the five datasets voted by the first layer,the prediction error of the support vector model decreased with the increase of the number of features,but the effects of random forest and XGBoost models are less dependent on the number of features,and they perform best on the datasetT4 andT1respectively.For the datasets voted by the second layer,all three models performed the best onU3.Among them,the evaluation indicators corresponding to the support vector model changed the most:MSE decreased by 37.58%compared with the average level,and MAE decreased by 19.32%,R2was 3.99%higher than the average level.Compared with the traditional principal component feature dimension reduction method,the feature dataset generated by the two-layer voting method proposed in this thesis has a significant advantage in reducing price prediction error.The mean squared error reduced by 25.93%,and the mean absolute error reduced by 16.79%. |