| PM2.5 particulates,as the main hazardous component of smog,not only seriously threaten human health and destroy natural environment,but also have a major impact on China’s economic development.Scientific and accurate prediction of PM2.5 will help the environmental protection department to formulate corresponding preventive and remedial measures.It will also provide scientific basis for the government’s policies and reduce the harm to the human body.In this paper,the research progress and prediction methods of PM2.5 are sorted and analyzed.Based on this,combined with machine learning theory and statistical forecasting methods,a new PM2.5 concentration prediction model(RFP model)was established to predict PM2.5 average daily concentration based on the random forest algorithm.The main work done in this paper is as follows:(1)Select the Xi’an area with high concentration of PM2.5 as the research object.Based on the Python language and the Scrapy framework,design the five functional modules of the crawler and realize the automatic crawling of Xi’an from multiple websites.Historical data from October 28,2013 to January 31,2018,includes air pollutant concentrations(PM2.5,PM10,SO2,NO2,CO,O3),meteorological conditions(temperature,dew point,humidity,sea level pressure,visibility,wind speed,wind direction,wind force,weather conditions).This paper uses Newton interpolation method,3 δcriteria,before and after average correction method,one-hot coding and other techniques to do a lot of pre-processing of the original data,thereby improving the quality of PM2.5experimental data.On this basis,a high-quality training data set specifically designed for PM2.5 concentration prediction research was constructed.(2)Using statistical theory,qualitatively analyzes and display the magnitude and direction of correlation between PM2.5 and influencing factors from correlation coefficients(including analysis of variance)and visualization.Through exploratory analysis,it was proved that the seasonal(spring,summer,autumn and winter),atmospheric pollutant concentration and meteorological conditions on the first 3 days affected the PM2.5 concentration on the day.Through correlation analysis,it provides the data and characteristic basis for the establishment of the model,and also provides reference and theoretical basis for the formation,source,and influencing factors of PM2.5.(3)Based on the correlation coefficient method in filtering method,the preliminary selection of features was performed.A total of 34 highly relevant features were selected toestablish the RFP-M1 and RFP-M2 models respectively.Based on the random forest method in the packaging method,17 features were further screened and the RFP-M3 model was established.Based on the grid search algorithm and cross-validation method to optimize the parameter combination,the RFP-M4 model was built.The performance of the four models was analyzed and compared.Finally,from the principle and method,the RFP model is compared with the BP-NN(Back Propagation Neural Network)model,and the prediction results are compared with other algorithms,including linear regression(LR),decision tree(DT),support vector machine(SVM).The experimental results show that the proposed RFP model not only can effectively predict the PM2.5 concentration,but also can improve the model’s operating efficiency without affecting the prediction accuracy,accounting for only 2.1% of the BP-NN model. |