Font Size: a A A

Application Of Machine Learning In PM2.5 Concentration Prediction

Posted on:2024-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:H B ZhangFull Text:PDF
GTID:2531307106486164Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
Today’s global industrialization progress is becoming increasingly mature,factories of varying sizes can be seen everywhere.With the continuous growth of urban motor vehicles,the problem of air pollution has become increasingly prominent and has become tricky.As the "invisible killer" of human health,the PM2.5concentration exceeded the standard has long become a problem for the government and the people.This is a fine particle that floats in the air everywhere,and it is one of the culprits that cause air pollution.Studies have shown that humans have been exposed for a long time in an environment where PM2.5 concentration exceeds standards will increase the risk of suffering from multiple diseases.In addition,the haze phenomenon caused by PM2.5 is also very harmful.In 2013,the central and eastern regions of the China have experienced an outbreak of very large haze.Since then,our country has opened a long-lasting "Fighting Haze War".Fortunately,the years of anti-haze campaigns has gradually achieved results,but the phenomenon of haze returning has occurred from time to time.As of March 2023,the new round of sandstorms reached Beijing again,and the air quality in Beijing has reached serious pollution level.Therefore,the road of fighting haze is still blocked and long,and it is possible to make early warning before the PM2.5 concentration is effectively predicted in advance.This thesis selects air pollution and meteorological data at the monitoring point of the Beijing Agricultural Exhibition Museum from March 2013 to February 2017,to analyze the influencing factors of PM2.5 concentration in the area and predict the establishment of a model.This thesis first preprocessing the data,including processing the loss of missing values,the category variable is one-hot-coded,and the processed data is described as a descriptive statistical analysis.According to the variable boxplots,the six types of air pollutants are found(PM2.5,PM10,SO2,SO2,NO2,CO,O3)is generally at a lower level,and there are more departure points at the same time,indicating that there is air pollution in this area of Beijing.A large proportion of the rainfall data is 0,indicating that the region has less rainfall.By drawing variable trend charts,and analysis of variables,it is found that PM2.5 have a high positive correlation between PM10 and CO,and there is a moderate correlation between SO2 and NO2.The correlation with O3 and four meteorological factors(temperature,atmospheric pressure,dew-point temperature,rainfall)is weak.Considering that there is a certain correlation between several meteorological factors and the remaining pollutants,all characteristics are used to establish models.After standardizing the data,divide the training set and test set at a ratio of 2:1,and establish three single machine learning models,namely SVR model,Random Forest model,and XGBOOST model.The optimal parameter combination is determined by the tuning of the model for empirical analysis.In the empirical analysis,the Stacking fusion model is established in conjunction with the multi-linear regression model(MLR),and the test set and training set prediction results of the three single models are stacked into new training sets and test sets.By comparing the evaluation indexes(MAE,RMSE,R-Squared)of the three models,it is found that the SVR model has the best model performance,and its R-Squared reaches 0.954.The Random Forest model and XGBOOST model are similar.The Stacking fusion model is between the three,and the overall performance has not been improved much.By observing the comparison chart of the test set prediction value and actual value trend of the four models,it is found that each model has more large prediction errors in the early stage of the test set prediction,especially the three single machine learning models.The proportion of test set prediction values is divided into two sections to calculate their respective MAE and RMSE.It is found that the evaluation indicators in the front section of the prediction value are significantly better than the latter section.Among them,the XGBOOST model is represented.The difference is 0.46 and 0.75,so the machine learning model may not be ideal for the long-term forecast of the PM2.5 concentration(more than one year long).In the end,this thesis cuts the original test set into two sections and takes half of the section to predict.It is found that the MAE and RMSE of each model have declined,and except for the SVR model,the R-squared of the other three models also decreased.Model(Random Forest,XGBOOST,Stacking)decreased by 0.28,0.41,0.44,respectively.Explain that the performance of these three models depends on the scale of data,while the SVR model can face data of different sizes flexibly.In addition,although the Stacking fusion model has not improved compared to the SVR model,compared to the other two models,the performance is still improved.Therefore,it can be used to improve the prediction effect by improving the selection of base-learners for the Stacking fusion model.
Keywords/Search Tags:PM2.5, Integrated learning, Stacking fusion model, Regression prediction
PDF Full Text Request
Related items