| With the growing size of cities and the improvement of people’s quality of life,the discharge of various pollutants is also increasing gradually,many problems of environmental pollution have arisen,air quality has gradually become a topic of concern,Accurate prediction of air quality index(AQI)is a key prerequisite for solving air pollution problems.However,the non-linear variation of the AQI depends on a number of factors,in the previous research on air quality prediction,redundant features are usually not dealt with,and the impact of data on the prediction model is rarely considered.This article starts from the data itself,consider the correlation and redundancy of features,the main work of the air quality prediction model is as follows.Firstly,according to the longitude and latitude of the monitoring station,meteorological data and air quality data are stitched together to form a complete data set.Based on the correlation and redundancy of all the features,relatedness and redundancy were combined to form a new feature screening index,an algorithm of feature extraction based on embedded redundancy is proposed.Secondly,in view of the uncertainty of AQI prediction,we combine the method of data classification and NGBOOST to improve NGBOOST,and propose a data classification method based on AQI distribution map.This method builds NGBOOST respectively from all kinds of labeled data,and then makes prediction through the NGBOOST which belongs to it,and summarizes the results.Thirdly,combined with Spark,the parallel air quality prediction model is established and the parallel algorithm of the air quality prediction model is designed.As NGBoost models are independent of each other,they can be used for parallel computation on distributed nodes.Furthermore,the decision tree is the base model of NGBoost,which further improves the parallelism,alleviates the problem that NGBoost’s generalization ability is weakened in the case of large amount of data,and improves the computational efficiency.Finally,experiments are carried out on pseudo-distributed nodes to verify the effectiveness of the proposed model and algorithm,and the comparative experiments and results are analyzed with other algorithms. |