Ambient air is important for the ecological environment of the earth.Both human and other creature are closely related to the quality of air.Serious air pollution could aggravate global warming,which causes great harm to the circulation of the earth’s ecosystem.Thus,this study concentrates on the air quality degree,which is more intuitive and comprehensible,and the air quality classification prediction model based on machine learning is constructed to predict the degree of air quality.Machine learning methods help give full play to the scientific value of historical air data,which could improve the dynamic monitoring ability of air quality condition to inform the public in time.Moreover,efficient air quality classification prediction models help provide reasonable suggestions for relevant environmental protection organizations,which gives important theoretical significance and practical value to the intelligent environmental protection and the sustainable development of ecological environment.In this paper,the concentration values of six main air pollutants(SO2,CO,NO2,O3,PM10,PM2.5)specified by the national standard are selected as the input of the models,then the degree of air quality are classified and predicted.In this paper,the degree of air quality is specified to six levels.Based on the original random forest algorithm,this study constructs air quality classification prediction models based on the improved random forest with taking the distribution imbalance and label noise of air data into consideration.The main conclusions are as follows:(1)This study analyzes the air samples from China Environmental Monitoring Center,to find that the distribution of air samples is unbalanced.The study also analyzes the process of air automatic monitoring,to find that the process of air data generation and sensor network transmission may lead to the existence of noise in air data.(2)This study analyzes and verifies that the original random forest classification algorithm could be affected by the unbalanced distribution and label noise.The more unbalanced the air data are,and the more label noise the air data have,the more affected the performance of original random forest classification models.(3)A random forest algorithm based on stratified resampling is proposed.It is proved that the classification model based on this algorithm is good at unbalanced classification problems,especially to improve the recognition rate of the minority samples.(4)A random forest algorithm based on labels correction is proposed.The experimental results show that the classification model based on this algorithm enhances the robustness to label noise.The improved model could increase the classification accuracy significantly for tasks with various label noise.(5)An air quality classification model based on improved random forest is proposed.This model can alleviate the negative impact from unbalanced distribution and label noise on air quality classification tasks,which significantly improves the accuracy to predict the minority air samples.Besides,the model could achieve a satisfying prediction results even with a high proportion of label noise in training sets. |