| In recent years,air passenger traffic has increased dramatically on both domestic and international flights.In the face of the increasing demand for air transportation,the problem of flight delay comes along with the increase in the number of passengers,which brings great challenges to the staff of the airline company.Therefore,how to accurately predict flight delays,reduce flight delays and reduce the loss caused by flight delays has gradually become a social concern.At present,there are numerous technologies to predict flight delays.The mainstream technology is to analyze and model flight data through machine learning method,so as to accurately predict flight delays,so that airlines and passengers can make preparations in advance.This paper takes the 2021 flight data from the US Bureau of Transportation Statistics as an example,conducts four sampling processing on the data,establishes an integrated learning prediction model,and analyzes the factors affecting flight delay.The research results also have reference significance for China’s flight delay problem.Firstly,this paper preprocessed the flight data,deleted the characteristic variables with more missing values and high correlation,and binarized the labels.The data set in this paper is huge,with a total of more than 6.3 million flight data in12 months.Therefore,this paper conducted stratified sampling of the original data set according to the month,and sampled in a reasonable way while ensuring the quality.Since the data set in this paper is unbalanced data set,SMOTE sampling and SMOTE sampling in oversampling,undersampling and mixed sampling were used respectively in this study to provide data support for subsequent modeling.Secondly,after sampling the data,the random forest model,Cat Boost model and Light GBM model are respectively established by using the four sampled data.However,the algorithm of a single model has limitations.In order to improve the accuracy of model prediction and improve the accuracy of the model,this paper introduces the Stacking model to fuse multiple submodels.The strengths of each model are taken into account and the advantages of each model are combined to establish the Stacking fusion model.The first layer based learner uses the random forest model in Bagging and Catboost model in Boosting,and the second layer based learner algorithm is Light GBM model to give full play to the sensitivity advantage of a single model to different feature variables.Finally,the accuracy rate,recall rate and AUC value were respectively used to analyze and compare the 16 models established above.Based on the evaluation and analysis of the model,SMOTE fusion algorithm after sampling has the best prediction effect compared with other single models. |