| With the rapid development of the civil aviation industry,the increasing number of flights has led to a sharp increase in flight delays.The problem of flight delays will not only seriously affect the travel of passengers,but also directly affect the normal order of the airport and the normal operation of airlines,which will cause a lot of economic losses,in the long run will affect the reputation of the aviation industry,is not conducive to the rapid development of the civil aviation industry.Therefore,it is very important to explore the flight data and select a good model to predict the flight delays in advance.Machine learning is the current research hotspot.In this paper,a variety of machine learning algorithms are used to empirically analyze the flight delay data,and the optimal model is found by establishing the model and comparing,so as to realize the correct classification of flight delay data,provide pre-judgment of flight delay for all parties,and reduce the economic losses and adverse effects caused by flight delay.This paper first cleans and processes the original data,chooses to analyze it from the perspective of ’ flight landing delay ’ and divides it into two results : delay and punctuality according to its delay duration,and then conducts exploratory analysis to explore the correlation between the main variables in the data and flight delays.Finally,five machine learning algorithms,including logistic regression,CART-based decision tree,random forest,Extreme Gradient Boosting(XGBoost)and Light Gradient Boosting Machine(LightGBM),are used to analyze the importance of different variables to delays,and classify and predict flight delays.The prediction effect was evaluated by combining precision,accuracy,recall,F1 value and AUC value.By comparing the evaluation indexes of different algorithms,the results show that the other algorithms have good performance except the decision tree based on CART.The decision tree based on CART only produces two nodes,and the effect is not ideal.Although the logistic regression algorithm is simple,it also has good performance.The three integrated learning algorithms of random forest,XGBoost and LightGBM solve the problem of poor effect of single decision tree.The cost of misjudgment based on predicting delayed flights as punctual flights in aviation data prediction is large.This paper tends to judge the classification effect from the perspective of precision combined with AUC value.Although LightGBM algorithm is inferior in accuracy and recall rate,it performs best in precision and AUC.Based on this,LightGBM is selected as the best prediction model.In the classification of flight delays,it is necessary to combine a variety of machine learning algorithms to compare and select the best evaluation indicators according to the task requirements,and to predict the key factors affecting flight delays.In real life,all parties should actively prepare for flight delays and take effective measures to minimize the adverse effects of delays. |