Flight Delay Analysis Based On Unbalanced Data Sets

Posted on:2024-06-01

Degree:Master

Type:Thesis

Country:China

Candidate:M H Liu

Full Text:PDF

GTID:2542306914497434

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

In recent years,air passenger traffic has increased dramatically on both domestic and international flights.In the face of the increasing demand for air transportation,the problem of flight delay comes along with the increase in the number of passengers,which brings great challenges to the staff of the airline company.Therefore,how to accurately predict flight delays,reduce flight delays and reduce the loss caused by flight delays has gradually become a social concern.At present,there are numerous technologies to predict flight delays.The mainstream technology is to analyze and model flight data through machine learning method,so as to accurately predict flight delays,so that airlines and passengers can make preparations in advance.This paper takes the 2021 flight data from the US Bureau of Transportation Statistics as an example,conducts four sampling processing on the data,establishes an integrated learning prediction model,and analyzes the factors affecting flight delay.The research results also have reference significance for China’s flight delay problem.Firstly,this paper preprocessed the flight data,deleted the characteristic variables with more missing values and high correlation,and binarized the labels.The data set in this paper is huge,with a total of more than 6.3 million flight data in12 months.Therefore,this paper conducted stratified sampling of the original data set according to the month,and sampled in a reasonable way while ensuring the quality.Since the data set in this paper is unbalanced data set,SMOTE sampling and SMOTE sampling in oversampling,undersampling and mixed sampling were used respectively in this study to provide data support for subsequent modeling.Secondly,after sampling the data,the random forest model,Cat Boost model and Light GBM model are respectively established by using the four sampled data.However,the algorithm of a single model has limitations.In order to improve the accuracy of model prediction and improve the accuracy of the model,this paper introduces the Stacking model to fuse multiple submodels.The strengths of each model are taken into account and the advantages of each model are combined to establish the Stacking fusion model.The first layer based learner uses the random forest model in Bagging and Catboost model in Boosting,and the second layer based learner algorithm is Light GBM model to give full play to the sensitivity advantage of a single model to different feature variables.Finally,the accuracy rate,recall rate and AUC value were respectively used to analyze and compare the 16 models established above.Based on the evaluation and analysis of the model,SMOTE fusion algorithm after sampling has the best prediction effect compared with other single models.

Keywords/Search Tags:

flight delays, smote sampling, random forest, lightgbm, stacking model

PDF Full Text Request

Related items

1	Prediction And Analysis Of Used Car Transaction Prices
2	A Flexible Decision Method For Slot Allocation Considering The Impact Of Flight Delays
3	Empirical Analysis Of Machine Learning Classification Algorithm To Flight Delay Data
4	A Study Of Flight Delay Classification Problem Based On Several Machine Learning Methods
5	Research On Structural Fault Diagnosis And Class Classification Of Photovoltaic Power Station Based On Random Forest
6	Design And Implementation Of Equipment Fault Detection Method Based On Random Forest And LightGBM
7	Research On Flight Delays Prediction Methods Based On Machine Learning
8	Detection Of Electricity Theft In Smart Grid Based On K-SMOTE And Improved Random Forest
9	Early Warning Of Grid Tower Failure Under Typhoon Disaster Based On Improved Random Forest
10	Research On Optimization Of Flight Delay Prediction Based On Operation Data