Application of Machine Learning Strategies to Improve the Prediction of Changes in the Airline Network Topology
Predictive modeling allows us to analyze historical patterns to forecast future events. When the data available for this analysis is imbalanced or skewed, many challenges arise. The lack of sensitivity towards the class with less data available hinders the sought-after predictive capabilities of the model. These imbalanced datasets are found across many different fields, including medical imaging, insurance claims and financial frauds. The objective of this thesis is to identify the challenges, and means to assess, the application of machine learning to transportation data that is imbalanced and using only one independent variable.
Airlines undergo a decision-making process on air route addition or deletion in order to adjust the services offered with respect to demand and cost, amongst other criteria. This process greatly affects the topology of the network, and results in a continuously evolving Air Traffic Network (ATN). Organizations like the Federal Aviation Administration (FAA) are interested in the network transformation and the influence airlines have as stakeholders. For this reason, they attempt to model the criteria used by airlines to modify routes. The goal is to be able to predict trends and dependencies observed in the network evolution, by understanding the relation between the number of passengers per flight leg as the single independent variable and the airline’s decision to keep or eliminate that route (the dependent variable). Research to date has used optimization-based methods and machine learning algorithms to model airlines’ decision-making process on air route addition and deletion, but these studies demonstrate less than a 50% accuracy.
In particular, two machine learning (ML) algorithms are examined: Sparse Gaussian Classification (SGC) and Deep Neural Networks (DNN). SGC is the extension of Gaussian Process Classification models to large datasets. These models use Gaussian Processes (GPs), which are proven to perform well in binary classification problems. DNN uses multiple layers of probabilities between the input and output layers. It is one of the most popular ML algorithms currently in use, so the results obtained using SGC were compared to the DNN model.
At a first glance, these two models appear to perform equally, giving a high accuracy output of 97.77%. However, post-processing the results using a simple Bayes classifier and using the appropriate metrics for measuring the performance of models trained with imbalanced datasets reveals otherwise. The results in both SGC and DNN provided predictions with a 1% of precision and 20% of recall with an score of 0.02 and an AUC (Area Under the Curve) of 0.38 and 0.31 respectively. The low score indicates the classifier is not performing accurately, and the AUC value confirms the inability of the models to differentiate between the classes. This is probably due to the existing interaction and competition of the airlines in the market, which is not captured by the models. Interestingly enough, the behavior of both models is very different across the range of threshold values. The SGC model captured more effectively the low confidence in these results. In order to validate the model, a stratified K-fold cross-validation model was run.The future application of Gaussian Processes in model-building for decision-making will depend on a clear understanding of its limitations and the imbalanced datasets used in the process, the central purpose of this thesis. Future steps in this investigation include further analysis of the training data as well as the exploration of variable-optimization algorithms. The tuning process of the SGC model could be improved by utilizing optimal hyperparameters and inducing inputs.