A STUDY ON THE IMPACT OF PREPROCESSING STEPS ON MACHINE LEARNING MODEL FAIRNESS
The success of machine learning techniques in widespread applications has taught us that with respect to accuracy, the more data, the better the model. However, for fairness, data quality is perhaps more important than quantity. Existing studies have considered the impact of data preprocessing on the accuracy of ML model tasks. However, the impact of preprocessing on the fairness of the downstream model has neither been studied nor well understood. Throughout this thesis, we conduct a systematic study of how data quality issues and data preprocessing steps impact model fairness. Our study evaluates several preprocessing techniques for several machine learning models trained over datasets with different characteristics and evaluated using several fairness metrics. It examines different data preparation techniques, such as changing categories into numbers, filling in missing information, and smoothing out unusual data points. The study measures fairness using standards that check if the model treats all groups equally, predicts outcomes fairly, and gives similar chances to everyone. By testing these methods on various types of data, the thesis identifies which combinations of techniques can make the models both accurate and fair.The empirical analysis demonstrated that preprocessing steps like one-hot encoding, imputation of missing values, and outlier treatment significantly influence fairness metrics. Specifically, models preprocessed with median imputation and robust scaling exhibited the most balanced performance across fairness and accuracy metrics, suggesting a potential best practice guideline for equitable ML model preparation. Thus, this work sheds light on the importance of data preparation in ML and emphasizes the need for careful handling of data to support fair and ethical use of ML in society.
History
Degree Type
- Master of Science
Department
- Computer and Information Technology
Campus location
- West Lafayette