OPTIMIZING MACHINE LEARNING PIPELINES FOR MODEL PERFORMANCE
Data pipelines are core machine learning components essential for moving data through various stages and applying transformations to enhance data quality for model training, thereby improving performance and efficiency. However, as data volumes grow, optimizing these pipelines becomes increasingly complex, which can impact performance and increase the costs of finding the optimal pipeline. Data-centric systems are found across various sectors, including finance, education, marketing, and healthcare, which are trained on historical data. After that, systems need to be monitored, and continuous testing is required to ensure the performance of new incoming data. However, when the system encounters failures with new incoming data, debugging is needed to find the data point that is causing the system to fail. Finding the optimal pipeline for new data can also be daunting. In this research, we aim to address these challenges by proposing an approach that uses the GRASP method to find the new pipeline and a data profile to find the cause of the disconnect between the pipeline and data.
History
Degree Type
- Master of Science
Department
- Computer and Information Technology
Campus location
- West Lafayette