Purdue University Graduate School
Browse

OPTIMIZING MACHINE LEARNING PIPELINES FOR MODEL PERFORMANCE

thesis
posted on 2024-12-10, 16:11 authored by Tejendra Pratap SinghTejendra Pratap Singh

Data pipelines are core machine learning components essential for moving data through various stages and applying transformations to enhance data quality for model training, thereby improving performance and efficiency. However, as data volumes grow, optimizing these pipelines becomes increasingly complex, which can impact performance and increase the costs of finding the optimal pipeline. Data-centric systems are found across various sectors, including finance, education, marketing, and healthcare, which are trained on historical data. After that, systems need to be monitored, and continuous testing is required to ensure the performance of new incoming data. However, when the system encounters failures with new incoming data, debugging is needed to find the data point that is causing the system to fail. Finding the optimal pipeline for new data can also be daunting. In this research, we aim to address these challenges by proposing an approach that uses the GRASP method to find the new pipeline and a data profile to find the cause of the disconnect between the pipeline and data.

History

Degree Type

  • Master of Science

Department

  • Computer and Information Technology

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Romila Pradhan

Additional Committee Member 2

John Springer

Additional Committee Member 3

Tianyi Li

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC