Purdue University Graduate School
Browse

Fairness-aware Synthetic Data Generation

thesis
posted on 2025-06-27, 15:08 authored by Jinwon SohnJinwon Sohn

Generative modeling is one of the most popular topics across various research areas. As a way to uncover the underlying data-generating process, it aims to produce synthetic data that mimics real data but is generated by a model. Such data has proven effective in diverse applications, such as preserving privacy or addressing the scarcity of training samples for large-scale AI models. At the same time, the community has paid increasing attention to the ethical aspects of generative models and their applications, including fairness, safety, and explainability. In particular, many empirical studies have shown that generative models can be discriminatory toward certain subpopulations due to reasons such as historically biased training data or imbalanced group proportions. This motivates the development of fairness-aware generative modeling or, in other words, synthetic data generation.

The first chapter of this dissertation discusses the training stability of generative adversarial modeling by examining the gradient variance of the employed neural networks. It proposes using an input distribution created by convex interpolation during training, which theoretically reduces gradient variance and contributes to stable training. Furthermore, a slight modification of the interpolation scheme enables fair data generation while maintaining training stability. The second chapter introduces a novel fairness-aware supervised learning framework. It serves as a foundation for the third chapter, which considers the use of synthetic data for supervised learning as a downstream application. The proposed framework tackles the common requirement of discrete sensitive attributes in the existing literature by introducing a new fairness penalty. The proposed penalty approximates the divergence between the joint distribution and the product of marginals and, thus, does not necessitate such discreteness. The final chapter focuses on fairness-aware synthetic data generation for fair supervised learning. It formulates a min-max bilevel optimization problem using the fairness penalty from the second chapter, aiming to convert original data into essentially fair synthetic data. The chapter provides a theoretical justification that the generated synthetic data ensures fairness across arbitrary downstream supervised learning models.

History

Degree Type

  • Doctor of Philosophy

Department

  • Statistics

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Qifan Song

Additional Committee Member 2

Faming Liang

Additional Committee Member 3

Vinayak Rao

Additional Committee Member 4

Will Wei Sun

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC