Purdue University Graduate School
Browse

Expeditious Causal Inference for Big Observational Data

Download (1002.12 kB)
thesis
posted on 2022-07-28, 13:24 authored by Yumin ZhangYumin Zhang

This dissertation address two significant challenges in the causal inference workflow for Big Observational Data. The first is designing Big Observational Data with high-dimensional and heterogeneous covariates. The second is performing uncertainty quantification for estimates of causal estimands that are obtained from the application of black box machine learning algorithms on the designed Big Observational Data. The methodologies developed by addressing these challenges are applied for the design and analysis of Big Observational Data from a large public university in the United States. 

Distributed Design

A fundamental issue in causal inference for Big Observational Data is confounding due to covariate imbalances between treatment groups. This can be addressed by designing the study prior to analysis. The design ensures that subjects in the different treatment groups that have comparable covariates are subclassified or matched together. Analyzing such a designed study helps to reduce biases arising from the confounding of covariates with treatment. Existing design methods, developed for traditional observational studies consisting of a single designer, can yield unsatisfactory designs with sub-optimum covariate balance for Big Observational Data due to their inability to accommodate the massive dimensionality, heterogeneity, and volume of the Big Data. We propose a new framework for the distributed design of Big Observational Data amongst collaborative designers. Our framework first assigns subsets of the high-dimensional and heterogeneous covariates to multiple designers. The designers then summarize their covariates into lower-dimensional quantities, share their summaries with the others, and design the study in parallel based on their assigned covariates and the summaries they receive. The final design is selected by comparing balance measures for all covariates across the candidates and identifying the best amongst the candidates. We perform simulation studies and analyze datasets from the 2016 Atlantic Causal Inference Conference Data Challenge to demonstrate the flexibility and power of our framework for constructing designs with good covariate balance from Big Observational Data.

Designed Bootstrap

The combination of modern machine learning algorithms with the nonparametric bootstrap can enable effective predictions and inferences on Big Observational Data. An increasingly prominent and critical objective in such analyses is to draw causal inferences from the Big Observational Data. A fundamental step in addressing this objective is to design the observational study prior to the application of machine learning algorithms. However, the application of the traditional nonparametric bootstrap on Big Observational Data requires excessive computational efforts. This is because every bootstrap sample would need to be re-designed under the traditional approach, which can be prohibitive in practice. We propose a design-based bootstrap for deriving causal inferences with reduced bias from the application of machine learning algorithms on Big Observational Data. Our bootstrap procedure operates by resampling from the original designed observational study. It eliminates the need for additional, costly design steps on each bootstrap sample that are performed under the standard nonparametric bootstrap. We demonstrate the computational efficiency of this procedure compared to the traditional nonparametric bootstrap, and its equivalency in terms of confidence interval coverage rates for the average treatment effects, by means of simulation studies and a real-life case study.

Case Study

We apply the distributed design and designed bootstrap methodologies in a case study involving institutional data from a large public university in the United States. The institutional data contains comprehensive information about the undergraduate students in the university, ranging from their academic records to on-campus activities. We study the causal effects of undergraduate students’ attempted course load on their academic performance based on a selection of covariates from these data. Ultimately, our real-life case study demonstrates how our methodologies enable researchers to effectively use straightforward design procedures to obtain valid causal inferences with reduced computational efforts from the application of machine learning algorithms on Big Observational Data.


Funding

Purdue University ITaP Explanatory Modeling Project Grant

History

Degree Type

  • Doctor of Philosophy

Department

  • Statistics

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Arman Sabbaghi

Additional Committee Member 2

Bruce A. Craig

Additional Committee Member 3

Raghu Pasupathy

Additional Committee Member 4

Vinayak A. P. Rao