Handling Complexity via Statistical Methods
Phenomena investigated from complex systems are characteristically dynamic, multi-dimensional, and nonlinear. Their traits can be captured through data generating mechanisms (DGM) that explain the interactions among the systems’ components. Measurement is fundamental to advance science, and complexity requires deviation from linear thinking to handle. Simplifying the measurement of complex and heterogeneous data in statistical methodology can compromise their accuracy. In particular, conventional statistical methods make assumptions on the DGM that are rarely met in real world, which can make inference inaccurate. We posit that causal inference for complex systems phenomena requires at least the incorporation of subject-matter knowledge and use of dynamic metrics in statistical methods to improve on its accuracy.
This thesis consists of two separate topics on handling data and data generating mechanisms complexities, the evaluation of bundled nutrition interventions and modeling atmospheric data.
Firstly, when a public health problem requires multiple ways to address its contributing factors, bundling of the approaches can be cost-effective. Scaling up bundled interventions geographically requires a hierarchical structure in implementation, with central coordination and supervision of multiple sites and staff delivering a bundled intervention. The experimental design to evaluate such an intervention becomes complex to accommodate the multiple intervention components and hierarchical implementation structure. The components of a bundled intervention may impact targeted outcomes additively or synergistically. However, noncompliance and protocol deviation can impede this potential impact, and introduce data complexities. We identify several statistical considerations and recommendations for the implementation and evaluation of bundled interventions.
The simple aggregate metrics used in clustering randomized controlled trials do not utilize all available information, and findings are prone to the ecological fallacy problem, in which inference at the aggregate level may not hold at the disaggregate level. Further, implementation heterogeneity impedes statistical power and consequently the accuracy of the inference from conventional comparison with a control arm. The intention-to-treat analysis can be inadequate for bundled interventions. We developed novel process-driven, disaggregated participation metrics to examine the mechanisms of impact of the Agriculture to Nutrition (ATONU) bundled intervention (ClinicalTrials.gov Identifier: NCT03152227). Logistic and beta-logistic hierarchical models were used to characterize these metrics, and generalized mixed models were employed to identify determinants of the study outcome, dietary diversity for women of reproductive age. Mediation analysis was applied to explore the underlying determinants by which the intervention affects the outcome through the process metrics. The determinants of greater participation should be the targets to improve implementation of future bundled interventions.
Secondly, observed atmospheric records are often prohibitively short with only one record typically available for study. Classical nonlinear time series models applied to explain the nonlinear DGM exhibit some statistical properties of the phenomena being investigated, but have nothing to do with their physical properties. The data’s complex dependent structure invalidates inference from classical time series models involving strong statistical assumptions rarely met in real atmospheric and climate data. The subsampling method may yield valid statistical inference. Atmospheric records, however, are typically too short to satisfy asymptotic conditions for the method’s validity, which necessitates enhancements of subsampling with the use of approximating models (those sharing statistical properties with the series under study).
Gyrostat models (G-models) are physically sound low-order models generated from the governing equations for atmospheric dynamics thus retaining some of their fundamental statistical and physical properties. We have demonstrated statistic that using G-models as approximating models in place of traditional time series models results in more precise subsampling confidence intervals with improved coverage probabilities. Future works will explore other types of G-models as approximating models for inference on atmospheric data. We will adopt this technique for inference on phenomena for AstroStatistics and pharmacokinetics.
- Doctor of Philosophy
- West Lafayette