Scalable Constraint-Based Causal Discovery with Multiple Imputation for Incomplete Data

Specialeforsvar: Frederik Fabricius-Bjerre

Titel: Scalable Constraint-Based Causal Discovery with Multiple Imputation for Incomplete Data

Abstract: Constraint-based causal discovery methods, such as the PC-algorithm, utilize conditional independence tests to discover the causal structure of underlying data-generating mechanisms. Multiple imputation of incomplete data provides a remedy for handling missing data and has been shown to be useful in combination with the PC-algorithm, when data is missing [1]. Using multiple imputation rather than single imputations yields more conservative conditional independence tests for the PC-algorithm. However, the impact of the number of imputations on the PC-algorithm has not been investigated thoroughly. This thesis examines the integration of multiple imputation for missing data with the PCalgorithm as well as the temporal PC- algorithm, assessing both computational efficiency and quality of structure estimation. Furthermore, GPU parallelization of the (temporal) PC-algorithm with multiple imputation for Gaussian data is implemented1 as a computational enhancement to mitigate the substantial runtime associated with the combined procedure, extending upon the cuPC implementation [2]. We also establish an upper bound on the number of conditional independence tests conducted for the temporal PC-algorithm. Comprehensive simulation studies are conducted to evaluate how varying the number of imputations and different degrees of freedom approximations affect the quality of causal structure estimation for linear Gaussian structural causal models with varying structures. The results of this thesis demonstrate that the GPU implementation substantially improves runtime, making it feasible to perform a large number of imputations with computational time potentially scaling only sub-linearly. Crucially,
results also indicate that increasing the number of imputations up to even m = 1000 yields higher structure estimation quality for the PC-algorithm and the temporal PC-algorithm in the presence of incomplete data. 

Vejledere: Niels Richard Hansen
                  Anne Helby Petersen, SUND
Censor:    Søren Wengel Mogensen, Lund Universitet