Variable Selection with Model-X Knockoffs

Specialeforsvar ved Marie Holst Mørch

Titel: Variable Selection with Model-X Knockoffs

Abstract: Contemporary scientific studies often imply the identification of relevant explanatory variables influencing a response out of many candidate variables. In this thesis, we consider the recently introduced Model-X knockoffs variable selection procedure for this purpose. As long as the distribution of the covariates is complete known, the Model-X knockoffs procedure allows us to select a model with guaranteed bound on the false discovery rate from finite samples under no model assumptions on the conditional distribution of the response and the method is applicable in settings where the number of predictors might be larger than the number of samples. This selection procedure operates by constructing knockoff copies of each original variable which are used as negative control variables to ensure that the procedure is not selecting too many irrelevant features. We examine the theoretical foundation of the procedure and establish the main result of rigorous control of the false discovery rate in this broad setting. Furthermore, we examine the robustness and applicability of the procedure by simulations, also in the setting where the covariate distribution is estimated from data, where we provide experimental evidence that the procedure is robust towards errors in the covariate distribution. In the search for theoretical guarantees of power and FDR in the setting where the covariate distribution is unknown, we examine the idea of data splitting and the RANK procedure. Finally, we apply the Model-X procedure to a real RNA-seq dataset from multiple myeloma cancer patients.

Vejleder: Niels Richard Hansen
Censor: Asger Hobolth