Stabilizing variable selection and regression

Research output: Contribution to journalJournal articleResearchpeer-review

Standard

Stabilizing variable selection and regression. / Pfister, Niklas; Williams, Evan G.; Peters, Jonas; Aebersold, Ruedi; Bühlmann, Peter.

In: Annals of Applied Statistics, Vol. 15, No. 3, 2021, p. 1220-1246.

Research output: Contribution to journalJournal articleResearchpeer-review

Harvard

Pfister, N, Williams, EG, Peters, J, Aebersold, R & Bühlmann, P 2021, 'Stabilizing variable selection and regression', Annals of Applied Statistics, vol. 15, no. 3, pp. 1220-1246. https://doi.org/10.1214/21-AOAS1487

APA

Pfister, N., Williams, E. G., Peters, J., Aebersold, R., & Bühlmann, P. (2021). Stabilizing variable selection and regression. Annals of Applied Statistics, 15(3), 1220-1246. https://doi.org/10.1214/21-AOAS1487

Vancouver

Pfister N, Williams EG, Peters J, Aebersold R, Bühlmann P. Stabilizing variable selection and regression. Annals of Applied Statistics. 2021;15(3):1220-1246. https://doi.org/10.1214/21-AOAS1487

Author

Pfister, Niklas ; Williams, Evan G. ; Peters, Jonas ; Aebersold, Ruedi ; Bühlmann, Peter. / Stabilizing variable selection and regression. In: Annals of Applied Statistics. 2021 ; Vol. 15, No. 3. pp. 1220-1246.

Bibtex

@article{987ee313b0734fd9bf4e10cbdb1152f0,
title = "Stabilizing variable selection and regression",
abstract = "We consider regression in which one predicts a response Y with a set of predictors X across different experiments or environments. This is a common setup in many data-driven scientific fields, and we argue that statistical inference can benefit from an analysis that takes into account the distributional changes across environments. In particular, it is useful to distinguish between stable and unstable predictors, that is, predictors which have a fixed or a changing functional dependence on the response, respectively. We introduce stabilized regression which explicitly enforces stability and thus improves generalization performance to previously unseen environments. Our work is motivated by an application in systems biology. Using multiomic data, we demonstrate how hypothesis generation about gene function can benefit from stabilized regression. We believe that a similar line of arguments for exploit-ing heterogeneity in data can be powerful for many other applications as well. We draw a theoretical connection between multi-environment regression and causal models which allows to graphically characterize stable vs. unstable functional dependence on the response. Formally, we introduce the notion of a stable blanket which is a subset of the predictors that lies between the direct causal predictors and the Markov blanket. We prove that this set is op-timal in the sense that a regression based on these predictors minimizes the mean squared prediction error, given that the resulting regression generalizes to unseen new environments.",
keywords = "Causality, Multiomic data, Regression, Variable selection",
author = "Niklas Pfister and Williams, {Evan G.} and Jonas Peters and Ruedi Aebersold and Peter B{\"u}hlmann",
note = "+",
year = "2021",
doi = "10.1214/21-AOAS1487",
language = "English",
volume = "15",
pages = "1220--1246",
journal = "Annals of Applied Statistics",
issn = "1932-6157",
publisher = "Institute of Mathematical Statistics",
number = "3",

}

RIS

TY - JOUR

T1 - Stabilizing variable selection and regression

AU - Pfister, Niklas

AU - Williams, Evan G.

AU - Peters, Jonas

AU - Aebersold, Ruedi

AU - Bühlmann, Peter

N1 - +

PY - 2021

Y1 - 2021

N2 - We consider regression in which one predicts a response Y with a set of predictors X across different experiments or environments. This is a common setup in many data-driven scientific fields, and we argue that statistical inference can benefit from an analysis that takes into account the distributional changes across environments. In particular, it is useful to distinguish between stable and unstable predictors, that is, predictors which have a fixed or a changing functional dependence on the response, respectively. We introduce stabilized regression which explicitly enforces stability and thus improves generalization performance to previously unseen environments. Our work is motivated by an application in systems biology. Using multiomic data, we demonstrate how hypothesis generation about gene function can benefit from stabilized regression. We believe that a similar line of arguments for exploit-ing heterogeneity in data can be powerful for many other applications as well. We draw a theoretical connection between multi-environment regression and causal models which allows to graphically characterize stable vs. unstable functional dependence on the response. Formally, we introduce the notion of a stable blanket which is a subset of the predictors that lies between the direct causal predictors and the Markov blanket. We prove that this set is op-timal in the sense that a regression based on these predictors minimizes the mean squared prediction error, given that the resulting regression generalizes to unseen new environments.

AB - We consider regression in which one predicts a response Y with a set of predictors X across different experiments or environments. This is a common setup in many data-driven scientific fields, and we argue that statistical inference can benefit from an analysis that takes into account the distributional changes across environments. In particular, it is useful to distinguish between stable and unstable predictors, that is, predictors which have a fixed or a changing functional dependence on the response, respectively. We introduce stabilized regression which explicitly enforces stability and thus improves generalization performance to previously unseen environments. Our work is motivated by an application in systems biology. Using multiomic data, we demonstrate how hypothesis generation about gene function can benefit from stabilized regression. We believe that a similar line of arguments for exploit-ing heterogeneity in data can be powerful for many other applications as well. We draw a theoretical connection between multi-environment regression and causal models which allows to graphically characterize stable vs. unstable functional dependence on the response. Formally, we introduce the notion of a stable blanket which is a subset of the predictors that lies between the direct causal predictors and the Markov blanket. We prove that this set is op-timal in the sense that a regression based on these predictors minimizes the mean squared prediction error, given that the resulting regression generalizes to unseen new environments.

KW - Causality

KW - Multiomic data

KW - Regression

KW - Variable selection

UR - http://www.scopus.com/inward/record.url?scp=85115004572&partnerID=8YFLogxK

U2 - 10.1214/21-AOAS1487

DO - 10.1214/21-AOAS1487

M3 - Journal article

AN - SCOPUS:85115004572

VL - 15

SP - 1220

EP - 1246

JO - Annals of Applied Statistics

JF - Annals of Applied Statistics

SN - 1932-6157

IS - 3

ER -

ID: 284194164