Inference for feature selection using the Lasso with high-dimensional data

Publikation: Andet › Andet bidrag › Forskning

Standard

Inference for feature selection using the Lasso with high-dimensional data. / Brink-Jensen, Kasper; Ekstrøm, Claus Thorn.

20 s. 2014. (arXiv.org: Statistics).

Publikation: Andet › Andet bidrag › Forskning

Harvard

Brink-Jensen, K & Ekstrøm, CT 2014, Inference for feature selection using the Lasso with high-dimensional data.. <http://arxiv.org/pdf/1403.4296v1.pdf>

APA

Brink-Jensen, K., & Ekstrøm, C. T. (2014, mar. 19). Inference for feature selection using the Lasso with high-dimensional data. arXiv.org: Statistics http://arxiv.org/pdf/1403.4296v1.pdf

Vancouver

Brink-Jensen K, Ekstrøm CT. Inference for feature selection using the Lasso with high-dimensional data. 2014. 20 s.

Author

Brink-Jensen, Kasper ; Ekstrøm, Claus Thorn. / Inference for feature selection using the Lasso with high-dimensional data. 2014. 20 s. (arXiv.org: Statistics).

Bibtex

@misc{d5b6b7058d5a45799fc1ba5cd53c23be,

title = "Inference for feature selection using the Lasso with high-dimensional data",

abstract = "Penalized regression models such as the Lasso have proved useful for variable selection in many fields - especially for situations with high-dimensional data where the numbers of predictors far exceeds the number of observations. These methods identify and rank variables of importance but do not generally provide any inference of the selected variables. Thus, the variables selected might be the {"}most important{"} but need not be significant. We propose a significance test for the selection found by the Lasso. We introduce a procedure that computes inference and p-values for features chosen by the Lasso. This method rephrases the null hypothesis and uses a randomization approach which ensures that the error rate is controlled even for small samples. We demonstrate the ability of the algorithm to compute $p$-values of the expected magnitude with simulated data using a multitude of scenarios that involve various effects strengths and correlation between predictors. The algorithm is also applied to a prostate cancer dataset that has been analyzed in recent papers on the subject. The proposed method is found to provide a powerful way to make inference for feature selection even for small samples and when the number of predictors are several orders of magnitude larger than the number of observations. The algorithm is implemented in the MESS package in R and is freely available.",

keywords = "stat.ME",

author = "Kasper Brink-Jensen and Ekstr{\o}m, {Claus Thorn}",

year = "2014",

month = mar,

day = "19",

language = "English",

series = "arXiv.org: Statistics",

publisher = "Cornell University Library",

type = "Other",

}

RIS

TY - GEN

T1 - Inference for feature selection using the Lasso with high-dimensional data

AU - Brink-Jensen, Kasper

AU - Ekstrøm, Claus Thorn

PY - 2014/3/19

Y1 - 2014/3/19

N2 - Penalized regression models such as the Lasso have proved useful for variable selection in many fields - especially for situations with high-dimensional data where the numbers of predictors far exceeds the number of observations. These methods identify and rank variables of importance but do not generally provide any inference of the selected variables. Thus, the variables selected might be the "most important" but need not be significant. We propose a significance test for the selection found by the Lasso. We introduce a procedure that computes inference and p-values for features chosen by the Lasso. This method rephrases the null hypothesis and uses a randomization approach which ensures that the error rate is controlled even for small samples. We demonstrate the ability of the algorithm to compute $p$-values of the expected magnitude with simulated data using a multitude of scenarios that involve various effects strengths and correlation between predictors. The algorithm is also applied to a prostate cancer dataset that has been analyzed in recent papers on the subject. The proposed method is found to provide a powerful way to make inference for feature selection even for small samples and when the number of predictors are several orders of magnitude larger than the number of observations. The algorithm is implemented in the MESS package in R and is freely available.

AB - Penalized regression models such as the Lasso have proved useful for variable selection in many fields - especially for situations with high-dimensional data where the numbers of predictors far exceeds the number of observations. These methods identify and rank variables of importance but do not generally provide any inference of the selected variables. Thus, the variables selected might be the "most important" but need not be significant. We propose a significance test for the selection found by the Lasso. We introduce a procedure that computes inference and p-values for features chosen by the Lasso. This method rephrases the null hypothesis and uses a randomization approach which ensures that the error rate is controlled even for small samples. We demonstrate the ability of the algorithm to compute $p$-values of the expected magnitude with simulated data using a multitude of scenarios that involve various effects strengths and correlation between predictors. The algorithm is also applied to a prostate cancer dataset that has been analyzed in recent papers on the subject. The proposed method is found to provide a powerful way to make inference for feature selection even for small samples and when the number of predictors are several orders of magnitude larger than the number of observations. The algorithm is implemented in the MESS package in R and is freely available.

KW - stat.ME

M3 - Other contribution

T3 - arXiv.org: Statistics

ER -

ID: 138917217

Institut for Matematiske Fag