High dimensional multiclass classification with applications to cancer diagnosis

Research output: Book/ReportPh.D. thesisResearch

  • Martin Vincent
Probabilistic classifiers are introduced and it is shown that the only regular linear probabilistic classifier with convex risk is multinomial regression. Penalized empirical risk minimization is introduced and used to construct supervised learning methods for probabilistic classifiers. A sparse group lasso penalized approach to high dimensional multinomial classification is presented. On different real data examples it is found that this approach clearly outperforms multinomial lasso in terms of error rate and features included in the model. An efficient coordinate descent algorithm is developed and the convergence is established. This algorithm is implemented in the msgl R package.
Examples of high dimensional multiclass problems are studied, in particular examples of
multiclass classification based on gene expression measurements. One such example is the clinically important - problem of identifying the primary tumor site of lever metastases, this particular problem is studied in detail. In order to adjust for the lever contamination found in biopsies of metastases a computational contamination model is develop. The contamination model is presented in a domain adaption framework and a simulation based domain adaption strategy is presented. It is shown that the presented computational contamination approach drastically improves the primary tumor site classification of lever contaminated biopsies of metastases. A final classifier for identification of the primary tumor site is developed. This classifier is validated on an independent validation set consisting of lever biopsies of metastases with varying tumor content.
Original languageEnglish
PublisherDepartment of Mathematical Sciences, Faculty of Science, University of Copenhagen
Publication statusPublished - 2013

ID: 97016368