University of Copenhagen
Home


BGC-course:

Statistical Learning and bioinformatics


News and information

17-10-2011 Final assignment
Data for final assignment

The hand-in deadline is Thursday, November 10, 2011, by email.
05-10-2011 The program for next week is now available.
19-9-2011 Information on rooms for the first week has been included in the program below.

The Auditorium 5 and 6 are located at HCØ, Universitetsparken 5 (see map below), in the basement under the E-building (the mathematics building.)

The room D317 is at HCØ, the D-building, and A110 is in the main building of HCØ on the second floor.
16-9-2011 The first session Monday morning from 9.15-12.00 will take place at

Copenhagen Biocenter
University of Copenhagen
Ole Maaløes Vej 5
DK-2200 Copenhagen N

in room 4032. Additional directions and rooms for the remaining week will be given there.
02-9-2011 The program for the first week is available. Regarding the rooms for teaching, this period is unfortunately very busy for the University, and we need to shuffle around between a number of rooms. A guide will appear later here on how to find the room for the first Monday session, where a detailed guide will be handed out.
02-8-2011 Welcome. The course webpage is available.

Time and Place

The teaching period is September 5 to November 4. Lectures and exercises in Copenhagen take place in week 38 (September 19 to September 23) and week 41 (October 10 to October 14). For the remaining weeks there are preparation, homework exercises and individual projects.

The lectures and exercises will take place at

HCØ
University of Copenhagen
Universitetsparken 5
DK-2100 Copenhagen Ø

Course Description

Remark: The primary literature for the course is the book:

Hastie, T, Tibshirani, R, and Friedman, Jerome. The Elements of Statistical Learning. Data mining, inference, and prediction. Springer, 2nd ed., 2009. Note that this book is freely available as a pdf-file from the webpage linked to above.

The main topics of this course are models and methods suitable for analyzing high dimensional data where there are typically many features compared to replications. This is a typical situation met in bioinformatics and exemplified by gene expression data, where we analyze experiments with thousands of parallel measurements and few replications.

The course focuses on supervised learning where typical approaches to high-dimensional data analysis involve flexible models combined with shrinkage or regularization algorithms, such as ridge or lasso regression perhaps combined with basis expansion techniques such as spline regression and smoothing splines. Also non-generative models such as classification and regression trees are found useful for prediction purposes.

In the course we start with linear methods for regression and classification and move on to more advanced topics including
  • Basis expansions, splines and regularization
  • Kernel methods
  • Additive models and generalized additive models
  • Trees
  • Ensemble methods
  • Boosting
An important component in the data analysis is model selection, optimization of tuning parameters and model assessment. The course will show how to use information criteria such as AIC and BIC as well as cross-validation for these purposes.

Access to good statistical software is paramount. Therefore we will illustrate the use of the models throughout the course with methods implemented in R, and the course will train the participants in using R and Bioconductor software for the analysis of genomic data.

Credit

Participants who pass the final project will receive a certificate of participation.

ECTS-credits: 7.5

Prerequisites

The participants are expected to know the theory for the multivariate normal distribution, ordinary multiple regression and linear normal models, and in particular the linear algebra associated with these models. Participants also need to be confident with random variables, probability measures, expectations and conditional expectations though the course by no means will focus on a formal, measure theoretic approach, the book uses e.g. expectations and conditional expectations and their computational rules.

Participants also need some prior experience with R and an interest in practical applications to biological questions. You need to know about the fundamental data structures such as vectors, lists and data frames and the fundamental functions such as lm for linear models and it is probably also necessary to know how to produce graphics. The participants are also expected to bring their own laptop for the exercises. We require that all participants prior to the course install the latest version of R and the latest version of Bioconductor (which releases will be announced on this web page when settled).

For the course we will use R version 2.13.1, and here is a list of some additional packages that you might want to install right away.
R-packages
rgl (3d-plotting)
e1071 (support vector machines)
ggplot2 (plotting)
lars (lasso regression)
glmnet (elastic net)
leaps (subset selection)
stepPlr (L2-penalized logistic regression)
rda (regularized discriminant analysis)
mboost (boosting)
mgcv (gam and model selection)
boot (bootstrapping and cross-validation)

Program

September 5 - 16: Preparation home.

  • Read Chapter 1 and sections 2.1-2.4. Skim the sections 2.5-2.9
  • Read about matrices and matrix decompositions. A good starting point is the Wikipedia entry on matrix decompositions. Positive definite matrices are, in particular, important and you should read about the Cholesky decomposition. The QR-decomposition and the singular value decomposition are also important.
  • Consult the R manual and R help pages on the use of "lm" for linear regression. You should be comfortable with the use of formula as well as design matrix (model matrix) specification of linear models and how you get from one to the other. See also the help pages for "model.matrix" and "formula". A useful book reference is Statistical Models in S, eds J. M. Chambers and T. J. Hastie.


Below you find information on which sections in the book we cover and when. There will also be a number of practical exercises. They will be made available during the course. They will consist mostly of small R exercises for training the use of R on various problems. Usually you will be given approximately 30-45 minutes to solve the exercise on your own computer. Solutions will be provided.

Registration

To register for the course send an email to Niels Richard Hansen. The number of participants at the course is limited to 20 students. In case of overbooking students from the universities participating in the BGC-network will be given priority.

Miscellaneous

Material

Primary literature for the course is

The Elements of Statistical Learning.
Data Mining, Inference, and Prediction
2nd ed.


See also the web page for the book The Elements of Statistical Learning for links to data, R resources, errata, etc.

For additional reading we recommend the books:

Bioinformatics and Computational Biology Solutions
Using R and Bioconductor

Bioconductor Case Studies

Directions and accommodation

Please find information on directions and accommodation on our website. Note that we have no possibility to give financial support for participants.



Lecturer

Niels Richard Hansen
Department of Mathematical Sciences