University of Copenhagen
Home


BMC-course:

Statistical Learning and bioinformatics


News and information

18-6-2009 It has come to my attention that using ggplot2 0.8.2 resulted in problems with getting correct colors on some of the plots. I could claim that this was a "feature", but it may be fair to say it was actually a bug. At least when installing 0.8.3 the "feature" disappears. As I explicitly wrote that we used 0.8.2 it is technically correct to say that your programs did not produce the correct colors -- really they did not under the requirements put up. However, as many of you may speculate what the ... I meant when I wrote that your colors are messed up, because on your computer it works fine, I can say just forget it. It seems that it was a badly designed default method for chosing the ordering of the colors, which has been changed in 0.8.3.
11-6-2009 I have emailed you the comments for the first assignment. A possible solution is available. There is also a version, where I don't remove the outlier
1-6-2009 Second assignment has been uploaded.
1-5-2009 First assignment has been uploaded.
27-4-2009 There was an unfortunate interchange of (X^TX)^{-1} and C^TC in the proof of Gauss-Markov's theorem in the handout (page 4). This has been corrected together with a few other misprints in the version you can download.
27-4-2009 Lectures Monday 9-12 and Wednesday take place in Auditorium 4, HCØ. Other classes take place in the classroom N037, Computer Science (Datalogisk Institut just outside of the main entrence of HCØ).
24-4-2009 The first lecture Monday 27 starts at 9.15 in Auditorium 4 -- see the map. This is in the HCØ institute. You will then be given directions for the remaining lectures and exercises. Information on R requirements is provided below. We have an eduroam wireless net that you can use if your laptop has been setup for it from your home university. I have also provided a link to the theoretical exercises, but you will get a handout of those on Monday.
7-4-2009 A program for the preparation and a more detailed schedule for the first week of teaching is now available.

Time and Place

The course takes place from April 20 to June 19, 2009. There are lectures and exercises in week 18 (April 27 - May 1) and week 22 (May 25 - May 31) at the University of Copenhagen. For the remaining weeks there are preparation, homework exercises and individual projects.

The lectures and exercises will take place at
Auditorium 4, HCØ
University of Copenhagen
Universitetsparken 5
DK-2200 Copenhagen Ø

Course Description

Remark: The primary literature for the course is the book:

Hastie, T, Tibshirani, R, and Friedman, Jerome. The Elements of Statistical Learning. Data mining, inference, and prediction. Springer, 2nd ed., 2009.

The book has just come out in a seconde edition, which we will use. Since I have not had the book in my hand yet the precise content of the course may be subject to minor changes depending on the exact nature of the revision of this second edition.

The main topic of this course is models and methods suitable for analyzing high dimensional data where there are typically many features compared to replications. This is a typical situation met in bioinformatics and exemplified by gene expression data, where we analyze experiments with thousands of parallel measurements and few replications.

The course focuses on supervised learning where typical approaches to high-dimensional data analysis involve flexible models combined with shrinkage or regularization algorithms, such as ridge or Lasso regression perhaps combined with basis expansion techniques such as spline regression and smoothing splines. Also non-generative models such as classification and regression trees are found useful for prediction purposes.

In the course we start with linear methods for regression and classification and move on to four more advanced topics:
  • Basis expansions, splines and regularization
  • Kernel methods
  • Additive models and generalized additive models
  • Trees
An important component in the data analysis is model selection and optimization of tuning parameters. The course will show how to use AIC and BIC as well as cross-validation and bootstrapping for these purposes.

Access to good statistical software is paramount. Therefore we will illustrate the use of the models throughout the course with methods implemented in R, and the course will train the participants in using R and Bioconductor software for the analysis of expression data.

Credit

Participants who pass the final project will receive a certificate of participation.

ECTS-credits: 7.5

Prerequisites

The participants are expected to know the theory for the multivariate normal distribution, ordinary multiple regression and linear normal models, and in particular the linear algebra associated with these models. Participants also need to be confident with random variables, probability measures, expectations and conditional expectations though the course by no means will focus on a formal, measure theoretic approach, the book uses e.g. expectations and conditional expectations and their computational rules.

Participants also need some prior experience with R and an interest in practical applications to biological questions. You need to know about the fundamental data structures such as vectors, lists and data frames and the fundamental functions such as lm for linear models and it is probably also necessary to know how to produce graphics. The participants are also expected to bring their own laptop for the exercises. We require that all participants prior to the course install the latest version of R and the latest version of Bioconductor (which releases will be announced on this web page when settled).

As it looks right now R version 2.9.0 is used with the additional packages below. We will not use any particular bioconductor packages this week. Additional packages may be added for the last part of the week and for the second week.
R-packages
rgl 0.84 (3d-plotting)
e1071 1.5-19 (support vector machines)
ggplot2 (0.8.2) (plotting)
lars (0.9-7) (lasso regression)
glmnet (1.1-3) (elastic net)
leaps (2.8) (subset selection)
stepPlr (0.9-1) (L2-penalized logistic regression)
rda (1.0.2) (regularized discriminant analysis)
mboost (1.1-1) (boosting)
mgcv (1.5-2) (gam and model selection)
boot (1.2-36) (bootstrapping and cross-validation)

Program

Program:

April 20 -24: Preparation home.

  • Read Chapter 1 and sections 2.1-2.4. Skim the sections 2.5-2.9
  • Read about matrices and matrix decompositions. A good starting point is the Wikipedia entry on matrix decompositions. Positive definite matrices are in particular important and you should read about the Cholesky decomposition. The QR-decomposition and the singular value decomposition are also important.
  • Consult the R manual and R help pages on the use of "lm" for linear regression. You should be comfortable with the use of formula as well as design matrix (model matrix) specification of linear models and how you get from one to the other. See also the help pages for "model.matrix" and "formula". A useful book reference is Statistical Models in S, eds J. M. Chambers and T. J. Hastie.


Below you find information on which sections in the book we cover and when. There are also a number of practical exercises announced. They will be made available during the course. They will consist mostly of small R exercises for training the use of R on various pratical problems. Usually you will be given approximately 30-45 minutes to solve the exercise on your own computer. Solutions will be provided.

May 4 - May 22: Home preparation and first compulsory assignment:
Classification of individuals based on "genetic fingerprint".
DATA
Hand-in deadline for the assignment is May 22. June 1 - June 30: Final project. Hand-in deadline is July 1, 2009.
Microarray classification
DATA

*The two fridays are reserved for "Individual work", which means that you can work on some of the exercises from the week. I will be available for questions during the day.

Registration

To register for the course send an email to Niels Richard Hansen. Dealine for registration is April 1. The number of participants at the course is limited to 28 students. In case of overbooking students from the universities participating in the BCM-network will be given priority.

Miscellaneous

Material

Primary literature for the course is

The Elements of Statistical Learning.
Data Mining, Inference, and Prediction
2nd ed.


See also the web page for the book The Elements of Statistical Learning for links to data, R resources, errata, etc.

During the course you will do the following 6 theoretical and 9 practical exercise.
Theo.1 Principal components
Theo.2 Ridge regression
Theo.3 Reproducing kernel Hilbert spaces
Theo.4 Penalized logistic regression
Theo.5 Support vector machines
Theo.6 Linear smoothers and cross-validation
Download Theoretical Exercises


Prac.1 Distribution of regression parameters
Prac.2 Linear discriminant analysis
Prac.3 Logistic regression
Prac.4 Ridge and Lasso regression
Prac.5 Logistic regression and basis expansions
Prac.6 Cross-validation and generalized additive models
Prac.7 Microarray data
GolubXTrain
GolubYTrain
GolubXTest
GolubYTest
Prac.8 Trees and boosting

The above theoretical exercises will be made available from the course start.
Additional exercises from the book will be pointed out along the way.

For additional reading we recommend the books:

Bioinformatics and Computational Biology Solutions
Using R and Bioconductor

Bioconductor Case Studies

Directions and accommodation

Please find information on directions and accommodation on our website. Note that we have no possibility to give financial support for participants.



Lecturer

Niels Richard Hansen
Department of Mathematical Sciences