Annual Meeting in the statistics network

January 7-8, 2010

Comwell hotel, Holte

How to get to the hotel by public transportation: Go to "Holte station", e.g by S-train, take bus 334 to "Kongevejen/Vasevej". Comwell hotel is located 200 meters further down Kongevejen.

The meeting takes place in the room Newton, which is located in the building to the left of the main building when you arrive at the hotel.
Program:
Thursday 7/1

9:00 - 9:20 Coffee with bread, juice and fruit
9:20 - 10.40 Survival analysis
10.40 - 11:00 Break
11:00 - 12:20 Bioinformatics
 
12:30 - 14:00 Lunch
 
14:00 - 14.40 Statistical computing
14.40 - 15:20 Functional data and image analysis
15:20 - 15:40 Coffee with cake, juice and fruit
15:40 - 16:25 Survey talk
The statistical evaluation of prediction models
Thomas A. Gerds
16.25 - 16.45 Break
16:45 - 17.30 Survey talk
Models and bioinformatics
Ole Winther and Albin Sandelin

18:30 - Dinner
 

Friday 8/1

9.15 - 10:15 Survey talk
Statistical modeling with stochastic differential equations
Susanne Ditlevsen
10:15 - 10:40 Break
10:40 - 12:00 Dynamical stochastic models
 
12:00 - 13:30 Lunch
 
13:30 - 14:30 Invited talk
Estimation of trees, forests, and other decomposable graphs
Steffen Lauritzen
14:30 - 15:30 Coffee with cake, juice and fruit

As last year we have five sessions with scientific presentations from the participants in the statistics network, each session organized within one of the five main themes of the network. In addition we have this year three invited survey talks from network participants and last but not least a special invited talk by Steffen Lauritzen.

Session speakers and titles will be available from November 6, 2009. Abstracts as soon as possible thereafter.

Talks and Abstracts

Thursday 7/1

9.20-10.40 Survival analysis

9.20-10.00 Christian Pipper - Analysis of cluster variation in fully specified marginal models for data-grouped survival data

An additive hazards model may be used to quantify the effect of genetic and environmental predictors on flowering of sugar beet plants recorded as data-grouped time-to-event data. Estimated predictor effects have an intuitive interpretation rooted in the underlying time dynamics of the flowering process. However, agricultural experiments are often designed using several plots consisting of multiple plants that are subsequently being monitored. In this paper we consider an additive hazards model with an additional plot structure induced by latent shared frailty variables. This approach enables us to derive a method to assess the quality of predictors in terms of how much plot variation they explain. We apply the method to a large dataset exploring flowering of sugar beet to assess the importance of the genetic predictor biotype.

10.00-10.40 Kyle Raymond - Estimating Cause Specific Hazards with Missing Covariates

Currently there is a great deal of interest in assessing haplotype effects for large scale association studies. This interest is driven by a desire to unravel the genetic basis of complex human diseases and traits. However, typically only the individual's unphased genotype, (the combination of the individual's two homologous haplotypes) is known. When assessing haplotype effects for survival data based on unphased genotype information it is crucial to account for possible competing risks. Flanders et al, 2005, showed via simulation that haplotype effects can be severely biased when ignoring competing risks. The estimation of haplotype effects in the presence of competing risks has not previously been considered. In this talk I will discuss some of the challenges and possible strategies for assessing haplotype effects and other quantities of interest such as cumulative incidence functions in the presence of competing risks.

11.00-12.20 Bioinformatics

11.00-11.40 Brian Parker - Modelling structural RNAs and type I error control in whole-genome cluster analysis.

The discovery of families of cis-regulatory structures in message RNAs is a key step in understanding post-transcriptional gene regulation. Such RNA structures can be probabilistically modelled using a formalism known as stochastic context-free grammars. The issues involved in using this form of model to discover families of structural RNAs in a genome-wide analysis are discussed: including type I error control in approximations to Kullback-Leibler divergence, and highly-connected subgraph analysis-- a form of density-based cluster analysis suitable for noisy datasets with a distinguished background set.

11.40-12.20 Jessica Kasza - Some Aspects of the Estimation of Bayesian Networks using a Score-based Approach

The estimation of Bayesian networks given high-dimensional data sets, with more variables than there are observations, has been the focus of much recent research. Such structures can be particularly useful in the estimation of genetic regulatory networks given gene expression data, providing information about the conditional dependence relationships between the expression levels of genes. In this talk, which will be divided into two parts, some aspects of the estimation of Bayesian networks using a score based approach are discussed. In the first part of the talk, the estimation of Bayesian networks when the available data set does not consist of independent and identically distributed samples will be considered. Two approaches are presented, and compared. In the second part of the talk, methods for the inclusion of prior knowledge about the underlying network structure are discussed. In particular, the inclusion of the prior assumption of a sparse underlying network is investigated.

14.00-14.40 Statistical computing

14.00-14.40 Peter Dalgaard - What we wish people knew more about when working with R

As every established scholar knows, the ignorance of younger people can appear to be absolutely bottomless. On second thought, you usually have to forgive them for not knowing what they were never taught. However, it does mean that we have work to do to bring e.g. PhD students to a level where they can contribute productively to R packages and to do so at a reasonable quality level.

This talk tries to establish a catalogue of items that would be essential in an introductory curriculum in statistical and scientific computing. This could include basic computer science topics, notably the theory of programming languages and object orientation, numerical analysis, and the practical toolchains involved in software development.

14.40-15.20 Functional data and image analysis

14.40-15.20 Helle Sørensen - Quantification of symmetry for functional data with application to equine lameness classification

This is a study on symmetry for (repeated) bi-phased data signals. In particular we are interested in quantification of the deviation between the two parts of the signal. We derive three symmetry scores using functional data techniques such as smoothing and registration. The scores are applied to acceleration signals from a study on equine gait. The scores turn out to be highly associated to lameness, and we investigate their applicability for lameness detection.

Joint work with Anders Tolver Jensen

15.40-16.25 Survey Talk

Thomas A. Gerds - The statistical evaluation of prediction models

Risk prediction models can be used to assess the current status (diagnosis) and the future status (prediction) of future patients. Risk prediction models consist of a data base and a set of rules and parameter estimates that determine the prediction for a new patient based on the data base. In a given application, different statistical approaches lead to a risk prediction model. For example, logistic regression and random forests are conceptually different methods that yield competing risk prediction models. The Brier score and the receiver operating characteristic (ROC) are suitable metrics for assessing and comparing the performance of risk prediction models. Suitable resampling strategies are discussed to compare risk prediction models that are derived from non-nested statistical procedures with different degree of complexity.

16.45-17.30 Survey Talk

Ole Winther and Albin Sandelin - Models and bioinformatics

Genomics of today is a data-driven field, dominated by new sequencing-based methods where the post-analysis is demanding computationally. More importantly, there is little consensus about the methods of choice, from a statistical and computer science perspective, which is relecting that the field is young and immature. In this two-part talk we will first make a brief survey of how the genomics field looks like today and what challenges experimentalists and bioinformaticians are facing, and then show both open statistical/modeling problems and possible solutions to some of these. Particular examples along the way include how to extrapolate how many new unique tags will be observed in a new experiment, assess "tissue specificity" with tag data and how to cluster tags on the genome in a meaningful way.

Friday 8/1

9.15-10.15 Survey Talk

Susanne Ditlevsen - Statistical modeling with stochastic differential equations

Continuous time processes are often modeled as a system of ordinary differential equations. These models assume that the observed dynamics are driven exclusively by internal, deterministic mechanisms. However, many observed systems are exposed to influences that are not completely understood or not feasible to model explicitly, which conveniently can be included in the models as stochastic influences or noise. This leads to stochastic differential equation models.

In this talk I will present a series of examples where these models have been applied, and discuss what we can learn from this type of models. Statistical problems will also be discussed.

10.40-12.00 Dynamical stochastic models

10.40-11.20 Patric Jahn - Modeling Membrane Potentials in Motoneurons by time-inhomogeneous Diffusion Leaky Integrate-and-Fire Models

A commonly used model for membrane potentials in neurons is the diffusion leaky integrate-and-fire model, where the membrane potential (Xt)t ≥ 0 is assumed to be a solution of a time-homogeneous SDE with linear drift

dXt= (a-Xt/τ)dt+ σ(Xt)dBt,

where (Bt)t ≥ 0 is a standard Brownian motion and σ(⋅) the diffusion coefficient. However, real data contains very often time-inhomogeneous patterns. Moreover, we can observe from data that the time-constant τ decreases when neuronal activity increases. Further, σ2(⋅) turns out to be a linear function of Xt, which leads to the Feller neuronal model. The issue is to model the cycling behavior of membrane potentials in motoneurons from an active network during mechanical stimulation and to take a varying τ and a linear σ2(⋅) into account. In a first step we use nonparametric methods in the data analysis which help to apply further regression methods in order to fit the model to data.

Joint work with Susanne Ditlevsen, Rune W. Berg and Jørn Hounsgaard

11.20-12.00 Niels Richard Hansen - Multivariate point process models

A one-dimensional point process is a model of the occurrences of a particular event in either time or one spatial dimension. Observations of the spiking times for a single neuron in the brain is an example of the former and the binding position of a transcription factor to the genome is an example of the latter. If we record the occurrences of multiple events, e.g. several different neurons or different transcription factors, we talk about a multivariate point process model.

In this talk I will focus on the modeling of multivariate point processes via intensities. The objective is to unravel the dependence structures in the multivariate distribution of event occurrences.

I will show some of our recent developments of statistical methods in the framework of multivariate point process models. I will in particular focus on statistical inference of interaction terms in the generalized linear point process models that we work with.

Joint work with Lisbeth Carstensen.

13.30-14.30 Invited talk

Steffen Lauritzen - Estimation of trees, forests, and other decomposable graphs

In the last decades there has been renewed interest in estimating unknown dependence structure among large numbers of variables. Most methods have a heuristic and somewhat adhoc character, as most principled methods tend to be too complex and needs heuristic or other modifications to become practically feasible.

An exception is the case when the structure is estimated as a tree, in which case penalized or ordinary maximum likelihood methods as well as fully Bayesian procedures have a complete and elegant theoretical and computational solution.

In the lecture I shall describe and discuss these solutions and their properties, as well as discuss possibilities for extending to the surprisingly more complex case of forests, or even general decomposable graphs.