Time and Place
The workshop takes place Monday, August 24 and Tuesday, August 25,
University of Copenhagen
Ole Maaløes Vej 5
DK-2200 Copenhagen Ø
The purpose of this workshop is to bring together people who work on or are
interested in high-throughput sequencing technologies and in particular their
use in relation to gene expression analysis. The focus of the workshop is on the
The workshop will have a few invited speakers combined with contributed talks and
posters from the participants.
Speakers and Program
List of confirmed speakers:
- Mark Robinson, WEHI Bioinformatics, Australia
- Peter 't Hoen, Leiden University Medical Center, NL
- Margaret Taub, UC Berkeley, USA
- Kasper Daniel Hansen, UC Berkeley, USA
- Oleg Mayba, UC Berkeley, USA
- Jakob Hedegaard, University of Aarhus, DK
- Mathilde Nielsen, University of Aarhus, DK
- Anders Tolver Jensen, University of Copenhagen, DK
- Anne-Mette K. Hein, CLC bio, DK
- Hanni Willenbrock, Exiqon, DK
- Irina-Ioana Mohorianu, University of East Anglia, UK
*Lunch can be purchased in the cafeteria in the same building as the Lundbeck Auditorium.
| Anders Tolver Jensen |
What is a good statistical model for a sample of count data?
By embedding small molecules into pieces of DNA it is possible to apply
principles from biological evolution to develop new technologies for
screening experiments in the drug discovery industry. The YoctoReactor
technology allows to extract a target specific subsample of a large
library of small molecules where the subsample is supposed to consist
mainly of molecules with binding preference for a given target molecule.
The results of these experiments are available in the form of a count
for each molecule in the initial library and thus ressembles data from a
SAGE library. From a statistical point of view an important task is to
identify molecules that binds with high preference to the target
compared to control samples. This is equivalent to identifying
differentially expressed genes based on data from different groups of
In this talk we focus on the importance of using the right model for
the counts as the performance of the resulting procedure for candidate
drug selection may be very bad if conclusions are based on a wrong null
model. We illustrate some tools for validating some of the most popular
models for count data from SAGE libraries. Finally, we introduce a
statistical model where we try to model directly some of the individual
steps in the generation of count data from sequencing technologies.
Simulations and empirical SAGE data are used to study the properties of
(With Ib Michael Skovgaard , LIFE, KU and the Danish
biotechnology company Vipergen)
| Anne-Mette K. Hein |
Building a comprehensive and user-friendly solution for NGS data analysis - challenges and solutions
This talk is a general introduction to the CLCbio workbenches and solutions.
CLCbio is a bioinformatics software company based in Århus, Denmark. Having been around for some years,
we launced the first version of our Genomics workbench in the Summer 2008. The goal of CLCbio's solutions and
workbenches is to provide a stable and platform independent framework for analysis of NGS data analysis, that is
user-friendly and allows efficient visualization of results. The current version of the workbench seamlessly handles
data from different platforms and protocols (including color-space data and paired-end reads) and supports a number
of NGS analysis types, including reference assembly, de-novo assembly, SNP detection, dip-detection, ChIP-seq and RNA-seq.
The CLCbio is an open framework - a developer kit is provided through which researchers can develop their own
plug-ins, which can then be loaded in and used through the graphical user interface of the workbench.
| Anne-Mette K. Hein |
Analyzing sequencing-based gene expression data in the CLCbio Genomics workbench.
This talk is a follow-up talk of talk 1, in which focus is on CLCbio solutions to sequencing-based expression
analysis. The analysis of NGS transcriptomics data is a field in it's infancy and we will give a presentation
of the tools that are currently available for the analysis of this type of data in the workbenches.
| Hanni Willenbrock |
Is sequencing (really) the golden standard for gene expression studies?
Recently, next-generation sequencing has been introduced as a promising, new platform for assessing the copy number of transcripts, while the existing microarray technology is considered less reliable for absolute, quantitative expression measurements. Nonetheless, so far, data from the two technologies have only been compared based on biological data, leading to the conclusion that, although they are somewhat correlated, expression values differ significantly. Here, we use synthetic RNA samples, resembling human microRNA samples, to find that microarray expression measures actually correlate better with sample RNA content than expression measures obtained from sequencing data. In addition, microarrays appear highly sensitive and perform equivalently to next-generation sequencing in terms of reproducibility and relative ratio quantification.
| Irina-Ioana Mohorianu |
Analysis of small-RNA expression profiles in fruit ripening
Small (s)RNAs are 20-25 nucleotide long non-coding RNAs that act as guides for the highly sequence-specific regulatory mechanism known as RNA silencing in eukaryotes. Plants in particular have been known to produce a highly complex and diverse population of regulatory sRNAs, which are involved in many different processes, such as transcriptional and post-transcriptional regulation of gene expression levels, genome maintenance and defence against pathogens. Our aim is to understand the involvement of sRNAs in Tomato fruit ripening by analysing time series of high-throughput sRNA sequencing data. The analysis involves normalisation and noise filtering of the raw data, identifying sRNAs that exhibit differential expression at various stages in the fruit ripening process and clustering of sRNA expression profiles. We have found many sRNAs, including members of the well known sub-class of microRNAs, to exhibit significant expression changes in key stages of fruit development, indicating a strong involvement of RNA silencing in this important step of a plant's life cycle.
(With Simon Moxon School of Computing Sciences , Jing Runchun School of Biological Sciences, Tamas Dalmay School of Biological Sciences, Vincent
Moulton School of Computing Sciences , Frank Schwach School of Computing Sciences,
University of East Anglia, Norwich, NR4 7TJ, UK)
| Jakob Hedegaard |
High-throughput DNA sequencing technologies at Aarhus University
The researchers at the Department of Genetics and Biotechnology, Faculty of Agricultural Sciences, Aarhus University are – among other tools - using high-throughput DNA sequencing technologies for their studies within animal genetics and system biology. Currently used applications are whole and targeted genome sequencing of pig and cattle as well as mRNA and small RNA transcriptome sequencing in pigs. Future applications include epigenome sequencing in pigs and mRNA and small RNA transcriptome sequencing in cattle. The setup of the Roche 454 FLX/Titanium instrument and the two Illumina Genome Analyzer II instruments is, together with examples of projects, presented with an emphasis on the structure of the analytical pipeline.
| Kasper Daniel Hansen |
| Margaret Taub |
Methods for Allocating Ambiguous Short-Reads
With the rise in prominence of biological research using new short-read DNA sequencing technologies comes the need for new techniques for aligning and assigning these reads to their genomic location of origin. Until now, methods for allocating reads which align with equal or similar fidelity to multiple genomic locations have not been model-based, and have tended to ignore potentially informative data. Here, I will demonstrate that existing methods for assigning ambiguous reads can produce biased results. I will also present a new method for allocating ambiguous reads to the genome, developed within a framework of statistical modeling, which shows promise in alleviating these biases, both in simulated and real data.
(With Terry Speed, UC Berkeley Dept of Statistics and Doron Lipson, Helicos Biosciences.)
| Mark Robinson |
Normalization for RNA-seq data
The fine detail provided by sequencing-based transcriptome surveys suggests that RNA-seq is likely to become the platform of choice for interrogating steady state RNA. In order to discover biologically important changes in expression, we show that normalization continues to be an essential step in the analysis. We outline a simple and effective method for performing normalization and show dramatically improved results for inferring differential expression in simulated and publicly available data sets.
| Mathilde Nielsen |
MicroRNA identity and abundance in porcine skeletal muscles
MicroRNAs (miRNA) are short single-stranded RNA molecules that regulate gene expression posttranscriptionally by binding to complementary sequences in the 3' untranslated region of target mRNAs. MiRNAs participate in the regulation of myogenesis and identification of the complete set of miRNAs expressed in muscles is likely to significantly increase our understanding of muscle growth and development. To determine the identity and abundance of miRNA in porcine skeletal muscle we applied a deep sequencing approach based on Illumina´s Genome Analyzer System. This allowed us to identify the sequences and relative expression levels of 212 annotated miRNA genes, thereby providing a thorough account of the miRNA transcriptome in porcine muscle tissue. The expression levels displayed a very large range as reflected by the number of sequence reads, which varied from single counts for rare miRNAs to several million reads for the most abundant miRNAs. Moreover, we identified numerous examples of mature miRNAs that were derived from opposite sides of the same predicted precursor stem-loop structures, and also observed length and sequence heterogeneity at the 5’ and 3' ends.
| Peter 't Hoen |
Statistical methods for evaluation of differential gene expression from high-throughput sequencing data
High-throughput sequencing of cDNA libraries has become an attractive alternative for microarray-based expression profiling. From a statistical points of view, the most important difference between digital gene expression data and microarray data is that the measurements are counts. This makes the statistical models used for microarray data inappropriate. Also models traditionally applied on SAGE (serial analysis of gene expression) data, may be inadequate, given the much higher sequencing depth and the availability of biological and technical replicates. We evaluated and compared several existing methods for analysis of digital gene expression data, including traditional tests and methods involving Bayesian statistics. Furthermore, we propose a new hierarchical model. In this model, the reads are modeled as Poisson distributed, the log(intensities) described by a multivariate normal distribution, and the between-library variance
following an inverse-gamma distribution. We demonstrate that the model provides a good fit to a data set based on Illumina sequence libraries from mouse hippocampus. We show how principal component analysis (PCA) can be derived from the parameter estimates and that it is better at identifying clusters of libraries than PCA based on the raw data. We also provide a method to derive per-tag significance bounds for difference between two groups and compare the results obtained with our method to those from alternative methods with respect to identifying genes that are up- or downregulated in transgenic mice.
(With Helene H. Thygesen Department of Mathematics and Applied Statistics, Lancaster University, Lancaster, United Kingdom and Renée X. de Menezes
Current affiliation: Department of Biostatistics, VU University Medical Center, Amsterdam, The Netherlands)
| Oleg Mayba |
ChIP-Seq: Methods and Challenges
In the past two years next generation sequencing technologies have been applied to a variety of biological experiments. One fast-growing area of application is the investigtion of protein-DNA interactions, principally transcription factor binding and chromatin modifications.
We present the framework of this new ChIP-Seq technology that combines chromatin immunoprecipitation experiments with high-throughput sequencing, focusing mainly on applications to transcription factor binding site identification, and discuss the properties of the resulting data, including various biases that can affect the interpreation of the results. We summarize some of the current approaches and methods for analyzing such data, in particular the way they handle the non-uniform nature of background noise and suggest some further possibilities for improvement.
Please fill out the registration
form and submit it.
Deadline for registration
is August 15.
The workshop is free, but we ask you to kindly register for the
The workshop is preceded by the course:
Statistical analysis of gene expression data with R and Bioconductor.
Please find information on directions and accommodation on our website. Note that we have no possibility to give financial support for