Workshop:

Gene expression based on sequencing technologies

Description	Program	Registration	Miscellaneous
	Abstracts		Directions

Time and Place

The workshop takes place Monday, August 24 and Tuesday, August 25, 2009.

Lundbeck Auditorium
University of Copenhagen
Copenhagen Biocenter
Ole Maaløes Vej 5
DK-2200 Copenhagen Ø

Description

The purpose of this workshop is to bring together people who work on or are interested in high-throughput sequencing technologies and in particular their use in relation to gene expression analysis. The focus of the workshop is on the methodological questions.

The workshop will have a few invited speakers combined with contributed talks and posters from the participants.

Speakers and Program

List of confirmed speakers:

Mark Robinson, WEHI Bioinformatics, Australia
Peter 't Hoen, Leiden University Medical Center, NL
Margaret Taub, UC Berkeley, USA
Kasper Daniel Hansen, UC Berkeley, USA
Oleg Mayba, UC Berkeley, USA
Jakob Hedegaard, University of Aarhus, DK
Mathilde Nielsen, University of Aarhus, DK
Anders Tolver Jensen, University of Copenhagen, DK
Anne-Mette K. Hein, CLC bio, DK
Hanni Willenbrock, Exiqon, DK
Irina-Ioana Mohorianu, University of East Anglia, UK

Detailed program

	Monday 24/8	Tuesday 25/8
9.00-9.30	Registration and coffee	Coffee
9.30-10.15	Margaret Taub Methods for Allocating Ambiguous Short-Reads	Anne-Mette K. Hein Analyzing sequencing-based gene expression data in the CLCbio Genomics workbench.
10.30-11.15	Anne-Mette K. Hein Building a comprehensive and user-friendly solution for NGS data analysis - challenges and solutions	Jakob Hedegaard High-throughput DNA sequencing technologies at Aarhus University
11.30-12.00	Hanni Willenbrock Is sequencing (really) the golden standard for gene expression studies?	Mathilde Nielsen MicroRNA identity and abundance in porcine skeletal muscles
12.00-13.00	Lunch*	Lunch*
13.00-13.45	Anders Tolver Jensen What is a good statistical model for a sample of count data?	Irina-Ioana Mohorianu Analysis of small-RNA expression profiles in fruit ripening
14.00-14.45	Oleg Mayba ChIP-Seq: Methods and Challenges	Mark Robinson Normalization for RNA-seq data
15.00-15.45	Peter 't Hoen Statistical methods for evaluation of differential gene expression from high-throughput sequencing data	Kasper Daniel Hansen

*Lunch can be purchased in the cafeteria in the same building as the Lundbeck Auditorium.

Abstracts

Anders Tolver Jensen

What is a good statistical model for a sample of count data?

By embedding small molecules into pieces of DNA it is possible to apply principles from biological evolution to develop new technologies for screening experiments in the drug discovery industry. The YoctoReactor technology allows to extract a target specific subsample of a large library of small molecules where the subsample is supposed to consist mainly of molecules with binding preference for a given target molecule. The results of these experiments are available in the form of a count for each molecule in the initial library and thus ressembles data from a SAGE library. From a statistical point of view an important task is to identify molecules that binds with high preference to the target compared to control samples. This is equivalent to identifying differentially expressed genes based on data from different groups of SAGE libraries.

In this talk we focus on the importance of using the right model for the counts as the performance of the resulting procedure for candidate drug selection may be very bad if conclusions are based on a wrong null model. We illustrate some tools for validating some of the most popular models for count data from SAGE libraries. Finally, we introduce a statistical model where we try to model directly some of the individual steps in the generation of count data from sequencing technologies. Simulations and empirical SAGE data are used to study the properties of the model.

(With Ib Michael Skovgaard , LIFE, KU and the Danish biotechnology company Vipergen)

Anne-Mette K. Hein

Building a comprehensive and user-friendly solution for NGS data analysis - challenges and solutions

This talk is a general introduction to the CLCbio workbenches and solutions. CLCbio is a bioinformatics software company based in Århus, Denmark. Having been around for some years, we launced the first version of our Genomics workbench in the Summer 2008. The goal of CLCbio's solutions and workbenches is to provide a stable and platform independent framework for analysis of NGS data analysis, that is user-friendly and allows efficient visualization of results. The current version of the workbench seamlessly handles data from different platforms and protocols (including color-space data and paired-end reads) and supports a number of NGS analysis types, including reference assembly, de-novo assembly, SNP detection, dip-detection, ChIP-seq and RNA-seq. The CLCbio is an open framework - a developer kit is provided through which researchers can develop their own plug-ins, which can then be loaded in and used through the graphical user interface of the workbench.

Anne-Mette K. Hein

Analyzing sequencing-based gene expression data in the CLCbio Genomics workbench.

This talk is a follow-up talk of talk 1, in which focus is on CLCbio solutions to sequencing-based expression analysis. The analysis of NGS transcriptomics data is a field in it's infancy and we will give a presentation of the tools that are currently available for the analysis of this type of data in the workbenches.

Hanni Willenbrock

Is sequencing (really) the golden standard for gene expression studies?

Recently, next-generation sequencing has been introduced as a promising, new platform for assessing the copy number of transcripts, while the existing microarray technology is considered less reliable for absolute, quantitative expression measurements. Nonetheless, so far, data from the two technologies have only been compared based on biological data, leading to the conclusion that, although they are somewhat correlated, expression values differ significantly. Here, we use synthetic RNA samples, resembling human microRNA samples, to find that microarray expression measures actually correlate better with sample RNA content than expression measures obtained from sequencing data. In addition, microarrays appear highly sensitive and perform equivalently to next-generation sequencing in terms of reproducibility and relative ratio quantification.

Irina-Ioana Mohorianu

Analysis of small-RNA expression profiles in fruit ripening

Small (s)RNAs are 20-25 nucleotide long non-coding RNAs that act as guides for the highly sequence-specific regulatory mechanism known as RNA silencing in eukaryotes. Plants in particular have been known to produce a highly complex and diverse population of regulatory sRNAs, which are involved in many different processes, such as transcriptional and post-transcriptional regulation of gene expression levels, genome maintenance and defence against pathogens. Our aim is to understand the involvement of sRNAs in Tomato fruit ripening by analysing time series of high-throughput sRNA sequencing data. The analysis involves normalisation and noise filtering of the raw data, identifying sRNAs that exhibit differential expression at various stages in the fruit ripening process and clustering of sRNA expression profiles. We have found many sRNAs, including members of the well known sub-class of microRNAs, to exhibit significant expression changes in key stages of fruit development, indicating a strong involvement of RNA silencing in this important step of a plant's life cycle.

(With Simon Moxon School of Computing Sciences , Jing Runchun School of Biological Sciences, Tamas Dalmay School of Biological Sciences, Vincent Moulton School of Computing Sciences , Frank Schwach School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, UK)

Jakob Hedegaard

High-throughput DNA sequencing technologies at Aarhus University

The researchers at the Department of Genetics and Biotechnology, Faculty of Agricultural Sciences, Aarhus University are – among other tools - using high-throughput DNA sequencing technologies for their studies within animal genetics and system biology. Currently used applications are whole and targeted genome sequencing of pig and cattle as well as mRNA and small RNA transcriptome sequencing in pigs. Future applications include epigenome sequencing in pigs and mRNA and small RNA transcriptome sequencing in cattle. The setup of the Roche 454 FLX/Titanium instrument and the two Illumina Genome Analyzer II instruments is, together with examples of projects, presented with an emphasis on the structure of the analytical pipeline.

Kasper Daniel Hansen

Margaret Taub

Methods for Allocating Ambiguous Short-Reads

With the rise in prominence of biological research using new short-read DNA sequencing technologies comes the need for new techniques for aligning and assigning these reads to their genomic location of origin. Until now, methods for allocating reads which align with equal or similar fidelity to multiple genomic locations have not been model-based, and have tended to ignore potentially informative data. Here, I will demonstrate that existing methods for assigning ambiguous reads can produce biased results. I will also present a new method for allocating ambiguous reads to the genome, developed within a framework of statistical modeling, which shows promise in alleviating these biases, both in simulated and real data. (With Terry Speed, UC Berkeley Dept of Statistics and Doron Lipson, Helicos Biosciences.)

Mark Robinson

Normalization for RNA-seq data

The fine detail provided by sequencing-based transcriptome surveys suggests that RNA-seq is likely to become the platform of choice for interrogating steady state RNA. In order to discover biologically important changes in expression, we show that normalization continues to be an essential step in the analysis. We outline a simple and effective method for performing normalization and show dramatically improved results for inferring differential expression in simulated and publicly available data sets.

Mathilde Nielsen

MicroRNA identity and abundance in porcine skeletal muscles

MicroRNAs (miRNA) are short single-stranded RNA molecules that regulate gene expression posttranscriptionally by binding to complementary sequences in the 3' untranslated region of target mRNAs. MiRNAs participate in the regulation of myogenesis and identification of the complete set of miRNAs expressed in muscles is likely to significantly increase our understanding of muscle growth and development. To determine the identity and abundance of miRNA in porcine skeletal muscle we applied a deep sequencing approach based on Illumina´s Genome Analyzer System. This allowed us to identify the sequences and relative expression levels of 212 annotated miRNA genes, thereby providing a thorough account of the miRNA transcriptome in porcine muscle tissue. The expression levels displayed a very large range as reflected by the number of sequence reads, which varied from single counts for rare miRNAs to several million reads for the most abundant miRNAs. Moreover, we identified numerous examples of mature miRNAs that were derived from opposite sides of the same predicted precursor stem-loop structures, and also observed length and sequence heterogeneity at the 5’ and 3' ends.

Peter 't Hoen

Statistical methods for evaluation of differential gene expression from high-throughput sequencing data

High-throughput sequencing of cDNA libraries has become an attractive alternative for microarray-based expression profiling. From a statistical points of view, the most important difference between digital gene expression data and microarray data is that the measurements are counts. This makes the statistical models used for microarray data inappropriate. Also models traditionally applied on SAGE (serial analysis of gene expression) data, may be inadequate, given the much higher sequencing depth and the availability of biological and technical replicates. We evaluated and compared several existing methods for analysis of digital gene expression data, including traditional tests and methods involving Bayesian statistics. Furthermore, we propose a new hierarchical model. In this model, the reads are modeled as Poisson distributed, the log(intensities) described by a multivariate normal distribution, and the between-library variance following an inverse-gamma distribution. We demonstrate that the model provides a good fit to a data set based on Illumina sequence libraries from mouse hippocampus. We show how principal component analysis (PCA) can be derived from the parameter estimates and that it is better at identifying clusters of libraries than PCA based on the raw data. We also provide a method to derive per-tag significance bounds for difference between two groups and compare the results obtained with our method to those from alternative methods with respect to identifying genes that are up- or downregulated in transgenic mice.

(With Helene H. Thygesen Department of Mathematics and Applied Statistics, Lancaster University, Lancaster, United Kingdom and Renée X. de Menezes Current affiliation: Department of Biostatistics, VU University Medical Center, Amsterdam, The Netherlands)

Oleg Mayba

ChIP-Seq: Methods and Challenges

In the past two years next generation sequencing technologies have been applied to a variety of biological experiments. One fast-growing area of application is the investigtion of protein-DNA interactions, principally transcription factor binding and chromatin modifications. We present the framework of this new ChIP-Seq technology that combines chromatin immunoprecipitation experiments with high-throughput sequencing, focusing mainly on applications to transcription factor binding site identification, and discuss the properties of the resulting data, including various biases that can affect the interpreation of the results. We summarize some of the current approaches and methods for analyzing such data, in particular the way they handle the non-uniform nature of background noise and suggest some further possibilities for improvement.

Registration

Please fill out the registration form and submit it.
Deadline for registration is August 15.

The workshop is free, but we ask you to kindly register for the workshop.