genetic variation – Sander Timmer

Using DNA barcoding to genetically identify inbred Drosophila lines

By Sander Timmer | July 28, 2010

Working with a large collection of Drosophila melanogaster strains takes large amounts of manual work. First, to ensure a stable growth of the different strains. Second, to generate enough sample material for experiments. Because of this chances are that by mistake sometimes strains are mis-labeled or get mixed up. These small mistakes of-course negatively influence all analysis further downstream.

In meta-genomics projects they use known polymorphic sites (ranging from microsatellites to SNPs) to identify what organisms are inside a sample. Now that within the Drosophila Genetic Reference Panel (DGRP) all inbred strains are fully sequenced we can use the polymorphic sites (in our case SNPs) to identify each strain uniquely using the same barcoding idea. I developed several functions in R that help selecting regions to be targeted using cheap old-fashioned RT-PCR to identify all of the strains uniquely without having to construct unique primers for each individual strain.

In the example below we selected a region that, when sequenced, is able to identify 28 unique strains using the 31 SNPs inside this specific region. By selecting several of this regions you can identify every strain with a high confidence.

De-novo sequence assembly of high coverage genomes

By Sander Timmer | June 28, 2010

As part of my PhD studies I was working on some individuals from the Drosophila Genetic Reference Panel (DGRP) that are sequenced with high coverage. For 2 strains (RAL 375 and 852) an average sequence coverage of 25X sequence data is available making things like de-novo sequence assemblies possible. Using the Velvet we constructed several de-novo assemblies for both individuals (n50 of 40k and 50k) and studied these assemblies to check for sequence divergency and potential large structural variance.

We took the contigs constructed by velvet for each of the individuals and aligned them against different reference genomes of Drosophila and some close related species using Exonerate. By plotting the length versus exonerate score (as a measurement of sequence similarity of contigs towards the reference genomes) it becomes clear that contigs of our de-novo sequence assemblies show the largest sequence similarity with the Drosophila melanogaster reference genome, something that was of-course expected. This is a positive result that strengthens our believe into that you can use sequence assemblers like Velvet to re-construct unknown genomes when sufficient sequence coverage is available.

One contig is showing a more interesting result that we haven’t been able to explain sufficient enough. This one contig is showing a much lower exonerate score than expected based on the contig length (exonerate score is scaling linear with contig length normally). This could indicate that in this specific contig something biological relevant is happening or that in this specific contig Velvet made a mistake. With the current data and coverage for that area we cannot conclude any of these results.

Evolutionary indel constraints in intronic regions studied within the 1000 Genomes Project

By Sander Timmer | December 20, 2009

Minor master summer internship at EMBL – European Bioinformatics Institute, Cambridge, United Kingdom under supervision of Dr. Paul Flicek, Laura Clarke and Dr. Zamin Iqbal.

The 1000 Genomes project is a path-breaking international experiment that involves DNA sequencing on a hitherto unimaginable scale, with the overall goals of generating a baseline understating of the variation in the human genome. In this study we try to detect and understand trends in indel rate in the Human genome. For this project new functionality is added to the Ensembl API to classify the genetic effect caused by indel variation.

To understand indel rate in Human genome the location and rate of indels are studied. The position of all indels is mapped on the different chromosomes. Genes with and without indels are equally located in the chromosomes (same distance distribution to start/end chromosome). If we look at genes with introns with expected indels we see an overrepresentation in the X chromosome.

Indels were classified in 16 different types of genetic effects. In our trios family sample most indels are located inside intronic regions. Gene enrichment on genes with low indel rate (intron length / number of indels) tend to be involved in RNA binding, which could be evidence for indel constraints in genes with this function.

Next to this I was working for the 1000 Genomes data centrum where I was working on automising sanity checks on the large datasets that had to be mirrored between the different sequencing centers and data centers.