Category: My Projects

Using DNA barcoding to genetically identify inbred Drosophila lines

Using DNA barcoding to genetically identify inbred Drosophila lines

Working with a large collection of Drosophila melanogaster strains takes large amounts of manual work. First, to ensure a stable growth of the different strains. Second, to generate enough sample material for experiments. Because of this chances are that by mistake sometimes strains are mis-labeled or get mixed up. These small mistakes of-course negatively influence all analysis further downstream.

In meta-genomics projects they use known polymorphic sites (ranging from microsatellites to SNPs) to identify what organisms are inside a sample. Now that within the Drosophila Genetic Reference Panel (DGRP) all inbred strains are fully sequenced we can use the polymorphic sites (in our case SNPs) to identify each strain uniquely using the same barcoding idea. I developed several functions in R that help selecting regions to be targeted using cheap old-fashioned RT-PCR to identify all of the strains uniquely without having to construct unique primers for each individual strain.

In the example below we selected a region that, when sequenced, is able to identify 28 unique strains using the 31 SNPs inside this specific region. By selecting several of this regions you can identify every strain with a high confidence.

 

De-novo sequence assembly of high coverage genomes

De-novo sequence assembly of high coverage genomes

As part of my PhD studies I was working on some individuals from the Drosophila Genetic Reference Panel (DGRP) that are sequenced with high coverage. For 2 strains (RAL 375 and 852) an average sequence coverage of 25X sequence data is available making things like de-novo sequence assemblies possible. Using the Velvet we constructed several de-novo assemblies for both individuals (n50 of 40k and 50k) and studied these assemblies to check for sequence divergency and potential large structural variance.

We took the contigs constructed by velvet for each of the individuals and aligned them against different reference genomes of Drosophila and some close related species using Exonerate. By plotting the length versus exonerate score (as a measurement of sequence similarity of contigs towards the reference genomes) it becomes clear that contigs of our de-novo sequence assemblies show the largest sequence similarity with the Drosophila melanogaster reference genome, something that was of-course expected. This is a positive result that strengthens our believe into that you can use sequence assemblers like Velvet to re-construct unknown genomes when sufficient sequence coverage is available.

One contig is showing a more interesting result that we haven’t been able to explain sufficient enough. This one contig is showing a much lower exonerate score than expected based on the contig length (exonerate score is scaling linear with contig length normally). This could indicate that in this specific contig something biological relevant is happening or that in this specific contig Velvet made a mistake. With the current data and coverage for that area we cannot conclude any of these results.

Evolutionary indel constraints in intronic regions studied within the 1000 Genomes Project

Evolutionary indel constraints in intronic regions studied within the 1000 Genomes Project

Minor master summer internship at EMBLEuropean Bioinformatics Institute, Cambridge, United Kingdom under supervision of Dr. Paul Flicek, Laura Clarke and Dr. Zamin Iqbal.

The 1000 Genomes project is a path-breaking international experiment that involves DNA sequencing on a hitherto unimaginable scale, with the overall goals of generating a baseline understating of the variation in the human genome. In this study we try to detect and understand trends in indel rate in the Human genome. For this project new functionality is added to the Ensembl API to classify the genetic effect caused by indel variation.

To understand indel rate in Human genome the location and rate of indels are studied. The position of all indels is mapped on the different chromosomes. Genes with and without indels are equally located in the chromosomes (same distance distribution to start/end chromosome). If we look at genes with introns with expected indels we see an overrepresentation in the X chromosome.

Indels were classified in 16 different types of genetic effects. In our trios family sample most indels are located inside intronic regions. Gene enrichment on genes with low indel rate (intron length / number of indels) tend to be involved in RNA binding, which could be evidence for indel constraints in genes with this function.

Next to this I was working for the 1000 Genomes data centrum where I was working on automising sanity checks on the large datasets that had to be mirrored between the different sequencing centers and data centers.

 

The influence of protein dynamics on structural alignments

The influence of protein dynamics on structural alignments

Minor project at the The Centre for Integrative Bioinformatics, Vrije Universiteit, Amsterdam, The Netherlands. Under supervision of Prof Heringa, Dr. Feenstra and Dr. Pirovano.

Current structural alignment methods don’t consider structural variance. By modelling protein dynamics on HOMSTRAD homologous proteins (with GROMACS) a new set of structural variant protein structures was created that respond to different “normal” stages of a protein structure. Dalilite structural alignments were made made and scored using sum-of-pairs. By comparing the sum-of-pairs score obtained with Dalilite and the RMSD divergence scores from our simulations we gained insight in how well structural alignment tools can deal with different sort of normal structural variance.

Part of my work has been published in the following publication: The meaning of alignment: lessons from structural diversity

THINGS – THis Is Not GeneSeeker

THINGS – THis Is Not GeneSeeker

Bachelor internship at:

Under supervision of Prof Leunissen (WUR), Herman van den Berg (Vertis) and  Koen Cuelenaere (Dalicon).

Creating a Oracle RDF based Pubmed text mining for finding relations between “health” related MESH terms and food items by using genes.

Most of the work consisted of working with the latest Oracle 10g2 techniques that implemented the first version of a RDF based database. Using a RDF database you can input all your data in your database and only define the structure (as what you would normally do in a relational database like MySQL). This way the database can quickly find new ways of overlap between data over multiple layers of information (as in this kind of database there is no root nor leafs all data can be related and queried this way).

 

Peritonitis in CAPD patients – a proteomics analysis

Peritonitis in CAPD patients – a proteomics analysis

Bachelor thesis at  Nijmegen Proteomics Facility, Laboratorium voor Kindergeneeskunde en Neurologie, Radboud University Nijmegen Medical Centre, Nijmegen, the Netherlands. Under supervision of Dr L. van den Heuvel and W. Pluk.

Built a central web-based data storage for proteomics data that can handle first steps of analysis automatically. Normalisation, ratio conversation, emPAI calculations and mapping found proteins/peptides onto KEGG and GO ontologies for a global overview of sample content.

System was then used for a research project where several samples of CAPD patients with and without peritonitis where analysed and where compared on protein abundance. There was not enough statistical power in our 13 samples to identify peritonitis specific proteins or peptides as the normal experimental and biological noise was bigger than the intra-individual signals.