Category: My Projects

Digital Transformation of Services at Tetra Pak

Digital Transformation of Services at Tetra Pak

Digital transformation is driving changes in behaviour and business models globally. I’ve had the pleasure to work with Tetra Pak Services and to see first hand how data science and connected devices are shaping our future.

 

In the video below you see a demo (from Hannover Messe 2017) on how our solution looks like.

Using Microsoft Azure Blob Storage from within Python

Using Microsoft Azure Blob Storage from within Python

When working with cloud-born applications it is sometimes nice to work with any local files. In my case I was working on building some Python pipeline to preprocess data before doing some Machine Learning with it. Actually, my Python code is living in a Jupyter notebook hosted by the Azure Machine Learning Studio.

As my data is living in Azure Blob Storage (this is the fast and cheap generic storage in the Microsoft cloud for your files) I wanted to write some Python scripts that would read from blob storage and write back to blob storage without having any local temp files. As the official documentation is not very clear (at least I find some parts confusing) I will share some bits of Python code that is working for me. Obviously this is all at your own risk and I cannot guarantee this solution will be stable nor that it will be the only or best way to do this.

 

#connect to your storage account
from azure.storage import BlobService
blob_service = BlobService(account_name='YourAccountName', account_key='YourKey')

#list all CSV files in your storage account

blobs = []
marker = None
while True:
    batch = blob_service.list_blobs('YourContainer', marker=marker, prefix='input_')
    blobs.extend(batch)
    if not batch.next_marker:
        break
    marker = batch.next_marker
for blob in blobs:
    print(blob.name)

#read the blob file as a text file
#I just read in the first from the pervious list

data = blob_service.get_blob_to_text('rockt', blobs[0].name).split("\n")
print("Number of lines in CSV " + str(len(data)))

#do your stuff
#I want to filter out some lines of my CSV and only keep those having ABC or DEF in them

matchers = ['abc', 'def']
matching = [s for s in data if any(xs in s for xs in matchers)]
print("Number of lines in CSV " + str(len(matching)))

#write your text directly back to blob storage

blob_service.put_block_blob_from_text(
    'YourContainer',
    'YourOutputFile.csv',
    ''.join(matching),
    x_ms_blob_content_type='text'
)

PhD thesis: Understanding the epigenome using system genetics

PhD thesis: Understanding the epigenome using system genetics

My PhD thesis has been published online and can be found at the University of Cambridge repository: https://www.repository.cam.ac.uk/handle/1810/246693

Understanding the epigenome using system genetics

Genetics has been successful in associating DNA sequence variants to both dichotomous and continuous traits in a variety of organisms, from plant and farm animal studies to human disease. With the advent of high-throughput genotyping, there has been an almost routine generation of genome-wide association studies (GWAS) between human disease traits and genomic regions. Despite this success, a particular frustration is that the majority of associated loci are in non-coding regions of the genome and thus interpretation is hard.

 
To improve characterisation of non-coding regions, molecular assays can be used as a phenotype, and subsequently be used to explain how genetics alter molecular mechanisms. In this thesis, the interplay of three molecular assays that are involved in regulating gene expression is studied. On 60 individuals, several assays are performed: FAIRE-chip, CTCF-seq, RNA-seq and DNA-seq.

 
In the first part, the discovery and characteristics of FAIRE-QTLs is presented.  The identified FAIRE-QTLs show strong overlap with other molecular QTLs, histone modifications, and transcription factors.

 
The second part consists of the integration of genome-wide molecular assays in a human population to reconstruct the human epigenome. Each of the molecular assays is associated with each of the other assays to discover phenotype-to-phenotype correlations. Furthermore, QTL data are used to dissect the causality for these phenotype-to-phenotype correlations in a system genetic manner.

 
The third part presents a comprehensive view of CTCF binding on the X chromosome, and its implications for X-chromosome inactivation. A novel X chromosome-wide CTCF effect is observed. Using the gender of each of the cell lines, observations are made about which CTCF sites are dosage-compensated, active on both chromosomes, or are only bound in females.

Quantitative Genetics of CTCF Binding Reveal Local Sequence Effects and Different Modes of X-Chromosome Association

Quantitative Genetics of CTCF Binding Reveal Local Sequence Effects and Different Modes of X-Chromosome Association

My first paper came out featuring work done during my PhD. The results regarding the binding of CTCF on the X chromosome is my work.

CTCF binding can be regulated by genetic variations

We have systematically measured the effect of normal genetic variation present in a human population on the binding of a specific chromatin protein (CTCF) to DNA by measuring its binding in 51 human cell lines. We observed a large number of changes in protein binding that we can confidently attribute to genetic effects. The corresponding genetic changes are often clustered around the binding motif for CTCF, but only a minority are actually within the motif.

The X chromosome and CTCF

Unexpectedly, we also find that at most binding sites on the X chromosome, CTCF binding occurs equally on both the X chromosomes in females at the same level as on the single X chromosome in males. This finding suggests that in general, CTCF binding is not subject to global dosage compensation, the process which equalizes gene expression levels from the two female X chromosomes and the single male X.

A. Plot of the metric to distinguish single-active from both active-sites, across the X chromosome for a variety of molecular assays (mRNA, ncRNA, DNase I and CTCF, coloured according to the key). B. A smooth density of the distribution of the dosage compensation fit for the 4 molecular assay types, with CTCF split into the 3 classifications (single active, both active and female specific). C. Allele-specific signal of heterozygote sites on the X chromosome from the 13 clonal female lines in the sample. The both-active sites show balanced allele-specificity, whereas the single-active sites show strong single allele CTCF binding. D. Box plot of the gender-specific behaviour of the DNase I assay at the major classes of X chromosome CTCF sites. DNase I data was collected in a different laboratory on different cell lines [17]. The both-active class shows a strong gender split, consistent with females having around double the signal, whereas the single-active sites show no gender change. doi:10.1371/journal.pgen.1004798.g005
A. Plot of the metric to distinguish single-active from both active-sites, across the X chromosome for a variety of molecular assays (mRNA, ncRNA, DNase I and CTCF, coloured according to the key). B. A smooth density of the distribution of the dosage compensation fit for the 4 molecular assay types, with CTCF split into the 3 classifications (single active, both active and female specific). C. Allele-specific signal of heterozygote sites on the X chromosome from the 13 clonal female lines in the sample. The both-active sites show balanced allele-specificity, whereas the single-active sites show strong single allele CTCF binding. D. Box plot of the gender-specific behaviour of the DNase I assay at the major classes of X chromosome CTCF sites. DNase I data was collected in a different laboratory on different cell lines [17]. The both-active class shows a strong gender split, consistent with females having around double the signal, whereas the single-active sites show no gender change.
doi:10.1371/journal.pgen.1004798.g005

Full paper

PLOS Genetics