Author: Sander Timmer

PhD student in computational genetics at Cambridge University and EMBL-European Bioinformatics Institute
The Data Science industry jobs

The Data Science industry jobs

With the hype of data science as a field there is a massive demand in all kind of different skills. The following infographics, made by DataCamp, shows some way of looking at it different roles/jobs that are out there.

Data science jobs
I decided to share some #Python code that I wrote today in which I enable to work directly from within Python on data stored at +Microsoft Azure #blob #storage. Maybe useful also for other people

I decided to share some #Python code that I wrote today in which I enable to work directly from within Python on data stored at +Microsoft Azure #blob #storage. Maybe useful also for other people

http://www.sandertimmer.nl/2015/08/using-microsoft-azure-blob-storage-from-within-python/



Using Microsoft Azure Blob Storage from within Python | Sander Timmer
When working with cloud-born applications it is sometimes nice to work with any local files. In my case I was working on building some Python pipeline to preprocess data before doing some Machine Learning with it. Actually, my Python code is living in a Jupyter notebook hosted by the Azure …

Check this out on Google+ 1 0

Using Microsoft Azure Blob Storage from within Python

Using Microsoft Azure Blob Storage from within Python

When working with cloud-born applications it is sometimes nice to work with any local files. In my case I was working on building some Python pipeline to preprocess data before doing some Machine Learning with it. Actually, my Python code is living in a Jupyter notebook hosted by the Azure Machine Learning Studio.

As my data is living in Azure Blob Storage (this is the fast and cheap generic storage in the Microsoft cloud for your files) I wanted to write some Python scripts that would read from blob storage and write back to blob storage without having any local temp files. As the official documentation is not very clear (at least I find some parts confusing) I will share some bits of Python code that is working for me. Obviously this is all at your own risk and I cannot guarantee this solution will be stable nor that it will be the only or best way to do this.

 

#connect to your storage account
from azure.storage import BlobService
blob_service = BlobService(account_name='YourAccountName', account_key='YourKey')

#list all CSV files in your storage account

blobs = []
marker = None
while True:
    batch = blob_service.list_blobs('YourContainer', marker=marker, prefix='input_')
    blobs.extend(batch)
    if not batch.next_marker:
        break
    marker = batch.next_marker
for blob in blobs:
    print(blob.name)

#read the blob file as a text file
#I just read in the first from the pervious list

data = blob_service.get_blob_to_text('rockt', blobs[0].name).split("\n")
print("Number of lines in CSV " + str(len(data)))

#do your stuff
#I want to filter out some lines of my CSV and only keep those having ABC or DEF in them

matchers = ['abc', 'def']
matching = [s for s in data if any(xs in s for xs in matchers)]
print("Number of lines in CSV " + str(len(matching)))

#write your text directly back to blob storage

blob_service.put_block_blob_from_text(
    'YourContainer',
    'YourOutputFile.csv',
    ''.join(matching),
    x_ms_blob_content_type='text'
)