Project Description

Dr Denis Bauer


Team Leader, Transformational Bioinformatics
CSIRO Health Program, Sydney

VariantSpark: applying Spark-based machine learning methods to genomic information

Bioinformatics methods, models and applications to disease

Tuesday 5 July 2016

Genomic information is increasingly being used for medical research, giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. Catering for this need, we developed VariantSpark, a Hadoop/Spark framework that utilises the machine learning library, MLlib, thereby providing the means of parallelisation for population-scale bioinformatics tasks. VariantSpark offers an interface to the standard variant format (VCF), seamless genome-wide sampling of variants and provides a pipeline for visualising results.

To demonstrate the capabilities of VariantSpark, we cluster of more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VariantSpark is 80% faster than the Spark-based genome clustering approach developed by the Global Alliance for Genomics and Health, ADAM, the comparable implementation using Hadoop/Mahout, as well as Admixture, a commonly used tool for determining individual ancestries. It is over 90% faster than traditional implementations using R and Python. These benefits of speed, resource consumption and scalability enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data.

Here I will give a short introduction into Hadoop and Spark as well as detail other approaches like the ADAM framework before talking about our solution, VariantSpark.

Dr. Denis Bauer is the team leader of the transformational bioinformatics team in CSIRO’s ehealth program. Her expertise is in high throughput genomic data analysis, computational genome engineering, as well as Spark/Hadoop and high-performance compute-system. She has a PhD in Bioinformatics and has done her Postdoctoral training in machine learning and human genetics, respectively. Her collaborators include Prof. Simon Foote on mammalian susceptibility to infectious diseases, Prof. Ian Blair on molecular mechanisms on motor neuron disease, and Prof. Rodney Scott on obesity-driven cancer.  She has 23 peer-reviewed publications (9 first author, 4 senior author) with three in journals of IF>8 (e.g. Nat Genet.) and H-index 9. To date she has attracted more than AU$25Million in funding.