Project Description

Dr Arash Bayat


Random forest and its application to genome-wide association studies

Data science and machine learning for bioinformatics

Thursday 4 July 2019

Arash is a researcher in Transformational Bioinformatics team at CSIRO. He has completed his bachelor and master degrees in computer engineering and moves towards bioinformatics during his PhD study at University of New South Wales. His current research interest is using machine learning and cloud infrastructures to process big genomic data.

GWAS is about computing association power of SNPs with the phenotype of interest. Traditional GWAS tends to look at each SNP independent from other SNPs when measuring association power. However, it has been discovered that there are SNPs that interact with each other to form a phenotypic response (epistasis). Capturing such epistasis interaction is a computational challenge. Random Forest is a machine learning approach that can be used to overcome the difficulty of this problem. This talk describes the strength and weaknesses of using Random Forest for this purpose.

VariantSpark: A cloud-based machine learning approach for big genomic data

Getting started with bioinformatic software

Friday 5 July 2019

Genomic data is going to set a new record. It is estimated that the volume of genomic data exceeds all astronomy and YouTube data combined. Such a dramatic increase in the amount of data is mainly due to the cost reduction in data production and the significant impact of genomic researches on our life. Neither traditional algorithms nor high-performance computers are capable of dealing with such a massive data load. Machine learning on cloud-platforms seems to be an appropriate solution to tackle this problem.

Machine learning is a well-suited method to extract valuable information out of big data in reasonable time especially when the traditional approach comes with exponential complexity. Yet, the computational requirement is beyond capabilities of commodity computers. Cloud platforms are able to provide adequate computational hardware to support our machine learning algorithm.

VariantSpark is a cloud-based machine learning software that can efficiently harvest cloud resources for machine learning algorithms processing genomic data.