Project Description

Dr James Doecke


Feature selection for biological data: educated guesses or blind computation

Data science and machine learning for bioinformatics

Thursday 4 July 2019

Dr Doecke has been working as a Biostatistician for approximately 12 years. After completing his PhD in statistical genetics at Griffith University, he started his career with the Queensland Institute for Medical Research in 2006 in Biostatistics. In 2008 he moved to CSIRO to further his career in statistical model development, specifically in feature selection and model prediction.

He has extensive experience in analysing data from studies on Alzheimer’s disease, cancer and inflammatory bowel disease. Dr Doecke has published over 70 Journal papers, including manuscripts in prestigious journals such as Nature, Gut and Molecular Psychiatry. With over 2200 citations, his work has been instrumental in the identification of blood based biomarkers in Alzheimer’s disease, and biomarker research in general.

Dr Doecke is consistently asked to work in prestigious laboratories around the world. He has spent time in the Cambridge Institute for Medical Research in the UK, and the MD Anderson Cancer Center, the number 1 cancer centre in the USA.

Currently he leads a team of Biostatisticians at CSIRO, and is the technical lead for all biomarkers and biostatistics arising from data collected within the Australian Imaging, Biomarkers and Lifestyle (AIBL) study of ageing. With a background in biostatistics, molecular biology and epidemiology, Dr Doecke applies both simple and complex statistical methodologies to real world medical problems, and advocates the importance of broad knowledge in medial biology and biostatistics to be able to answer some of the world’s most complex disease problems.

With the era of big data upon is, and the introduction of machine learning technologies to be able to assess this data becoming more available and accessible, it is tempting to want to run all possible computations to assess each and every relationship possible. This can run into billions of relationships to assess, not to mention that this becomes even larger when we have multiple outcomes. Whilst it is now possible to run billions of computations across multiple CPU’s, we still run into trouble when we want to take into account the multivariate nature of most applications. A standard covariance matrix of a big data set will take a very long time to run. Even with the massive resources we have to compute relationships amongst this big data, many complex methodologies are not able to be run at such a large scale. One alternative is to construct a biological and statistical design within your data prior to analyses. This talk will describe one such design to assess three genomic platforms of data (SNP, mRNA expression and DNA methylation) with a view to understanding some of the complex relationships commonly found within disease phenotypes.