Project Description

Prof. Geoff McLachlan

The University of Queensland

A need for caution in clustering and classifying high-dimensional data

Data science and machine learning for bioinformatics

Tuesday 3 July 2018

Geoff McLachlan has made seminal contributions in statistics, particularly in machine learning. He has written over 270 research articles which have received over 40,000 citations. He has written six monographs on discriminant analysis (McLachlan, 1992), mixture models (McLachlan and Basford, 1988; McLachlan and Peel, 200), the EM algorithm (McLachlan and Krishnan, 1997 & 2008); and the analysis of gene expression data (McLachlan, Do & Ambroise; 2004). He is a fellow of the Australian Academy of Science and also a fellow of the American Statistical Association and the Royal Statistical Society.

In this talk, we illustrate the caution that needs to be exercised in the supervised and unsupervised classification of big data sets in bioinformatics. Issues that arise include whether a clustering obtained after much searching of the data represents a genuine or spurious grouping and the optimistic bias of commonly used error rates of discriminant rules formed by extensive selection from many feature variables.