Project Description

Dr Arun Konagurthu

Monash University

Statistical compression of protein folding patterns and inference of recurrent substructural themes

Data science and machine learning for bioinformatics

Thursday 4 July 2019

Dr Arun Konagurthu is a Senior Lecturer and (currently) the Director of the undergraduate Bachelor of Computer Science studies at Monash University’s Faculty of Information Technology. He held the Larkins Fellowship at this Faculty in 2010-2013. Prior to that, he held the Eberly College of Science Fellowship at Pennsylvania State University, working with Professor Arthur Lesk at the Huck Institutes of Genomics, Proteomics, and Bioinformatics. His research interests cover protein structural bioinformatics, statistical inductive inference, combinatorial optimization, graph theory and algorithms.

Computational analyses of the growing corpus of three-dimensional (3D) structures of proteins have revealed a limited set of recurrent substructural themes as building blocks of protein architecture. Knowledge of such architectural building blocks underlying the observed repertoire of protein folding patterns remains crucial to unravel how protein 3D structures come about, how they function and how they evolve. Characterizing a comprehensive dictionary of such building blocks has been an unanswered computational challenge in protein structural studies. Using information-theoretic inference, we address this question and identify a comprehensive dictionary of 1,493 substructural ‘concepts’. Each concept represents a topologically-conserved assembly of helices and strands that make contact. Any protein structure can be dissected into instances of concepts from this dictionary. We dissected the world-wide protein data bank and completely inventoried all concept instances. This yields an unprecedented source of biological insights. These include: correlations between concepts and catalytic activities or binding sites, useful for rational drug design; local amino-acid sequence-structure correlations, useful for ab initio structure prediction methods; and information supporting the recognition and exploration of evolutionary relationships, useful for structural studies. This talk will mainly discuss the unsupervised method based on the Minimum Message Length (MML) criterion we used to learn this comprehensive architectural concept dictionary: see