Project Description

A/Prof. Hanna Suominen

Australian National University

Introduction of machine learning for health data analytics

Data science and machine learning for bioinformatics

Thursday 4 July 2019

Associate Professor Hanna Suominen, with over 15 years’ experience in longitudinal, multimodal data analytics for saving, structuring, and summarising data, is bridging the gap between Computer Science (CS) and health/social sciences. Her MSc was awarded in applied mathematics, PhD in CS, and Adj. Prof. in CS in the University of Turku, Finland in 2005, 2009, and 2013, respectively. She joined The ANU and Data61 as the Team Leader of TAMPA, Theory and Applications of Multimodal Pattern Analysis within the Machine Learning (ML) Group after working in Data61/NICTA as a Team Leader of Natural Language Processing (NLP) and Senior Researcher in ML. Hanna has over 100 publications with 60 co-authors from 10 countries, including Harvard, Karolinska Institutet, and Max Planck. Her work has been published in the most prestigious journals, cited over 1,200 times, and awarded for best papers, ML/NLP-methods, business-plans, and teaching-units. She has scored competitive grants with a total value of over $10-20 million in the past 2 years alone.

Information flow, defined as channels, contact, communication, or links to pertinent people, is critical in any data intensive field but critical in healthcare. For example, over 10% of preventable adverse events in healthcare are caused by failures in information flow. These failures are tangible in handover; regardless of good verbal communication, 65%-100% information is lost after 3-5 shifts if notes are taken by hand, or not at all. The goal of our studies was to make producing and using clinical documentation more efficient through machine learning-assisted information flow, and thereby contribute to health and healthcare. We studied automated speech recognition (ASR) and text classification as ways to populate health records and perform hospital surveillance. ASR recognised up to 73% of 14,095 test words correctly. The classifier achieved on 100 test documents the 81% F1 in filtering out irrelevant text and up to 100% in filling out the form headings. At the level of 75% precision, the surveillance system had 100% recall, that is, it did not miss a single sick patient. We also introduced Web apps to demonstrate the software design and released synthetic but realistic clinical datasets. The significance hinges on opening our data, software, and evaluations to the research and development community for studying clinical documentation, ASR, and classification.