Project Description

Ms Kate Wathen-Dunn

Sugar Research Australia, Brisbane

A newbie’s guide to building a bioinformatics pipeline

Data science and machine learning for bioinformatics

Tuesday 3 July 2018

Kate is a Senior Technician at Sugar Research Australia, where she works on sugarcane transcriptomics, trait development and molecular characterisation. She has a background in plant biotech and molecular biology (BSc, Murdoch University), and bioinformatics (MBioinf, University of Queensland). Her current research focus is on the transcriptomics underpinning Yellow Canopy Syndrome of sugarcane, and the search for a biomarker to use as a diagnostic of the syndrome.

Since 2012, sugarcane crops in Queensland have been affected by Yellow Canopy Syndrome (YCS) which is characterised by a specific pattern of leaf yellowing and an abnormal accumulation of sucrose and starch in the leaf. Currently, little is known about the cause and transmission of YCS, or the molecular mechanisms that are impacted in the sugarcane plants. Using the changes in gene expression to answer these questions, we collected and sequenced RNA from both healthy and symptomatic plant leaves over several sampling trips. With no reference genome or transcriptome to map our data to, and limited computing resources, we de novo assembled 3 variety-specific reference contig sets from the reads. Differential expression analysis of these sets gave us essentially 3 different answers. What we needed was a reference that could be made from and used by all the sugarcane varieties. Unfortunately, this did not exist.

So, I set out to build one, using a total of 5.5 billion reads from 70 RNA samples from 3 sugarcane varieties, 36 of which had symptoms of YCS and 34 of which appeared healthy. As a recent bioinformatics newbie myself, I investigated suitable tools and approaches to best build the reference. With the large amount of RNAseq data available to me, this turned out to be more of a challenge than I had anticipated. No single technique or existing pipeline was going to be appropriate for my data, and I definitely needed better computing resources! Taking aspects of the best approaches in the literature, and through trial and error, I built a bioinformatics pipeline to take my data from RNAseq to an annotated transcriptome that I could then use in my research. This talk covers some of the challenges faced, decisions made and lessons learned along the way.