Understanding genetic regulation in the era of massively parallel sequencing: how many sites should a transcription factor bind? – Jacques van-Helden

The turn from the 20th to the 21st Century was marked by a drastic change in the scale at which biologists study regulatory networks. When I was a PhD student, a typical thesis would consist in analysing the regulation of one particular gene by one or a few transcription factors. By the end of the 1990s, microarray technology enabled monitoring the expression of all the genes of an organism in a single experiment (transcriptome arrays). A few years later emerged the ChIP ‐ on ‐ chip method, which combined chromatin imunoprecipitation (ChIP) with microarrays (chips) to report supposedly exhaustive lists of direct target genes for a transcription factor of interest. Five years ago, the microarray step of ChIP ‐ on ‐ chip was replaced by massively parallel sequencing, giving rise to the ChIP ‐ seq approach to directly characterise the chromosomal regions bound by transcription factors under different conditions and in different tissues. Massive sequencing is also currently used to characterise the landscapes of nucleosome occupancy, chromatin accessibility, histone methylation, providing a genome ‐ scale view of epigenetic modifications modulating genetic regulation.
The massive amounts of data generated by the aforementioned technologies called for the development of novel bioinformatics approaches, to extract regulatory motifs from clusters of co ‐ expressed genes, to predict transcription factor binding sites from target regions pulled down by ChIP ‐ on ‐ chip and ChIP ‐ seq, to analyse how transcription factors interact on so ‐ called cis ‐ regulatory modules (CRMs) to drive complex patterns of expression, and to understand the interlacing of genetic and epigenetic regulation.
Beyond the drastic improvements of “wet lab” technologies and “in silico” analytic approaches, high ‐ throughput profiling of expression, binding, chromatin accessibility open fundamental questions regarding the specificity, robustness and evolution of regulatory mechanisms. Transcriptome analyses revealed hundreds of genes involved in cellular response to various signals, developmental stages, cellular state etc. ChIP ‐ seq experiments return several thousands of binding locations (peaks) for transcription factors previously considered as “specific”. This contrasts with the classical models by which regulation shapes the body plan and the function of organs by fine ‐ tuning the expression of a very specific subset of genes under each condition. How can we conciliate the idea of robust regulatory networks with the apparent noisiness of binding and transcription profiles?
Although I cannot provide a clear ‐ cut answer to this fundamental question, I will try to address a very fragmentary aspect of it: how many binding sites should “reasonably” be bound to a given transcription factor under a given condition (cell type, developmental stage …)? ChIP ‐ seq experiments typically return millions of reads, which are analysed with peak ‐ calling programs to produce a set of “peaks” (more precisely, “regions enriched in reads”) considered as putative binding regions for the immunoprecipitated factor. However, the biologist is confronted to a choice between a dozen of peak calling software tools, and each tool has a series of options which strongly affect the result. Depending on the chosen tool and parameters, the same set of reads can produce from a few tens to a several tens of thousands peaks. How many of these are supposed to really bind the protein? A reliable assessment would necessitate some golden standard, i.e. an exhaustive list of sites bound by some factor, but since such list would require for is to use the method that we attempt to evaluate, there is an intrinsic problem of circularity. I will propose some approaches to circumvent this problem by using indirect evaluators to compare the respective qualities of peak collections produced by different programs/parameters.


Comments are closed.