The D2 statistic is defined as the number of k-mer matches for some pre-specified length k between two sequences of letters. It is potentially a useful statistic for measuring similarity of DNA or protein sequences in cases where long alignments may not be appropriate. It is rarely used directly in its basic form and a number of related statistics with names such as D2z, D2*, D2S and N2 have also been devised. These related statistics are motivated by a need to give an optimal signal of sequence similarity above the noise of word-occurrence variability in each of the sequences, and are often designed to target specific applications.
There are two approaches to designing and using word count statistics. The most common approach is to build into the statistic allowances for the statistical properties of word counts based on parameters such as sequence lengths and letter distributions, and use the statistic directly as a measure of similarity. Alternatively, one can take an approach more traditionally aligned with classical statistical hypothesis testing, in which one approximates the distribution of a relatively simple form of the statistic under an assumed null hypothesis, and measures similarity in terms of a calculated P-value.
In this talk I will review advances in the second of these two approaches: How well can we calculate P-values for D2 and its close relatives under null hypotheses of sequences composed of independent letters or sequences with a Markovian dependence. I will also present results of applying the method to identifying cis-regulatory modules and comparison of the theoretical D2 distribution for Markov sequences up to fifth order with empirical distributions of D2 from the human genome.