DistancE Calculation using Alignment-Free methods in PYthon
© 2006 by Michael Höhl
This package is distributed under the terms of the GNU General Public
The package decaf+py contains Python
modules and programs to facilitate the calculation of alignment-free
distances. In particular, this software implements pattern-based
To run the programs contained in decaf+py, you need to have a recent
version of Python installed, preferable version 2.4 or higher.
However, some modules will work under older versions of Python.
One unique feature of decaf+py is its implementation of pattern-based
distance calculation. Additionally, several word-based methods are
implemented. These implementations rely on patterns or k-mers (see
extract-words.sh) being stored in Teiresias format, to be read in by
TeiresiasPatterns.py. To enable this functionality, you need to
obtain Teiresias from IBM.
Alternatively, you could extract k-mers using other software,
e.g. compseq from the EMBOSS suite (available from http://emboss.sourceforge.net/),
and write a script that converts the output to Teiresias format.
The following software activates specific subsets of functionality in
decaf+py. If you do not need this functionality, you are not required
to install this software.
To enable pattern-based ML distance calculation, you need protdist
from the Phylip package. However, other programs it contains such as
fitch and neighbor are probably of interest to you as well.
To enable (fast) computation of the ACS distance, you need the Python
SuffixTree package for efficient string matching.
To enable computation of distances based on Poisson probabilities, you
need GSL and a Python wrapper around it for rng.poisson_pdf in Word.py.
Speeding up running times
The following software is not required. However, if this optional
Python package is present, it will accelerate most distance
calculations (only compute-acs-distance.py does not benefit from it).
A few distance methods rely on additional data besides the unaligned
sequences or k-mers occurring in them. These data are mutation data
and equilibrium frequencies of amino acids. Mutation data are needed
for distance calculation from similarity matrices and are used in
compute-pattern-distance.py and compute-word-wm-distance.py.
Equilibrium frequencies are contained in e.g. jones.dat from the PAML
distribution and used in the standardized Euclidean distance and
distances from Poisson probabilities. If you do not have appropriate
equilibrium frequencies, it is possible to run these programs by
specifying the alphabet size, assuming equal frequencies. For your
convenience, BLOSUM62 is already included in the archive; additional
data sources are as follows.
Possibly the easiest way to install this software is to simply copy
all *.py and *.sh files into a directory that appears in your $PATH
variable, and you are done.
Contents of the archive
* NB: itertoos2.py is taken from the Python Library Reference, Section
"5.16.3 Recipes" and resides in the file
Doc/html/lib/itertools-recipes.html under the Python directory.
How to use the programs
All programs are to be used from the command-line and come with a help
screen that is shown automatically when you invoke them without any
parameter. They require at least a set of (unaligned) sequences in
Fasta format to work on; additionally, most programs also need
patterns or k-mers/words in Teiresias format. All distance
calculations will output a lower diagonal distance matrix in Phylip
format; this can be read in by fitch and neighbor (with option "L")
for tree reconstruction.
Assume that we have a set of sequences in a file named seqs.fa; this
file is in Fasta format. In order to compute the Lempel-Ziv distance,
we type on the command-line
and it displays the (lower triangular) distance matrix. We can tell
it to write the matrix to a file, say distmatrix, by typing
- compute-lz-distance.py --fasta seqs.fa
By default, compute-lz-distance.py uses a distance variant published
under the name d1_star2 (d1**). We can specify a different one,
- compute-lz-distance.py --fasta seqs.fa --outfile distmatrix
Most options can be abbreviated by a single dash and letter. Our
previous command now becomes
- compute-lz-distance.py --fasta seqs.fa --outfile distmatrix --distance d_star
If we want to see the range of available options, we simply call the
program without any options, and a help screen will show up as
- compute-lz-distance.py -f seqs.fa -o distmatrix -d d_star
usage: compute-lz-distance.py [options]
-h, --help show this help message and exit
-f FILE, --fasta=FILE
read sequence data from FILE
-o FILE, --outfile=FILE
write output to FILE
-d NAME, --distance=NAME
choose from d, d_star, d1, d1_star, d1_star2 (default)
Similarly, to compute pattern-based distances, we proceed as follows.
Assume that we already ran Teiresias and have patterns residing in a
file called patterns.thr. We can then calculate distances using the
maximum likelihood variant.
A faster variant is based on distances calculated using a similarity matrix.
- compute-pattern-distance.py --fasta seqs.fa --patterns patterns.thr --outfile distmatrix --protdist
- compute-pattern-distance.py --fasta seqs.fa --patterns patterns.thr --outfile distmatrix --matrix BLOSUM62
Finally, assume that we ran extract-words.sh and have a file called
patterns.thr that now contains the extracted k-mers. We calculate the
standardized Euclidean distance as follows.
Furthermore, we can change the distance calculation formalism to use a
normalized variant, as is sometimes done, as opposed to the original
- compute-word-std-distance.py --fasta seqs.fa --patterns patterns.thr --outfile distmatrix --equilibrium jones.dat
We can measure the effect of equilibrium frequencies by comparatively
analyzing the distances assuming equal frequencies, specifying the
- compute-word-std-distance.py --fasta seqs.fa --patterns patterns.thr --outfile distmatrix --equilibrium jones.dat --distance euclid_norm
- compute-word-std-distance.py --fasta seqs.fa --patterns patterns.thr --outfile distmatrix --alphabet 20
What functionality do they provide?
If you are looking for an implementation of the pattern-based
- compute-pattern-distance.py computes variants of pattern-based
distances, of interest are variants that use maximum likelihood
and similarity matrices.
- pattern-filter-majority.py filters patterns according to majority
consensus and consistency criteria;
- the resulting patterns can then be used for distance calculation.
Most of the program names should be self-explanatory:
- compute-acs-distance.py computes a distance based on the average
common substring length
- compute-lz-distance.py computes various
distances based on the Lempel-Ziv complexity
- compute-word-composition-distance.py computes the composition
- compute-word-std-distance.py computes the standardized
- compute-word-wm-distance.py computes the W-metric
compute-word-distance.py allows computation of a number of word-based
distances, amongst them
This versatility is achieved by specifying the type of data to be used
independently from the distance calculation formalism.
- the (squared) Euclidean distance,
- a distance based on the fraction of common k-mer counts,
- and a distance based on probabilities of common k-mer counts under
a Poisson model.
Similarly, two programs offer many variants of published methods,
simply by combining elements of various methods:
Finally, compute-word-mix-distance.py computes a mixed distance based
on probabilities of words under a Poisson model, combining additive
and multiplicative distance calculation formalisms; this is inspired
by van Helden's mixed metric.
How to use the modules
If you would like to use the modules for your own Python programs,
have a look at the supplied programs. This should give you an idea of
how the modules are meant to be used programmatically, and should get
you started. Also, in case you would like to extend the functionality
of the modules, you may find the comments helpful, especially since
they contain assumptions underlying some of the code.
The following papers introduce methods that are, amongst others,
implemented in this software:
- Blaisdell B: A measure of the similarity of sets of sequences not
requiring sequence alignment. Proc. Natl Acad. Sci. U.S.A., 1986,
- Blaisdell B: Effectiveness of measures requiring and not requiring
prior sequence alignment for estimating the dissimilarity of natural
sequences. J. Mol. Evol., 1989, 29(6):526-537.
- Burstein D, Ulitsky I, Tuller T, Chor B: Information theoretic
approaches to whole genome phylogenies. in Proceedings of the Ninth
Annual International Conference on Research in Computational
Molecular Biology (RECOMB 2005). 2005, 283-295, Cambridge, MA.
- Edgar R: Local homology recognition and distance measures in linear
time using compressed amino acid alphabets. Bioinformatics, 2004,
- Gentleman J, Mullin R: The distribution of the frequency of occurrence
of nucleotide subsequences, based on their overlap capability.
Biometrics, 1989, 45(1), 35-52.
- Hao B, Qi J: Prokaryote phylogeny without sequence alignment: from
avoidance signature to composition distance. J. Bioinf. and
Computat. Biol., 2004, 2:1-19.
- Höhl M, Rigoutsos I, Ragan M: Pattern-based phylogenetic distance
estimation and tree reconstruction. arXiv:q-bio.QM/0605002, 2006.
- Lempel A, Ziv J: On the complexity of finite sequences. IEEE
Trans. Inform. Theory, 1976, IT-22:75-81.
- Otu H, Sayood K: A new sequence distance measure for phylogenetic tree
reconstruction. Bioinformatics, 2003, 19(16):2122-2130.
- Qi J, Wang B, Hao B: Whole proteome prokaryote phylogeny without
sequence alignment: a k-string composition approach. J. Mol. Evol.,
- Taylor W, Jones D: Deriving an amino acid distance matrix.
J. Theor. Biol., 1993, 164:65-83.
- Ulitsky I, Burstein D, Tuller T, Chor B: The average common substring
approach to phylogenomic reconstruction. J. Computat. Biol., 2006,
- Van Helden J: Metrics for comparing regulatory sequences on the basis
of pattern counts. Bioinformatics, 2004, 20(3):399-406.
- Vinga S, Gouveia-Oliveira R, Almeida J: Comparative evaluation of word
composition distances for the recognition of SCOP relationships.
Bioinformatics, 2004, 20(2):206-215.
- Wu T, Burke J, Davison D: A measure of DNA sequence dissimilarity
based on the Mahalanobis distance between frequencies of words.
Biometrics, 1997, 53(4):1431-1439.
- 2006-05-02 - Updated references
- 2006-04-27 - Initial release
How to contact me
I can be reached under these email addresses
Michael Höhl, April 2006