decaf+py

DistancE Calculation using Alignment-Free methods in PYthon

© 2006 by Michael Höhl

This package is distributed under the terms of the GNU General Public License.

Overview

The package decaf+py contains Python modules and programs to facilitate the calculation of alignment-free distances. In particular, this software implements pattern-based distance calculation.

Software requirements

To run the programs contained in decaf+py, you need to have a recent version of Python installed, preferable version 2.4 or higher. However, some modules will work under older versions of Python. One unique feature of decaf+py is its implementation of pattern-based distance calculation. Additionally, several word-based methods are implemented. These implementations rely on patterns or k-mers (see extract-words.sh) being stored in Teiresias format, to be read in by TeiresiasPatterns.py. To enable this functionality, you need to obtain Teiresias from IBM. Alternatively, you could extract k-mers using other software, e.g. compseq from the EMBOSS suite (available from http://emboss.sourceforge.net/), and write a script that converts the output to Teiresias format.

Additional requirements

The following software activates specific subsets of functionality in decaf+py. If you do not need this functionality, you are not required to install this software.

To enable pattern-based ML distance calculation, you need protdist from the Phylip package. However, other programs it contains such as fitch and neighbor are probably of interest to you as well.

To enable (fast) computation of the ACS distance, you need the Python SuffixTree package for efficient string matching. To enable computation of distances based on Poisson probabilities, you need GSL and a Python wrapper around it for rng.poisson_pdf in Word.py.

Speeding up running times

The following software is not required. However, if this optional Python package is present, it will accelerate most distance calculations (only compute-acs-distance.py does not benefit from it).

Data sources

A few distance methods rely on additional data besides the unaligned sequences or k-mers occurring in them. These data are mutation data and equilibrium frequencies of amino acids. Mutation data are needed for distance calculation from similarity matrices and are used in compute-pattern-distance.py and compute-word-wm-distance.py. Equilibrium frequencies are contained in e.g. jones.dat from the PAML distribution and used in the standardized Euclidean distance and distances from Poisson probabilities. If you do not have appropriate equilibrium frequencies, it is possible to run these programs by specifying the alphabet size, assuming equal frequencies. For your convenience, BLOSUM62 is already included in the archive; additional data sources are as follows.

Installation

Possibly the easiest way to install this software is to simply copy all *.py and *.sh files into a directory that appears in your $PATH variable, and you are done.

Contents of the archive

Non-Python files

README.txt
LICENSE.txt
BLOSUM62
extract-words.sh

Modules

itertools2.py (*)
ACS.py
Align.py
BlastMatrix.py
Distance.py
DistMatrixFactory.py
DistMatrix.py
DistMatrixUtils.py
FastaFile.py
LempelZiv.py
OutFile.py
Paml.py
PatternDist.py
PatternFilter.py
PatternIter.py
PatternSelect.py
Phylip.py
ResiduePairs.py
Seqs.py
SeqUtils.py
TeiresiasPatterns.py
Wm.py
Word.py

Programs

compute-acs-distance.py
compute-lz-distance.py
compute-pattern-distance.py
compute-word-composition-distance.py
compute-word-composition-many-distances.py
compute-word-distance.py
compute-word-many-distances.py
compute-word-mix-distance.py
compute-word-std-distance.py
compute-word-wm-distance.py
pattern-filter-majority.py

* NB: itertoos2.py is taken from the Python Library Reference, Section "5.16.3 Recipes" and resides in the file Doc/html/lib/itertools-recipes.html under the Python directory.

How to use the programs

All programs are to be used from the command-line and come with a help screen that is shown automatically when you invoke them without any parameter. They require at least a set of (unaligned) sequences in Fasta format to work on; additionally, most programs also need patterns or k-mers/words in Teiresias format. All distance calculations will output a lower diagonal distance matrix in Phylip format; this can be read in by fitch and neighbor (with option "L") for tree reconstruction.

Example sessions

Assume that we have a set of sequences in a file named seqs.fa; this file is in Fasta format. In order to compute the Lempel-Ziv distance, we type on the command-line and it displays the (lower triangular) distance matrix. We can tell it to write the matrix to a file, say distmatrix, by typing By default, compute-lz-distance.py uses a distance variant published under the name d1_star2 (d1**). We can specify a different one, e.g. d_star Most options can be abbreviated by a single dash and letter. Our previous command now becomes If we want to see the range of available options, we simply call the program without any options, and a help screen will show up as reproduced below.
usage: compute-lz-distance.py [options]

options:
  -h, --help            show this help message and exit

  Mandatory Options:
    -f FILE, --fasta=FILE
                        read sequence data from FILE

  Additional Options:
    -o FILE, --outfile=FILE
                        write output to FILE

  Distance Options:
    -d NAME, --distance=NAME
                        choose from d, d_star, d1, d1_star, d1_star2 (default)

Similarly, to compute pattern-based distances, we proceed as follows. Assume that we already ran Teiresias and have patterns residing in a file called patterns.thr. We can then calculate distances using the maximum likelihood variant.

A faster variant is based on distances calculated using a similarity matrix.

Finally, assume that we ran extract-words.sh and have a file called patterns.thr that now contains the extracted k-mers. We calculate the standardized Euclidean distance as follows.

Furthermore, we can change the distance calculation formalism to use a normalized variant, as is sometimes done, as opposed to the original squared variant We can measure the effect of equilibrium frequencies by comparatively analyzing the distances assuming equal frequencies, specifying the alphabet size.

What functionality do they provide?

If you are looking for an implementation of the pattern-based distance:

Most of the program names should be self-explanatory:

compute-word-distance.py allows computation of a number of word-based distances, amongst them

This versatility is achieved by specifying the type of data to be used independently from the distance calculation formalism.

Similarly, two programs offer many variants of published methods, simply by combining elements of various methods:

Finally, compute-word-mix-distance.py computes a mixed distance based on probabilities of words under a Poisson model, combining additive and multiplicative distance calculation formalisms; this is inspired by van Helden's mixed metric.

How to use the modules

If you would like to use the modules for your own Python programs, have a look at the supplied programs. This should give you an idea of how the modules are meant to be used programmatically, and should get you started. Also, in case you would like to extend the functionality of the modules, you may find the comments helpful, especially since they contain assumptions underlying some of the code.

References

The following papers introduce methods that are, amongst others, implemented in this software:
  1. Blaisdell B: A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Natl Acad. Sci. U.S.A., 1986, 83(14):5155-5159.

  2. Blaisdell B: Effectiveness of measures requiring and not requiring prior sequence alignment for estimating the dissimilarity of natural sequences. J. Mol. Evol., 1989, 29(6):526-537.

  3. Burstein D, Ulitsky I, Tuller T, Chor B: Information theoretic approaches to whole genome phylogenies. in Proceedings of the Ninth Annual International Conference on Research in Computational Molecular Biology (RECOMB 2005). 2005, 283-295, Cambridge, MA.

  4. Edgar R: Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Bioinformatics, 2004, 32:380-385.

  5. Gentleman J, Mullin R: The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability. Biometrics, 1989, 45(1), 35-52.

  6. Hao B, Qi J: Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance. J. Bioinf. and Computat. Biol., 2004, 2:1-19.

  7. Höhl M, Rigoutsos I, Ragan M: Pattern-based phylogenetic distance estimation and tree reconstruction. arXiv:q-bio.QM/0605002, 2006.

  8. Lempel A, Ziv J: On the complexity of finite sequences. IEEE Trans. Inform. Theory, 1976, IT-22:75-81.

  9. Otu H, Sayood K: A new sequence distance measure for phylogenetic tree reconstruction. Bioinformatics, 2003, 19(16):2122-2130.

  10. Qi J, Wang B, Hao B: Whole proteome prokaryote phylogeny without sequence alignment: a k-string composition approach. J. Mol. Evol., 2004, 58:1-11

  11. Taylor W, Jones D: Deriving an amino acid distance matrix. J. Theor. Biol., 1993, 164:65-83.

  12. Ulitsky I, Burstein D, Tuller T, Chor B: The average common substring approach to phylogenomic reconstruction. J. Computat. Biol., 2006, 13(2):336-350.

  13. Van Helden J: Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics, 2004, 20(3):399-406.

  14. Vinga S, Gouveia-Oliveira R, Almeida J: Comparative evaluation of word composition distances for the recognition of SCOP relationships. Bioinformatics, 2004, 20(2):206-215.

  15. Wu T, Burke J, Davison D: A measure of DNA sequence dissimilarity based on the Mahalanobis distance between frequencies of words. Biometrics, 1997, 53(4):1431-1439.

Version history

How to contact me

I can be reached under these email addresses

Have fun,
Michael Höhl, April 2006