This package is distributed under the terms of the GNU General Public License.

- Python (available from http://www.python.org/)

- Teiresias (available from http://cbcsrv.watson.ibm.com/Tspd.html)

To enable pattern-based ML distance calculation, you need protdist from the Phylip package. However, other programs it contains such as fitch and neighbor are probably of interest to you as well.

- Phylip 3.6 (available from http://evolution.genetics.washington.edu/phylip.html)

- SuffixTree-0.7 (available from http://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/)

- pygsl (available from http://pygsl.sourceforge.net/)
- GSL - GNU Scientific Library (available from http://www.gnu.org/software/gsl/)

- Psyco (available from http://psyco.sourceforge.net/)

- Blast matrices (available from ftp://ftp.ncbi.nih.gov/blast/matrices/)
- PAML (available from http://abacus.gene.ucl.ac.uk/software/paml.html)

`compute-lz-distance.py --fasta seqs.fa`

`compute-lz-distance.py --fasta seqs.fa --outfile distmatrix`

`compute-lz-distance.py --fasta seqs.fa --outfile distmatrix --distance d_star`

`compute-lz-distance.py -f seqs.fa -o distmatrix -d d_star`

`compute-lz-distance.py`

usage: compute-lz-distance.py [options] options: -h, --help show this help message and exit Mandatory Options: -f FILE, --fasta=FILE read sequence data from FILE Additional Options: -o FILE, --outfile=FILE write output to FILE Distance Options: -d NAME, --distance=NAME choose from d, d_star, d1, d1_star, d1_star2 (default)

Similarly, to compute pattern-based distances, we proceed as follows. Assume that we already ran Teiresias and have patterns residing in a file called patterns.thr. We can then calculate distances using the maximum likelihood variant.

`compute-pattern-distance.py --fasta seqs.fa --patterns patterns.thr --outfile distmatrix --protdist`

`compute-pattern-distance.py --fasta seqs.fa --patterns patterns.thr --outfile distmatrix --matrix BLOSUM62`

Finally, assume that we ran extract-words.sh and have a file called patterns.thr that now contains the extracted k-mers. We calculate the standardized Euclidean distance as follows.

`compute-word-std-distance.py --fasta seqs.fa --patterns patterns.thr --outfile distmatrix --equilibrium jones.dat`

`compute-word-std-distance.py --fasta seqs.fa --patterns patterns.thr --outfile distmatrix --equilibrium jones.dat --distance euclid_norm`

`compute-word-std-distance.py --fasta seqs.fa --patterns patterns.thr --outfile distmatrix --alphabet 20`

- compute-pattern-distance.py computes variants of pattern-based distances, of interest are variants that use maximum likelihood and similarity matrices.
- pattern-filter-majority.py filters patterns according to majority consensus and consistency criteria;
- the resulting patterns can then be used for distance calculation.

Most of the program names should be self-explanatory:

- compute-acs-distance.py computes a distance based on the average common substring length
- compute-lz-distance.py computes various distances based on the Lempel-Ziv complexity
- compute-word-composition-distance.py computes the composition distance
- compute-word-std-distance.py computes the standardized Euclidean distance
- compute-word-wm-distance.py computes the W-metric

compute-word-distance.py allows computation of a number of word-based distances, amongst them

- the (squared) Euclidean distance,
- a distance based on the fraction of common k-mer counts,
- and a distance based on probabilities of common k-mer counts under a Poisson model.

Similarly, two programs offer many variants of published methods, simply by combining elements of various methods:

- compute-word-composition-many-distances.py
- compute-word-many-distances.py

Finally, compute-word-mix-distance.py computes a mixed distance based on probabilities of words under a Poisson model, combining additive and multiplicative distance calculation formalisms; this is inspired by van Helden's mixed metric.

**2006-05-02**- Updated references**2006-04-27**- Initial release

Have fun,

Michael Höhl, April 2006