This package is distributed under the terms of the GNU General Public License.

- Python (available from http://www.python.org/)

- Teiresias (available from http://cbcsrv.watson.ibm.com/Tspd.html)

To enable pattern-based ML distance calculation, you need protdist from the Phylip package. However, other programs it contains such as fitch and neighbor are probably of interest to you as well.

- Phylip 3.6 (available from http://evolution.genetics.washington.edu/phylip.html)

- SuffixTree-0.7 (available from http://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/)

- pygsl (available from http://pygsl.sourceforge.net/)
- GSL - GNU Scientific Library (available from http://www.gnu.org/software/gsl/)

- Psyco (available from http://psyco.sourceforge.net/)

- Blast matrices (available from ftp://ftp.ncbi.nih.gov/blast/matrices/)
- PAML (available from http://abacus.gene.ucl.ac.uk/software/paml.html)

LICENSE.txt

BLOSUM62

extract-words.sh

ACS.py

Align.py

BlastMatrix.py

Distance.py

DistMatrixFactory.py

DistMatrix.py

DistMatrixUtils.py

FastaFile.py

LempelZiv.py

OutFile.py

Paml.py

PatternDist.py

PatternFilter.py

PatternIter.py

PatternSelect.py

Phylip.py

ResiduePairs.py

Seqs.py

SeqUtils.py

TeiresiasPatterns.py

Wm.py

Word.py

compute-lz-distance.py

compute-pattern-distance.py

compute-word-composition-distance.py

compute-word-composition-many-distances.py

compute-word-distance.py

compute-word-many-distances.py

compute-word-mix-distance.py

compute-word-std-distance.py

compute-word-wm-distance.py

pattern-filter-majority.py

* NB: itertoos2.py is taken from the Python Library Reference, Section "5.16.3 Recipes" and resides in the file Doc/html/lib/itertools-recipes.html under the Python directory.

`compute-lz-distance.py --fasta seqs.fa`

`compute-lz-distance.py --fasta seqs.fa --outfile distmatrix`

`compute-lz-distance.py --fasta seqs.fa --outfile distmatrix --distance d_star`

`compute-lz-distance.py -f seqs.fa -o distmatrix -d d_star`

`compute-lz-distance.py`

usage: compute-lz-distance.py [options] options: -h, --help show this help message and exit Mandatory Options: -f FILE, --fasta=FILE read sequence data from FILE Additional Options: -o FILE, --outfile=FILE write output to FILE Distance Options: -d NAME, --distance=NAME choose from d, d_star, d1, d1_star, d1_star2 (default)

Similarly, to compute pattern-based distances, we proceed as follows. Assume that we already ran Teiresias and have patterns residing in a file called patterns.thr. We can then calculate distances using the maximum likelihood variant.

`compute-pattern-distance.py --fasta seqs.fa --patterns patterns.thr --outfile distmatrix --protdist`

`compute-pattern-distance.py --fasta seqs.fa --patterns patterns.thr --outfile distmatrix --matrix BLOSUM62`

Finally, assume that we ran extract-words.sh and have a file called patterns.thr that now contains the extracted k-mers. We calculate the standardized Euclidean distance as follows.

`compute-word-std-distance.py --fasta seqs.fa --patterns patterns.thr --outfile distmatrix --equilibrium jones.dat`

`compute-word-std-distance.py --fasta seqs.fa --patterns patterns.thr --outfile distmatrix --equilibrium jones.dat --distance euclid_norm`

`compute-word-std-distance.py --fasta seqs.fa --patterns patterns.thr --outfile distmatrix --alphabet 20`

- compute-pattern-distance.py computes variants of pattern-based distances, of interest are variants that use maximum likelihood and similarity matrices.
- pattern-filter-majority.py filters patterns according to majority consensus and consistency criteria;
- the resulting patterns can then be used for distance calculation.

Most of the program names should be self-explanatory:

- compute-acs-distance.py computes a distance based on the average common substring length
- compute-lz-distance.py computes various distances based on the Lempel-Ziv complexity
- compute-word-composition-distance.py computes the composition distance
- compute-word-std-distance.py computes the standardized Euclidean distance
- compute-word-wm-distance.py computes the W-metric

compute-word-distance.py allows computation of a number of word-based distances, amongst them

- the (squared) Euclidean distance,
- a distance based on the fraction of common k-mer counts,
- and a distance based on probabilities of common k-mer counts under a Poisson model.

Similarly, two programs offer many variants of published methods, simply by combining elements of various methods:

- compute-word-composition-many-distances.py
- compute-word-many-distances.py

Finally, compute-word-mix-distance.py computes a mixed distance based on probabilities of words under a Poisson model, combining additive and multiplicative distance calculation formalisms; this is inspired by van Helden's mixed metric.

- Blaisdell B:
*A measure of the similarity of sets of sequences not requiring sequence alignment.*Proc. Natl Acad. Sci. U.S.A., 1986, 83(14):5155-5159. - Blaisdell B:
*Effectiveness of measures requiring and not requiring prior sequence alignment for estimating the dissimilarity of natural sequences.*J. Mol. Evol., 1989, 29(6):526-537. - Burstein D, Ulitsky I, Tuller T, Chor B:
*Information theoretic approaches to whole genome phylogenies.*in Proceedings of the Ninth Annual International Conference on Research in Computational Molecular Biology (RECOMB 2005). 2005, 283-295, Cambridge, MA. - Edgar R:
*Local homology recognition and distance measures in linear time using compressed amino acid alphabets.*Bioinformatics, 2004, 32:380-385. - Gentleman J, Mullin R:
*The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability.*Biometrics, 1989, 45(1), 35-52. - Hao B, Qi J:
*Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance.*J. Bioinf. and Computat. Biol., 2004, 2:1-19. - Höhl M, Rigoutsos I, Ragan M:
*Pattern-based phylogenetic distance estimation and tree reconstruction.*arXiv:q-bio.QM/0605002, 2006. - Lempel A, Ziv J:
*On the complexity of finite sequences.*IEEE Trans. Inform. Theory, 1976, IT-22:75-81. - Otu H, Sayood K:
*A new sequence distance measure for phylogenetic tree reconstruction.*Bioinformatics, 2003, 19(16):2122-2130. - Qi J, Wang B, Hao B:
*Whole proteome prokaryote phylogeny without sequence alignment: a k-string composition approach.*J. Mol. Evol., 2004, 58:1-11 - Taylor W, Jones D:
*Deriving an amino acid distance matrix.*J. Theor. Biol., 1993, 164:65-83. - Ulitsky I, Burstein D, Tuller T, Chor B:
*The average common substring approach to phylogenomic reconstruction.*J. Computat. Biol., 2006, 13(2):336-350. - Van Helden J:
*Metrics for comparing regulatory sequences on the basis of pattern counts.*Bioinformatics, 2004, 20(3):399-406. - Vinga S, Gouveia-Oliveira R, Almeida J:
*Comparative evaluation of word composition distances for the recognition of SCOP relationships.*Bioinformatics, 2004, 20(2):206-215. - Wu T, Burke J, Davison D:
*A measure of DNA sequence dissimilarity based on the Mahalanobis distance between frequencies of words.*Biometrics, 1997, 53(4):1431-1439.

**2006-05-02**- Updated references**2006-04-27**- Initial release

Have fun,

Michael Höhl, April 2006