Synthetic eight-taxon & putative orthologs datasets
Overview
This website presents data used in the manuscript
Is multiple sequence alignment required for accurate inference of phylogeny?
by Michael Höhl and Mark A. Ragan (submitted, 2006)
Data organization
The data is organized hierarchically into directories as follows:
- synthetic contains the synthetic eight-taxon datasets
- (trees contains the reference trees, used in all three datasets)
- control contains control data (1000 amino acids, no ASRV)
- asrv contains ASRV data (1000 amino acids, ASRV with alpha=0.5)
- short contains short-sequences data (300 amino acids, no ASRV)
- each dataset contains 7 reference sets, labeled set1 to set7
- each reference set contains
- ref-tree/*/outtree
the reference trees (soft links to synthetic/trees/*/)
- ref-fa/*/seqs.fa
the original reference sequences (AA) in fasta format
- ref-fa-chem/*/seqs.fa
sequences encoded using alphabet CE
- each directory contains 100 elements, labeled exp001 to exp100
- orthologs containts the putative orthologs dataset with four reference sets
- f-s contains few taxa, short deep phylogenetic branches (50 elements)
- f-l contains few taxa, long deep phylogenetic branches (52 elements)
- m-s contains many taxa, short deep phylogenetic branches (80 elements)
- m-l contains many taxa, long deep phylogenetic branches (38 elements)
- each reference set contains
- ref-tree/*/outtree
the reference trees
- ref-tree-pp/*/outtree
the reference trees, collapsed at various PP thresholds
- ref-bipart/*/biparts
the deep phylogenetic branch measured by DPB
- ref-gb/*/seqs.fa
sequences after GBLOCKS treatment, which yielded reference trees using MrBayes
- ref-fa/*/seqs.fa
the original reference sequences (AA) in fasta format
- ref-fa-chem/*/seqs.fa
sequences encoded using alphabet CE
- each directory contains a number of elements (different for each reference set), starting with exp001
Format of bipart-file
The bipart-file (ref-bipart/*/biparts) is a tab-delimited file
consisting of one line (the single deep phylogenetic branch measured
by DPB) with three parts
dot-star partition1 partition2
dot-star is a bipartition of the reference tree in dot-star
format (using characters '.' and '*').
partition1 and partition2 are comma-separated lists
(each enclosed by '[' and ']') of sequence
identifiers ID (used in fasta file headers: '>ID 1').
The characters in dot-star refer to the combined IDs from
partition1 and partition2 in ascending numerical
order. Dots refer to partition1, stars refer to
partition2. Example:
.**. [1, 52] [2, 42]
The combined and ordered IDs are [1, 2, 42, 52]. Now we map
these IDs to dot-star format. Since 1 is in
partition1, the first character is '.'; 2
is in partition2, the corresponding character is
'*'; similar for 42; for 52, we proceed as
for 1. Hence the resulting string is '.**.'!
Downloads
(bzip2-compressed tar-balls)
How to contact me
I can be reached under these email addresses
Michael Höhl, 4 May 2006