Lateral genetic transfer datasets from the phylogenetic pipeline project
Robert G. Beiko, Timothy J. Harlow, and Mark A. Ragan
These are the datasets that underpin the analysis of lateral genetic transfer reported in "Highways of gene sharing in prokaryotes" (Beiko et al, 2005).
The information below is contained in tab-separated files that were exported from our local Oracle database.
Where appropriate, brief column descriptions are given for each file.
BLASTP hits for all 144 genomes
The following file can be used to convert the internal pipeline IDs (which run from 1 to 422,971) into various accessions and reference IDs used by GenBank. Note that since these genomes were downloaded in 2003, some of the annotated proteins may have ceased to exist in the database, which is why there is sometimes a discrepancy between NCBI_REFSEQ_GI_ORIG (2003 information) and NCBI_REFSEQ_GI_CURRENT (2005 information).
Normalized BLASTP hits
Maximally Representative Clusters (MRCs)
Based on the normalized BLASTP hits above, the hybrid clustering method described in Harlow et al. was used to build Markov clusters (analogous to homologous sets of proteins) and hybrid clusters (putative orthologs from within Markov clusters). The following file contains the 22,437 sets of putative orthologs we recovered, with each set containing between 4 and 144 sequences.
The MRCs above were aligned using several different methods, and the winning alignment chosen using WOOF, the validation method described in Beiko, Chan, and Ragan (2005). The alignments below are sorted by algorithm; you can also download just the winners.
T-COFFEE consensus alignments
T-COFFEE progressive alignments
The following file contains the winning alignments, both before and after trimming with GBLOCKS. The filename for each alignment identifies the algorithm that produced the winning alignment, but does not preclude the possibility that other algorithms produced the same alignment. A complete list of algorithms that produced winning alignments (including ties) for each protein data set is available on request.
The collection of concordant and discordant bipartitions is available below. The file "MRCNodeTests.txt" contains a list of every MRC that was subjected to phylogenetic analysis. Each bipartition of the inferred Bayesian tree that had a posterior probability greater than 0.01 is listed on a separate line, starting with the IDs of which biparitions were addressed (see the PNAS paper for this definition), "D" or "C" to indicate agreement or disagreement with the reference supertree, the posterior probability of that bipartition, followed by the split of taxa induced by the bipartition. The file "GenomePipe_GenomeName.txt" associates our internal genome IDs with the names of prokaryotes that were used in the study.
Datasets used in Beiko, Keith, Harlow, and Ragan paper
These 70 alignments are in Nexus format, with protein IDs derived from the NCBI RefSeq database.
Citing LGT datasets
The appropriate citation depends on which data are being used. The described methods are contained in several papers:
Harlow, T.J., Gogarten, J.P., and Ragan, M.A. (2004). A hybrid clustering approach to recognition of protein families in 114 microbial genomes. BMC Bioinformatics 5:45.
Alignment and validation:
Beiko, R.G., Chan, C.X., and Ragan, M.A. (2005). A word-oriented approach to alignment validation. Bioinformatics 21: 2230-2239.
Benchmarking of Bayesian runs:
Beiko, R.G., Keith, J.M., Harlow, T.J., and Ragan, M.A. Searching for convergence in phylogenetic Markov chain Monte Carlo. Systematic Biology, accepted April 2006.
Edit path inference:
Beiko, R.G. and Hamilton, N. (2006). Phylogenetic identification of lateral genetic transfer events. BMC Evol. Biol. 6: 15.
Beiko, R.G., Harlow, T.J., and Ragan, M.A. (2005). Highways of gene sharing in prokaryotes. Proc. Natl. Acad. Sci. USA 102:14332-14337.
Contact Rob Beiko
© 2005-2006 Robert G. Beiko