Download LGTNet and unpack the zip-file to a folder of your choice. Then run the command test.bat (Windows) or ./test.sh (Linux) to ensure that everything is working properly. Compare the program output to the example output provided here. The software runs on all platforms that support the Java Runtime Environment JRE 1.6.x or higher (tested only for Windows 7 and Linux 3.8.13). To verify that Java is installed run java -version. The console output should be similar to this example:
java version "1.6.0_22" Java(TM) SE Runtime Environment (build 1.6.0_22-b04) Java HotSpot(TM) Client VM (build 17.1-b03, mixed mode, sharing)If Java is not installed, download the Java runtime environment from http://java.sun.com/products/archive and run the installer.
LGTNet is a software to infer networks of lateral genetic transfer (LGT) from sequence data. In contrast to traditional techniques based on multiple sequence alignments and phylogenetic trees LGTNet is an alignment- and tree-free method. As a consequence, LGTNet is very fast (more than 1000 times faster than a phylogenetic approach) but cannot infer when in time an LGT event happened. It only predicts, which species have exchanged genetic material at some point in time and approximately indicates the sequence regions involved.
LGTNet is invoked from the command line using the following format:
lgtnet.(bat|sh) -i <fasta_file|folder> [-o <output_file|stdout> -n <substring_length> -w <window-size> -m <high|medium|low> -p <profiles_file>]
lgtnet.bat -i test/test.fa
lgtnet.bat -i test/test.fa -o test/network.tsv
./lgtnet.sh -i test/test.fa -o test/network.tsv -n 7 -w 20
./lgtnet.sh -i test/test.fa -o test/network.tsv -p test/profiles.tsv
-i <fasta_file|folder> specifies the input data and is required. Input data can be provided either as a path to a single FASTA file (e.g. -i data/genomes.fa) containing sequence data in multi FASTA format or as a path to a folder (e.g. -i data/sequences) that contains multiple individual FASTA files with sequence data. Note that sequence data must not contain ambiguity characters.
LGTNet extracts sequence identifiers from the FASTA header of each sequence by taking all characters up to the first comma, whitespace or pipe symbol. Here some examples of header files and the resulting sequence identifiers:
HEADER SEQUENCE ID >NC_010498.1 => NC_010498.1 >NC_010498.1|Escherichia coli SMS-3-5 => NC_010498.1 >NC_010498.1, Escherichia coli SMS-3-5 => NC_010498.1 >NC_010498.1 Escherichia coli SMS-3-5 => NC_010498.1Yous should ensure that the extracted sequence identifiers are unique, since they are used in the output files described below.
-o <output_file|stdout> specifies the output destination for the inferred network and is optional. The network is either written to standard output (-o stdout) or to a file (e.g. -o results/output.tsv). If no output destination is specified standard output is chosen. The network is outputted as a symmetric weight matrix in tab-separated-values (tsv) format:
SE001 0.000 0.776 0.807 0.772 0.805 0.778 0.766 0.770 0.624 0.756 SE002 0.776 0.000 0.794 0.704 0.776 0.712 0.688 0.794 0.800 0.758 SE003 0.807 0.794 0.000 0.789 0.822 0.799 0.781 0.716 0.797 0.666 SE004 0.772 0.704 0.789 0.000 0.759 0.673 0.568 0.782 0.794 0.763 SE005 0.805 0.776 0.822 0.759 0.000 0.767 0.763 0.828 0.829 0.819 SE006 0.778 0.712 0.799 0.673 0.767 0.000 0.655 0.787 0.797 0.759 SE007 0.766 0.688 0.781 0.568 0.763 0.655 0.000 0.775 0.800 0.752 SE008 0.770 0.794 0.716 0.782 0.828 0.787 0.775 0.000 0.783 0.595 SE009 0.624 0.800 0.797 0.794 0.829 0.797 0.800 0.783 0.000 0.761 SE010 0.756 0.758 0.666 0.763 0.819 0.759 0.752 0.595 0.761 0.000The first column contains the sequence identifiers and rows are sorted following the same order. Each matrix cell contains the confidence score for an LGT prediction ranging from zero to one. For instance, there is fairly high confidence for an interaction between SE0001 and SE0003, since the score is 0.807. On the other hand, the score for an interaction between sequences SE008 and SE010 is only 0.595.
-n <substring_length> specifies the length of the substrings. This parameter is optional and default values of 21 and 7 are chosen for DNA and amino acid sequences, respectively. LGTNet determines the sequence type automatically.
-w <window-size> specifies window length (or bin size). This parameter is optional and default values of 60 and 20 are chosen for DNA and amino acid sequences, respectively.
-m <high|medium|low> specifies one of three different computation modes.
The default mode is high.
-m high: high speed and memory consumption.
-m medium: medium speed and memory consumption.
-m low: low speed and memory consumption.
-p <profiles_file> specifies a file to write profile data to. Per default no profile data are written. The profile file contains the histogram data for each pair of sequences (see example below). All values are tab separated. The first column shows the ids of the compared sequences and the integers following show the frequencies of substring matches for each histogram bin. Considering the specified window size (= bin size) this data can be used to identify sequence regions that are likely to be involved in LGT.
SE003-SE009 28 12 18 26 0 22 32 10 15 4 ... SE005-SE009 40 12 16 16 8 16 20 8 0 0 ... SE002-SE005 14 26 14 30 8 26 20 18 0 0 ... SE002-SE010 28 26 14 32 40 26 26 14 26 0 ... SE001-SE002 28 20 18 28 40 18 20 20 12 6 ... SE001-SE007 28 34 18 28 40 26 20 36 10 0 ... ...
Here an example output of LGTNet when running the test example via test.bat or ./test.sh. Computation should finish within a few seconds and the test folder should then contain two new files: network.tsv and profiles.tsv
LGTNet Version: 1.00 loading ... parameters -i test/test.fa -o test/network.tsv -p test/profiles.tsv sequences read 10 alphabet AA substring length 7 window size 20 mode high processing............................................. writing ... finished.
Running test.bat is the same as running lgtnet.bat with the following parameter settings:
lgtnet.bat -i test/test.fa -o test/network.tsv -p test/profiles.tsv
LGTNet is written in Scala (Version 2.10), and runs under the Java Runtime Environment 1.6 or later. The source code is available upon request.
In case of an error LGTNet writes a logfile error.log with detailed information concerning location and reason of the problem.
LGTNet does not permit ambiguity characters in sequences. If you encounter an error message such as java.lang.Exception: Invalid letter in sequence: 'X' Alphabet=AA it means that your sequence data contain ambiguity characters that need to be removed.
This error message occurs when LGTNet (or more precisely the Java Runtime Environment) runs out of memory. In this case replace the option -Xmx600M within the lgtnet.bat file by a higher value (e.g. -Xmx2000M means 2000 Megabytes and -Xmx12G means 12 Gigabytes) or run LGTNet in low memory consumption mode -m low.
If the amount of memory requested (using -Xmx) is too large the Java runtime environment will report the following error "Could not create the Java virtual machine". What is too large and what not depends on the amount of main memory available in your computer.
The computation time grows quadratically with the number of species and linearly with the sequence length. If your computer has multiple cores run LGTNet with the option -m high to accelerate computation. On a single core machine the setting -m medium is equally fast or even faster.
|1.00||15.07.13||First public version|
S. Maetschke, L. McIntyre, C. Chan, M. Ragan
Fast inference of lateral genetic transfer networks.