LGTNet

Version 1.00

download

contents

installation

Download LGTNet and unpack the zip-file to a folder of your choice. Then run the command test.bat (Windows) or ./test.sh (Linux) to ensure that everything is working properly. Compare the program output to the example output provided here. The software runs on all platforms that support the Java Runtime Environment JRE 1.6.x or higher (tested only for Windows 7 and Linux 3.8.13). To verify that Java is installed run java -version. The console output should be similar to this example:

java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
Java HotSpot(TM) Client VM (build 17.1-b03, mixed mode, sharing)
If Java is not installed, download the Java runtime environment from http://java.sun.com/products/archive and run the installer.

introduction

LGTNet is a software to infer networks of lateral genetic transfer (LGT) from sequence data. In contrast to traditional techniques based on multiple sequence alignments and phylogenetic trees LGTNet is an alignment- and tree-free method. As a consequence, LGTNet is very fast (more than 1000 times faster than a phylogenetic approach) but cannot infer when in time an LGT event happened. It only predicts, which species have exchanged genetic material at some point in time and approximately indicates the sequence regions involved.

usage

LGTNet is invoked from the command line using the following format:

lgtnet.(bat|sh) -i <fasta_file|folder> [-o <output_file|stdout> -n <substring_length> -w <window-size> -m <high|medium|low> -p <profiles_file>]

examples

lgtnet.bat -i test/test.fa
lgtnet.bat -i test/test.fa -o test/network.tsv
./lgtnet.sh -i test/test.fa -o test/network.tsv -n 7 -w 20
./lgtnet.sh -i test/test.fa -o test/network.tsv -p test/profiles.tsv

options

-i <fasta_file|folder> specifies the input data and is required. Input data can be provided either as a path to a single FASTA file (e.g. -i data/genomes.fa) containing sequence data in multi FASTA format or as a path to a folder (e.g. -i data/sequences) that contains multiple individual FASTA files with sequence data. Note that sequence data must not contain ambiguity characters.

LGTNet extracts sequence identifiers from the FASTA header of each sequence by taking all characters up to the first comma, whitespace or pipe symbol. Here some examples of header files and the resulting sequence identifiers:

HEADER                                        SEQUENCE ID
>NC_010498.1                             =>   NC_010498.1 
>NC_010498.1|Escherichia coli SMS-3-5    =>   NC_010498.1  
>NC_010498.1, Escherichia coli SMS-3-5   =>   NC_010498.1 
>NC_010498.1 Escherichia coli SMS-3-5    =>   NC_010498.1 
Yous should ensure that the extracted sequence identifiers are unique, since they are used in the output files described below.

-o <output_file|stdout> specifies the output destination for the inferred network and is optional. The network is either written to standard output (-o stdout) or to a file (e.g. -o results/output.tsv). If no output destination is specified standard output is chosen. The network is outputted as a symmetric weight matrix in tab-separated-values (tsv) format:

SE001 0.000 0.776 0.807 0.772 0.805 0.778 0.766 0.770 0.624 0.756
SE002 0.776 0.000 0.794 0.704 0.776 0.712 0.688 0.794 0.800 0.758
SE003 0.807 0.794 0.000 0.789 0.822 0.799 0.781 0.716 0.797 0.666
SE004 0.772 0.704 0.789 0.000 0.759 0.673 0.568 0.782 0.794 0.763
SE005 0.805 0.776 0.822 0.759 0.000 0.767 0.763 0.828 0.829 0.819
SE006 0.778 0.712 0.799 0.673 0.767 0.000 0.655 0.787 0.797 0.759
SE007 0.766 0.688 0.781 0.568 0.763 0.655 0.000 0.775 0.800 0.752
SE008 0.770 0.794 0.716 0.782 0.828 0.787 0.775 0.000 0.783 0.595
SE009 0.624 0.800 0.797 0.794 0.829 0.797 0.800 0.783 0.000 0.761
SE010 0.756 0.758 0.666 0.763 0.819 0.759 0.752 0.595 0.761 0.000
The first column contains the sequence identifiers and rows are sorted following the same order. Each matrix cell contains the confidence score for an LGT prediction ranging from zero to one. For instance, there is fairly high confidence for an interaction between SE0001 and SE0003, since the score is 0.807. On the other hand, the score for an interaction between sequences SE008 and SE010 is only 0.595.

-n <substring_length> specifies the length of the substrings. This parameter is optional and default values of 21 and 7 are chosen for DNA and amino acid sequences, respectively. LGTNet determines the sequence type automatically.

-w <window-size> specifies window length (or bin size). This parameter is optional and default values of 60 and 20 are chosen for DNA and amino acid sequences, respectively.

-m <high|medium|low> specifies one of three different computation modes. The default mode is high.
-m high: high speed and memory consumption.
-m medium: medium speed and memory consumption.
-m low: low speed and memory consumption.

-p <profiles_file> specifies a file to write profile data to. Per default no profile data are written. The profile file contains the histogram data for each pair of sequences (see example below). All values are tab separated. The first column shows the ids of the compared sequences and the integers following show the frequencies of substring matches for each histogram bin. Considering the specified window size (= bin size) this data can be used to identify sequence regions that are likely to be involved in LGT.

SE003-SE009 28  12  18  26  0   22  32  10  15  4 ...
SE005-SE009 40  12  16  16  8   16  20  8   0   0 ...
SE002-SE005 14  26  14  30  8   26  20  18  0   0 ...
SE002-SE010 28  26  14  32  40  26  26  14  26  0 ...
SE001-SE002 28  20  18  28  40  18  20  20  12  6 ...
SE001-SE007 28  34  18  28  40  26  20  36  10  0 ...
...

example

Here an example output of LGTNet when running the test example via test.bat or ./test.sh. Computation should finish within a few seconds and the test folder should then contain two new files: network.tsv and profiles.tsv

LGTNet Version: 1.00
loading ...
parameters        -i test/test.fa -o test/network.tsv -p test/profiles.tsv
sequences read    10
alphabet          AA
substring length  7
window size       20
mode              high
processing.............................................
writing ...
finished.

Running test.bat is the same as running lgtnet.bat with the following parameter settings:

lgtnet.bat -i test/test.fa -o test/network.tsv -p test/profiles.tsv

software

LGTNet is written in Scala (Version 2.10), and runs under the Java Runtime Environment 1.6 or later. The source code is available upon request.

frequently asked questions

Logging of errors

In case of an error LGTNet writes a logfile error.log with detailed information concerning location and reason of the problem.

Invalid letter in sequence

LGTNet does not permit ambiguity characters in sequences. If you encounter an error message such as java.lang.Exception: Invalid letter in sequence: 'X' Alphabet=AA it means that your sequence data contain ambiguity characters that need to be removed.

Out of memory

This error message occurs when LGTNet (or more precisely the Java Runtime Environment) runs out of memory. In this case replace the option -Xmx600M within the lgtnet.bat file by a higher value (e.g. -Xmx2000M means 2000 Megabytes and -Xmx12G means 12 Gigabytes) or run LGTNet in low memory consumption mode -m low.

Could not create the Java virtual machine

If the amount of memory requested (using -Xmx) is too large the Java runtime environment will report the following error "Could not create the Java virtual machine". What is too large and what not depends on the amount of main memory available in your computer.

Computation takes too long

The computation time grows quadratically with the number of species and linearly with the sequence length. If your computer has multiple cores run LGTNet with the option -m high to accelerate computation. On a single core machine the setting -m medium is equally fast or even faster.

history


versiondatedescription
1.0015.07.13 First public version

contact

nameemail
Mark Raganm.ragan@uq.edu.au

references

S. Maetschke, L. McIntyre, C. Chan, M. Ragan
Fast inference of lateral genetic transfer networks.