go2ppi

Version 1.06

contents

installation

Download go2ppi from here and unpack the zip-file to a folder of your choice. Then run the command go2ppi.bat test/test.cfg to ensure that everything is working properly. Compare the program output to the example output provided here. Also have a look at the configuration file go2ppi.cfg, which describes a more realistic application. Use go2ppi.cfg as template for your own prediction projects.

The software runs on all platforms that support the Java Runtime Environment JRE 1.6.x or higher (tested only for Windows 7 but should work under Unix and Mac as well). To test whether Java is installed run java -version. The console output should be similar to this:

java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
Java HotSpot(TM) Client VM (build 17.1-b03, mixed mode, sharing)
If Java is not installed, download the Java runtime environment from http://java.sun.com/products/archive and run the installer.

introduction

go2ppi is a predictor for protein-protein interactions (PPI) based on Gene Ontology (GO) annotations. The predictor reads a list of protein identifiers with their corresponding GO annotations. It generates a list of protein pairs with a confidence value, which indicates whether two proteins are likely to interact or not (for details see the paper). When using the predictor three phases are to be distinguished, which are described in the following.

training

During the training phase the predictor learns from a given interaction network, the Gene Ontology, and proteins annotated with GO terms to infer interactions between proteins. Required input data are

When the training is finished the go2ppi program writes a model file (.mdl) that contains the generated predictor, which is required for the subsequent test or prediction phases. A typical console output of the training phase is shown below:
TRAINING
  predictor: RF
  loading network: data/SC.net
  annotating network: data/SC_uniprot.anno
  loading ontology: data/go.obo 
  extracting sub-ontologies: BP,CC
  creating training data
  training predictor
  saving predictor: data/SC.mdl
  finished.

test

The test phase is optional. It performs a cross-validation test to estimate the prediction accuracy of the predictor on new data. Prediction accuracy is reported as AUC (Area under the ROC curve). A value of 0.5 means random guessing while an AUC of 1.0 indicates a perfect predictor. Typical AUCs for this kind of problem range from 0.7 to 0.9.

The testing phase requires the same input data as the training phase. A typical console output of the testing phase is shown below:

TESTING
  loading predictor: data/SC.mdl
  loading network: data/SC.net
  annotating network: data/SC_uniprot.anno
  loading ontology: data/go.obo 
  extracting sub-ontologies: BP,CC
  creating test data
  testing predictor
  5 fold x-validation AUC = 0.907
  finished.

prediction

Prediction is the application of the predictor to new data. All proteins, which predictions are to be generated for, have to be listed within the annotations file (.anno) used for the training of the predictor. Furthermore, no predictions are possible for proteins without GO annotation nor can self-interactions be predicted.

In addition to the file required for training also an input file with the proteins to predict interactions for is needed. This input file must have the same format (but not necessarily the same content) as the annotations file, such as the protein accession numbers and list of GO terms. Here an example:

   P32837	GO:0000329,GO:0016021,GO:0015495,GO:0015489
   Q12044	GO:0005737,GO:0005515
   ...

The prediction phase generates an output file that contains protein pairs given by their accession numbers and a score (in the interval 0..1), which indicates the confidence/strength of the interaction. 0 means there is no interaction and 1 indicates a highly confident prediction of an interaction. In the following a short example of an output file:

   P53012 P36018 0.520
   P53012 P32259 0.530
   P53012 P11746 0.130
   P53012 P40465 0.170
   P53012 P17536 0.250
   P53012 P38265 0.250
   ...

Note that the predictions file can get very large (several GB), since it contains the confidence scores for all possible pairings of proteins (excluding self-parings). To reduce the size a threshold can be specified to report only interactions with a confidence score above or equal to the given threshold.

A typical console output of the prediction phase is shown below:

PREDICTING
  loading predictor: data/SC.mdl
  loading ontology: data/go.obo 
  extracting sub-ontologies: BP,CC
  loading inputs: data/SC.anno
  predicting...
  0.5 %
  ...
  100.0 %
  finished. 

Depending on the number of proteins to infer interactions for, the prediction phase can take very, very long (years!). The computation time grows quadratically with the number of proteins within the annotation file.

To shorten the overall computation time the prediction phase can be distributed on multiple computers or cluster nodes by using filter files. A filter file directly lists the protein pairs to compute predictions for. Having a different filter file on each node allows to perform independent, parallel predictions of interactions and the overall computation time is essentially divided by the number of available nodes.

To generate filter files use the gen_filter.bat tool. It reads the an annotation file (.anno) and creates the given number of filter files (.fil) in the specified output folder. See the following example:

gen_filter.bat test\test.anno 5 test\filters
This will create 5 filter files (filter1.fil ... filter5.fil) in the folder test\filters. Within the configuration file (.cfg) a filter file is enabled and specified via the filter parameter. Here an example of a configuration file with filtering:
mode         = PREDICT
annotations  = test/test.anno
ontology     = data/go.obo
model        = test/test.mdl
network      = test/test.ppi
inputs       = test/test.anno
filter       = test/filters/filter1.fil
predictions  = test/test.pred
self-test    = YES
sub-ontology = BP,CC,MF
predictor    = NB
threshold    = 0.9
folds        = 5
runs         = 1

Note that filter files are used only during the prediction phase and therefore do not allow to distribute and accelerate the training or test phase. However the training and test phase are typically much faster and do not pose a problem. When the prediction is performed in a distributed fashion the overall prediction result is simply generated by aggregating the output files for the individual computers or nodes.

usage

go2ppi is invoked from the command line using the following format:

go2ppi.bat configuration_file [parameter_overrides]
configuration_file is a configuration file (.cf) that contains all parameter settings (e.g. file paths) required to run go2ppi. See the go2ppi.cfg configuration file for details. Input and output files are specified within the configuration file. In its simplest form go2ppi can be started as follows:
go2ppi.bat go2ppi.cfg
[parameter_overrides] is an optional list of parameter settings to override settings within the configuration file. Here an example, which overrides the parameters predictor and sub-ontology:
go2ppi.bat go2ppi.cfg predictor=NB sub-ontology=BP,CC
This allows to use a configuration file with standard settings but to run the software for different case-specific settings from the command line without the need to modify the configuration file.

example

Here an example output of go2ppi when running the test example via go2ppi.bat test\test.cfg. Computation should finish within a few minutes and the test folder should then contain two new files: test.mdl and test.pred

go2ppi
Version 1.05
config file: test\test.cfg
SETTINGS
  predictor       = 'NB'
  network         = 'test/test.ppi'
  self-test       = 'YES'
  runs            = '1'
  threshold       = '0.9'
  model           = 'test/test.mdl'
  sub-ontology    = 'BP,CC,MF'
  mode            = 'TRAIN,TEST,PREDICT'
  annotations     = 'test/test.anno'
  inputs          = 'test/test.anno'
  ontology        = 'data/go.obo'
  folds           = '5'
  predictions     = 'test/test.pred'
TRAINING
  predictor: NB
  loading network: test/test.ppi
  annotating network: test/test.anno
  loading ontology: data/go.obo
  extracting sub-ontologies: BP,CC,MF
  creating training data
  training predictor
  saving predictor: test/test.mdl
  performing self-test: YES
  self-test AUC = 0.980
  finished.
TESTING
  loading predictor: test/test.mdl
  loading network: test/test.ppi
  annotating network: test/test.anno
  loading ontology: data/go.obo
  extracting sub-ontologies: BP,CC,MF
  creating test data
  testing predictor
  5 fold x-validation AUC = 0.896
  finished.
PREDICTING
  loading predictor: test/test.mdl
  loading ontology: data/go.obo
  extracting sub-ontologies: BP,CC,MF
  loading inputs: test/test.anno
  predicting...
  100.0 %
finished.

To run your own experiment call go2ppi.bat go2ppi.cfg with a configuration file (go2ppi.cfg) that links to your data.

Software

go2ppi is written in Scala (Version 2.9), and runs under the Java Runtime Environment 1.6 or later. The source code is available upon request. Note that go2ppi is part of a rather large library and modifications might be challenging to implement.

known problems

java.lang.StackOverflowError

For large Gene Ontologies or PPI data sets the software might report a java.lang.StackOverflowError when reading the model file of a predictor (e.g. TEST or PREDICT mode). In this case the stack size for the Java runtime environment needs to be increased by changing the option -Xss in go2ppi.bat. For instance, increase from -Xss100M to -Xss200M. Note that an error will occur if the values for stack (-Xss) or heap (-Xmx) size are too large. See below.

Could not create the Java virtual machine

If the values for stack (-Xss) or heap (-Xmx) size, which are specified in go2ppi.bat, are too large the Java runtime environment will report an error "Could not create the Java virtual machine". What is too large and what not depends on the amount of main memory available on your computer.

Prediction file gets too large

Set the threshold for reported predictions to a higher value.

Prediction takes too long

The time to predict interactions grows quadratically with the number of proteins, since predictions for all possible pairs (excluding self-pairings) need to be computed. See filter files as a method to speed up the prediction phase by distributing the computation on multiple computers or cluster nodes. Note that filter files will not allow you to accelerate the training or test phase.

history


versiondatedescription
1.0621.06.12 Memory consumption of gen_filters reduced (dramatically). Error with multiple parameters for go2ppi.bat fixed. Shell scripts added.
1.0508.06.12 Filter files to parallelize go2ppi introduced and gen_filters.bat provided. go.obo updated
1.0222.11.11 Whitespaces from config values trimmed
1.0102.08.11 Regulatory links filtered out of Gene ontology graph
1.0016.07.11 First public version

contact

nameemail
Stefan Maetschkes.maetschke@uq.edu.au
Mark Raganm.ragan@uq.edu.au

references

Stefan R. Maetschke, Martin Simonsen, Melissa J. Davis, Mark A. Ragan
Gene Ontology driven inference of protein-protein interactions using inducers
Bioinformatics. 2011

Supplementary material