Download go2ppi from here and unpack the zip-file to a folder of your choice. Then run the command go2ppi.bat test/test.cfg to ensure that everything is working properly. Compare the program output to the example output provided here. Also have a look at the configuration file go2ppi.cfg, which describes a more realistic application. Use go2ppi.cfg as template for your own prediction projects.
The software runs on all platforms that support the Java Runtime Environment JRE 1.6.x or higher (tested only for Windows 7 but should work under Unix and Mac as well). To test whether Java is installed run java -version. The console output should be similar to this:
java version "1.6.0_22" Java(TM) SE Runtime Environment (build 1.6.0_22-b04) Java HotSpot(TM) Client VM (build 17.1-b03, mixed mode, sharing)If Java is not installed, download the Java runtime environment from http://java.sun.com/products/archive and run the installer.
go2ppi is a predictor for protein-protein interactions (PPI) based on Gene Ontology (GO) annotations. The predictor reads a list of protein identifiers with their corresponding GO annotations. It generates a list of protein pairs with a confidence value, which indicates whether two proteins are likely to interact or not (for details see the paper). When using the predictor three phases are to be distinguished, which are described in the following.
During the training phase the predictor learns from a given interaction network, the Gene Ontology, and proteins annotated with GO terms to infer interactions between proteins. Required input data are
P32837 GO:0000329,GO:0016021,GO:0015495,GO:0015489 Q12044 GO:0005737,GO:0005515 ...Note: The annotation file must contain all proteins used during training, test or prediction and the same file must be used for all three phases! Annotation files can be generated with the provided id2go.bat tool. It takes a file containing protein accession numbers (e.g. a list or a network), retrieves GO annotations from Uniprot (internet connection is required) and outputs an annotation file in the format described above.
P11972 P38853 P33299 P22141 P20676 Q02821 ...Note: The edge list may contain additional columns that are ignored. Important: All protein accession numbers must be contained within the annotations file.
TRAINING predictor: RF loading network: data/SC.net annotating network: data/SC_uniprot.anno loading ontology: data/go.obo extracting sub-ontologies: BP,CC creating training data training predictor saving predictor: data/SC.mdl finished.
The test phase is optional. It performs a cross-validation test to estimate the prediction accuracy of the predictor on new data. Prediction accuracy is reported as AUC (Area under the ROC curve). A value of 0.5 means random guessing while an AUC of 1.0 indicates a perfect predictor. Typical AUCs for this kind of problem range from 0.7 to 0.9.
The testing phase requires the same input data as the training phase. A typical console output of the testing phase is shown below:
TESTING loading predictor: data/SC.mdl loading network: data/SC.net annotating network: data/SC_uniprot.anno loading ontology: data/go.obo extracting sub-ontologies: BP,CC creating test data testing predictor 5 fold x-validation AUC = 0.907 finished.
Prediction is the application of the predictor to new data. All proteins, which predictions are to be generated for, have to be listed within the annotations file (.anno) used for the training of the predictor. Furthermore, no predictions are possible for proteins without GO annotation nor can self-interactions be predicted.
In addition to the file required for training also an input file with the proteins to predict interactions for is needed. This input file must have the same format (but not necessarily the same content) as the annotations file, such as the protein accession numbers and list of GO terms. Here an example:
P32837 GO:0000329,GO:0016021,GO:0015495,GO:0015489 Q12044 GO:0005737,GO:0005515 ...
The prediction phase generates an output file that contains protein pairs given by their accession numbers and a score (in the interval 0..1), which indicates the confidence/strength of the interaction. 0 means there is no interaction and 1 indicates a highly confident prediction of an interaction. In the following a short example of an output file:
P53012 P36018 0.520 P53012 P32259 0.530 P53012 P11746 0.130 P53012 P40465 0.170 P53012 P17536 0.250 P53012 P38265 0.250 ...
Note that the predictions file can get very large (several GB), since it contains the confidence scores for all possible pairings of proteins (excluding self-parings). To reduce the size a threshold can be specified to report only interactions with a confidence score above or equal to the given threshold.
A typical console output of the prediction phase is shown below:
PREDICTING loading predictor: data/SC.mdl loading ontology: data/go.obo extracting sub-ontologies: BP,CC loading inputs: data/SC.anno predicting... 0.5 % ... 100.0 % finished.
Depending on the number of proteins to infer interactions for, the prediction phase can take very, very long (years!). The computation time grows quadratically with the number of proteins within the annotation file.
To shorten the overall computation time the prediction phase can be distributed on multiple computers or cluster nodes by using filter files. A filter file directly lists the protein pairs to compute predictions for. Having a different filter file on each node allows to perform independent, parallel predictions of interactions and the overall computation time is essentially divided by the number of available nodes.
To generate filter files use the gen_filter.bat tool. It reads the an annotation file (.anno) and creates the given number of filter files (.fil) in the specified output folder. See the following example:
gen_filter.bat test\test.anno 5 test\filtersThis will create 5 filter files (filter1.fil ... filter5.fil) in the folder test\filters. Within the configuration file (.cfg) a filter file is enabled and specified via the filter parameter. Here an example of a configuration file with filtering:
mode = PREDICT annotations = test/test.anno ontology = data/go.obo model = test/test.mdl network = test/test.ppi inputs = test/test.anno filter = test/filters/filter1.fil predictions = test/test.pred self-test = YES sub-ontology = BP,CC,MF predictor = NB threshold = 0.9 folds = 5 runs = 1
Note that filter files are used only during the prediction phase and therefore do not allow to distribute and accelerate the training or test phase. However the training and test phase are typically much faster and do not pose a problem. When the prediction is performed in a distributed fashion the overall prediction result is simply generated by aggregating the output files for the individual computers or nodes.
go2ppi is invoked from the command line using the following format:
go2ppi.bat configuration_file [parameter_overrides]configuration_file is a configuration file (.cf) that contains all parameter settings (e.g. file paths) required to run go2ppi. See the go2ppi.cfg configuration file for details. Input and output files are specified within the configuration file. In its simplest form go2ppi can be started as follows:
go2ppi.bat go2ppi.cfg[parameter_overrides] is an optional list of parameter settings to override settings within the configuration file. Here an example, which overrides the parameters predictor and sub-ontology:
go2ppi.bat go2ppi.cfg predictor=NB sub-ontology=BP,CCThis allows to use a configuration file with standard settings but to run the software for different case-specific settings from the command line without the need to modify the configuration file.
Here an example output of go2ppi when running the test example via go2ppi.bat test\test.cfg. Computation should finish within a few minutes and the test folder should then contain two new files: test.mdl and test.pred
go2ppi Version 1.05 config file: test\test.cfg SETTINGS predictor = 'NB' network = 'test/test.ppi' self-test = 'YES' runs = '1' threshold = '0.9' model = 'test/test.mdl' sub-ontology = 'BP,CC,MF' mode = 'TRAIN,TEST,PREDICT' annotations = 'test/test.anno' inputs = 'test/test.anno' ontology = 'data/go.obo' folds = '5' predictions = 'test/test.pred' TRAINING predictor: NB loading network: test/test.ppi annotating network: test/test.anno loading ontology: data/go.obo extracting sub-ontologies: BP,CC,MF creating training data training predictor saving predictor: test/test.mdl performing self-test: YES self-test AUC = 0.980 finished. TESTING loading predictor: test/test.mdl loading network: test/test.ppi annotating network: test/test.anno loading ontology: data/go.obo extracting sub-ontologies: BP,CC,MF creating test data testing predictor 5 fold x-validation AUC = 0.896 finished. PREDICTING loading predictor: test/test.mdl loading ontology: data/go.obo extracting sub-ontologies: BP,CC,MF loading inputs: test/test.anno predicting... 100.0 % finished.
To run your own experiment call go2ppi.bat go2ppi.cfg with a configuration file (go2ppi.cfg) that links to your data.
go2ppi is written in Scala (Version 2.9), and runs under the Java Runtime Environment 1.6 or later. The source code is available upon request. Note that go2ppi is part of a rather large library and modifications might be challenging to implement.
For large Gene Ontologies or PPI data sets the software might report a java.lang.StackOverflowError when reading the model file of a predictor (e.g. TEST or PREDICT mode). In this case the stack size for the Java runtime environment needs to be increased by changing the option -Xss in go2ppi.bat. For instance, increase from -Xss100M to -Xss200M. Note that an error will occur if the values for stack (-Xss) or heap (-Xmx) size are too large. See below.
If the values for stack (-Xss) or heap (-Xmx) size, which are specified in go2ppi.bat, are too large the Java runtime environment will report an error "Could not create the Java virtual machine". What is too large and what not depends on the amount of main memory available on your computer.
Set the threshold for reported predictions to a higher value.
The time to predict interactions grows quadratically with the number of proteins, since predictions for all possible pairs (excluding self-pairings) need to be computed. See filter files as a method to speed up the prediction phase by distributing the computation on multiple computers or cluster nodes. Note that filter files will not allow you to accelerate the training or test phase.
|1.06||21.06.12||Memory consumption of gen_filters reduced (dramatically). Error with multiple parameters for go2ppi.bat fixed. Shell scripts added.|
|1.05||08.06.12||Filter files to parallelize go2ppi introduced and gen_filters.bat provided. go.obo updated|
|1.02||22.11.11||Whitespaces from config values trimmed|
|1.01||02.08.11||Regulatory links filtered out of Gene ontology graph|
|1.00||16.07.11||First public version|
Stefan R. Maetschke, Martin Simonsen, Melissa J. Davis, Mark A. Ragan
Gene Ontology driven inference of protein-protein interactions using inducers