Genetic Algorithm Neural Networks for Regulatory Region Identification
Robert G. Beiko and Robert L. Charlebois
Overview
GANN is a machine learning method designed with the complexities of transcriptional regulation in mind.
The key principle is that regulatory regions are composed of features such as consensus strings, characterized binding sites, and DNA structural properties. GANN identifies these features in a set of sequences, and then identifies combinations of features that can differentiate between the positive set (sequences with known or putative regulatory function) and the negative set (sequences with no regulatory function). Once these features have been identified, they can be used to classify new sequences of unknown function.
- Artificial Neural Networks are used for pattern detection, because they can model complex interactions between input variables (i.e., the features). This can be potentially very important if the positive set contains different types of regulatory regions that must all be classified.
- The number of sequence encodings that can be generated is practically infinite, and even a reasonable number (a few hundred) are too much to present to the neural network at once. The Outer Genetic Algorithm (OGA) was designed to test different subsets from the pool of available representations, and generate new subsets using evolutionary operations.
Implementation
The GANN suite is a set of Perl scripts and C++ programs that extract genomic sequences of interest, extract the desired sequence features, and identify useful combinations of these features with the core machine learning algorithm. The modular design of the suite allows the input of tabular data from outside sources, and analysis of observed sequence properties with more traditional statistical analysis methods.
GANN is currently in Version 2.0; there are many more features that I would like to implement but the time frame for these changes is not determined. Requests for modifications and bug reports are welcome.
Alternatively, since the source code is released under the GPL and available for download and inspection (and is hopefully not too inscrutable), you can always implement changes yourself :^>
Documentation
GANN 2.0 flowchart (.pdf)
The GANN 2.0 Manual (txt) / (doc) / (PDF)
Download GANN
Win32
Win32 executables + Perl scripts
Source code
Source code for Win32 and UNIX
Each of the 4 C++ programs has its own makefile; simply type 'make' in the appropriate directory to generate the executable.
The .mcp files included are project files associated with MetroWorks CodeWarrior for Windows; if you have CodeWarrior open these to compile the source code.
Unfortunately due to differences in C++ string stream libraries the current implementation of GANN will not compile properly on Mac OS X.
Citing GANN
The main citation for GANN is:
Beiko, R.G. and Charlebois, R.L. (2005). GANN: genetic algorithm neural networks for the detection of conserved combinations of features in DNA. BMC Bioinformatics 6: 36.
Contact
You can E-mail me your comments at r.beiko@imb.uq.edu.au
© 2004-2006 Robert G. Beiko