- GAME is a software which utilizes a genetic algorithm to find motifs
(statistically over-represented patterns) in a set of related DNA
sequences. As one application, motifs could be transcription factor
binding sites in a set of co-regulated genes.
- GAME implemented by Java. The
java source codes are available upon request.
- Usage: java -jar motif.jar -i
Inputfile -minw minWidth -expw expectedWidth -maxw maxWidth -o Outputfile
(options)
Inputfile must be DNA sequences in FASTA format. Here is a sample
input file in FASTA format, consisting of 20 DNA sequences each
of which implanted with one 16bp-long motif site
Options:
|
|
|
<number of times you want to run GAME to find
motif (default 10)> GAME cannot guarantee to find the optimal
motifs in a single run. The more runs GAME does, the more likely
GAME won't miss the optimal motifs, at price of costing more
time. Theoretically, when run number large enough, GAME search
equals to exhaustivesearch and optima are guaranteed. It is
a tradeoff between accuracy and time. |
|
-n |
|
<number of top motifs to report (default 3)> |
|
-d |
1 |
[if only need to examine forward strand (default 2, examine
both forward and its reverse-complementary strand)] |
|
-a |
|
<frequency of Nucleotide A in the background (default estimated
from Inputfile)> |
|
-t |
|
<frequency of Nucleotide T in the background (default estimated
from Inputfile)> |
|
-c |
|
<frequency of Nucleotide C in the background (default estimated
from Inputfile)> |
|
-g |
|
<frequency of Nucleotide G in the background (default estimated
from Inputfile)> |
|
-v |
1 |
[if need to report the search results for all lengths in the
output file (default 0, report only the results for the optimal
length)] |
|
-m |
|
<symbol that is prohibited in the motif sites (default
'#')> |
|
-x |
|
<overlapping length allowed between two different motif
sites(default 0)> |
|
-cap |
|
<maximum number of sequences GAME can search (default 1024)> |
|
-pop |
|
<population size in genetic algorithm search (default 500)> |
|
-gen |
|
<maximum evolution generation in genetic algorithm search
(default 3000)> |
|
-mut |
|
<mutation rate in genetic algorithm search (default 0.001)> |
|
-con |
|
<#consecutive generations w/o improvement will be determined
as converged in genetic algorithm search (default 50)> |
Note: -a, -t, -c, -g must be provided simultaneously and be
positive, otherwise they will be ignored. If the summation of
these four parameters does not equal to 1, they will be normalized
so that their summation is 1. |
- Outpufile: For each length, GAME searches and returns
the best motif(s) it could find with a motif score. The motif with
the highest motif score is the optimal motif, and its length is the
optimal length, based on a Bayesian model. The best motif(s) information
includes: Motif width, motif sites, motif score, relative entropy,
degenerate representation of the motif (consider all bases with >=
20% abundance, represented in IUPAC);
each motif site, sequence id, starting position of the motif site.
Here is a sample output file.
Paste the aligned sequences below each motif to WebLogo,
and you can visualize each motif by its sequence logo.
- Example: You have collected a set of co-expressed
genes. You suspect they may be co-regulated by 2
Transcription Factors (TFs), i.e., the upstream sequences of these
genes may all contain the binding sites of the two TFs. So you collect
the upstream sequences of these genes and store them into the file
SampleInput. To your biological domain knowledge,
you know that the width of the binding sites is very likely to be
16. You are not 100% sure about this or you think the binding
site width varies for different genes, but you know for sure that
the width is at least 13 and at most 20.
You want to run GAME 10 times and hope GAME store
the output into the file SampleOutput. So you run
GAME as: java -jar motif.jar -i SampleInput -minw 13 -expw 16 -maxw
20 -o SampleOutput -n 2 -run 10. Another example: java -jar motif.jar
-i SampleInput -minw 13 -expw 16 -maxw 20 -o SampleOutput -n 2 -run
10 -d 1 -v 1 -x 2 -m N -a 0.3 -t 0.3 -c 0.2 -g 0.2 --- search only
forward strands, report the search results for all lengths, allow
2bp overlapping between two different motif sites, don't want 'N'
to appear in my motif sites and specify the background A,T,C,G frequencies.
Email: zhiwei@njit.edu and stjensen@wharton.upenn.edu
Last updated 6/9/2008
|