GAME
Genetic Algorithm based Motif Elicitation

This page has been visited times since 02/13/2007 17:43:23 View The Stats


Overview

  • GAME is a software which utilizes a genetic algorithm to find motifs (statistically over-represented patterns) in a set of related DNA sequences. As one application, motifs could be transcription factor binding sites in a set of co-regulated genes.

Download

  • GAME implemented by Java. The java source codes are available upon request.

Documentation

  • Usage: java -jar motif.jar -i Inputfile -minw minWidth -expw expectedWidth -maxw maxWidth -o Outputfile (options)
    Inputfile must be DNA sequences in FASTA format. Here is a sample input file in FASTA format, consisting of 20 DNA sequences each of which implanted with one 16bp-long motif site
    Options:
     

    -run

     

      <number of times you want to run GAME to find motif (default 10)> GAME cannot guarantee to find the optimal motifs in a single run. The more runs GAME does, the more likely GAME won't miss the optimal motifs, at price of costing more time. Theoretically, when run number large enough, GAME search equals to exhaustivesearch and optima are guaranteed. It is a tradeoff between accuracy and time.
     
    -n
      <number of top motifs to report (default 3)>
     
    -d
    1 [if only need to examine forward strand (default 2, examine both forward and its reverse-complementary strand)]
     
    -a
      <frequency of Nucleotide A in the background (default estimated from Inputfile)>
     
    -t
    <frequency of Nucleotide T in the background (default estimated from Inputfile)>
     
    -c
    <frequency of Nucleotide C in the background (default estimated from Inputfile)>
     
    -g
    <frequency of Nucleotide G in the background (default estimated from Inputfile)>
      -v 1 [if need to report the search results for all lengths in the output file (default 0, report only the results for the optimal length)]
      -m   <symbol that is prohibited in the motif sites (default '#')>
      -x   <overlapping length allowed between two different motif sites(default 0)>
     
    -cap
    <maximum number of sequences GAME can search (default 1024)>
     
    -pop
    <population size in genetic algorithm search (default 500)>
     
    -gen
    <maximum evolution generation in genetic algorithm search (default 3000)>
     
    -mut
    <mutation rate in genetic algorithm search (default 0.001)>
     
    -con
    <#consecutive generations w/o improvement will be determined as converged in genetic algorithm search (default 50)>
    Note: -a, -t, -c, -g must be provided simultaneously and be positive, otherwise they will be ignored. If the summation of these four parameters does not equal to 1, they will be normalized so that their summation is 1.

  • Outpufile: For each length, GAME searches and returns the best motif(s) it could find with a motif score. The motif with the highest motif score is the optimal motif, and its length is the optimal length, based on a Bayesian model. The best motif(s) information includes: Motif width, motif sites, motif score, relative entropy, degenerate representation of the motif (consider all bases with >= 20% abundance, represented in IUPAC); each motif site, sequence id, starting position of the motif site. Here is a sample output file. Paste the aligned sequences below each motif to WebLogo, and you can visualize each motif by its sequence logo.
  • Example: You have collected a set of co-expressed genes. You suspect they may be co-regulated by 2 Transcription Factors (TFs), i.e., the upstream sequences of these genes may all contain the binding sites of the two TFs. So you collect the upstream sequences of these genes and store them into the file SampleInput. To your biological domain knowledge, you know that the width of the binding sites is very likely to be 16. You are not 100% sure about this or you think the binding site width varies for different genes, but you know for sure that the width is at least 13 and at most 20. You want to run GAME 10 times and hope GAME store the output into the file SampleOutput. So you run GAME as: java -jar motif.jar -i SampleInput -minw 13 -expw 16 -maxw 20 -o SampleOutput -n 2 -run 10. Another example: java -jar motif.jar -i SampleInput -minw 13 -expw 16 -maxw 20 -o SampleOutput -n 2 -run 10 -d 1 -v 1 -x 2 -m N -a 0.3 -t 0.3 -c 0.2 -g 0.2 --- search only forward strands, report the search results for all lengths, allow 2bp overlapping between two different motif sites, don't want 'N' to appear in my motif sites and specify the background A,T,C,G frequencies.

Reference


Contact

Email: zhiwei@njit.edu and stjensen@wharton.upenn.edu

Last updated 6/9/2008