SDISCOVER: finding active motifs in a set of protein or DNA
sequences
Jason Wang
Department of Computer Science
New Jersey Institute of Technology
wangj@njit.edu
Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
Gung-Wei Chirn
Novartis Pharmaceuticals
Introduction
We describe a method for discovering active
(or frequently occurring) motifs
in a set of protein or DNA sequences.
SDISCOVERY takes a set of protein or DNA sequences and produces a
collection of active motifs in the set.
Another method SSORT is described that
sorts the output from SDISCOVERY according to motifs' lengths
and deletes all substring motifs having the same occurrence number
as their superstring motifs.
Installation
The programs are written in C programming language.
They run on a sun sparc workstation under the SUN operating system
version 4.1.2.
In the links below,
we have posted the source code of the software we developed
and the steps to compile and run it.
- Click here to download sdiscovery.c -- the
software.
- Here is the input file in FASTA format
used as input to sdiscovery.c:
SAMPLE -- input file in FASTA format.
The result is written to data.out file.
- Compile the sdiscovery.c file with the following command.
gcc -o sdiscovery sdiscovery.c -lm
- You can run the program by typing
sdiscovery
- Click here to download ssort.c program.
- Compile the ssort.c file with the following command.
gcc -o ssort ssort.c -lgen
- Suppose the output of sdiscovery is in data.out. Type
ssort data.out > sorted.output
Motifs in sorted.output are sorted based the motifs' lengths
and a substring motif is eliminated
if it has the same occurrence number as
its superstring motif.
Input
Input file format: FASTA format; see file SAMPLE.
Note the following items concerning sequences.
- Each new sequence is started with a ">" on a new line
in column one.
- Text appearing after a ">" in column one
is considered a sequence name and is disregarded in
the discovery process.
- In each sequence,
there is one space after every 10 characters for readability.
- The process to discover active motifs may take a while.
Please be patient.
Below is an example of FASTA format input which is in the SAMPLE
file.
>FA10_BOVIN COAGULATION FACTOR X PRECURSOR (EC 3.4.21.6) (STUART
FACTOR).
MAGLLHLVLL STALGGLLRP AGSVFLPRDQ AHRVLQRARR ANSFLEEVKQ
GNLERECLEE
PHVTRFKDTY FVTGIVSWGE GCARKGKFGV YTKVSNFLKW IDKIMKARAG
AAGSRGHSEA
PATWTVPPPL PL
>OSTC_HUMAN OSTEOCALCIN PRECURSOR
(GAMMA-CARBOXYGLUTAMIC ACID-CONTAINING PRO
MRALTLLALL ALAALCIAGQ AGAKPSGAES
SKGAAFVSKQ EGSEVVKRPR RYLYQWLGAP
VPYPDPLEPR REVCELNPDC DELADHIGFQ
EAYRRFYGPV
>THRB_RAT PROTHROMBIN PRECURSOR (EC 3.4.21.5).
RIGKHSRTRY ERNVEKISML EKIYIHPRYN WRENLDRDIA LLKLKKPVPF SDYIHPVCLP
TDNMFCAGFK VNDTKRGDAC EGDSGGPFVM KSPYNHRWYQ MGIVSWGEGC DRNGKYGFYT
HVFRLKRWMQ KVIDQHR
Output
When running the sdiscovery program
you will see the following lines at command prompt.
To use all default parameter values, you can just
press "enter" on the keyboard.
% Enter the file name of sequences
(an example file can be found in
file SAMPLE;
maximum number of sequences in the file is 200;
maximum length of sequences is 5000) [SAMPLE]: SAMPLE
===> 3 sequences found in file SAMPLE
% Enter the form of interesting motifs 1 or 2
(1 means *X*; 2 means *X*Y*) [1] ?
% Enter the minimum length of
interesting motifs
(default is 10) [10] ?
% Enter the minimum occurrence number
for interesting motifs
(the occurrence number of an interesting motif
refers to the number of sequences in which
the motif approximately occurs; default is 2) [2] ?
% Enter the number of
mutations allowed in searching
for similar motifs (default is 1; maximum
number is 10) [1] ?
% Where the result should be stored (enter the file
name) [data.out] ?
8 motifs found
350 motifs checked
A sample output is shown below (the data.out file)
after using the input file SAMPLE.
Minimum length = 10
Minimum occurrence number = 2
Number of mutations allowed = 1
Total number of sequences = 3
Input file name = SAMPLE
Occurrence number |
Motif |
|
|
2 |
*MGIVSWGEGC* |
2 |
*GIVSWGEGCA* |
2 |
*GIVSWGEGCAR* |
2 |
*GIVSWGEGCD* |
2 |
*GIVSWGEGCDR* |
2 |
*TGIVSWGEGC* |
2 |
*IVSWGEGCAR* |
2 |
*IVSWGEGCDR* |
8 motifs found
350 motifs
checked
After running the sdiscovery and having the output in data.out,
we sort the
result in data.out file with ssort program.
Using ssort data.out > sorted.output,
we get the sorted output in sorted.output.
Below is the sorted output.
Minimum length = 10
Minimum occurrence number = 2
Number of mutations allowed = 1
Total number of sequences = 3
Input file name = SAMPLE
Occurrence number |
Motif |
|
|
2 |
*GIVSWGEGCDR* |
2 |
*GIVSWGEGCAR* |
2 |
*TGIVSWGEGC* |
2 |
*MGIVSWGEGC* |
Citation
Jason T. L. Wang, Thomas G. Marr, Dennis Shasha,
Bruce A. Shapiro and Gung-Wei Chirn,
"Discovering
Active Motifs in Sets of Related
Protein Sequences and Using Them for Classification,"
Nucleic Acids Research, Vol. 22, No. 14, Aug. 1994, pp. 2769-2775.
Download Issues
Some browsers open
the PDF file and
the Web page manuals and programs
instead of starting a download.
If this happens,
try right-clicking on the link and choosing
an option named "Save Target As..." or similar.
If a separate window is popped up, click "File" on the top bar
menu of the window and click on "Save As" to save the file.