SDISCOVERY

SDISCOVER: finding active motifs in a set of protein or DNA sequences

Jason Wang
Department of Computer Science
New Jersey Institute of Technology
wangj@njit.edu

Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University

Gung-Wei Chirn
Novartis Pharmaceuticals

Introduction

We describe a method for discovering active (or frequently occurring) motifs in a set of protein or DNA sequences. SDISCOVERY takes a set of protein or DNA sequences and produces a collection of active motifs in the set. Another method SSORT is described that sorts the output from SDISCOVERY according to motifs' lengths and deletes all substring motifs having the same occurrence number as their superstring motifs.

Installation

The programs are written in C programming language. They run on a sun sparc workstation under the SUN operating system version 4.1.2.
In the links below, we have posted the source code of the software we developed and the steps to compile and run it.

Click here to download sdiscovery.c -- the software.
Here is the input file in FASTA format used as input to sdiscovery.c: SAMPLE -- input file in FASTA format.
The result is written to data.out file.
Compile the sdiscovery.c file with the following command.
gcc -o sdiscovery sdiscovery.c -lm
You can run the program by typing
sdiscovery
Click here to download ssort.c program.
Compile the ssort.c file with the following command.
gcc -o ssort ssort.c -lgen
Suppose the output of sdiscovery is in data.out. Type
ssort data.out > sorted.output
Motifs in sorted.output are sorted based the motifs' lengths and a substring motif is eliminated if it has the same occurrence number as its superstring motif.

Input

Input file format: FASTA format; see file SAMPLE.
Note the following items concerning sequences.

Each new sequence is started with a ">" on a new line in column one.
Text appearing after a ">" in column one is considered a sequence name and is disregarded in the discovery process.
In each sequence, there is one space after every 10 characters for readability.
The process to discover active motifs may take a while. Please be patient.

Below is an example of FASTA format input which is in the SAMPLE file.

Output

When running the sdiscovery program you will see the following lines at command prompt. To use all default parameter values, you can just press "enter" on the keyboard.

A sample output is shown below (the data.out file) after using the input file SAMPLE.

Occurrence number	Motif

2	MGIVSWGEGC
2	GIVSWGEGCA
2	GIVSWGEGCAR
2	GIVSWGEGCD
2	GIVSWGEGCDR
2	TGIVSWGEGC
2	IVSWGEGCAR
2	IVSWGEGCDR

After running the sdiscovery and having the output in data.out, we sort the result in data.out file with ssort program.
Using ssort data.out > sorted.output, we get the sorted output in sorted.output. Below is the sorted output.

Occurrence number	Motif

2	GIVSWGEGCDR
2	GIVSWGEGCAR
2	TGIVSWGEGC
2	MGIVSWGEGC

Citation

Jason T. L. Wang, Thomas G. Marr, Dennis Shasha, Bruce A. Shapiro and Gung-Wei Chirn, "Discovering Active Motifs in Sets of Related Protein Sequences and Using Them for Classification," Nucleic Acids Research, Vol. 22, No. 14, Aug. 1994, pp. 2769-2775.

Download Issues

Some browsers open the PDF file and the Web page manuals and programs instead of starting a download. If this happens, try right-clicking on the link and choosing an option named "Save Target As..." or similar. If a separate window is popped up, click "File" on the top bar menu of the window and click on "Save As" to save the file.

Occurrence number	Motif

2	MGIVSWGEGC
2	GIVSWGEGCA
2	GIVSWGEGCAR
2	GIVSWGEGCD
2	GIVSWGEGCDR
2	TGIVSWGEGC
2	IVSWGEGCAR
2	IVSWGEGCDR

Occurrence number	Motif

2	GIVSWGEGCDR
2	GIVSWGEGCAR
2	TGIVSWGEGC
2	MGIVSWGEGC