University Computing Systems


HPC - Sun Grid Engine


Introduction to Sun Grid Engine
Batch and Parallel Queues
Using Sun Grid Engine
Example Scripts
Userstat
Getting Additional Help

About Sun Grid Engine

Sun Grid Engine (SGE) is an advanced job scheduler which schedules jobs to run in a cluster environment. The main purpose of a job scheduler is to utilize system resources in the most efficient way possible.

SGE is written and distributed by Sun Microsystems under the Sun Industry Standards Source License, and is available free.

The home page for SGE is located at: http://gridengine.sunsource.net

Batch and Parallel Queues

Queue setup (hydra) :

node000-node019 batch only jobs
node020-node039 batch or parallel jobs
node040-node065 parallel only jobs

Therefore, batch jobs (those without a request for a parallel environment in the submit script) will use nodes 000-019 exclusively and overlap with parallel jobs on nodes 020-039. Nodes 040-065 will only run jobs that request a parallel environment and overlap on 020-039 as well.

Therefore :

Using Sun Grid Engine

The process of submitting jobs to SGE is done using a script.

The "q" commands are located in /opt/sge/bin/<dir>.
cappl : dir = glinux
hydra : dir = lx24-amd64

After the script has been customized to one's needs, it can be submitted to grid engine for execution. The qsub (/opt/sge/bin/lx24-amd64/qstat on hydra) command is used to submit jobs to the job queue :

qsub sge_script

After running the above command, users will see a message similar to the following:

your job 132 ("IMB-MPI1") has been submitted

In the above example, the number "132" represents the SGE job number and "IMB-MPI1" is the name of the job that is being submitted to the queue.

The qstat (/opt/sge/bin/lx24-amd64/qstat on hydra) command can be used to display information about the job queues and the running jobs.

qstat
________________________________________________________________________________

job-ID  prior name       user         state submit/start at     queue
master  ja-task-ID 
--------------------------------------------------------------------------------
   132     0 IMB-MPI1   guest23      r     05/02/2006 09:49:01 cl_name003.q MASTER         
   132     0 IMB-MPI1   guest23      r     05/02/2006 09:49:01 cl_name003.q SLAVE          
   132     0 IMB-MPI1   guest23      r     05/02/2006 09:49:01 cl_name005.q SLAVE          
   132     0 IMB-MPI1   guest23      r     05/02/2006 09:49:01 cl_name007.q SLAVE          
   132     0 IMB-MPI1   guest23      r     05/02/2006 09:49:01 cl_name009.q SLAVE          
   132     0 IMB-MPI1   guest23      r     05/02/2006 09:49:01 cl_name015.q SLAVE          

________________________________________________________________________________

							where cl_name = 
							 for cappl : cappl
							 for hydra : node
							 for kong : node
The above information can appear convoluted to someone who just wants a quick look at the number of processors their job is running on and the length of time it has been running. The userstat command can be used instead of qstat. The userstat command displays information about specific jobs: (Note that this example was run on the cluster "cappl.njit.edu." On other clusters, such as hydra.njit.edu and kong.njit.edu, the applicable node names will appear in "Host" column and applicable information will appear the total number of CPUs, Memory, etc...)
_________________________________________________________________________

BATCH QUEUE Total Jobs: 1   Active Jobs: 1
Job-ID  Prior   Name            User   State     Submit/Start       CPUs
 132     0    IMB-MPI1     guest23     r    05/02/2006 09:49:01     5 


HOSTS Total Nodes: 17  Down nodes: 0
Host          CPUs     Load     Memory     Memory Use    Swap    Swap Use
cluster         34     0.75      16.8G        1.1G       2.0G      160.0K
      cappl     2     0.01    1010.3M      103.5M       2.0G      160.0K
   cappl000     2     0.00    1010.3M       59.6M       0.0K        0.0K
   cappl001     2     0.00    1010.3M       59.4M       0.0K        0.0K
   cappl002     2     0.00    1010.3M       59.4M       0.0K        0.0K
   cappl003     2     0.11    1010.3M       71.2M       0.0K        0.0K
   cappl004     2     0.00    1010.3M       59.1M       0.0K        0.0K
   cappl005     2     0.14    1010.3M       70.6M       0.0K        0.0K
   cappl006     2     0.00    1010.3M       59.6M       0.0K        0.0K
   cappl007     2     0.13    1010.3M       70.6M       0.0K        0.0K
   cappl008     2     0.00    1010.3M       59.6M       0.0K        0.0K
   cappl009     2     0.09    1010.3M       70.2M       0.0K        0.0K
   cappl010     2     0.00    1010.3M       59.2M       0.0K        0.0K
________________________________________________________________________

For additional information on qstat and userstat, see their corresponding man pages.

If a job is running in the queue and removal of the job is desired, the qdel command can be used to delete the job from the queue.

qdel 132

The above command will print a message similar to the following:

guest23 has registered the job 132 for deletion

After running qdel, the job will no longer appear in the queue, since it has been removed.

Example Scripts

SGE Info, including example scripts

To view the URL above on hydra :

lynx /usr/share/doc/bsc-doc-1.0/sge/SGE.html

The URL above gives example scripts for single jobs and parallel jobs including :

Userstat

In general, if you cannot delete your job(s), it means there is node error. One way to see this is to use the "userstat" ( /usr/userstat/bin/userstat ) command.

If you start userstat, you will see all your SGE jobs. The first indication that soemthing is wrong, is that userstat is reporting a down node (SGE lost contact with it). If you move the cursor to the job number (down arrow) and enter <Return> you will see only the nodes being used by your jobs. Next, enter "n" to move to the lower nodes window. Scrolling down will show nodes that are down.

You can enter "h" to get a userstat help screen, or "man userstat".

Getting additional help

Additional help on using SGE can be found on the Web and in the manual pages for SGE. The following links may also be useful :

Grid Engine HOWTOs
Grid Engine Documents
Online Manual Pages