Office of Information Technologies
About the OIT
Help Desk
Solutions Center
Training
Services
FAQ
Responsible Use Policy
  Contact Us
OIT Headlines
HPCC Home > HPCC Batch System Search the OIT website  

 

Using the HPCC Batch System

The HPCC environment consists of 140 processors in 44 systems using 3 architectures (SGI, Sun, IBM Linux) This environment consists of a mixture of Symetric Multi Processor (SMP) and distributed memory Beowulf type clusters.

The HPCC uses the Sun Grid Engine (SGE) Enterprise Edition batch queuing software on the SGI, Sun, and IBM Red Hat Linux environments in the HPCC. For more details see http://wwws.sun.com/software/gridware/sgeee53/index.html

Access to the SGE Software
Configuring an Account in SGE
Submitting a Job to SGE
Options to SGE
Some Examples of Submitting a Batch Job
Parallel SMP Example
Note on Running Parallel Gaussian Jobs
File Staging/Specific File Movement to Local Machine
Monitoring Batch Jobs
Canceling Batch Jobs
Parallel Jobs
Interactive Jobs
GUI Interface to SGE
Additional Documentation

Access to the SGE software (top)

All the SGE commands and environment setup are contained in the sge module. This module is loaded by default on all HPCC SGI, Sun & Linux machines.

Configuring an Account in SGE (top)

SGE (Sun Grid Engine) allows for transparent use of the AFS filesystem used extensively at Notre Dame. SGE makes a copy of your token and that is used as authentication on the machine which your jobs runs. The token lifetime used for all batch jobs has been set to 720 hours or 30 days. It is expected that no batch job would have a wall clock time of longer that 720 hours.

If your account's startup scripts contain commands that are not appropriate for batch work (configuring the prompt, setting the delete character, etc), you should add the line

if ( $?ENVIRONMENT !=0) exit 0

to your .cshrc file near the top, after the "source /usr/local/Startup/Cshrc". This will cause the rest of the startup file to be skipped. For example your .cshrc might look like this:


# Source the System Cshrc
# DO NOT REMOVE THIS LINE
#
if ( -r /usr/local/Startup/Cshrc ) then
source /usr/local/Startup/Cshrc
endif

if ( $?ENVIRONMENT != 0 ) exit 0

 

Submitting a Job to SGE (top)

You can begin submitting jobs to the batch system using the qsub command. This is done by writing a script in the C Shell scripting language and then submitting this script to SGE. Such ``batch scripts'' usually contain two types of commands, those that specify options to SGE and those that execute the unix commands necessary to run your job.

Note: While any Bourne Shell & Korn Shell scripts may work it is recommended that new scripts which are written use Csh as it's the same as your interactive environment. Csh scripts have to be used for submitting batch jobs to the Linux environment.

Options to SGE (top)

The following options can be given to qsub on the command line, or preceded with #$ in batch scripts. If an option is specified on both the command line and as a directive in the batch script, the command line value takes precedence. Additional information on qsub can be found by typing "man qsub".

  • -M afsid@nd.edu Specify an address where SGE should send email about your job.
  • -m abe Tell SGE to send email to the specified address if the job is aborted, begins, or ends.
  • -pe smp ## Tell SGE how many CPUs your job will need (See the section on Parallel jobs below). For parallel jobs the number of CPUs is required. Jobs that use more CPUs than they request will be killed. The default is one CPU. So if you're running a serial job just omit this and have it default to one cpu. The maximum number of CPUs that can be requested during normal use is 16 (see the section on dedicated jobs below if you need more than 16 CPUs).

Note: Jobs requesting a large number of CPUs might spend a long time waiting in queue -- it's may be more practical to request a fewer number of CPUs (4-8) and start running sooner than to wait for a large number of processors to become idle.

  • -r y or n Tell SGE if your job is ``rerunnable''. Rerunnable jobs will be restarted from the beginning if they are killed for some reason. Most jobs are rerunnable, but a few programs like Gaussian and qchem are not and should specify -r n.

-l arch=irix6 or solaris64 or glinux Requests resources of this architecture type to run your job on. irix6 specifies the SGI architecture, solaris64 specifies the Sun solaris 64 bit architecture, and glinux the Linux architecture. Note all OIT machines in the HPCC and in the Sun clusters are running the 64 bit version of the Solaris Operating System. Specifying an architecture is unnecessary if you're running an application which is available on both architectures, for example Gaussian. You would however want to specify an architecture, if for example, your program was written and compiled on a Sun machine. If you want to run that job you'd need to specify the -l arch=solaris64 so that your job will run on a machine which uses the Sun architecture.


Note: By default the batch queueing system will run the job on the fastest system which meets the requirements that you specify. Thus if you specify that the job needs to run on the Sun architecture (solaris64) you may end up waiting for jobs to finish on that architecture before your job starts. However there may have been a SGI or Linux system which was idle and could have run your job immediately. Only request the architecture type if it is necessary.

Similiarly in general you should not specify a specific machine queue e.g sun2.q for your job to run on. Either omit the architecture (if possible) or just specify an architechture type as mentioned above.

 

Some examples of submitting a batch job (top)

Here we're submitting a batch script named gaussian.job which runs the application Gaussian 98, it uses the file testDFT.com for input. It is assumed that the file testDFT.com and the batch script gaussian.job are in the directory from which you submit this job. It is a single processor job. The job is not re-runnable and mail will be sent to rich@nd.edu if the job aborts or ends.

The file gaussian.job looks like

#!/bin/csh
g98 < testDFT.com


To run the job on the command line you type

qsub -M rich@nd.edu -m ae -r n gaussian.job


Now here's the same example using directives in a Csh batch script gaussian.job instead of typing
them on the command line.

The file gaussian.job now looks like

#!/bin/csh
#$ -M rich@nd.edu
#$ -m ae
#$ -r n

g98 < testDFT.com


Now to run the job on the command line you simply type:


qsub gaussian.job

  
                      

Parallel SMP example (top)

In this example we're again submitting a batch script named gaussian.job to SGE to run a Gaussian job. However this time we're requesting that the job use four CPUs. The job is not rerunnable and mail will be sent to rich@nd.edu if the job aborts or ends.

Note: You must also tell Gaussian the number of processor to use. For example at the top of the Gaussian input file testDFT.com you would have the line. %NProc=4 This should match the number of processors which you specify with -pe smp ##

The file gaussian.job now looks like

#!/bin/csh
#$ -pe smp 4
#$ -M rich@nd.edu
#$ -m ae
#$ -r n

g98 < testDFT.com

Now to run the job on the command line you simply type

qsub gaussian.job

 

Monitoring Batch Jobs (top)

Jobs can be monitored using the qstat command. Some useful forms:

qstat without arguments will print the status of all jobs in the queue. The output shows the following:

  • The job ID number
  • Priority of job
  • Name of job
  • ID of user who submitted job
  • State of the job: States can be
    • t(ransferring)
    • r(unning)
  • Submit or start time and date of the job
  • If running - the queue in which the job is running
  • The function of the running job (MASTER or SLAVE)
  • The job array task ID

qstat -f Job ID Provides a full listing of the job that has the listed Job ID (or all jobs if no Job ID is given). The output shows the following: For each queue the information printed consists of:

  • the queue name
  • the queue type: Types or combinations of types can be
    • B(atch)
    • P(arallel)
  • The number of used and available job slots
  • The load average on the queue host
  • The architecture of the queue host
  • The state of the queue - Queue states or combinations of states can be
    • a(larm)
    • A(larm)
    • s(uspended)
    • d(isable)
    • D(isable)
    • E(rror)

qstat -j [job_list] Prints either for all pending jobs or the jobs contained in job_list the reason for not being scheduled. Additional information can be obtained my looking at the man page for qstat. Type "man qstat" for additional information.

Canceling Batch Jobs (top)

Jobs can be cancelled or killed using the qdel command. The most common form is `qdel JobID` which kills the job that matches the Job ID.

Parallel Jobs (top)

Using the -pe smp directive as detailed above is equivalent to asking permission to use multiple CPUs in your batch job. To run a job on multiple CPUs, you will also need to specify the number of CPUs by some method for the application. Some common methods of doing this are outlined below:

  • MPI programs use the -np option to the mpirun command to specify the number of CPUs to run on. The mpi programs are part of the mpt module.
  • openMP programs use the MP_SET_NUMTHREADS environment variable to control the number of CPUs the program runs on.
  • Other programs may have specific flags or methods of starting parallel execution (for instance the %NProc= ``link 0 command'' in Gaussian 98).

Interactive Jobs (top)

Currently interactive support to the HPCC compute nodes is not configured or planned.

A common use of interactive jobs was for looking at the output of a running batch job. For an alternate way of doing this:

Synchronizing Open Files in AFS

GUI Interface to SGE (top)

Some user find it easier to work with a graphical user interface than with the command line programs listed above. SGE provides a very extensive GUI tool qmon program which supports all of the operations listed above through a series of menus and windows. To start qmon type qmon on a SGE submit host.

Additional Documentation (top)

Most users will probably be able to create and run batch jobs with the information provided on this web page. For those needing in-depth knowledge the complete documentation in PDF format is provided here by clicking on the following links.

Sun Grid Engine 5.3 and Sun Grid Engine, Enterprise Edition 5.3 Reference Manual

Sun Grid Engine, Enterprise Edition 5.3 Administration and User's Guide

page modified 12/12/02

 

 

 

 

   
ND Home

OIT Home

Copyright © 2003, Office of Information Technologies (OIT),
P.O. Box 539, University of Notre Dame, Notre Dame, IN 46556

Page last modified on January 8, 2003