Using
the HPCC Batch System
The HPCC environment consists of 140 processors in 44 systems
using 3 architectures (SGI, Sun, IBM Linux) This environment
consists of a mixture of Symetric Multi Processor (SMP) and
distributed memory Beowulf type clusters.
The HPCC uses the Sun Grid Engine (SGE) Enterprise Edition
batch queuing software on the SGI, Sun, and IBM Red Hat Linux
environments in the HPCC. For more details see http://wwws.sun.com/software/gridware/sgeee53/index.html
Access to the SGE Software
Configuring an Account in SGE
Submitting a Job to SGE
Options to SGE
Some Examples of Submitting a Batch Job
Parallel SMP Example
Note on Running Parallel Gaussian
Jobs
File Staging/Specific File Movement
to Local Machine
Monitoring Batch Jobs
Canceling Batch Jobs
Parallel Jobs
Interactive Jobs
GUI Interface to SGE
Additional Documentation
Access
to the SGE software (top)
All the
SGE commands and environment setup are contained in the sge
module. This module is loaded by default on all HPCC SGI,
Sun & Linux machines.
Configuring
an Account in SGE (top)
SGE (Sun Grid Engine) allows for transparent use of the AFS
filesystem used extensively at Notre Dame. SGE makes a copy
of your token and that is used as authentication on the machine
which your jobs runs. The token lifetime used for all batch
jobs has been set to 720 hours or 30 days. It is expected
that no batch job would have a wall clock time of longer that
720 hours.
If your
account's startup scripts contain commands that are not appropriate
for batch work (configuring the prompt, setting the delete
character, etc), you should add the line
if
( $?ENVIRONMENT !=0) exit 0
to your
.cshrc file near the top, after the "source /usr/local/Startup/Cshrc".
This will cause the rest of the startup file to be skipped.
For example your .cshrc might look like this:
#
Source the System Cshrc
# DO NOT REMOVE THIS LINE
#
if ( -r /usr/local/Startup/Cshrc ) then
source /usr/local/Startup/Cshrc
endif
if ( $?ENVIRONMENT != 0 ) exit 0
Submitting a Job to SGE (top)
You
can begin submitting jobs to the batch system using the
qsub command.
This is done by writing a script in the C Shell scripting
language and then submitting this script to SGE. Such ``batch
scripts'' usually contain two types of commands, those that
specify options to SGE and those that execute the unix commands
necessary to run your job.
Note: While any Bourne Shell & Korn Shell
scripts may work it is recommended that new scripts which
are written use Csh as it's the same as your interactive environment.
Csh scripts have to be used for submitting batch jobs to the
Linux environment.
Options
to SGE (top)
The
following options can be given to qsub on the command line,
or preceded with #$ in batch scripts. If an option
is specified on both the command line and as a directive in
the batch script, the command line value takes precedence.
Additional information on qsub can be found by typing "man
qsub".
- -M
afsid@nd.edu Specify an address where SGE should
send email about your job.
- -m
abe Tell SGE to send email to the specified address
if the job is aborted, begins, or ends.
- -pe
smp ## Tell SGE how many CPUs your job will need (See
the section on Parallel jobs below). For parallel jobs the
number of CPUs is required. Jobs that use more CPUs than
they request will be killed. The default is one CPU. So
if you're running a serial job just omit this and have it
default to one cpu. The maximum number of CPUs that can
be requested during normal use is 16 (see the section on
dedicated jobs below if you need more than 16 CPUs).
Note:
Jobs requesting a large number of CPUs might spend a long
time waiting in queue -- it's may be more practical
to request a fewer number of CPUs (4-8) and start running
sooner than to wait for a large number of processors to
become idle.
- -r
y or n Tell SGE if your job is ``rerunnable''.
Rerunnable jobs will be restarted from the beginning if
they are killed for some reason. Most jobs are rerunnable,
but a few programs like Gaussian and qchem are not and should
specify -r n.
-l
arch=irix6 or solaris64 or glinux Requests
resources of this architecture type to run your job on.
irix6 specifies the SGI architecture, solaris64 specifies
the Sun solaris 64 bit architecture, and glinux the Linux
architecture. Note all OIT machines in the HPCC and in the
Sun clusters are running the 64 bit version of the Solaris
Operating System. Specifying an architecture is unnecessary
if you're running an application which is available on both
architectures, for example Gaussian. You would however want
to specify an architecture, if for example, your program
was written and compiled on a Sun machine. If you want to
run that job you'd need to specify the -l arch=solaris64
so that your job will run on a machine which uses the Sun
architecture.
Note: By
default the batch queueing system will run the job on the
fastest system which meets the requirements that you specify.
Thus if you specify that the job needs to run on the Sun
architecture (solaris64) you may end up waiting for jobs
to finish on that architecture before your job starts. However
there may have been a SGI or Linux system which was idle
and could have run your job immediately. Only request the
architecture type if it is necessary.
Similiarly in general you should not specify a specific
machine queue e.g sun2.q for your job to run on. Either
omit the architecture (if possible) or just specify an architechture
type as mentioned above.
Some
examples of submitting a batch job (top)
Here we're
submitting a batch script named gaussian.job which runs the
application Gaussian 98, it uses the file testDFT.com for
input. It is assumed that the file testDFT.com and the batch
script gaussian.job are in the directory from which you submit
this job. It is a single processor job. The job is not re-runnable
and mail will be sent to rich@nd.edu if the job aborts or
ends.
The file
gaussian.job looks like
#!/bin/csh
g98 < testDFT.com
To run the job on the command line you type
qsub
-M rich@nd.edu -m ae -r n gaussian.job
Now here's the same example using directives in a
Csh batch script gaussian.job instead of typing
them on the command line.
The file
gaussian.job now looks like
#!/bin/csh
#$ -M rich@nd.edu
#$ -m ae
#$ -r n
g98 < testDFT.com
Now to run the job on the command line you simply type:
qsub gaussian.job
Parallel
SMP example (top)
In
this example we're again submitting a batch script named gaussian.job
to SGE to run a Gaussian job. However this time we're requesting
that the job use four CPUs. The job is not rerunnable and
mail will be sent to rich@nd.edu if the job aborts or ends.
Note:
You must also tell Gaussian the number of processor to use.
For example at the top of the Gaussian input file testDFT.com
you would have the line. %NProc=4
This should match the number
of processors which you specify with -pe smp ##
The
file gaussian.job now looks like
#!/bin/csh
#$ -pe smp 4
#$ -M rich@nd.edu
#$ -m ae
#$ -r n
g98
< testDFT.com
Now to run the job on the command line you simply type
qsub gaussian.job
Monitoring
Batch Jobs (top)
Jobs
can be monitored using the qstat
command. Some useful forms:
qstat
without arguments will print the status of all jobs in the
queue. The output shows the following:
- The
job ID number
- Priority
of job
- Name
of job
- ID
of user who submitted job
- State
of the job: States can be
- Submit
or start time and date of the job
- If
running - the queue in which the job is running
- The
function of the running job (MASTER or SLAVE)
- The
job array task ID
qstat
-f Job ID
Provides a full listing of the job that has the listed Job
ID (or all jobs if no Job ID is given). The output shows
the following: For each queue the information printed consists
of:
- the
queue name
- the
queue type: Types or combinations of types can be
- The
number of used and available job slots
- The
load average on the queue host
- The
architecture of the queue host
- The
state of the queue - Queue states or combinations of states
can be
- a(larm)
- A(larm)
- s(uspended)
- d(isable)
- D(isable)
- E(rror)
qstat
-j [job_list] Prints either for all pending jobs or the
jobs contained in job_list the reason for not being scheduled.
Additional information can be obtained my looking at the man
page for qstat. Type "man qstat" for additional information.
Canceling
Batch Jobs (top)
Jobs can
be cancelled or killed using the qdel
command. The most common form is `qdel
JobID` which kills the job that matches the
Job ID.
Parallel
Jobs (top)
Using
the -pe smp directive as detailed above is equivalent
to asking permission to use multiple CPUs in your
batch job. To run a job on multiple CPUs, you will also need
to specify the number of CPUs by some method for the application.
Some common methods of doing this are outlined below:
- MPI
programs use the -np option to the mpirun
command to specify the number of CPUs to run on. The mpi
programs are part of the mpt module.
- openMP
programs use the MP_SET_NUMTHREADS environment
variable to control the number of CPUs the program runs
on.
- Other
programs may have specific flags or methods of starting
parallel execution (for instance the %NProc= ``link
0 command'' in Gaussian 98).
Interactive
Jobs (top)
Currently
interactive support to the HPCC compute nodes is not configured
or planned.
A common
use of interactive jobs was for looking at the output of a
running batch job. For an alternate way of doing this:
Synchronizing
Open Files in AFS
GUI
Interface to SGE (top)
Some user
find it easier to work with a graphical user interface than
with the command line programs listed above. SGE provides
a very extensive GUI tool qmon program which supports
all of the operations listed above through a series of menus
and windows. To start qmon type qmon on a SGE submit
host.
page
modified 12/12/02
|