Using Condor at Notre Dame

These instructions will get you started using Condor at Notre Dame. You can learn a lot more about Condor in general at the Condor web site.

To start, add the Condor tools to your path:

setenv PATH /afs/nd.edu/user37/condor/software/bin:$PATH
setenv PATH /afs/nd.edu/user37/condor/software/sbin:$PATH
Next, log into a machine that has Condor installed. If your machine has a ~condor directory, then Condor is probably running there. If not, you can ask your system administrator to install Condor with these instructions. Or, you can log into cclscratch00.cse.nd.edu through cclscratch03.cse.nd.edu to submit jobs. Mail dthain at cse.nd.edu to request a login on those machines.

To see the machines available in the ND Condor pool, you can view the Condor status web page, or you can run the condor_status command:

condor_status

Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime
 
vm1@hedwig.cs LINUX       INTEL  Owner      Idle       0.220   501  0+00:00:10
vm2@hedwig.cs LINUX       INTEL  Owner      Idle       0.000   501  0+00:00:11
wombat00.csel LINUX       INTEL  Owner      Idle       0.010   121  0+00:00:14
...

To submit a batch job to Condor, you must create a submission file and then run the condor_submit command. Try creating this sample submit file in /tmp/YOURNAME/test.submit. (Make sure that your really do put it in /tmp/YOURNAME/test.submit)

universe = vanilla
executable = /bin/echo
arguments = hello condor
output = test.output
should_transfer_files = yes
when_to_transfer_output = on_exit
log = test.logfile
queue 
Now, to submit the job to Condor, execute:
cd /tmp/YOURNAME
condor_submit test.submit
Submitting job(s)...
1 job(s) submitted to cluster 2.
Once the job is submitted, you can use condor_q to look at the status of the jobs in your queue. If you run condor_q quickly enough, you will see your job idle:
-- Submitter: hedwig.cse.nd.edu : <129.74.154.241:33593> : hedwig.cse.nd.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
   2.0   dthain          8/26 17:21   0+00:00:00 I  0   0.0  echo hello world
If you decide to cancel a job, use condor_rm and the job id:
condor_rm 2.0
Job 2.0 marked for removal.
Note about email: Despite what the Condor manual says, you will not receive email when a job is complete. This feature has been disabled at Notre Dame due to our email security configuration.

Because you will certainly want to run many jobs at once via Condor, you can easily modify your submit file to run a program with tens or hundreds of variations. Change the queue command to queue several jobs at once, and the $(PROCESS) macro to modify the parameters with the job number.

universe = vanilla
executable = /bin/echo
arguments = hello $(PROCESS)
output = test.output.$(PROCESS)
error = test.error.$(PROCESS)
should_transfer_files = yes
when_to_transfer_output = on_exit
log = test.logfile
queue 10
Now, when you run condor_submit, you should see something like this:
condor_submit test.submit
Submitting job(s)..........
10 job(s) submitted to cluster 9.
Note in this case that "cluster" means "a bunch of jobs", where each job is named 9.0, 9.1, 9.2, and so forth. In this next example, condor_q shows that cluster 9 is halfway complete, with job 9.5 currently running.
condor_q

-- Submitter: hedwig.cse.nd.edu : <129.74.154.241:33593> : hedwig.cse.nd.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
   9.5   dthain          8/26 17:46   0+00:00:01 R  0   0.0  echo hello 5
   9.6   dthain          8/26 17:46   0+00:00:00 I  0   0.0  echo hello 6
   9.7   dthain          8/26 17:46   0+00:00:00 I  0   0.0  echo hello 7
   9.8   dthain          8/26 17:46   0+00:00:00 I  0   0.0  echo hello 8
   9.9   dthain          8/26 17:46   0+00:00:00 I  0   0.0  echo hello 9

Important note about AFS:

In the example above, the submit file and all of the job's details were stored in /tmp/YOURNAME on your local disk. Condor simply moved the necessary files back and forth in order to run your jobs. If instead your store your data files in AFS (i.e. your home directory), Condor cannot access them because it will not have your AFS Kerberos ticket..

If you want Condor to be able to read any data out of AFS, you must change the ACLs on the necessary directories to allow any user to read the data. This is fine for non-sensitive data. Here's how:

fs setacl ~/my/data/directory system:anyuser read
If you want Condor to be able to write to AFS, you must change the ACLs to allow any user to write to that directory. Of course, this is a security risk, and should probably not be done without some careful thought.

There is much more to Condor. Please read the manual to learn more.

Users and administrators of Condor at Notre Dame are encouraged to subscribe to the condor-discuss mailing list to learn more.

Related Links


Cooperative Computing Lab - CSE Department - Notre Dame