GUIDELINES FOR RUNNING LARGE JOBS ON THE DEPARTMENT OF MATHEMATICS SYSTEMS


1) Run long jobs at lower priority.

    If you are going to run a non-interactive job that takes more than 5
minutes to complete you should make sure to "nice" the job.  This lowers the
priority of your job so that people needing interactive access will not notice
it as much.  For example, to run "job" in the background at nice 10 use:
"nice +10 job &".

    Time to job completion    Minimum Nice value to use
    **********************    *************************
    5-10 minutes	      4
    10-60 minutes	      10
    1-2 hours		      16
    2-10 hours		      19
    More than 10 hours	      19, and send email to "requests@math.toronto.edu"

If you forgot to "nice" a job when it started you can still use "renice" once
you know the PID (process identification number).  Use "ps -fu $USER" to get
a listing of your processes and their PIDs.  (If you are running a job with
multiple threads then use "ps -fLu $USER".)  Then you could use, for example,
"renice +19 <pid>" where "<pid>" is the PID for any long running processes
or threads.  There are manual pages for these commands: "man renice" and
"man ps" work, but you should use "man tcsh" for information about "nice"
(assuming you are using the default shell, tcsh, that we give to people).


2) Run jobs consecutively, not concurrently if possible.

    If you have several commands to run it is usually better to run them
consecutively, not concurrently.  For example, if you had to run "job1",
"job2", and "job3" then use:

	sphere% nice +19 job1 ; nice +19 job2 ; nice +19 job3 &

(if you are using our compute server, sphere, or:

	coxeter% nice +19 job1 ; nice +19 job2 ; nice +19 job3 &

if you are running the commands on coxeter)

to run the commands consecutively in the background.  Note that sphere is a
twelve-core machine so at off hours, if not much else is using the
machine, then up to ten or eleven jobs could be run concurrently without
interfering (much) with each other or other people.  Some software, such as
matlab will automatically use more processors for some tasks (for example,
for matrix inversion).

    If you feel you need to run jobs concurrently then send email to
"requests@math" to let us know the machine you intend to use, how many jobs
you wish to run, and why concurrent processing is necessary, so we can assess
the impact on other users.  Currently we do not see any problem with a user
using many of the processors.


3) Make sure you do not fill up all the disk space.

    You should always do a test on a small case for any job that will produce
output.  You could fill up all the disk space if you had a program that printed
a line for each even number and the program was counting from 1 to 10^15, for
example.  The shell has a facility for limiting the size of created files.
Type "limit f 10m" before running your job to limit the size of created files
to 10 megabytes, for example.  On coxeter file system usage quotas have been
implemented in order to help prevent the accidental consumption of the whole
disk by a single user.  The files one sees in your home directory on sphere are
the same ones that are on coxeter.  Soon we hope to have larger quotas on
coxeter.

    On sphere each user has a directory in /DO_NOT_BACK_UP/scratch which is NOT
BACKED UP but which has more space for temporary work.  For example if your
login name is "janedoe" then you could run the command:

	cd /DO_NOT_BACK_UP/scratch/janedoe

before starting your programs and work inside that directory which currently
has more than 1TB of disk space.  Of course since this space is not backed up
any important results that are there should be moved/copied to your home
directory on sphere (which is the same as your home directory on coxeter).


4) Try to make your job restartable.

    The system may go down part way through your calculations, for example,
during the weekly reboot time.  If some milestones indicating how far things
have gotten and/or some current state of the calculation can be written out
then continuing the job from a later point after interruption is simpler.


5) Memory size limits.

    All machines have a limited amount of main memory (RAM).  sphere can
support jobs of approximately 30GB in size.  If a job gets too large then
the system will slow down dramatically as swapping occurs.  Try to ensure
that your jobs do not use more memory than is necessary, but send an email
to "requests@math" if you need access to more RAM.  If a process gets too
large it may be terminated (by the system or by hand if necessary).


[Last update:  June 21, 2017]