GUIDELINES FOR RUNNING LARGE JOBS ON THE DEPARTMENT OF MATHEMATICS SYSTEMS 1) Run long jobs at lower priority. If you are going to run a non-interactive job that takes more than 5 minutes to complete you should make sure to "nice" the job. This lowers the priority of your job so that people needing interactive access will not notice it as much. For example, to run "job" in the background at nice 10 use: "nice +10 job &". Time to job completion Minimum Nice value to use ********************** ************************* 5-10 minutes 4 10-60 minutes 10 1-2 hours 16 2-10 hours 19 More than 10 hours 19, and send email to "requests@math.toronto.edu" If you forgot to "nice" a job when it started you can still use "renice" once you know the PID (process identification number). Use "ps -fu $USER" to get a listing of your processes and their PIDs. (If you are running a job with multiple threads then use "ps -fLu $USER".) Then you could use, for example, "renice +19 " where "" is the PID for any long running processes or threads. There are manual pages for these commands: "man renice" and "man ps" work, but you should use "man tcsh" for information about "nice" (assuming you are using the default shell, tcsh, that we give to people). 2) Run jobs consecutively, not concurrently if possible. If you have several commands to run it is usually better to run them consecutively, not concurrently. For example, if you had to run "job1", "job2", and "job3" then use: sphere% nice +19 job1 ; nice +19 job2 ; nice +19 job3 & (if you are using our compute server, sphere, or: coxeter% nice +19 job1 ; nice +19 job2 ; nice +19 job3 & if you are running the commands on coxeter) to run the commands consecutively in the background. Note that sphere is a twelve-core machine so at off hours, if not much else is using the machine, then up to ten or eleven jobs could be run concurrently without interfering (much) with each other or other people. Some software, such as matlab will automatically use more processors for some tasks (for example, for matrix inversion). If you feel you need to run jobs concurrently then send email to "requests@math" to let us know the machine you intend to use, how many jobs you wish to run, and why concurrent processing is necessary, so we can assess the impact on other users. Currently we do not see any problem with a user using many of the processors. 3) Make sure you do not fill up all the disk space. You should always do a test on a small case for any job that will produce output. You could fill up all the disk space if you had a program that printed a line for each even number and the program was counting from 1 to 10^15, for example. The shell has a facility for limiting the size of created files. Type "limit f 10m" before running your job to limit the size of created files to 10 megabytes, for example. On coxeter file system usage quotas have been implemented in order to help prevent the accidental consumption of the whole disk by a single user. The files one sees in your home directory on sphere are the same ones that are on coxeter. Soon we hope to have larger quotas on coxeter. On sphere each user has a directory in /DO_NOT_BACK_UP/scratch which is NOT BACKED UP but which has more space for temporary work. For example if your login name is "janedoe" then you could run the command: cd /DO_NOT_BACK_UP/scratch/janedoe before starting your programs and work inside that directory which currently has more than 1TB of disk space. Of course since this space is not backed up any important results that are there should be moved/copied to your home directory on sphere (which is the same as your home directory on coxeter). 4) Try to make your job restartable. The system may go down part way through your calculations, for example, during the weekly reboot time. If some milestones indicating how far things have gotten and/or some current state of the calculation can be written out then continuing the job from a later point after interruption is simpler. 5) Memory size limits. All machines have a limited amount of main memory (RAM). sphere can support jobs of approximately 30GB in size. If a job gets too large then the system will slow down dramatically as swapping occurs. Try to ensure that your jobs do not use more memory than is necessary, but send an email to "requests@math" if you need access to more RAM. If a process gets too large it may be terminated (by the system or by hand if necessary). [Last update: June 21, 2017]