Long running jobs

One of the most common reasons to use a computational server is to run long-running processes for days, weeks, and even months. In this context, "long-running" typically means more than 8 hours, but this is also somewhat system dependent. The follow includes some tools for keeping your jobs running and some guidelines for being a thoughtful system user.

Overloading a shared server

The key here is "shared." OIT resources are provided for the entire campus. While we understand you're trying to get your work done, there are many other users trying to get theirs done as well. There are somewhat different rules for "stand-alone" servers and the HPC clusters, since cluster's use a scheduler and "fair-share" algorithm to balance out usage, when necessary.

The general rules of etiquette on stand-alone servers are:

Always 'nice' or 'renice' your long-running jobs. (refer to below)
Try not to use more than 1/2 the available cores on a server, if you can control it. Some multi-threaded applications don't give you the ability to control how many threads it uses, so this is not always an option.
Also, don't monopolize multiple servers when others are waiting.

For HPC clusters, rules are somewhat different, but:

Don't run long-running jobs on login nodes. Any job not run through the job scheduler is subject to immediate termination.
In your sbatch script, don't specify tasks-per-node or task-per-process that you can't use. Try to be as accurate on running you jobs as possible. This may cause to not schedule available resources, even though they are available, because your script indicates you intend to use them.

Be Nice! nice and renice

If you're going to fire off a long running job on a compute server or HPC cluster login node, it is good etiquette and a requirement that you reduce the run priority of your job on a shared server. A large number of jobs can bog down a server to the point it is hard to log into or even unresponsive. When you start a long running job you should use the "nice" command to reduce its run priority. If you're started a process, you can use 'renice' to change its run priority after its started. You can only do this to processes you own and you can only set them to lower run priorities (unless you're an admin.) To reduce run priority you increase its 'nice' value. This command can seem counter intuitive. The range is 19 (lowest priority) to -20 (highest). For long-running job, we strongly encourage you to set 'nice' level to 19. The following would set a job to 19 (the dash indicates a flag, not a minus) .

nice -19 large-job-name

The following would set the process number 15751 to 19 (lowest) nice level.

renice 19 -p 15751

Lifewire has a short introduction - Example Uses of the Commands "nice" and "renice".

Terminal multiplexor

Users connect to our systems with terminal applications that provide command line interfaces (CLI) to the remote server. One of the biggest risks of using a remote system is that your local terminal session can drop, killing any processes that are tied to that session. We recommend using a terminal multiplexer such as 'screen' or 'tmux.'

What is a terminal multiplexer? This is a tool that create virtual terminal sessions and allows you switch easily between several programs in one terminal, detach them while they run in the background, and reattach them to a different terminal. A detached virtual terminal [on a server] will keep running even if your network connection drops or your laptop battery dies.

It's a very good practice to get in the habit of using these tools in the course of your work. There are a number of online resources for terminal multiplexers tmux and screen. (We find tmux easier to use.)

tmux

screen

Application Checkpointing

While tmux and screen may allow your sessions to keep running if your terminal crashes or your network drops, it won't save you from a server crash or restart, application failure, or hitting a job scheduler time limit. Application checkpointing is a feature written into software that will allow your application to restart from a recent "known good state" rather than have to restart at the beginning. Typically application data is written to files that capture the ongoing state of application progress. This can be a hugely valuable feature, but is generally difficult to implement. Depending on the nature of your work, you may find it very important to have checkpointing capability in your application.

Report abuse