There are several tools available on our servers to monitor resource utilization. The most commonly used utility is top. top gives you a running snapshot of a number of different metrics on a server. In this article we highlight important metrics for you to watch for to ensure you share resources responsibly, and don't run afoul of our policies.
In the screenshot above, the "load average" (in lime green) is a summary of processor utilization. The numbers are average load figures from the last minute, five minutes ago, and fifteen minutes ago. A load of 3.18 should be read to indicate that the computer has enough tasks to keep an average of 3.18 cores/processors occupied. When this number exceeds the number of cores available on the server, jobs will start to slow down. At this point you should avoid submitting new jobs.
Also indicated above, the "Mem:" (in red) line describes memory utilization, with the numbers in kilobytes. In this example, the server has 32GB total memory, and at this moment approximately 10GB is used with 22GB free. The figure for free memory is the one you should watch. When memory goes too low the computer will start using virtual memory (see the "Swap:" line) and everything will start to slow down. In this instance, virtual memory is barely used because there is ample real memory.
The image above displays many details about the "top" jobs currently running. Two important columns in this display are "RES" and "%CPU" (in blue). These show the amount of physical memory and the percentage of one processor each job is using, respectively. In this display, user john1395 is running three concurrent Stata jobs that are fully utilizing three cores, which violates the policy.
License to Kill (misbehaving jobs)
CLA-OIT system administrators do not typically watch servers for violations. Instead, we hear about overuse issues directly from researchers whose jobs cannot run because someone else is monopolizing a server or when a server becomes completely unresponsive. When jobs start behaving badly, we'll respond in one of the following ways:
Lower job priority. Each job on the server has a priority, called the "nice" level, denoted by the NI column (in fuchsia) on the job table from top. The lower the number, the higher the priority. When we demote a job's priority it is forced to wait until jobs with a higher priority are finished. This is the simplest, least destructive, way to make sure a single person is not blocking others from doing work, by sending them to the back of the queue and only allowing their extra jobs to finish when nothing else is happening.
Stop job. Sometimes more drastic action is necessary if a job is trying to consume more memory than is available. This can cause an entire server to crash, and lowering priority will not solve the problem because it only puts the job on virtual memory. Stopping a job allows a researcher to resume it later, without losing data. View the run/stop status of a job in the "S" column in the top display (to the left of the "%CPU" column.) An "R" in this field denotes running jobs, "S" sleeping jobs, and "T" stopped jobs. You can restart your own stopped jobs or we can do it for you. Note: if we restart your stopped job, talk to us so we can avoid repeating the problem.
Kill job. A last resort, the kill job option is irreversible and almost always causes loss of data. We will only kill jobs that are obvious "runaways," meaning crashed or unresponsive, but still consuming resources. Learn how to kill jobs yourself by checking out the documentation for the kill command -- type "man kill" at a command prompt.
As always, send us an email if you'd like to learn more about managing your jobs or have questions about our policies.