Monitor CPU Use on DigitalOcean Droplets


Keeping an eye on your server's performance is important, and DigitalOcean provides Droplet graphs with up-to-the-minute visualizations of how your server is performing over time. In this guide, we will review the graphs that are available by default, as well as additional graphs available by installing the DigitalOcean Agent, a small utility that gathers information about memory, disk utilization, and top consumers of CPU and memory on the system.

We'll describe how to use two common Linux utilities, uptime, and top, to learn about your CPU load and utilization, and how to set DigitalOcean Alert Policies to notify you about significant changes related to a Droplet's CPU.

You will need the two utilities discussed, uptime and top. They're available as part of the default installation of most Linux distributions.

CPU Load vs. CPU Utilization

CPU load is the length of the queue of scheduled tasks, including the ones being processed. Tasks can switch in a matter of milliseconds, so a single snapshot of the load isn't as useful as the average of several readings taken over a period of time, so the load value is often provided as a load average.

The CPU load tells us about how much demand there is for CPU time. High demand can lead to contention for CPU time and degraded performance.

CPU Utilization, on the other hand, tells us how busy the CPUs are, without any awareness of how many processes are waiting. Monitoring utilization can show trends over time, highlight spikes, and help identify unwanted activity. On a single processor system, the total capacity is always 1. On a multiprocessor system, the data can be displayed in two different ways. The combined capacity of all the processors is counted as 100% regardless of the number of processors, and this is known as normalized.

Open your terminal using:

Ctrl+Alt+T
And run the following commands:

Display CPU Information

nproc --all
On most modern Linux distributions, we can also use the lscpu command, which displays not only the number of processors but also the architecture, model name, speed, much more:

lscpu

Your output will be like the following:

. . . 

Architecture:          x86_64 
CPU op-mode(s):        32-bit, 64-bit 
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 61
Model name:            Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz
Stepping:              4
CPU MHz:               2182.946
CPU max MHz:           2700.0000
CPU min MHz:           500.0000
BogoMIPS:              4390.17
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              3072K
NUMA node0 CPU(s):     0-3
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap xsaveopt dtherm ida arat pln pts

Optimal CPU utilization varies depending on the kind of work the server is expected to do. Sustained high CPU usage comes at the price of less responsive interactivity with the system. It is often appropriate for computationally-intense applications and batch jobs to consistently run at or near full capacity. However, if the system is expected to serve web pages or provide responsive interactive sessions for services like SSH, then it may be desirable to have some idle processing power available.

Monitoring

uptime

The output of this uptime command shows :

  • the CPU load average
  • the system time at the moment the command was run
  • how long the server had been running
  • how many connections users had to the machine

top

The output of this top command shows:

The Header Block

  • The CPU load averages at one, five, and fifteen minutes beginning with the command name and the time updated for each data refresh.

  • A summary of the state of tasks: the total number of processes, followed by how many of them are running.

  • Normalised CPU utilization figures displayed as percentages (without the % symbol) adding up to 100%.

  • Memory and swap usage.

  • us, user: time running un-niced user processes
    This category refers to user processes that were started with no explicit scheduling priority.

  • sy, system: time running kernel processes
    When a process is doing its own work, it will appear in either the user figure described above or, if its priority was explicitly set using the nice command, in the nice figure that follows.

  • ni, nice: time running niced user processes
    Like user this refers to process tasks that do not involve the kernel. Unlike user, the scheduling priority for these tasks was set explicitly. The niceness level of a process is indicated in the fourth column in the process table under the header NI. Processes with a niceness value between 1 and 20 that consume a lot of CPU time are generally not a problem because tasks with normal or higher priority will be able to get processing power when they need it. However, if tasks with elevated niceness, between -1 and -20, are taking a disproportionate amount of CPU, they can easily affect the responsiveness of the system and warrant further investigation.

  • id, idle: time spent in the kernel idle handler
    The percentage of time that the CPU was both available and idle.

  • wa, IO-wait : time waiting for I/O completion
    Shows when the a processor has begun a read or write activity and is waiting for the I/O operation to complete.

  • hi : time spent servicing hardware interrupts
    The time spent on physical interrupts sent to the CPU from peripheral devices like disks and hardware network interfaces. When the hardware interrupt value is high, one of the peripheral devices may not be working properly.

  • si : time spent servicing software interrupts
    Software interrupts are sent by processes rather than physical devices. Unlike hardware interrupts that occur at the CPU level, software interrupts occur at the kernel level. When the software interrupt value is using a lot of processing power, investigate the specific processes that are using the CPU.

  • st: time stolen from this VM by the hypervisor
    The "steal" value refers to how long a virtual CPU spends waiting for a physical CPU while the hypervisor is servicing itself or a different virtual CPU. Essentially, the amount of CPU use in this field indicates how much processing power your VM is ready to use, but which is not available to your application because it is being used by the physical host or another virtual machine. Generally, seeing a steal value of up to 10% for brief periods of time is not a cause for concern. Larger amounts of steal for longer periods of time may indicate that the physical server has more demand for CPU than it can support.

Information Table

Output
PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
9966 sammy     20   0    9528     96      0 R  99.7  0.0   0:40.53 stress
9967 sammy     20   0    9528     96      0 R  99.3  0.0   0:40.38 stress
7 root      20   0       0      0      0 S   0.3  0.0   0:01.86 rcu_sched
1431 root      10 -10    7772   4556   2448 S   0.3  0.1   0:22.45 iscsid
9968 root      20   0   42556   3604   3012 R   0.3  0.1   0:00.08 top
9977 root      20   0   96080   6556   5668 S   0.3  0.2   0:00.01 sshd
... 

The CPU% is presented as a percent value, but it isn't normalized, so on this two-core system, the total of all the values in the process table should add up to 200% when both processors are fully utilized.