Linux cgroups Explained: How Containers Get Their Limits

Intermediate 10 min read

What you'll learn

✓What cgroups are and what they control
✓The difference between v1 and v2
✓How memory and CPU limits actually work
✓How Docker and Kubernetes use cgroups
✓How to inspect cgroups on a live system

Prerequisites

•Basic Linux command line

If namespaces give containers their separate view of the world, cgroups are what enforce their share of it. Every CPU limit you set on a Docker container, every memory cap in a Kubernetes pod, every “noisy neighbor” protection on a multi-tenant host comes down to cgroups doing their job in the kernel.

What and Why

Control groups, usually called cgroups, are a kernel feature that organizes processes into hierarchies and applies resource constraints and accounting to each hierarchy. You can cap CPU, memory, block IO, PIDs, and several other resources per group, and you get per-group statistics for free.

Without cgroups, one runaway process can starve the whole machine. With them, you can guarantee that the database container always gets at least two CPUs and the analytics job never exceeds 4GB of RAM, regardless of what else is running.

Mental Model

A cgroup is a directory under /sys/fs/cgroup. Inside that directory are control files you write to (memory.max, cpu.max, cpu.weight) and stat files you read from (memory.current, cpu.stat). Processes belong to a cgroup; writing a PID into cgroup.procs moves it there.

cgroups are hierarchical. A child inherits the limits of its parent and can only tighten them, never loosen. This is how Kubernetes nests pod cgroups inside QoS-class cgroups inside the node-level cgroup.

There are two versions: v1 (one hierarchy per controller, legacy) and v2 (one unified hierarchy, the future). Most modern distros default to v2.

Hands-on Example

Look at a running container’s cgroup:

docker run -d --name demo --memory 256m --cpus 1.5 nginx:1.27
cat /proc/$(docker inspect -f '{{.State.Pid}}' demo)/cgroup

On cgroup v2 you will see one line pointing under /sys/fs/cgroup/system.slice/docker-<id>.scope/. Inside that directory:

cgdir=/sys/fs/cgroup/system.slice/docker-*.scope
cat $cgdir/memory.max     # 268435456 (256 MiB)
cat $cgdir/cpu.max        # 150000 100000 (1.5 CPU)
cat $cgdir/memory.current # live usage

/sys/fs/cgroup/                      (root)
cpu.max, memory.max, ...
user.slice/
  user-1000.slice/
    session-3.scope/    <- your login shell
system.slice/
  docker.service/
  docker-abc123.scope/  <- the container
    memory.max = 256MiB
    cpu.max    = 1.5 CPU
    cgroup.procs:
      12345
      12346
      12350

cgroup v2 hierarchy with Docker

When the container exceeds memory.max, the kernel’s OOM killer fires inside that cgroup. The host stays healthy; the container loses its biggest process.

cpu.max works as a quota over a period. 150000 100000 means 150ms of CPU per 100ms wall time, across all CPUs combined — effectively 1.5 cores.

Common Pitfalls

Confusing limits with reservations. cpu.max is a ceiling. cpu.weight is a relative share that only matters under contention. Kubernetes requests map to weights, limits map to quotas — get them backwards and your scheduling decisions go sideways.

OOM kills that look like crashes. When memory.max is hit, the kernel kills a process inside the cgroup, often without a userspace message. Look at dmesg or journalctl -k and you will see “Memory cgroup out of memory.” Always check kernel logs when a container disappears.

CPU throttling silently degrading latency. A container with cpu.max set can burn through its quota and be throttled for the remainder of the period, adding tens of milliseconds of latency. cat cpu.stat shows nr_throttled and throttled_usec. If those are nonzero and growing, raise the limit or reshape the workload.

Mixing v1 and v2. Older runtimes assume v1, newer ones prefer v2. On a host with the hybrid layout, the wrong runtime can silently ignore limits. Check stat -fc %T /sys/fs/cgroup — cgroup2fs means pure v2.

Forgetting that PID limits exist. Fork bombs are still a thing. pids.max caps the number of tasks in a cgroup and is cheap insurance.

Practical Tips

Read cgroup.stat and *.pressure files. Pressure Stall Information (PSI) is the v2 feature that tells you how often the cgroup waited on CPU, memory, or IO. It is the most useful signal for “is this container starved?”

When running cgroup-aware tools inside a container, expose /sys/fs/cgroup read-only. Modern JVMs, Node, and Go runtimes read their limits from there to size thread pools and GC heaps.

Use systemd-run --scope -p MemoryMax=512M your-command for ad-hoc isolation outside Docker. It is the easiest way to play with cgroups without a runtime in the way.

When debugging OOM kills, check memory.events. It counts oom, oom_kill, and high events — far more reliable than tailing dmesg.

For multi-tenant hosts, set limits at the slice level (system.slice, user.slice) and let children inherit. This protects the host from any single tenant.

Wrap-up

cgroups are the boring, essential plumbing that makes containers a credible isolation primitive. Once you can find a container’s cgroup directory, read its limits, and check its pressure stats, the magic of Docker and Kubernetes becomes legible — they are mostly tools that write the right values into the right files. The next time a pod gets OOM-killed or a service mysteriously slows under load, you will know exactly where to look.