Go pprof Profiling Tutorial

Intermediate 9 min read

What you'll learn

✓What profiles pprof captures and how it captures them
✓Wiring net/http/pprof into a service
✓Collecting CPU, heap, goroutine, and mutex profiles
✓Reading flame graphs and top output
✓Avoiding common misreads

Prerequisites

•A running Go service or program
•Comfort with the command line

When a Go service is slow, the temptation is to guess. pprof is the tool that turns guesses into evidence. It samples your program, attributes time and allocations to functions, and renders the result so the costly path is obvious. It is part of the standard library and takes about three lines to enable.

What pprof is and why

pprof is a sampling profiler with multiple modes. The CPU profile samples program counters at a regular interval and tells you where the CPU spent its time. The heap profile records live allocations. There are also goroutine, mutex, and block profiles.

The “why” is simple: code intuition is wrong roughly half the time. A profile tells you the actual hot function, which is often a json.Unmarshal or a regex compile in a loop rather than the algorithm you suspected.

Mental model

Each profile is a tree of stack samples. CPU profiles sample at 100 Hz by default; if your function appears in 30 percent of samples, it accounts for roughly 30 percent of CPU time. Heap profiles work the same way but count allocation events instead of clock ticks.

The flame graph is the most useful visualization. Width equals share of samples. Height equals stack depth. Look for the widest single function near the top, and that is your hot spot.

Hands-on example

Enable the HTTP endpoint in your service.

import (
    "net/http"
    _ "net/http/pprof"
)

func main() {
    go func() { http.ListenAndServe("localhost:6060", nil) }()
    // your real server on a different port
}

Now collect a 30 second CPU profile under load.

go tool pprof -http=:8081 http://localhost:6060/debug/pprof/profile?seconds=30

This opens a browser with flame graph, top list, source view, and graph. Use the top command in the terminal version for a quick textual summary.

(pprof) top
Showing nodes accounting for 8.2s, 82% of 10s total
      flat   cum   function
     3.40s  3.40s  encoding/json.(*decodeState).object
     1.90s  4.20s  example.com/api.handleSearch
     1.10s  1.30s  runtime.mallocgc

For heap, swap the URL: /debug/pprof/heap. For goroutines: /debug/pprof/goroutine?debug=2 gives a readable text dump useful for diagnosing leaks.

pprof data flow from running process to flame graph

Common pitfalls

Profiling under no load is meaningless. The CPU profile will be empty or full of runtime idle. Generate realistic traffic before you sample, or you will conclude that your service spends all its time in runtime.netpollwait.

Heap profiles show in-use allocations by default. If you want to see where allocation pressure comes from (and therefore GC cost), use ?gc=1 or look at the alloc_space view. People often profile heap, see almost nothing, and miss the churn entirely.

Inlined functions can vanish from the flame graph. Compile with -gcflags="-l" during profiling to disable inlining temporarily, or read the flame graph with the understanding that the parent function attributes the cost.

Do not expose /debug/pprof on a public interface. Bind it to localhost, a unix socket, or behind auth. The profile endpoints reveal internals and the CPU profile can briefly affect performance.

Practical tips

For CPU work, focus on the widest box near the top of the flame graph that you can actually change. The runtime functions at the bottom are usually noise; the function in your code two levels up is the lever.

For allocations, go test -bench=. -benchmem -memprofile=mem.out and then go tool pprof mem.out is the fastest loop. You can compare two profiles with pprof -diff_base=before.pb.gz after.pb.gz to confirm an optimization moved the needle.

Goroutine profiles diagnose leaks and deadlocks. If goroutine count grows over time, take a profile, then take another five minutes later, and diff the two. The growing stacks are your leak.

Mutex and block profiles are off by default. Enable them with runtime.SetMutexProfileFraction(5) and runtime.SetBlockProfileRate(1) when you suspect contention. The cost is small and the insight can be enormous.

Wrap-up

pprof turns performance work from guesswork into a tight loop. Wire in net/http/pprof, generate realistic load, collect the relevant profile, and read the flame graph from the widest box down. Once you can ask “where is the time actually going?” with confidence, optimization stops being a stab in the dark.