AWS SQS Dead-Letter Queues: Catching Poison Messages

Intermediate 9 min read

What you'll learn

✓What dead-letter queues are and why every SQS consumer needs one
✓How maxReceiveCount and visibility timeout interact
✓How to wire up a DLQ with Terraform or the console
✓How to debug, replay, and clear out stuck messages
✓Production tips for alarms and redrive policies

Prerequisites

•Basic familiarity with Amazon SQS queues and Lambda or worker consumers

A dead-letter queue (DLQ) is a secondary SQS queue that catches messages your primary consumer cannot process. Without one, a single corrupt message can be redelivered forever, blocking throughput, racking up Lambda bills, and hiding bugs behind silent retries. With one, you get a quarantine area you can inspect, alarm on, and replay.

What and Why

SQS is at-least-once delivery. When a consumer receives a message, the message becomes invisible for the visibility timeout. If the consumer deletes it, life is good. If the consumer crashes, throws, or simply forgets to delete, the message reappears and the receive count increments. With no DLQ, this loop has no exit.

A DLQ flips the model. You set maxReceiveCount on the source queue’s redrive policy. Once a message has been received that many times without being deleted, SQS moves it to the DLQ automatically. Your consumer is now free to drain healthy messages while the bad ones wait for a human or a replay job.

You want DLQs for: bad JSON payloads, references to deleted database rows, downstream services that return permanent 4xx errors, code paths with bugs that throw on edge cases, and messages that exceed Lambda’s 15-minute limit.

Mental Model

Think of the DLQ as a hospital triage room. The main queue is the front door; messages keep coming. Your consumer is the receptionist. Most visitors are routed quickly. A few get stuck — they keep showing up because nobody can help them. After three tries, they get moved to triage so the queue keeps flowing.

Two knobs matter most. maxReceiveCount is how patient you are: 3 to 5 is typical. The DLQ’s own retention is how long quarantine lasts: set it to 14 days (the maximum) because you want time to debug before evidence disappears.

Hands-on Example

Create two queues — orders and orders-dlq — and link them with a redrive policy.

aws sqs create-queue --queue-name orders-dlq \
  --attributes MessageRetentionPeriod=1209600

DLQ_ARN=$(aws sqs get-queue-attributes \
  --queue-url $(aws sqs get-queue-url --queue-name orders-dlq --query QueueUrl --output text) \
  --attribute-names QueueArn --query Attributes.QueueArn --output text)

aws sqs create-queue --queue-name orders \
  --attributes "{\"RedrivePolicy\":\"{\\\"deadLetterTargetArn\\\":\\\"$DLQ_ARN\\\",\\\"maxReceiveCount\\\":\\\"5\\\"}\",\"VisibilityTimeout\":\"60\"}"

Producer -> [orders queue] -> Consumer
               |               |
               |          (throws/no delete)
               |               |
               +-- receive #1..#5 increment
               |
               v (after 5 failed receives)
          [orders-dlq] -- alarm -> on-call

DLQ flow after maxReceiveCount is hit

Now publish a malformed message and watch a buggy consumer fail. After five attempts (about five visibility-timeout cycles), the message disappears from orders and shows up in orders-dlq. You can inspect it with aws sqs receive-message --queue-url ... --visibility-timeout 0.

To replay, use the SQS console’s “Start DLQ redrive” feature or write a small script that reads from the DLQ and sends back to the source. Always fix the bug first — otherwise the redrive just refills the DLQ.

Common Pitfalls

maxReceiveCount of 1. One transient blip and the message dies. Use at least 3, usually 5.
Visibility timeout shorter than processing time. The message comes back while still being processed, the receive count climbs, and it lands in the DLQ even though processing eventually succeeded. Set visibility to 6 times your p99 handler duration for Lambda.
DLQ retention left at 4 days default. Bug reports often take longer than that. Set it to 14 days.
No alarm on DLQ depth. A DLQ you never look at is just a silent failure log. Wire ApproximateNumberOfMessagesVisible to CloudWatch.
Shared DLQ across unrelated queues. Mixing payloads makes replay risky. One DLQ per source queue.

Production Tips

Set a CloudWatch alarm on ApproximateNumberOfMessagesVisible > 0 for the DLQ with a five-minute evaluation period, and page on it. For Lambda consumers, configure the function-level OnFailure destination as a second layer of defense — it catches failures before SQS retries are exhausted, with the full error context attached.

Tag DLQs distinctly (role=dlq) so cost and ops dashboards can filter them. Store the redrive policy in Terraform or CDK so it survives recreate. Finally, write a runbook: every DLQ should link to “how to diagnose and how to replay” in the alarm description.

Wrap-up

Dead-letter queues turn invisible failures into actionable alerts. Configure one per source queue, set maxReceiveCount between 3 and 5, give the DLQ 14-day retention, and alarm on depth. Your future on-call self will thank you when the first poison message arrives.