A data pipeline runs on Temporal. Workflows orchestrate ingestion, transformation, and loading. Everything has been humming along for weeks. Then one morning, the dashboard shows stale data. The numbers haven't moved since last night.

The on-call engineer opens the Temporal UI. Workflows are stuck in "Running", none of them completing. The task queue shows zero pollers. No workers are picking up tasks.

The natural instinct: something is wrong with Temporal.

But is it?

Spoiler: this is not a complex outage. The root cause is almost embarrassingly simple. But that's exactly what makes it a good example. Most production incidents aren't exotic. They're straightforward problems hiding behind a layer of indirection. The interesting question isn't what broke, it's how fast can you find it.

The setup

The pipeline is straightforward. A Go worker registers a GreetingWorkflow on a task queue called greeting-queue. The workflow executes a single activity, returns a result, and completes. The worker runs as a Kubernetes Deployment in the same cluster as the Temporal server.

When everything works, the flow looks like this:

1. A workflow is started on the greeting-queue task queue

2. The worker picks it up within milliseconds

3. The activity executes and returns

4. The workflow completes

Temporal UI showing the greeting-queue task queue with 1 poller and a completed GreetingWorkflow

This is the healthy state. The worker is polling, workflows complete in under a second, and the dashboard updates in real time.

The silence

Now something changes. A new deployment rolls out. The dashboard goes stale. The engineer opens the Temporal UI and starts a workflow manually. It times out. The task is scheduled but never picked up.

Temporal UI showing a workflow that timed out, the task was scheduled but no worker picked it up

The task queue page still shows a poller. Temporal keeps reporting the last known worker for up to five minutes after it stops polling. Everything looks fine on the Temporal side. The server is working correctly: it scheduled the task and is patiently waiting for a worker to claim it.

This is the trap. Temporal is doing exactly what it should. The problem is elsewhere.

Digging into Kubernetes

The engineer switches to kubectl. The worker pod exists, but something is off:

$ kubectl get pods -n temporal -l app=greeting-worker
NAME                               READY   STATUS             RESTARTS      AGE
greeting-worker-5cd897fbcb-w5b58   0/1    CrashLoopBackOff   5 (30s ago)   3m

CrashLoopBackOff. The container is crashing on every start. Kubernetes keeps restarting it, but it fails every time. The backoff delay grows longer with each attempt (10s, 20s, 40s) and the worker never stays up long enough to register on the task queue.

The logs reveal the cause:

$ kubectl logs -l app=greeting-worker -n temporal
2026/04/07 08:15:42 Simulated startup crash (CRASH_ON_START=true)

An environment variable (CRASH_ON_START=true) was set in the latest deployment. The worker crashes immediately on startup, before it ever connects to Temporal. It never registers as a poller. From the Temporal side, the task queue is simply empty.

The fix is trivial once found: remove the bad environment variable, redeploy, and the worker comes back online. Workflows resume. The dashboard updates.

A crashlooping pod. That's it. Not a networking partition, not a Temporal bug, not a resource limit, just a bad env var. The kind of thing any engineer would fix in seconds if they knew where to look.

But the investigation still took 10 minutes. Opening the Temporal UI, ruling out server issues, switching to kubectl, finding the right namespace, describing the pod, reading the logs. Each step requires context-switching between tools and mental models. For a trivial root cause, that's a lot of overhead.

What Annie sees

This is where it gets interesting. Not because the problem is hard, but because it doesn't have to feel hard.

With AnyShift connected to the cluster, the same incident plays out differently. The engineer notices stale data on the dashboard. Instead of jumping between tools, they ask Annie:

"Why is my pipeline broken?"

Annie already has the full picture. She has been watching the Kubernetes cluster continuously, tracking pod states, events, restarts, and resource usage. She correlates what she sees with the infrastructure graph.

AnyShift chat interface: Annie identifies the greeting-worker pod in CrashLoopBackOff and links it to the stalled Temporal task queue

Within seconds, Annie responds:

Root Cause: `greeting-worker` pods in CrashLoopBackOff

Your greeting-worker pods are crashing on startup. The container exits immediately with a fatal error related to a misconfigured environment variable. Since no healthy workers are running, no one is polling the greeting-queue task queue, so workflows are queuing up and timing out.

- greeting-worker-6f58c576df-tbhjf: crashing since 08:36 UTC

- greeting-worker-6f58c576df-2vvz2: crashing since 08:55 UTC

Recommended fix: check the environment variables on the greeting-worker Deployment. The latest rollout introduced a configuration change that is causing the container to exit before connecting to Temporal.

One question, one answer. No tab-switching. No guessing which namespace to check. No scrolling through kubectl describe output looking for the relevant line. Annie traced the symptom (stale data) back through the timed-out workflows, through the empty task queue, all the way to the crashing pods.

AnyShift resource graph showing the Temporal server, greeting-queue task queue, and greeting-worker pod. The pod node is highlighted in red indicating CrashLoopBackOff, with edges connecting it to the task queue and pending workflows

The takeaway

This was a simple incident. A crashlooping pod caused by a bad environment variable. Any engineer could have found it, and any engineer would have fixed it in seconds.

The point isn't that the problem was hard. It's that even simple problems take too long when the symptom and the cause live in different systems. Temporal tells you what is stuck. Kubernetes tells you why. Bridging the two means context-switching between UIs, CLIs, and mental models, all for a root cause that turns out to be one bad config line.

Annie eliminates that gap. She watches both sides continuously and connects symptoms to causes in seconds, so you can skip the investigation and go straight to the fix. The simpler the root cause, the more frustrating it is to spend 10 minutes hunting for it, and the more obvious the value of having someone who already knows the answer.


Want Annie investigating your infrastructure? Get started with AnyShift