How Yubo's SRE team runs parallel investigations during peak hours

TL;DR

Yubo's small SRE team supports 85 million users across 140 countries on a multi-region GCP and GKE stack producing 20 TB of logs per day. With Annie running parallel investigation in Slack, peak-hour incidents now resolve in two messages while the on-call engineer stays on everything else. The harder case anchors the piece: a NATS backlog of over 15,000 queued messages where the Horizontal Pod Autoscaler held the wrong threshold and quietly never scaled out. Annie pinned the rule from the graph and the team unblocked the workload before it became a page.

Two investigators, one incident

"I send her down one path while I take another."

Peak hours on Yubo span the 140 countries the platform reaches, when live video rooms see the most concurrent activity. They are also when a slow investigation is most expensive. The SRE team is small. Deeply experienced, but small. Pulling one engineer into a peak-hour investigation means the rest of production loses coverage.

Thomas Labarussias, Staff SRE/DevOps at Yubo, started by testing Annie against incidents he had already root-caused himself. Once she landed on the same answers, he shifted to running her in parallel.

The failure mode that hides

Yubo's NATS messaging layer accumulates a backlog of over 15,000 queued messages. The Horizontal Pod Autoscaler that should respond does not. Nothing has crashed. No service is throwing 500s. A graph that only watches application errors will not see this incident at all.

It needs three reads in one place: the HPA configuration, the actual queue depth, and the scaling decisions the controller did or did not make. Annie has all three in the same versioned graph. She analyzes the configuration against the observed backlog and pins the failure to the HPA rule that held the wrong threshold. The team acts on the analysis immediately.

NATS backlog grew, HPA didn't react. Yubo's second documented incident: HPA replica count stays flat as a dashed line while the NATS queued-message count rises along a curve to 15,726 messages. Annie's note: read the HPA config against the backlog and pinned the threshold rule that blocked scale-out.

A separate incident, an API 500 spike during peak hours, follows the same shape with louder symptoms. The engineer on call tags Annie. She correlates the error pattern across services in the same time bucket, distinguishes the root failure from the cascade, and reports the answer. Two messages. Five minutes. The same investigation, manually, would have pulled the on-call engineer off everything else for the rest of the window.

"I send her down one path while I take another"

"Before Annie, an incident during peak hours pulled one of us off everything else until it was resolved. Now it is two messages in Slack and I am back to building. That changes what a small team can do."
Thomas Labarussias, Staff SRE/DevOps at Yubo

The graph holds the cross-system context. Annie holds a parallel investigation thread on it. Yubo's SRE team is small enough that one hour-long investigation pulls a meaningful share of its capacity off everything else. Running a second investigation on the side, in Slack, on the same source of truth, roughly doubles the team's effective coverage during the moments that matter.

Where this is going

28 engineers now have access to the channel where Annie operates. Yubo is expanding the scope into proactive reporting on the same graph (so investigations start before the page fires), GKE node pool upgrade tracking (so the upgrade-time blast radius is known in advance), and VictoriaLogs integration (so the same correlation pattern works on log-based incidents).

Where Yubo is taking this. A three-step roadmap: Today, incident response with Annie in Slack and 39 engineers in the channel. Next, proactive reporting with investigations starting before pages fire. Then, stack expansion covering GKE node pool upgrade tracking and VictoriaLogs.