How BeReal's SRE team triages a panic in code it didn't write

TL;DR

BeReal is a social app with 40+ million monthly active users, built around synchronized posting moments: users get a prompt, rush into the camera, and unlock the feed together. For a lean SRE team on GCP, that means incidents can surface during bursts, often in services the on-call engineer does not own. With Anyshift, the BeReal team can route a Go panic to the right domain owner in about 30 seconds. Anyshift does it by reading the stack trace against a maintained graph of BeReal's cluster, including pod history that would be expensive to reconstruct through live API calls.

Signal	BeReal reality	Anyshift payoff
Triage speed	Go panic in an unfamiliar service	Right owner in about 30 seconds
Ownership	One shared alert channel across domains	Crash becomes a domain routing decision
Context	GCP and Kubernetes relationships span services	Stack trace read against a maintained infrastructure graph
Pod scale	20-50k pod rotations over two days	Pod history stays queryable without a live API storm

From panic to routing decision

Diagram: Anyshift reads a Go panic against BeReal's maintained cluster graph and returns the likely domain owner

"A panic shows up with a huge trace, and I don't have the business context or the technical context. [Anyshift] tells me: you've got a cache miss in domain X. Thirty seconds, maybe a minute."
Thomas Lorreyte, SRE at BeReal

From there, the path is obvious: Domain X has an owner. Thomas routes the panic there and moves on. The stack trace stops being a wall and becomes a routing decision. The detail that matters most is simple: on these crashes, Anyshift didn't miss. The trace-to-cause signal is clean enough that Anyshift's read holds up. Thomas called it the use case by essence.

"Especially for someone like me, who's in it but not too deep, who watches the alerts but isn't the owner of what's happening."

Why live-API agents miss cross-service context

Calling this "AI explains your stack traces" undersells the hard part. Plenty of tools can explain a trace after the fact. Fast routing comes from the graph: Anyshift already knew how the crashing service related to everything around it before Thomas asked.

Wire a general-purpose agent to a few live connections, one on AWS, one on GCP, maybe Datadog, and it works up to a point. Ask a narrow question, it queries the API, it answers. The moment the question spans services, the moment it needs to know how the thing in one project relates to the thing in another, it runs out of room.

"A classic agent with a live connection to your AWS, your GCP, whatever, in a single context it's never going to assemble the whole graph. It'll do a few correlations. But your context explodes, it gets expensive, and it's never precise. [Anyshift] already has the entire graph of your cluster. It's just on another level."

A live-API agent re-queries the world every time you ask, and pays for it in tokens, latency, and precision. Anyshift maintains the graph instead, so it's already correlated and standing when a panic hits the channel, queryable on the spot. The speed comes from that standing context. There's no lookup to wait on in the first place.

Where Anyshift's graph earns its keep: pods at scale

The sharpest proof came from a pain BeReal already lives with. They turned off ArgoCD's pod-level checks, because at their scale running them continuously was simply too much.

That's the problem in one sentence. The most operationally interesting layer of a cluster is also the one most hostile to repeated live querying, so the tools that depend on live querying back away from it. We asked Thomas whether Anyshift would hit the same wall: at BeReal's traffic, could its scanning hammer the API hard enough to hurt?

It depends what you scan, he said. Buckets, services, deployments are stable object types, and querying them on the fly is fine. "Scanning buckets is fine, a hundred at most, worst case." Pods are a different animal. Over two days you can see twenty, forty, fifty thousand pod rotations. Ask a live API for historical pods, including terminated ones, and you're chasing tens of thousands of JSON objects every time. "With the pods it'd cough up a bit of blood."

Chart: a live-API agent re-querying the kube API server for historical pods climbs toward fifty thousand JSON objects over two days and throttles, while Anyshift reads the same pod history from the maintained graph at a flat cost and fetches live detail on a single pod only when asked

This is exactly what a maintained graph is for. Anyshift already holds the pod information in the graph, correlated and queryable at any moment, with no fresh API storm to pay for. When you need the last mile, the live detail on a specific pod, it fetches that on demand, on top of the graph rather than instead of it. You get the standing picture for free and the deep detail when you ask!

Thomas landed on the logic himself. Re-querying the kube API server every time means throttling it, forcing more pods, and losing the history the moment etcd compacts anyway.

"You maintain your graph, and your graph is queryable at any moment. That's where it's above everything else."

Beyond crashes: BeReal alert fatigue

BeReal's shared alert channel creates another graph problem: alert fatigue. Some alarms are obvious if you know the stack. Old or cross-service alerts are not. When Anyshift marks something as a false positive, and why, the engineer still checks it, but with a direct positive signal instead of a blank page.

The same graph helps with audits. Anyshift maps what is actually running across GCP projects: services in use or abandoned, service accounts bound to workloads, and resources nobody has touched in the audit logs.

"For building it worked well, for security it worked well, for buckets it worked well. We looked at who was accessing which bucket with which service account. That worked. With a classic agent, as far as I know, you just can't do this, you don't have enough connections, you don't have enough context."

The value sits next to security: infrastructure reporting with a point of view, answering cross-service questions that disappear when a tool queries one resource at a time.

How the BeReal team actually uses Anyshift

At BeReal, Thomas's entry point is Slack and scheduled reports. He never used Anyshift conversationally, never replied to it in the alert channel, and he had a clear reason.

"You're already deep in your own investigation. It's like two people working in parallel. I work on my side, I see [Anyshift] produced something, I've got my explanation, [Anyshift's] matches, it completes the picture. I'm not going to go back and forth with it."

That's one good way to use Anyshift, not the only one. Our customers don't agree on the surface. A few live entirely in Slack; the rest are split between the CLI and the web app, where they can pull artifacts, PDFs, diagrams. Thomas's Slack-and-reports habit sits at one end of that range, and we haven't found the surface everyone wants yet. Thomas would tell you the surface was never the point for him anyway.