Insights & Engineering

The Anyshift Blog.

Deep dives into Site Reliability Engineering, AI in production, and scaling infrastructure gracefully. Written by the team building the future of SRE.

Browse by Category

Featured Article

AI in Production

How Much Context Does an AI Agent Really Need?

Unified production context makes AI agents cheaper, faster, and easier to scale. In our live benchmark, resolving relationships before inference cut model-token usage by 83.5%, while reducing response time and estimated cost without changing the task outcome.

Roxane Fischer

Jul 13, 2026 · 7 min read

Latest Articles

Product Series

The Anyshift Graph API

Every question your team asks about production now has an API. Rank a flapping monitor in 0.4 seconds, trace a bad deploy to its failures, and give your agents ground truth instead of another dashboard.

Roxane Fischer

Jul 15, 2026 · 4 min read

Ecosystem Series

Anyshift meets Harness: production-aware approvals

Harness runs the production release pipeline and manual approval gate for the checkout-api deployment. Anyshift adds the production impact: affected services, owners, recent changes, and the review decision before the approval waits for a human.

Stephane Jourdan

Jun 29, 2026 · 4 min read

Changelog Series

What shipped at Anyshift

A new VictoriaMetrics integration for Annie, a self-verification step that double-checks her findings, and one-click pull request fixes inside reports.

Anyshift

Jun 22, 2026

Ecosystem Series

Anyshift meets Dash0: write production-change context into service timelines

Dash0 shows the telemetry. Anyshift writes the upstream production-change context into Dash0 as OpenTelemetry service events for every affected service.

Stephane Jourdan

Jun 19, 2026 · 3 min read

Ecosystem Series

Anyshift meets Redis: Production context before AI agents act

Anyshift adds live production context to AI agents using Redis, then writes the enriched context back to Redis before an AI agent acts.

Stephane Jourdan

Jun 19, 2026 · 4 min read

Ecosystem Series

Anyshift meets ServiceNow: production context for incident workflows

ServiceNow manages incidents, changes, approvals, and tasks. Anyshift adds production context so those workflows can include cause, blast radius, and owner review.

Stephane Jourdan

Jun 18, 2026 · 3 min read

Ecosystem Series

Anyshift meets Postman: production-impact API checks before release gates run

Postman runs API collections with environments. Anyshift adds the live production context that tells those workflows which API paths, consumers, owners, and monitors actually matter before a release gate runs.

Stephane Jourdan

Jun 15, 2026 · 3 min read

Ecosystem Series

Prove Which Snyk Container CVEs Are Running in Production

Snyk Container identifies vulnerable image digests. Anyshift joins each digest to Kubernetes runtime, exposure, rollout history, owner, and remediation window.

Roxane Fischer

Jun 15, 2026 · 3 min read

Changelog Series

What shipped at Anyshift

Annie's root cause analyses now cite only verifiable evidence: logs, traces, metrics, and live infrastructure state, never knowledge-base articles or memory.

Anyshift

Jun 15, 2026

Product Series

Break our demo infrastructure on purpose and watch the root cause surface

Sever the link between a service and its database in our new Playground, and a change event lands. Seconds later the root cause comes back, traced against the topology graph instead of a wall of logs. A hands-on way to see what change-first root-cause analysis does, no signup.

Louis Fradin

Jun 12, 2026 · 2 min read

Ecosystem Series

Anyshift meets CrowdStrike: safer threat response with production context

CrowdStrike Falcon helps security teams decide what to do with suspicious domains, IPs, and files. Anyshift shows which services, owners, dependencies, and recent deploys are behind the signal before analysts detect, block, or escalate it.

Stephane Jourdan

Jun 10, 2026 · 4 min read

Ecosystem Series

Approve Confluent schema changes with production impact

Confluent validates and registers Kafka schema changes. Anyshift adds the production impact: affected services, owners, monitors, and skipped non-production paths.

Stephane Jourdan

Jun 10, 2026 · 4 min read

AI in Production Series

AI Context for Prod, Optimized by AIs in Prod

Annie (our AI SRE agent) had institutional memory from ACE, the agentic context-engineering loop that curates cheatsheets from past runs. It worked, but clients kept catching her trusting stale entries or missing answers buried in her own bloated context. So we added five things on top: (1) a fixed set of memory items always presented to the agent, (2) per-query retrieval over the rest of the memory store, (3) an agent-optimized index of that store, (4) the ability for the agent to query the store mid-run, and (5) tried-and-true memory freshness mechanisms. Production context, now optimized by the AI using it. Here's the reasoning and what a few weeks in production say.

Ghazi Felhi

Jun 10, 2026 · 10 min read

Ecosystem Series

Anyshift meets Coralogix: turning telemetry into reviewed production handoffs

Coralogix is where SREs investigate telemetry. Anyshift adds the production graph around a signal: affected service, owner, recent deploy, dependency evidence, and skip reasons, then writes the reviewed handoff into a Coralogix Custom Dashboard.

Stephane Jourdan

Jun 9, 2026 · 4 min read

AI in Production Series

How we turned on-call judgment into skills an AI agent can load

An AI agent in the incident channel can run kubectl and read a dashboard. What it can't do is judge whether the last deploy is the suspect or a red herring. We open-sourced the SRE skills that encode that judgment, runnable offline against fixtures with no credentials.

Louis Fradin

Jun 9, 2026 · 4 min read

Ecosystem Series

Anyshift meets MongoDB Atlas: production-aware alert settings

MongoDB Atlas can alert when a cluster nears its connection limit. Anyshift adds the pre-enable review: affected services, owners, monitors, recent changes, and non-production exclusions before paging starts.

Stephane Jourdan

Jun 9, 2026 · 4 min read

Ecosystem Series

Anyshift meets Snowflake: production context before agents act

Snowflake is where teams govern data, workloads, and AI workflows. Anyshift adds the live production graph those workflows need before they apply a fix, rerun a task, refresh a dynamic table, or trigger an agentic workflow.

Stephane Jourdan

Jun 8, 2026 · 4 min read

Ecosystem Series

Anyshift meets Databricks: checking production impact before a data pipeline rerun

Databricks gives teams the governed data and AI surface. Anyshift adds the live production context a Databricks workflow needs before it patches a data pipeline, reruns a backfill, or calls an agent tool.

Stephane Jourdan

Jun 8, 2026 · 4 min read

Ecosystem Series

Anyshift meets GitLab: production impact before merge

GitLab shows reviewers the diff, pipelines, and approvals. Anyshift adds the missing production layer: which live services use the changed code, who owns them, what can be skipped, and who should review before merge.

Stephane Jourdan

Jun 8, 2026 · 4 min read

Ecosystem Series

Anyshift meets Okta: production reachability for access changes

Okta is where teams manage identity, access, and policy. Anyshift adds production reachability enrichment around an access change: which services, cloud roles, Kubernetes workloads, monitors, and owners sit behind the group before Okta performs the assignment.

Stephane Jourdan

Jun 8, 2026 · 3 min read

Changelog Series

What shipped at Anyshift

Annie now opens pull requests across multiple repos in one request, instead of stopping at the first repository she inspects.

Anyshift

Jun 8, 2026

Ecosystem Series

Anyshift meets Elastic: debug with PR context already attached

Elastic gives teams the place to search, triage, and open Cases (Kibana investigation tickets) when an incident starts. For a PR that changes a shared authentication module, Anyshift adds what Elastic cannot infer from the PR alone: which production services depend on it, who owns them, Identity hints, and evidence. So when a human or agent starts debugging, the context is already attached.

Stephane Jourdan

Jun 4, 2026 · 3 min read

Testimonials Series

How BeReal's SRE team triages a panic in code it didn't write

BeReal's synchronized posting ritual creates sharp traffic waves, not a smooth feed. With Anyshift, the team can route Go panics from unfamiliar services to the right owner in about 30 seconds because Anyshift reads each crash against a maintained infrastructure graph.

Anyshift

Jun 3, 2026

AI in Production Series

Private Equity's AI Transition Problem in 2026

Private equity firms are pushing AI across portfolios, but the hard part is no longer experimentation. It is turning sponsor-level AI pressure into governed, production-aware operating workflows that actually move EBITDA.

Roxane Fischer

Jun 2, 2026 · 8 min read

Ecosystem Series

Anyshift meets Splunk: reduce maintenance alert fatigue

Planned maintenance often creates alert noise. Anyshift finds the Splunk alerts affected by a change, pauses only those saved searches, and turns them back on when the window ends. Teams keep real alerts visible while expected noise stays out of the way.

Louis Fradin

Jun 1, 2026 · 3 min read

Changelog Series

What shipped at Anyshift

Annie can now open a real GitHub pull request from chat with the new Fix - Open PR button, plus a reworked memory layer for larger investigations.

Anyshift

Jun 1, 2026

Ecosystem Series

Anyshift meets Dynatrace: graph context for every deploy

A deployment event should carry the service, owner, and monitored entity it actually changed. Anyshift adds that production context to Dynatrace so on-call teams do not rebuild it from CI and infrastructure tabs.

Stephane Jourdan

May 28, 2026 · 3 min read

Ecosystem Series

Anyshift meets New Relic: track shared changes across services

Anyshift finds the production services affected by a shared change, then prepares a reviewed New Relic Change Tracking event for each service's timeline.

Stephane Jourdan

May 27, 2026 · 4 min read

Ecosystem Series

Anyshift meets Sentry: releases that follow impact

Sentry is where teams debug regressions. Anyshift makes sure the release context reaches every affected project, including downstream services that did not deploy.

Louis Fradin

May 27, 2026 · 3 min read

Ecosystem Series

Anyshift meets acli: PR impact, routed into Jira

A shared-code PR should not surprise downstream teams after merge. Anyshift finds the running services and owners affected by the change, then routes the advisory work into Jira before the review is over.

Stephane Jourdan

May 26, 2026 · 4 min read

Changelog Series

What shipped at Anyshift

Annie gets report personas, per-conversation effort levels, OpsGenie alerts, Sentry/Notion link auto-resolve, and a clearer left sidebar around Root Cause Analysis, Proactive, and Custom Reports.

Anyshift

May 25, 2026

Ecosystem Series

Anyshift meets pup: turning intent into audited Datadog runbooks

Datadog pup can mute monitors during maintenance, but teams still have to know which downstream services will be noisy. Anyshift CLI maps the affected services from production context, then prepares the Datadog downtime runbook with an audit trail.

Stephane Jourdan

May 22, 2026 · 4 min read

Ecosystem Series

Anyshift meets gcx: cloning Grafana observability with audited runbooks

Grafana shows the service you instrumented, but downstream services often miss the same dashboards and SLOs. Anyshift maps the dependency graph, finds the coverage gaps, and prepares the Grafana resources for gcx to apply after review.

Stephane Jourdan

May 21, 2026 · 4 min read

Changelog Series

What shipped at Anyshift

Linear and Notion join Annie's knowledge sources, annie-cli gains access-token authentication for headless CI use, and Annie won't recommend silencing alerts.

Anyshift

May 18, 2026

Infrastructure as Code Series

How to Detect Terraform Drift Across Multi-Cloud

A development RDS instance had its publicly_accessible flag flipped on a Friday afternoon. The team's drift-detection cadence was once per weekday, so 60+ hours passed before anyone caught it. Walkthrough of the audit-log subscription architecture that would have caught it in two minutes across AWS, GCP, and Azure, with every config block paste-able into your own account.

Louis Fradin

May 15, 2026 · 9 min read

Production Debugging Series

How to Trace a Production Incident Back to the Commit

Burned 25 minutes on a Friday-morning page before I realized the responsible commit was in another team's repo. This is the four-command sequence I now run when an alert lands and `git log` on my own service comes up empty, with the outputs at each step and where the search space gets cut.

Louis Fradin

May 15, 2026 · 8 min read

Product Series

Annie reads Linear now

Forty minutes paging Linear to confirm a returning customer report was the same bug we'd half-shipped a fix for in February. The Linear integration went GA May 13, and Annie pulled both tickets, the linked PR, and the stalled action in twenty-three seconds.

Louis Fradin

May 13, 2026 · 2 min read

Product Series

Annie searches Notion now

Ten minutes to find a post-mortem already sitting in Notion. The Notion integration shipped May 12, and Annie picked the same page in eighteen seconds, root cause and open action items tagged.

Louis Fradin

May 12, 2026 · 2 min read

Changelog Series

What shipped at Anyshift

Datadog gains 50+ Bits AI capabilities; PagerDuty + Incident.io + Sentry join as sources; k8s-agent v0.3.2 brings on-demand graph reconciliation.

Anyshift

May 11, 2026

Product Series

How we now know which commit broke each Sentry error

Five Sentry tickets in one worker turned out to be one bug. The most-recent error came from the very PR that had wired Sentry forwarding in. How a stack frame now leads to the offending commit, the deploy behind it, and the team that owns the failing path.

Louis Fradin

May 11, 2026 · 2 min read

Testimonials Series

How Yubo's SRE team runs parallel investigations during peak hours

Yubo's small SRE team supports 85M users across 140 countries on GCP and GKE, producing 20 TB of logs per day. With Annie running parallel investigations in Slack, peak-hour incidents now resolve in two messages.

Anyshift

May 6, 2026

Product Series

Anyshift is now available on AWS Marketplace

Anyshift is now on AWS Marketplace. Buy through your AWS account, bill against your AWS spend, and skip the standalone InfoSec review.

Anyshift

May 4, 2026

Changelog Series

What shipped at Anyshift

6 product areas shipped: Slack reports, MCP and CLI tools to drive Annie from your terminal, smarter automation rules, tighter AWS onboarding.

Anyshift

May 4, 2026

Events & Talks

Context Engineering for DevOps interactive conference

How AI agents learn your infrastructure. A walkthrough of the DevOpsCon Amsterdam 2026 talk: the gap between LLMs and production, structural context as a versioned graph, and ACE for self-improvement without labels.

Louis Fradin

Apr 23, 2026

Product Series

Report Templates: pre-built investigations, one click

Every Monday, the pod-stability review gets rebuilt from scratch. Same dashboards, same correlation work, same write-up. Two hours, gone. Report Templates turn the recurring investigations platform and SRE teams run by hand into one click.

Louis Fradin

Apr 15, 2026 · 2 min read

Production Debugging Series

My Workers Stopped Polling: a K8s + Temporal Whodunit

Temporal workflows stuck in Running with zero pollers, and Temporal still reports a healthy task queue. The root cause lives one layer down: a CrashLoopBackOff in the Kubernetes worker pod, caused by a single bad environment variable. A walkthrough of debugging Temporal workers on Kubernetes the manual way (10 minutes), then with an infrastructure context layer that bridges the two systems (seconds).

Louis Fradin

Apr 8, 2026 · 6 min read

Product Series

Annie CLI

136 CloudWatch alarms vanish overnight. Annie cross-references Slack, the audit trail, and your infra graph in one query. Now it runs in your terminal.

Stephane Jourdan

Mar 16, 2026 · 3 min read

AI in Production Series

Top 10 AI SRE Tools in 2026: A Comprehensive Comparison

The 10 best AI SRE tools in 2026 compared by architecture, root cause analysis, remediation, and change awareness — from Anyshift's versioned graph to Resolve AI's autonomous agents.

Roxane Fischer

Mar 13, 2026 · 15 min read

AI in Production Series

Agentic Context Engineering in Production: How AI Agents Build Institutional Expertise

AI agents start every run from scratch. ACE (Agentic Context Engineering) gives them institutional memory that evolves through use, cutting root cause analysis time by 30%.

Ghazi Felhi

Mar 11, 2026 · 8 min read

Anyshift Engineering Series

Building a Temporal Infrastructure Knowledge Graph: A Year of Working with Neo4j at Scale

How Anyshift chose Neo4j for building a temporal infrastructure knowledge graph and lessons learned over a year of production use.

Stephane Jourdan

Feb 17, 2026 · 14 min read

AI in Production Series

Why AI-SRE Needs Topology, Not Just Telemetry

The limits of telemetry-only AI approaches to SRE and why topology is the missing piece.

Roxane Fischer

Jan 27, 2026 · 8 min read

Product Series

Anyshift Product Demo - Voodoo Testimonial

Watch the Anyshift product demo featuring a testimonial from Voodoo.

Anyshift

Nov 26, 2025

Product Series

Anyshift Interview - The AI That Repairs Breakdowns in 5 Minutes

An interview about the AI that repairs infrastructure breakdowns in 5 minutes.

Anyshift

Nov 26, 2025

Product Series

Anyshift Interview - How Silicon Valley Builds AI Products

An interview with Roxane Fischer on how Silicon Valley builds AI products.

Anyshift

Nov 26, 2025

Product Series

Anyshift Product Demo - Citrix Testimonial

Watch the Anyshift product demo featuring a testimonial from Citrix.

Anyshift

Oct 28, 2025

AWS Reference Series

Monitoring services: A Deep Dive in AWS Resources & Best Practices to Adopt

Master AWS monitoring with CloudWatch, CloudTrail, AWS Config, and X-Ray for comprehensive observability.

Mattias Fjellstrom

Oct 15, 2025 · 17 min read

AWS Reference Series

Database Services: A Deep Dive in AWS Resources & Best Practices to Adopt

A comprehensive guide to AWS database services including RDS, Aurora, DynamoDB, and ElastiCache.

Mattias Fjellstrom

Sep 3, 2025 · 19 min read

AWS Reference Series

AWS Lambda: A Deep Dive in AWS Resources & Best Practices to Adopt

Deep dive into AWS Lambda covering function configuration, IAM, VPC integration, and monitoring best practices.

Mattias Fjellstrom

Jun 23, 2025 · 16 min read

AWS Reference Series

S3 Buckets: A Deep Dive in AWS Resources & Best Practices to Adopt

Everything you need to know about Amazon S3 configuration, access control, encryption, and lifecycle management.

Mattias Fjellstrom

Jun 6, 2025 · 18 min read

AWS Reference Series

Secrets Manager: A Deep Dive in AWS Resources & Best Practices to Adopt

A guide to AWS Secrets Manager covering secret storage, rotation, access policies, and integration patterns.

Mattias Fjellstrom

May 7, 2025 · 12 min read

AWS Reference Series

Network Firewall: A Deep Dive in AWS Resources & Best Practices to Adopt

Learn about AWS Network Firewall architecture, rule groups, policies, and logging configurations.

Mattias Fjellstrom

Apr 7, 2025 · 17 min read

Tutorial Series

Terraform Versioning Explained

A video tutorial by Ned Bellavance explaining Terraform versioning best practices.

Ned Bellavance

Mar 24, 2025

AWS Reference Series

VPC networking: A Deep Dive in AWS Resources & Best Practices to Adopt

Master AWS VPC networking fundamentals including subnets, route tables, gateways, and peering configurations.

Mattias Fjellstrom

Mar 7, 2025 · 21 min read

Events & Talks

Anyshift at Civo Navigate SF: Key Takeaways

Key takeaways from the Civo Navigate SF conference.

Roxane Fischer

Mar 4, 2025 · 3 min read

AWS Reference Series

DNS: A Deep Dive in AWS Resources & Best Practices to Adopt

Explore Route 53 and DNS management in AWS, including hosted zones, record types, routing policies, and health checks.

Mattias Fjellstrom

Feb 25, 2025 · 16 min read

AWS Reference Series

Identity and Access Management (IAM): A Deep Dive in AWS Resources & Best Practices to Adopt

A comprehensive guide to AWS IAM covering users, groups, roles, policies, and best practices for secure access management.

Mattias Fjellstrom

Feb 3, 2025 · 18 min read

Events & Talks

Anyshift at DevOpsDays Dallas: 5 Key Reasons You're Struggling to Debug Your Infrastructure in Under an Hour

A talk recording from DevOpsDays Dallas about the challenges of debugging infrastructure quickly.

Roxane Fischer

Dec 12, 2024

Events & Talks

[Featured in Tessl] DevOps with AI: Identifying the impact zone, with Roxane Fischer

A featured interview on Tessl about DevOps with AI and identifying the impact zone.

Roxane Fischer

Dec 12, 2024

AI in Production Series

Navigating AI in your Infrastructure: Dos, Don'ts, and Why It Matters

GenAI is everywhere. But very often, the cool and exciting demos don't work the same way in production.

Roxane Fischer

Oct 15, 2024 · 7 min read

Production Debugging Series

Common Weak Points in Infrastructure Management: An In-Depth Guide

Managing infrastructure at scale is a complex endeavor that demands meticulous planning, robust tooling, and continuous adaptation.

Roxane Fischer

Sep 19, 2024 · 4 min read

Product Series

Anyshift.io at Awesome AI Dev Tools September

Anyshift's presentation at the Awesome AI Dev Tools event in September 2024.

Roxane Fischer

Sep 1, 2024

Production Debugging Series

5 Key Reasons You're Struggling to Debug Your Infrastructure in Under an Hour

Most infrastructure debugging sessions blow past the one-hour mark for the same five structural reasons: scattered visibility across cloud accounts, missing historical state, terraform plan output that hides downstream impact, runbooks that lag the live infrastructure, and post-merger environments that no one has fully mapped. A walkthrough of each, with concrete examples and what reduces the time.

Roxane Fischer

Jul 30, 2024 · 4 min read

Production Debugging

Top 3 Weak Points in Your Infrastructure and how to mitigate them

Three structural patterns recur in growing infrastructure orgs: single-repo bottlenecks where dozens of teams share one approval queue, ClickOps and dead IaC code that drift outside any state file, and module version fragmentation that quietly bypasses security patches. A walkthrough of each, with the practices that contain the blast radius.

Roxane Fischer

Jul 30, 2024 · 3 min read

Article Series

Deep Dives

Beginner

Exploring the intersection of AI and DevOps, covering best practices, insights, and practical applications for modern infrastructure teams.

Engineering

Technical deep dives into Anyshift's engineering decisions, architecture, and lessons learned.

Product

The latest product updates, feature launches, testimonials, and news from Anyshift.

Expert

A comprehensive series exploring AWS resources in depth, covering best practices and Terraform configurations for each service.

Advanced

Advanced tutorials and guides on maintaining reliable cloud infrastructure.

Meet Our Writers

The Contributors

Anyshift

Anyshift shares the latest product updates, feature launches, and news from the team.

Ghazi Felhi

AI Engineer

Ghazi Felhi is an AI Engineer at Anyshift with a PhD in Generative AI, specializing in Language Modeling. A published AI researcher, he brings a track record of productionizing innovative AI-based solutions to Anyshift, where he works on Annie, Anyshift's AI SRE.

Louis Fradin

DevRel & Backend Engineer

Louis Fradin is a DevRel and Backend Engineer at Anyshift, where he's helping build the AI context layer for production systems, giving teams the infrastructure graph they need so AI agents can actually understand what's running in prod.

His path to SRE started deep in the stack: four years writing Linux drivers and managing HPC infrastructure for the French Ministry of Armed Forces, followed by three and a half years at Ubisoft building and operating Kubernetes clusters at scale for game servers with Go, Temporal, Talos, or OpenTelemetry.

Today he bridges that engineering background with developer advocacy, advocating for better observability primitives and smarter AI tooling for the people keeping systems alive.

Mattias Fjellstrom

Cloud Architect | Author | HashiCorp Ambassador

Mattias is a cloud architect consultant working to help customers improve their cloud environments. He has extensive experience with both the AWS and Microsoft Azure platforms and holds professional-level certifications in both. He is also a HashiCorp Ambassador and an author of a book covering the Terraform Authoring and Operations Professional certification.

Ned Bellavance

HashiCorp Ambassador

Ned is an IT professional and educator with more than 20 years of experience in the field. He has been a helpdesk operator, systems administrator, cloud architect, and product manager. In 2019, Ned founded Ned in the Cloud LLC to work as an independent educator, creator, and consultant. Ned is a Microsoft MVP since 2017 and a HashiCorp Ambassador since 2020.

Roxane Fischer

CEO & Co-Founder

With a passion for innovation and a deep understanding of cloud infrastructure, Roxane Fischer leads Anyshift.io with a vision to transform how companies manage and maintain their cloud environments. Her background as an ex-Lead Engineer and AI researcher gives her a unique ability to anticipate industry needs, driving Anyshift's growth by delivering solutions that prioritize efficiency, reliability, and long-term success.

Stephane Jourdan

CTO & Co-Founder

With over 20 years of experience in the infrastructure space, Stephane Jourdan is a true authority on building scalable, resilient systems. As the author of the Infrastructure-as-Code Cookbook and former Co-Founder & CTO at CloudSkiff (creators of driftctl, acquired by Snyk), his depth of knowledge in cloud architecture and automation is unmatched.

Stay ahead of the pager.

Get a monthly digest of our best engineering articles, SRE case studies, and Anyshift product updates. No spam, just signal.