The Challenge: When Manual Runbooks Meet Growing Scale
GoFundMe has built one of the world's most trusted crowdfunding platforms, helping millions raise funds for causes that matter. With two donations happening every second and over 190 million donors relying on the platform, the stakes for uptime are high.
The team operates a fully cloud-native, containerized stack on AWS alongside a constellation of SaaS services. They had the fundamentals in place: robust monitoring, strong observability tooling, and a well-oiled incident management process. When alerts fired, a third-party on-call system paged the responders based on service ownership. Incident management tools automatically spun up war rooms across Slack channels and phone bridges, enabling real-time cross-team coordination.
However, in practice, ownership maps drifted as the org evolved, making triage harder and handoffs more frequent when incidents span multiple services. Playbooks and service ownership were difficult to maintain, and signals were fragmented across tools. Every incident still leaned on human expertise, manual runbooks, and firefighting under pressure.
Jesse Sanchez, Director of Site Reliability Engineering at GoFundMe, explains: “As with most growing companies, you balance newer services with legacy ones. You’re modernizing while still maintaining older software and utilities. That mix creates complexity and technical debt that isn’t always straightforward to troubleshoot.”
“Even in a case when we have really good runbooks, it’s a manual process. You find the runbook and go through the one, 10, 20, 30 steps. Then rinse and repeat.”
At scale, these manual processes created predictable friction: gaps in instrumentation, logging, or third-party integration limits left responders bouncing between logs, dashboards, and vendor status pages without clear signals.
Most critically, incident response relied heavily on tribal knowledge, with the load falling on a small group of senior engineers with deep context. They had the pattern recognition to navigate ambiguity, while newer or junior engineers took longer to reach root cause.
“It’s often the same people in incidents because they’ve built the muscle memory. We wanted the tooling to capture that.
The Solution: Moving From Tribal Knowledge to Intelligent Automation
GoFundMe needed a solution to make the expertise of its best troubleshooters accessible to everyone, without the burden of relying on runbooks (when they existed) and pinging other engineers during an incident. Wild Moose closed that gap by codifying knowledge, automating checks, and delivering high-signal summaries within the team's existing workflow.
"We wanted something embedded in our workflows that shows where we are in an incident and provides guided next steps."
What set Wild Moose apart was its focused scope on learning the company’s unique workflows, specifically triage and cause analysis. It posts root-cause summaries and evidence directly to Slack, so coordination and diagnosis happen in the same place without context switching. The platform learns GoFundMe’s patterns, integrates with existing observability and on-call tooling, and surfaces concise, defensible assessments the moment an alert fires so engineers can move straight to the fix.
“Wild Moose lets us move from manual runbooks and tribal knowledge to intelligent automation. Playbooks run automatically and outputs are ready when we enter an incident.”
Wild Moose’s approach was to enhance human expertise, not replace it. The platform would capture GoFundMe's operational knowledge and provide intelligent guidance, with humans remaining in the loop for decision-making and execution.
Codifying Tribal Knowledge Into Workflows That Run Themselves
The implementation centered on capturing their existing workflows and transforming those into intelligent, automated playbooks. Wild Moose enabled the team to codify their workflows as versioned playbooks in Git, making them part of the normal development lifecycle with proper version control, review processes, and collaboration workflows.
“Wild Moose lets us codify run books so we can version them, put it into our Git repository and be able to use it in a normal developer lifecycle.”
The platform integrated seamlessly with GoFundMe's existing logging and observability vendors, pulling in the data needed to execute playbook logic automatically. When an alert fires, Wild Moose runs the relevant steps, inferring in real time based on signals which playbooks to run, and presents outputs directly in the team's workflows.
The result isn’t a generic AI recommendation. It’s their domain knowledge, version-controlled and reviewable in Git, executed at machine speed and delivered where the team already collaborates. Over time, those playbooks capture tribal knowledge and make it reusable by any responder, not only the usual experts.
The Results: Shorter Diagnosis and Broader Coverage
Instead of manually investigating incidents and production issues, engineers now receive automated outputs the moment an alert is fired. This shift from reactive manual work to proactive automated guidance fundamentally changed the incident response experience.
GoFundMe uses Wild Moose's root-cause summaries to reliably point responders to the right direction.
Beyond faster resolution times, Wild Moose addresses the tribal knowledge problem that had long challenged the team. The platform learns from feedback and the engineers' behavior, capturing the expertise that previously lived only in the heads of experienced engineers.
"Wild Moose is a fast, low-friction integration with quick ROI. You don’t need months to see value, you’ll get positive results quickly."
Scaling Intelligent Automation: What's Next for GoFundMe
The partnership between GoFundMe and Wild Moose demonstrates a crucial insight for SRE leaders: the most impactful solutions enhance existing systems and your team. This involves converting manual operational knowledge into intelligent, automated guidance, which expands your team's expertise rather than replacing it.
