Solving production issues autonomously is incredibly complex.
Every engineer knows how critical it is to resolve production issues quickly, but this has historically been a challenge for AI. The problem isn’t just about the technical complexity—it’s about keeping up with the ever-changing landscape of production environments.
Why is it so difficult?
The Problem: Outdated Playbooks
The response we heard from every company we spoke to was the same: playbooks are outdated. This isn’t just a minor inconvenience; it’s a major issue. As production environments evolve rapidly, the playbooks designed to guide engineers through troubleshooting become obsolete almost as fast as they’re written.
For companies that are trying to move quickly, keeping these playbooks up to date is practically impossible.
A New Approach: AI Autonomy Without Compromises
When we realized that we couldn’t rely on outdated playbooks, we decided to explore new ways to make our AI agent truly autonomous. But as we tested different approaches, one thing became clear:
Letting AI make every decision in real-time doesn’t work.
The debugging search space is massive. There are an overwhelming number of checks that need to be run when investigating an issue. This made real-time, fully autonomous decision-making too slow and computationally expensive.
So, what’s the solution? The key lies in finding a balance between fully automated responses and human intervention.
What We’ve Learned About Building the Perfect On-Call Copilot
Here are the lessons we’ve learned on our journey:
✨ Slack Is Gold
A huge amount of troubleshooting happens in Slack—especially now that remote work is the norm. Slack conversations provide invaluable data: they show exactly what steps were taken, what checks were performed, and when.
By integrating Slack into our system, we’re able to capture this information without adding any extra work for teams. It allows us to continually learn from each incident, improving the system’s efficiency over time.
✨ Coverage Grows with Use
We’ve learned that as new issues arise, they don’t need to be a roadblock. If a team member runs a query for an issue we haven’t seen before, we simply add it to our automated responses for future use. This adaptive learning approach ensures that the AI agent is constantly improving and can handle a growing range of issues with each new challenge.
✨ Every Company Has Repetitive, Time-Consuming Checks
While the specific checks may vary from one company to another, one thing is certain: every company has repetitive checks that waste engineers' time. Whether it’s reviewing logs, confirming configurations, or running basic diagnostics, these tasks are critical but time-consuming.
Automating these repetitive tasks saves invaluable time for engineers, allowing them to focus on higher-value work—and it’s a game-changer in speeding up incident resolution.
The Results: Cutting Investigation Time by Over 40%
By combining these insights and finding the right balance between automation and human input, we’ve been able to cut investigation time by over 40% right from the start.
But it doesn’t stop there. This approach continues to evolve, with each new incident feeding data back into the system and driving even faster resolutions.
What’s Next?
As we continue to refine our on-call copilot, we’re working to nail that perfect balance between automation and human expertise. The next step is expanding our AI’s capabilities while ensuring that it remains fast, scalable, and cost-effective.
In the photo below, my team demonstrates the view we cherish as we continue working on this exciting, evolving solution.
Key Takeaways:
- Outdated playbooks are a major challenge in production issue resolution, but AI can help.
- Slack conversations are invaluable data sources for capturing real-time troubleshooting steps.
- By automating repetitive checks, we save time and increase incident resolution speed.
- Our approach has already reduced investigation time by 40%—and we’re just getting started.
Curious to see how we can help your team cut investigation times and resolve issues faster?
Get in touch with us today, and let’s talk about how we can optimize your on-call process.