Agentic AI for Production Troubleshooting

A colleague of mine runs a design agency and offers website hosting for the websites they make. The webserver is classic—monolithic and multi-tenant. It had accumulated several dozen websites. Dated architecture, but appropriate for a business that is primarily about visual design.

The server was experiencing unpredictable periods of downtime. This had been going on for months.

On a monolith, there’s no separation of concerns to help shrink the problem space. Analyzing the logs is like walking through an open-air market, a crowd of signals each offering an inviting narrative about the problem I need to solve.

It was a perfect problem for vibe engineering. What followed was an investigation that played out over a few sessions as I learned to refine applying an agentic approach to production troubleshooting.

First Attempt: Manual Log Analysis

To start, I wanted to keep Claude at arm’s length from the server—to be sure I knew what was going into the prompts. So, for the initial attempts, I downloaded log files and had Claude use a blend of custom scripts and direct inference to analyze them. The solutions lined up with my expectations. Yay!

But the downtime happened again.

In hindsight, this was clearly my bias at work. I chose logs that I assumed were related to the problem. Claude made suggestions constrained by my biased framing and data samples. I needed to “expand the solution space” (i.e. curb my bias). I also wanted to speed things up. Downloading logs was slow.

Building the Observability Layer

I set up a couple observability instances and started streaming logs and metrics into them. These provided a sort of proxy (this is a key pattern for safer AI use), and Claude could eventually query those via their APIs. I did other things for a few weeks. Data accumulated.

The server failed again and I went to work. This time I had more data, and tools for interrogating it. So instead of saying to Claude, “here are some web access logs I downloaded. Tell me which website is causing issues,” I could ask more sophisticated questions: “What’s the baseline resource demand for this process?” “What logs show unusual behavior around these two downtimes?”

The Investigation Takes Off

That’s when things took off. The AI:

Queried LVE stats to check per-user resource consumption
Searched syslog for patterns around crash times
Examined cron job schedules for conflicts
Analyzed firewall logs for lock contention
Correlated timestamps across multiple data sources

We arrived at the answer: an iptables lock deadlock caused by CSF configuration issues and cron job timing. Nothing to do with user load at all.

I couldn’t have written the queries, parsed the logs, or correlated the timestamps myself. But I knew what questions to ask, what looked suspicious, and when to dig deeper. The AI provided the implementation. I provided the direction.

What Made This Work

The proxy pattern. Giving Claude access to observability APIs rather than direct server access meant I could audit what data it was seeing and what queries it was running. This is safer and more controllable than giving an AI shell access to production.

Accumulated data. Having weeks of metrics and logs meant we could establish baselines and spot anomalies. Real-time debugging is harder—you’re always chasing the moment.

Better questions. The shift from “analyze these logs” to “what’s the baseline resource demand for this process?” changed everything. Specific, measurable questions get specific, useful answers.

My domain knowledge. I knew what LVE stats were. I knew that iptables lock contention was a thing. I knew that cron timing could cause conflicts. Without that background, I wouldn’t have known where to look or what looked suspicious.

The Takeaway

This is what agentic AI looks like in practice: not a magic oracle that solves problems, but a capable collaborator that can execute investigations at machine speed while you provide the direction and judgment.

The AI found the answer. But I found the path to the answer.

This post is a case study from Vibe Engineering in Practice.