- Access exclusive content
- Connect with peers
- Share your expertise
- Find support resources
This blog is written by Puneet Gupta
Have you ever been in this situation?
It's 3 AM. Production is down. Customers are hitting errors. Your phone won't stop ringing.
You grep through thousands of log lines across services, packet captures, and metrics dashboards. The clock is ticking. Every minute costs real money.
Finally, you find something:
ERROR: Database connection timeout
You've found an error. But is it the root cause or just a symptom? Why is the database timing out? Did something change? Is it the database, the network, or the application?
The answer is scattered across a dozen systems, and you don't have hours to piece it together manually.
Welcome to LogWatch
LogWatch doesn't just find errors — it tells you the complete story.
Application logs: 5,247 connection timeouts to PostgreSQL over 60 minutes
Network PCAP analysis: The same service shows 523,000 connection attempts — 100x more than any other client
GitLab pipeline logs: A deployment 90 minutes ago introduced a retry loop with aggressive polling
Root cause: Not a database problem. A configuration regression caused exponential retry behavior, creating a self-inflicted DDoS.
The root cause wasn't hidden. It was just buried.
Traditional approach:
4–6 hours → manual log analysis → blaming the database team → checking deployments → tribal knowledge → eventually finding the config issue
LogWatch approach:
3 minutes → automated cross-system correlation → precise root cause → actionable fix
This is intelligence-driven log analysis — the opposite of throwing raw logs at an LLM.
At Palo Alto Networks, this wasn’t an edge case. It was daily life in Network Security (NetSec).
Our CI/CD pipelines generated massive volumes of logs across Jenkins and GitLab. A single failure meant engineers and SREs manually grepping through noise, trying to infer where the failure started — and why. Real answers required correlating data across:
All the data existed but none of it spoke the same language. The real problem wasn’t lack of observability — it was lack of intelligence.
Logs were telling us what happened, repeatedly. They weren’t telling us why.
The result was a system trapped in inefficiency:
The irony was painful.
We weren’t missing data — we were overloaded with it.
What we needed wasn’t more logs, dashboards, or alerts. We needed a solution that could preprocess and reason over logs before humans ever looked at them.
That insight led us toward an agent-based log intelligence approach:
Instead of engineers asking, “Where do I even start?”
The system should answer, “Here is the failure, here is the dependency chain, and here is the most likely root cause.”
That shift from raw logs to agent-ready intelligence is what transformed log data from a liability into an asset.
When large language models exploded in popularity, the knee-jerk reaction was obvious: dump the logs into GPT and get instant insights. Teams quickly discovered the harsh economics:
The fundamental flaw: treating all log data as equally important.
LogWatch solves the log explosion problem with a hybrid architecture that combines deterministic preprocessing with targeted LLM analysis. Instead of blindly feeding raw logs into expensive language models, LogWatch builds an intelligent preprocessing pipeline that converts noisy, unstructured logs into high-signal, structured intelligence.
The result is faster analysis, dramatically lower cost, and insights that are directly usable by troubleshooting agents, TSE workflows, and NetSec operations.
TSE Agent
Jenkins pipeline correlation using logwatch
Regression Analysis: Logwatch
Raw Logs → Smart Preprocessing → Structured Intelligence → LLM Analysis → Actionable Results
Traditional approach
50,000 log lines → LLM API High cost → generic summaries of mostly irrelevant data
LogWatch approach
50,000 log lines → ~50 meaningful patterns → ~ Low LLM cost → precise, actionable insights tied to real failures
LogWatch ensures that LLMs reason only over signal, never over raw noise.
LogWatch’s core innovation lies in its multi-stage preprocessing pipeline, which combines rule-based efficiency with AI-driven flexibility.
Regular expressions (regex) are used deliberately for cost control, speed, and early signal extraction.
Regex preprocessing is augmented with zero-shot AI classification, allowing LogWatch to enrich log patterns without any training data:
This makes LogWatch adaptable across products, customers, and environments with zero retraining.
To avoid line-by-line analysis, LogWatch uses the Drain3 clustering algorithm to generate intelligent log templates.
Instead of analysing 50,000 individual log lines, LogWatch produces compact, high-signal patterns:
Modern incidents rarely live in a single system. LogWatch’s correlation engine is designed to reason across pipelines, observability platforms, security systems, network data, and internal tools.
Within a single connector, LogWatch identifies:
Across connectors, LogWatch correlates:
This enables accurate root-cause propagation across system boundaries.
LogWatch provides a connector SDK that allows users to contribute new integrations in a controlled and consistent way. To add a new connector, a user implements a single class that follows a well-defined interface. The goal is to keep user effort minimal while ensuring every connector can fully participate in preprocessing, correlation, and workflows.
A user-written connector is responsible for:
LogWatch SDK
The LogWatch platform abstracts away all common infrastructure concerns by providing:
This separation allows users to focus only on what makes their data source unique, while LogWatch handles scale, consistency, and intelligence.
LogWatch is designed as an intelligence layer for agents:
Instead of asking engineers to read logs, LogWatch prepares the data so agents can reason, correlate, and act.
This is a strong set of results. To make them even more impactful for a report or presentation, I’ve reorganised the information to highlight the Efficiency Gains and Strategic Impact.
Logwatch has successfully moved from integration to a core driver of engineering efficiency. By automating high-toil tasks, the system has shifted the focus from reactive firefighting to proactive development.
The following table summarises the quantitative improvements across key engineering pillars:
Before LogWatch, finding the root cause of a build failure took 4–6 hours. Engineers manually checked pipeline logs, observability data, and Jira tickets. Triage was 100% manual, slow, and error-prone.
With LogWatch, root cause analysis now takes minutes, reducing downtime by 95%+. Jira triage effort dropped to ~40% manual, saving 60% of engineering time. What used to be manual or script-driven is now fully automated, eliminating human error.
Logwatch is currently embedded across three critical high-impact areas:
The impact of Logwatch extends beyond individual ticket resolution:
Mainline Health: By reducing build failure diagnostics from half a day to minutes, Logwatch ensures the main branch remains “green.”
Commit Velocity: A healthy main branch eliminates developer “wait states,” directly increasing the frequency and reliability of code commits across all engineering pods.
Team
Puneet gupta https://www.linkedin.com/in/puneetggupta/
Peter Kirubakaran N https://www.linkedin.com/in/peter-kirubakaran-n-a3225621/
Sughosh Divanji https://www.linkedin.com/in/sughosh-divanji-b0a1021b/
Kuldeep Saini https://www.linkedin.com/in/kuldeep-s-8ab5021a4/
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
| Subject | Likes |
|---|---|
| 3 Likes | |
| 2 Likes | |
| 2 Likes | |
| 1 Like | |
| 1 Like |


