WildFire Droid: The Agentic AI Framework to simplify Cloud Security Product Operations

asrao · ‎09-03-2025

WildFire Droid Title.png

Co-author: Immanuel Edward Henrik (@iedwardhenr)

Introduction: The Unseen Battle in Cloud Security Operations

In the world of cloud-based cybersecurity products, operational teams are constantly fighting a battle with one hand tied behind their back. Despite having countless tools at their disposal — incident management platforms, log aggregators, deployment pipelines, data warehouses, documentation wikis, and communication channels — the workflows that connect them remain clunky and manual. Security professionals spend their days context-switching between disparate systems, copying data from one tool to paste into another, and mentally stitching together information scattered across a dozen interfaces.

The result? A critical bottleneck that leaves teams one step behind the threats they're trying to stop.

But what if the system could think for itself?

This is the core idea behind Agentic AI. It's a leap beyond simple automation or basic chatbots. Agentic AI enables a new class of intelligent systems composed of multiple specialised agents that can understand complex commands, reason about multi-step problems, and work together autonomously to solve them. It's not just about doing tasks faster — it's about having a system that can reason and respond with the strategic precision of a human expert.

We've built the WildFire Droid — an Agentic AI assistant that is redefining the operational efficiency of our WildFire cloud security teams. Built on the Google Agent Development Kit (ADK) and powered by a sophisticated multi-agent architecture, the WildFire Droid transforms multi-stage security workflows into seamless, conversational experiences.

This blog post offers a detailed exploration of its architecture, its growing arsenal of specialised agents, the key technical innovations that power it, and the profound impact it is having on our daily operations.

The "Agentic" Revolution: A Team of Specialists, Not a Single Bot

To truly grasp the innovation behind the WildFire Droid, it's important to understand the foundational concept of an Agentic AI. Unlike a traditional chatbot that follows a rigid script, an Agentic AI is a system composed of multiple specialised agents. Think of it as having a team of domain experts at your disposal, all managed by a single, brilliant team lead.

In the case of WildFire Droid, this team is led by a central Root Agent (the "Droid Manager"). When you make a request, the Root Agent's first job is to understand your intent — your true goal. It then plans an orchestration strategy: should it deploy multiple agents in parallel, chain them sequentially, or use a hybrid approach? Once it determines the optimal path, it seamlessly delegates the task to the most appropriate sub-agent (or agents). After receiving their outputs, it synthesizes the results into a coherent, actionable response.

This architecture ensures that every request is handled by a true specialist, leading to greater accuracy, efficiency, and depth of response.

The 5-Phase Orchestration Workflow

What makes the Root Agent more than a simple router is its structured orchestration intelligence:

Planning — Decompose the request, map it to the right agents, and design the execution strategy.
Delegation — Deploy agents with clear instructions, in the optimal order, with full context.
Validation — Review agent outputs for completeness and cross-validate when possible.
Synthesis — Merge results, eliminate duplicates, and resolve contradictions.
Delivery — Present a comprehensive answer, cite agent contributions, and suggest next steps.

This means the Droid doesn't just forward your question — it thinks about the best way to answer it.

The Architecture: How It All Fits Together

The Agent Hierarchy

WildFire Droid Architecture

Key Design Patterns

Orchestrated Multi-Agent Collaboration — The Root Agent doesn't just route requests. It plans orchestration strategies, deploys agents in parallel when tasks are independent, chains them sequentially when outputs inform inputs, and synthesizes results into coherent narratives.
Sequential Pipelines for Critical Operations — For high-stakes operations like API key management, a Sequential Agent enforces strict ordering: each step must complete successfully before the next begins. This ensures validation, execution, and auditing happen in the correct order with no shortcuts.
After-Tool Callbacks for Intelligent Processing — Multiple agents use callback functions to automatically process tool responses through an in-memory RAG pipeline. This is transparent to the agent — it simply receives optimally compressed, relevant content.
MCP Server Isolation — Each MCP server runs as a separate process, providing process isolation, independent lifecycle management, and clean separation of concerns.

WildFire Droid's Core Capabilities

The WildFire Droid is designed to transform how teams work by turning complex, multi-system tasks into simple, conversational requests.

Total Workflow Automation: The Droid acts as a single point of contact for a wide range of services — from sample analysis and verdict management to deployment operations and incident investigation. It eliminates the constant cycle of context-switching and manual tasks.
Intelligent Data Analysis: With its BigQuery and BigTable Agent (codenamed "Chronos"), the Droid turns complex databases into conversational tools. You can ask for detailed data analysis in plain language and get results that are easy to understand, complete with interactive visualisations.
Autonomous Root Cause Analysis: When incidents occur, the Droid doesn't just fetch data — it orchestrates a full forensic investigation across logs, metrics, alerts, and documentation to trace failures from symptom to source.
Secure & Compliant Operations: For sensitive tasks like managing API keys, the Droid enforces security through a sequential pipeline that requires JIRA ticket validation, explicit user confirmation, and managerial approval before any changes are made. Every action is audited.
Knowledge Synthesis: The Droid fuses information from internal documentation (Confluence), operational records (JIRA), team communications (Slack), and vector-based knowledge bases to provide comprehensive, cross-validated answers.
Interactive Visualisations: Query results and data analysis can be rendered as interactive line charts, bar charts, and pie charts directly in the chat interface, making insights immediately actionable.
GitOps Deployment Management: Full ArgoCD deployment operations — health monitoring, failure diagnosis, resource inspection, sync management — all from a single conversational interface.

A Deep Dive into WildFire Droid's Agent Arsenal

1. The WildFire Operational Agent: Mission Control

The Problem: Managing the WildFire malware analysis pipeline involves a wide range of operational tasks — submitting samples for analysis, checking and updating verdicts, fetching detailed threat reports, scheduling production maintenance windows, and triggering test automation suites. Each of these tasks traditionally requires navigating different systems, APIs, and interfaces.

The Solution: The WildFire Operational Agent is a unified operations specialist that handles the entire sample analysis lifecycle through a single conversational interface. It integrates with multiple backend services via a dedicated MCP server, providing:

Sample Submission — Submit files for analysis across any cloud region with configurable file types, submission types, and sample counts. The agent validates all parameters and confirms before execution.
Verdict Management — Fetch or update verdicts for any file hash. Supports benign, malware, grayware, phishing, and C2 classifications with force-override capabilities.
Threat Intelligence Reports — Fetch comprehensive analysis reports from the threat intelligence platform, summarizing key findings about static and dynamic analysis.
Maintenance Scheduling — Create and manage status pages for production maintenance windows across any cloud region.
Test Automation — Trigger sanity tests, smoke tests, and other test suites across multiple environments directly from the chat interface.

2. The BigData Agent ("Chronos"): Data Intelligence at Your Fingertips

The Problem: Cybersecurity products generate massive amounts of data stored across BigQuery and BigTable in multiple cloud projects. Extracting meaningful insights traditionally requires proficiency in writing complex SQL queries and understanding the schema of dozens of tables. This creates a barrier for non-data analysts and slows down critical investigations.

The Solution: Chronos is a data intelligence analyst that transforms natural language questions into optimized database queries. You can ask questions like "Show me all malware submissions from the last 30 days that originated from Europe" and the agent will automatically write and execute the correct query, interpret the results, and suggest visualizations for complex datasets.

Chronos understands the WildFire data topology — it knows which tables contain sample analysis data, submission logs, service endpoint configurations, and API usage metrics. It enforces mandatory partition filters for performance, protects sensitive data fields, and warns about long-running queries before executing them.

3. The Root Cause Analysis Agent: The Autonomous Detective

The Problem: When a production alert fires at 3 AM, the on-call engineer faces a daunting task: understand the alert, recall the system architecture, find the right logs, trace the failure upstream through service dependencies, and identify the root cause — all while under pressure. This process typically involves opening the incident management platform, consulting documentation, querying multiple log systems, checking cloud platform metrics, and mentally tracing the error chain. Each step requires a different tool, different query syntax, and different mental models.

The Solution: The Root Cause Analysis (RCA) Agent is an autonomous investigation orchestrator that coordinates a team of sub-agents to perform deep forensic analysis, following a methodology inspired by how senior SREs actually debug production issues:

Phase 1: Parallel Context Gathering — The agent simultaneously fetches alert details from PagerDuty and system architecture documentation from the knowledge base. These are independent information sources, so they can be queried in parallel — saving valuable time.

Phase 2: Iterative Log Investigation — Rather than making a single log query and hoping for the best, the agent performs iterative investigation:

First query: Broad context around the failure timestamp
Second query: Error-focused filtering (exceptions, 5xx errors, timeouts)
Third query: Temporal expansion to understand what led to the failure
Fourth query: Pattern analysis with different keywords

Phase 3: Upstream Chain Tracing — This is where the real intelligence shines. When Component A's logs show "error calling Component B," the agent automatically queries Component B's logs. If B's logs show "error calling Component C," it traces further upstream. It continues until it finds a component failing internally with no upstream cause — the true root cause.

Phase 4: Synthesis — The agent constructs a complete causal chain with timestamps, log evidence, and a confidence level (High/Medium/Low) based on evidence quality.

The RCA Agent coordinates three specialized sub-agents:

PagerDuty Agent — Fetches incidents with intelligent time-window management, supports keyword search, and handles SLA-specific date logic.
Grafana Agent — Performs hypothesis-driven log forensics through Grafana Loki, constructing LogQL queries across multi-region infrastructure.
Google Cloud Agent — Queries GCP Cloud Logging, metrics, and traces using Google's official observability MCP servers, providing infrastructure-level evidence.

The result? An investigation that might take an engineer 30+ minutes can be completed in a single conversation.

4. The Hybrid RAG Agent (K-ARCH): The Knowledge Architect

The Problem: Teams need two types of information: specific, documented knowledge about internal operations, and contextual information scattered across communication channels and project management tools. Finding the right answer often requires searching multiple sources and synthesizing the results.

The Solution: The Hybrid RAG (Retrieval-Augmented Generation) Agent is a multi-modal knowledge engine that strategically fuses content from multiple sources:

Confluence Integration — Searches and retrieves documentation from the team's Confluence space via a dedicated MCP server, understanding page structure, labels, and cross-references.
Vector Knowledge Base — Performs semantic search against a vectorized knowledge base of documentation, understanding meaning and context rather than just keywords.
Slack Search — Searches team communication channels for operational context, recent discussions, and ground-truth information.
JIRA Integration — When documents reference JIRA tickets, the agent autonomously follows those references to fetch actual ticket details for complete context.

What makes this agent truly powerful is its multi-source synthesis capability:

Information Chaining — Findings from one source become input to query another. A Confluence document mentioning a JIRA ticket triggers an automatic ticket fetch.
Cross-Validation — Critical claims are verified across multiple independent sources. The agent states confidence levels based on source agreement.
Adaptive Retrieval — If initial queries fail, the agent reasons about why and adjusts strategy — different keywords, different source, broader scope.
Contradiction Resolution — When formal documentation contradicts recent Slack discussions, the agent flags the discrepancy and reasons about source authority and recency.

5. The API Key Agent: Secure Lifecycle Management

The Problem: Managing API keys for a global cloud security platform is a sensitive, multi-step process that requires strict validation, authorization, and audit trails. Manual processes are error-prone and difficult to audit.

The Solution: The API Key Agent transforms this complex workflow into a secure, conversational process using a Sequential Pipeline architecture — a chain of specialized sub-agents that must complete in strict order:

JIRA Fetch Agent — Retrieves the associated JIRA ticket details.
Request Validation Agent — Validates that the request parameters match the ticket, checks approval status, and verifies authorization.
Execution Agent — Performs the actual API key creation or update.
Audit Logging Agent — Logs the complete operation details to a persistent audit database.

For read operations (fetching API key details), the agent provides instant clarity — highlighting issues like invalid keys, expired dates, or misconfigured fields. Every write operation requires a valid JIRA ticket, explicit user confirmation, and managerial approval. Safety is structural, not just suggested.

6. The Slack Agent: Communication Intelligence

The Problem: Critical information is scattered across communication channels. On-call incidents, deployment updates, and product launches are buried under a constant stream of messages, making it nearly impossible to get a quick, accurate summary.

The Solution: The Slack Agent transforms communication channels into a source of structured operational intelligence:

Automated Communication — Posts formatted messages to team channels using proper Slack markup, with mandatory user confirmation before any message is sent.
Intelligent Analysis — When tasked with summarizing a specific period, the agent autonomously calculates date ranges, fetches message history, and produces structured reports covering:
- Production issues and their resolutions
- On-call incidents with team involvement and outcomes
- Deployment updates and associated issues
- New feature releases and their impact
- Team achievements and positive feedback

The agent also supports user identity resolution, threaded replies, and keyword-based message search across authorized channels.

7. The ArgoCD Agent: GitOps Deployment Operations

The Problem: Managing deployments across multiple environments and ArgoCD projects requires navigating complex Kubernetes resource hierarchies, understanding sync states, and diagnosing deployment failures — all through separate interfaces for each environment.

The Solution: The ArgoCD Agent is a deployment operations specialist that autonomously manages and monitors GitOps application deployments. It connects to multiple ArgoCD environments, navigates project hierarchies, and provides:

Health Monitoring — Lists applications with health and sync status across all environments.
Deployment Diagnosis — When something is unhealthy, it reasons through the failure chain: Health Degraded → Which resources? → What status? → Check events for crash reasons.
Resource Inspection — Retrieves live Kubernetes manifests, resource trees, managed resources, and application events.
Sync Management — Supports sync operations with dry-run previews and safe defaults.
Cross-Environment Comparison — Systematically iterates through environments to compare deployment states.

Safety is built in: sync previews use dry-run by default, auto-sync changes default to safe settings, and all modifications require explicit user confirmation.

8. The JIRA Agent: Project Management Intelligence

The Problem: Teams constantly need to interact with JIRA — viewing ticket details, creating issues, adding comments, and validating tickets for approval workflows.

The Solution: The JIRA Agent provides comprehensive JIRA lifecycle management through natural language. It can fetch detailed ticket information, create new issues (with project-specific field validation), add formatted comments, and validate tickets for approval workflows. All agent-created tickets are automatically labeled for traceability, and every write operation requires explicit user confirmation.

9. The Visualization Agent: Data Storytelling

The Problem: Raw data and query results are difficult to interpret quickly. Teams need visual representations to spot trends, compare metrics, and communicate findings effectively.

The Solution: The Visualization Agent is a data storytelling specialist that analyzes data semantics and selects the optimal chart type to communicate insights. It generates interactive charts — line charts for trends, pie charts for composition, and bar charts for comparisons — that render directly in the chat interface. The agent reasons about which visualization type best communicates the insight, considering data semantics, the primary insight hierarchy, and cognitive load. Multi-series support allows overlaying multiple data series on a single chart for richer analysis.

The Technical Innovations That Power It All

MCP Servers: A Modular Tool Architecture

A key architectural innovation is the use of MCP (Model Context Protocol) servers — lightweight, purpose-built tool servers that expose domain-specific capabilities to agents. The Droid runs four custom MCP servers:

WildFire Operations MCP — Sample submission, verdict management, maintenance scheduling, threat reports, and test automation — all with strongly-typed, validated parameters and enum constraints.
ArgoCD Operations MCP — A comprehensive GitOps management interface with 15+ tools covering environment management, application lifecycle, resource inspection, sync operations, and deployment history.
Confluence MCP — Structured access to the team's knowledge base with search, page retrieval, HTML-to-text parsing, and link extraction.
Grafana Loki MCP — Log forensics across multi-region Loki datasources with LogQL query execution, datasource discovery, and label exploration.

Additionally, the Droid integrates with Google's official GCP Observability MCP and gcloud MCP servers, demonstrating how the MCP ecosystem enables rapid capability expansion.

The MCP architecture provides strong typing, process isolation, standardized request/response patterns, and extensibility — new tools can be added without modifying any existing agent code.

In-Memory RAG: Intelligent Context Compression

When agents retrieve large volumes of data — thousands of log lines, extensive ArgoCD resource trees, or lengthy Confluence documents — the raw content can easily exceed what an LLM can effectively reason about. Simply truncating the data loses critical information. Sending it all wastes tokens and degrades reasoning quality.

The Droid solves this with an in-memory RAG pipeline that acts as an intelligent filter between tool responses and the LLM:

Content Extraction — Raw tool responses (often nested JSON) are parsed into clean text content.
Log Deduplication — For log data, the Drain3 algorithm identifies unique log patterns and collapses repeated entries. If 500 log lines follow the same pattern, they're reduced to a single representative line with a count — often reducing volume by 90%+.
Token Estimation — Content size is measured against configurable thresholds.
Vector Indexing — When content exceeds the threshold, it's chunked, embedded using Vertex AI embeddings, and indexed in an in-memory vector store.
Semantic Search — The most relevant chunks are retrieved based on semantic similarity to the user's original query.

This pipeline is used across multiple agents — each with tuned thresholds appropriate to their data volumes. The result: agents always work with the most relevant information, regardless of how much raw data their tools return.

Enterprise-Grade Security

The WildFire Droid is built with enterprise security as a first-class concern:

SAML Authentication — All users authenticate through SSO via SAML 2.0, with session management and automatic redirect for unauthenticated requests.
Role-Based Access — Sensitive operations require JIRA-based approval workflows with designated approvers.
Audit Logging — All API key operations are logged to a persistent audit database with complete operation details.
Data Protection — Sensitive fields are never exposed in query results; only hashed identifiers are used.

Modern Chat Interface

The frontend is a Next.js application with a modern, responsive chat interface featuring:

Server-Sent Events (SSE) streaming for real-time agent responses
Interactive chart rendering (line, bar, and pie charts)
Animated markdown rendering with syntax highlighting
Session history with the ability to resume previous conversations
Document download capabilities for exported data

The Impact: From Manual Grind to AI-Powered Productivity

The deployment and performance of WildFire Droid have demonstrated a clear shift from fragmented, manual workflows to a streamlined, conversational approach. This is more than a technology demonstration — it's a fundamental change in how teams work. By turning multi-step processes into simple conversations, WildFire Droid empowers teams to:

Focus on What Matters — Analysts can dedicate more time to critical threat analysis rather than mundane administrative tasks.
Move with Speed and Accuracy — On-call engineers can investigate incidents, correlate evidence across multiple systems, and identify root causes faster and more reliably.
Democratize Data Access — Team members who aren't SQL experts can now query complex databases and get meaningful insights through natural language.
Ensure Compliance — Automated approval workflows and audit logging ensure every sensitive operation is documented and authorized.
Foster Innovation — This Agentic AI framework serves as a blueprint for building more intelligent, collaborative tools across the organization.

Future Roadmap

The WildFire Droid, in its current form, is a powerful leap forward. But it's also a foundational step in a much larger journey. Our roadmap is focused on transforming the Droid from a responsive assistant into a proactive partner:

Proactive & Predictive Intelligence — Developing capabilities that allow the Droid to anticipate user needs, automate routine tasks without being asked, and proactively flag potential issues based on real-time data analysis.
Broader Ecosystem Integration — Expanding the Droid's reach by integrating with more internal products and systems, enabling even more complex cross-platform workflows.
Learning from Investigations — Building a knowledge base from past RCA investigations to accelerate future ones.
Continuous Learning & Optimization — Building feedback loops where the Droid learns from every interaction, continuously fine-tuning its reasoning and tool use.
Expanded MCP Ecosystem — Integrating with more external MCP servers as the ecosystem grows.
Enhanced Customization — Providing tools for teams to build and personalize their own agent-driven workflows for unique challenges.

Conclusion

The WildFire Droid represents a significant paradigm shift in how we approach cybersecurity operations. By leveraging an Agentic AI framework built on the Google Agent Development Kit, we are fundamentally redefining the relationship between our teams and their tools.

With its growing arsenal of specialized agents — from autonomous root cause analysis to secure API key management, from natural language data analytics to GitOps deployment operations — the WildFire Droid is not just a tool. It's a strategic advantage that allows security professionals to focus on the complex, cognitive challenges of threat intelligence and response.

The evolution from a responsive assistant to an autonomous operations partner demonstrates something important about Agentic AI: the real value isn't in any single agent or tool. It's in the orchestration — the ability to coordinate multiple specialized agents, each with deep domain expertise, to solve problems that no single agent could handle alone.

The WildFire Droid is a testament to the future — a future where intelligent automation empowers us to stay ahead of the evolving threat landscape with unprecedented speed and precision.

WildFire Droid: The Agentic AI Framework to simplify Cloud Security Product Operations

WildFire Droid: The Agentic AI Framework to simplify Cloud Security Product Operations

Introduction: The Unseen Battle in Cloud Security Operations