- Access exclusive content
- Connect with peers
- Share your expertise
- Find support resources
Let's take a look at our existing architecture and the challenges we've encountered with it.
Our observability platform, Garuda, is deployed in a kubernetes cluster using Istio for service mesh and its ingress gateway for managing north-south traffic.
Tenant’s observability data from agents ends up on vector pipelines (via Istio’s ingress gateway), where transformations using vector remap language (VRL) happens, and they are further pushed to respective sinks; Grafana Loki for logs, Grafana Mimir for metrics. Traces are currently being pushed directly to the gRPC endpoint (open telemetry protocol), exposed by Grafana Tempo
However, there were several challenges with this setup:
Scaling hardware is not a solution; we cannot keep scaling the platform and agents to support synchronous ingestion. Our primary aim was to ensure that once valid data reaches our system, we try our best to ensure it reaches its intended destination successfully.
To address these challenges, it was decided to decouple the flow via a message broker system where the publishers simply push the received data to a queue, and consumers take care of consuming the data and sending it over to its relevant destination. Several factors were considered while building this and the following steps were taken:
By implementing these changes, and with the same amount of resources allocated for Mimir and Loki, event drops were reduced to zero in normal conditions and tenants observed no data loss post moving to this architecture. And if there is some outage / fluctuation in the destination, the payload simply gets piled up in NATS JetStream and is ingested as soon as the destination is back up. Here’s a snap of when there was a small fluctuation in our systems:
While there was a problem at ingestion due to restart of Loki ingesters, the messages got piled up at NATS and as soon as all Loki ingesters came up, all the pending messages were successfully consumed, thus helping us avoid any data loss during the period.
Additionally, there is more control over the pipeline in terms of handling different tenants, and reduced dependency on an external tool for adding any feature if required. Even the tenant side agents have reduced memory usage due to reduced in-memory pile up as the platform now has a much higher ingestion rate.
Overall, the observability ingestion pipeline is now more performant, resilient, and reliable than ever before.
With an ingestion at a rate of ~5.1 million metrics/s and ~420k log lines/s, here’s a snap of what kubectl top shows for a 3-node nats cluster.
Reference Metric: cortex_distributor_received_samples_total
In terms of handling throughput, an increase of 50% is observed for metric pushed to Mimir
Reference Metric: loki_distributor_lines_received_total
In terms of handling throughput, an increase of 225% is observed for log lines pushed to Loki.