- Access exclusive content
- Connect with peers
- Share your expertise
- Find support resources
This blog written by Peter Kirubakaran N.
In the realm of SaaS, understanding and effectively implementing Service Level Agreements (SLA), Service Level Objectives (SLO), and Service Level Indicators (SLI) are pivotal to enhancing the reliability and performance of services. In this blog post, we delve into these key concepts and explore how Core Infrastructure and Platform Engineering (CIPE) is going to revolutionize service reliability with its SLO as a Service (SLOaaS) offering as part of the Garuda platform.
In essence, SLI is the metric being measured, SLO is the target or goal set based on SLIs, and SLA is the overall agreement that includes SLOs and other terms defining the relationship between the service provider and the customer. These concepts are crucial in managing and improving the reliability and performance of service(s).
The ubiquity of SLI and SLI-based alerts can lead to alert fatigue, obscuring alerts that genuinely impact customers. For instance, distinguishing between a Pod crash loop where there are 100+ replicas active vs a database write replica in crash loop impacting ingestion highlights the importance of aligning alerts with customer impact.
SLOs are of various types and identifying the right fit to get the best result plays the key role. Let's look at some commonly used SLO types.
It is best to start with defining SLOs for the workflows of a product that are customer facing.
In case of Garuda or any observability platform some of the customer facing workflows will be:
There could be various moving components within these but defining what will be SLO’s for these components and the overall SLA based on this will be a good place to start.
Now that we have understood the key terminologies and How to identify the SLOs and the difficulties with SLOs, let's look at SLO as a Service offering from CIPE that aids in SLO creation while also helping with minimizing the pitfalls with SLO.
Define the identified metric as yaml or via the UI in Space Portal.
slo_name: “requests-availability”
objective: 99.9
description: “Common SLO based on availability for HTTP request”
Labels:
category: availability
Sli:
Events:
error_query: sum(rate(http_request_duration_seconds_count{job=”myservice”,code=~”(5..|429)”}[{{.window}}]))
total_query:
sum(rate(http_request_duration_seconds_count{job=”myservice”}[{{.window}}]))
Garuda’s SLO Dashboard
Automatic notification based on SLO
In conclusion, the implementation of SLOs through CIPE's SLOaaS is a powerful tool for tracking product stability and reducing alert fatigue. It not only provides a comprehensive view of service reliability but also equips decision-makers with the data needed to make informed choices regarding new feature introductions to the product. As service reliability becomes increasingly critical, embracing SLOaaS is a strategic move towards ensuring seamless and reliable user experiences.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Subject | Likes |
---|---|
5 Likes | |
3 Likes | |
3 Likes | |
3 Likes | |
2 Likes |
User | Likes Count |
---|---|
12 | |
4 | |
3 | |
3 | |
2 |