Implementing SLOs-as-Code: A Case Study

This text is a preview of a chat by Stephan Lips for SLOconf 2023, on Might 15-18, 2023. To look at this discuss and lots of extra prefer it, register at no cost at sloconf.com.

By managing service degree targets (SLOs) as code, we will co-locate SLO definitions and possession with the product code and crew. This helps horizontal scaling of SLO possession whereas establishing a single supply of reality and including transparency, integrating with the code administration course of, and creating an audit path for SLOs.

Once you measure the reliability of a whole lot and even hundreds of services throughout an enterprise, possession of SLOs shouldn’t reside with a single crew. There are a number of facets that make scaling SLO possession horizontally throughout groups rather more environment friendly: A standardized SLO administration platform, course of structure and automation throughout the enterprise. Managing SLOs-as-code alongside product code is a step in the direction of 360° product possession and allows us to automate SLO updates through steady integration.

Targets and Structure

The bigger goal behind the SLOs-as-code strategy is to maneuver in the direction of 360° product possession. At Procore, groups personal their product’s code and processes, together with facets like testing, efficiency, reliability, automated deployment pipelines and deployment configurations. Including SLO possession to a crew’s portfolio can even assist enhance product reliability.

Implementing SLOs-as-code includes three architectural elements, which can be mentioned intimately in their very own respective sections. In abstract, these elements are:

  • SLO objects and definitions: Standardized SLOs and associated assets facilitate centralized administration and automation.
  • Automation: SLOs as code are our single and reproducible supply of reality, which we automate through Steady Integration.
  • A standard platform: Nobl9 is the SLO administration platform upon which SLOs, associated assets, and automation are constructed. It aggregates the varied knowledge streams that energy SLIs.

SLO Objects and Definitions

Procore’s Observability crew has designed our SLO-as-code strategy to scale with Procore’s rising variety of groups and companies. Selecting YAML because the supply of reality permits Procore a scalable strategy for the corporate by centralized automation. Following the examples put forth by openslo.com and embracing a ubiquitous language like YAML helps keep away from including the complexities of Terraform for improvement groups and is less complicated to embed in each crew’s directories. We used a GitOps strategy to infrastructure-as-code (IaC) to create and preserve our Nobl9 assets.

The Nobl9 assets could be outlined as YAML configuration (config) recordsdata. Specifically, one can declaratively outline a useful resource’s properties (within the config file) and have a device learn and course of that right into a dwell and operating useful resource. It’s essential to attract a distinction between the useful resource and its configuration, as we’ll be discussing each all through this text. All assets, from initiatives (the first grouping of assets in Nobl9) to SLOs, could be outlined by YAML.

Determine 2. A easy visualization of Nobl9 object relationships. A Challenge is a top-level object. Initiatives include Providers, SLOs are hooked up to Providers, and SLOs can set off alerts through project-scoped Alert Insurance policies. Lastly, Position Bindings outline who has entry to the Challenge contents.

Procore adopted a hybrid strategy to organizing our Nobl9 configuration in order that the observability crew can assessment systemic modifications whereas groups nonetheless personal modifications to their service SLOs. A separate central repository is the supply of reality for all different Nobl9 configurations, akin to Challenge and Alert Coverage definitions. As initiatives have a one-to-many relationship with their companies, it might rapidly turn into a recreation of “guess the place” if the challenge configuration had been outlined inside one in every of its service repositories. The central repository is owned by the observability crew and permits us to handle permissions by pull request opinions submitted by product groups for his or her non-SLO Nobl9 assets. As soon as requests are merged, our automation applies the modifications. The SLO definitions are co-located with the service’s code. Groups self-regulate these assets to keep up their SLOs internally.

Automation

Automating the creation and modification of Nobl9 assets is crucial to Procore — it makes iterating on the deployment course of fast and painless for our engineering groups. Automation removes human error and potential problems that often come together with manually making use of config recordsdata through CLI instruments.

With that in thoughts, the observability crew created a CI job/workflow that our engineering groups copy into their challenge repo throughout their SLOs-as-code onboarding. Procore’s CI job makes use of the Nobl9 sloctl docker picture, so we don’t have to put in the sloctl CLI device in our CI containers. We configured this job to solely apply the configs which have been added or up to date, which helps to future-proof our pipeline as we scale the variety of our SLO configs.

Determine 3: SLOs-as-code Automation Workflow

This requires the next steps within the workflow:

  1. An engineer creates or modifies a config and commits the modifications to GitHub.
  2. The engineer opens a PR, will get opinions and merges their modifications into the principle department.
  3. CI picks up the modifications in GitHub and kicks off the next steps:
    1. Utilizing git diff and regex sample matching, we work out what configs have been added or modified from the newest merge to essential:

#!/bin/bash
ADDED_OR_CHANGED_CONFIGS=$(git diff –name-only –diff-filter=d HEAD^ HEAD | { grep -E “^${NOBL9_RESOURCE_DIR}/.*.y[a]?ml$” || :; })

2. We loop over the added/modified configs and run sloctl apply utilizing the nobl9/sloctl docker picture, which updates our assets in Nobl9:

#!/bin/bash
for file in “$ADDED_OR_CHANGED_CONFIGS”
do
  docker run nobl9/sloctl:v0.0.80 apply -f “$file”
performed

3. We use the identical workflow in our centralized IaC repo that manages our initiatives, companies, role-binding and alert coverage configs.

Frequent Platform

Our platform requires one other key function for SLOs-as-code to work: We want to have the ability to consolidate totally different observability knowledge sources and telemetry streams that energy SLIs. This platform must assist standardized knowledge codecs to outline SLOs and associated entities, akin to alert insurance policies for various circumstances and developments, whereas permitting for the administration of those codecs through automation. We evaluated three distributors that supplied SLO options, and solely Nobl9 met all our necessities, which prolonged past the SLOs-as-code matter of this text.

Observations

Adopting SLOs-as-code could be impeded by three specific boundaries of entry: New tooling (Nobl9 platform), knowledge and format (SLO configs and assets) and course of. To drive adoption, now we have designed a self-service mannequin supported by an open-door consulting observe. Whereas that is an ongoing studying course of for all concerned, we’re seeing early indicators of success of an strategy that makes use of:

  • Detailed, step-by-step directions, together with code snippets and templates.
  • Screencast tutorials supplementing the directions.
  • Preliminary face-to-face, white-glove conferences to clarify ideas and supply demos. The private expertise and dialog in our expertise created appreciable enthusiasm and goodwill amongst our groups. It proved to be a significantly better strategy than simply pointing them to documentation.
  • Quick suggestions cycles with onboarding groups to enhance directions.

Fairly incessantly, groups acknowledged their main challenges weren’t the technical facets of SLOs-as-code mentioned on this article, however tips on how to design and enhance on significant SLOs. We addressed this by authoring articles like Black Box SLIs and different inside assets for steering. Whereas posts is usually a useful gizmo, we discovered that working with groups and discussing how their product impacts the person expertise was extra useful to modeling their SLOs utilizing observability knowledge.

Final however not least, we’re integrating with our embedded SRE (ESRE) crew to assist onboard extra companies. When a product crew engages with ESRE on their deployment technique and configuration, ESRE additionally discusses onboarding to SLOs-as-code and primary SLOs, akin to error charge or length.

Devin Cunningham and Justin Hoang co-authored this text.