Best practices for Grafana SLOs

这篇具有很好参考价值的文章主要介绍了Best practices for Grafana SLOs。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

Best practices for Grafana SLOs

Best practices for Grafana SLOs | Grafana Cloud documentation

Because SLOs are still a relatively new practice, it can feel overwhelming when you start to create SLOs for the first time. To help simplify things, some best practices for SLOs and SLO queries are provided on this page.

What is a good SLO?

A Service Level Objective (SLO) is meant to define specific, measurable targets which represent the quality of service provided by a service provider to its users. The best place to start is with the level of service your customers expect. Sometimes these are written into formal service level agreements (SLAs) with customers, and sometimes they are implicit in customers expectations for a service.

Good SLOs are simple. Don’t use every metric you can track as an SLI; choose the ones that really matter to the consumers of your service. If you choose too many, it’ll make it hard to pay attention to the ones that matter.

A good SLO is attainable, not aspirational

Start with a realistic target. Unrealistic goals create unnecessary frustration which can then eclipse useful feedback from the SLO. Remember, this is meant to be achievable and it is meant to reflect the user experience. An SLO is not an OKR.

It’s also important to make your SLO simple and understandable. The most effective SLOs are the ones that are readable for all stakeholders.

Target services with good traffic

Too little traffic is insufficient for monitoring trends and can cause noisy alerts and irregularities can be reflected disproportionately with low-traffic environments. Conversely, too much traffic can mask customer-specific issues.

Team alignment

Teams should be the ones to create SLOs and SLIs, not managers. Your SLOs should communicate to you feedback for your services and the customer experience with them, so it’s good for the team to work together to create the SLOs.

Embed SLO review in team rituals

As you work with SLOs, the information they provide can help guide decision-making because they add context and correlate patterns. This can help when there’s a need to balance reliability and feature velocity. Early on, it’s good practice for teams to review SLOs at regular intervals.

Iterate and adjust

Once SLO review is a part of your team rituals it’s important to iterate on the information you have to be able to make continuously more informed decisions.

As you learn more from your SLOs, you may learn your assumptions don’t reflect practical reality. In the early period of SLO implementation, you may find there are a number of factors you hadn’t previously considered. If you have a lot of error budget left over, you can adjust your objectives accordingly.

Alerts and labels

SLO alerts are different from typical data source alerts. Because alerts for SLOs let you know there is a trend in your burn rate that needs attention, it’s important to understand how to set up and balance fast-burn and slow-burn alerts to keep you informed without inducing alerting fatigue.

Prioritize your alerts

Have your alerts routed first to designated individuals to validate your SLI. Send notifications to designated engineers through OnCall or your main escalation channel when fast-burn alerts fire so that the appropriate people can quickly respond to possible pressing issues. Send group notifications for slow-burn alerts to analyze and respond to as a team during normal working hours.

Use labels

Set up good label practices. Keep them limited to make them navigable and consumable for triage.

Grafana SLOs use two label types: SLO labels and Alert labels. SLO labels are for grouping and filtering SLOs. Alert labels are added to slow and fast burn alerts and are used to route notifications and add metadata to alerts.

Query tips and pitfalls

There are many approaches to how you configure your SLO queries. Ultimately, it all depends on your needs. Ultimately, just remember: if you don’t have metrics that represent your user’s experience then you need new metrics.

Keep queries simple

The best SLIs are based on Prometheus counter metrics (such as monotonically increasing series) and use labels to encode the counted event as either a success or failure (for example: requests_total{code=”200”}). If your metrics don’t look like this, it’s probably better to reinstrument your service with well-suited metrics than to try and work around the issue with complex SLI query definitions.

Availability and Latency are the most common SLOs to start with for request driven services. For example:

  • Availability (non-5xx responses): requests_total{code!~”5..”} / requests_total
  • Latency (less than 1 second): requests_duration_seconds_bucket{code!~”5..”, le=”1.0”} / requests_duration_seconds_count{code!~”5..”}

Freshness is a common SLO for message queues or batch processes where you want to ensure that each item (perhaps after several retries) gets completed before the work request grows too stale.

  • Freshness (work spent less than 120 sec in queue): completed_duration_seconds_bucket{le=”120”} / completed_duration_seconds_count

Advanced SLIs

Define advanced SLIs as a “success/total” ratio for best dashboards. The “Ratio” SLO type enforces this success/total style, but you’ll get more dashboard features if you follow the same approach with your advanced SLOs.

  • Do <success rate> / <total rate>
  • Avoid: 1 - (<failure rate> / <total rate>)

If you can’t reinstrument your metrics to encode success/failure with labels and you must work with failure_total and all_total counters, you can do (total - fail) / total. For example: `( sum by (…) (rate(all_total[$__rate_interval]))

  • sum by (…) (rate(failure_total[$__rate_interval])) ) / sum by (…) (rate(all_total[$__rate_interval]))`

Know your SLIs

There are many SLI types. A brief explanation of Multidimensional and Rollup SLIs follows below.

Multidimensional SLI

A Multidimensional SLI reports a ratio for each value of a given label. for example: sum by (cluster) (rate(<success>[5m])) / sum by (cluster) (rate<total>[5m])). When you specify “group by” labels on the ratio SLO type, it makes it a multidimensional SLI. A common use is to specify cluster and/or namespace in the grouping. Multidimensional SLIs enables per-cluster alerting and supports more flexible dashboards where you can include or exclude values for the chosen dimension labels (see rollup SLI below).

Rollup SLI

A rollup SLI (or aggregated SLI) is a calculation of a multidimensional SLI where the numerator and denominator is further aggregated before the final ratio calculation. When you select cluster=all on the dashboard of a multidimensional SLO that defined cluster as a group label, the dashboard calculates the aggregate ratio of the sum of all successes/over sum of all requests. This provides alerting on each cluster and reporting on the rollup overall results.

Additional reference materials

Google provides very clear documentation on SLOs in their [SRE Book(https://sre.google/sre-book/service-level-objectives/)]. They also provide useful guides on SLO implementation and alerting on SLOs.

因为 SLO 仍然是一种相对较新的实践,当你第一次开始为其创建 SLO 时,可能会感到不知所措。为了帮助简化事情,本页提供了一些关于 SLO 和 SLO 查询的最佳实践。 什么是好的 SLO? 服务水平目标 (SLO) 旨在定义代表服务提供商为其用户提供的服务质量的特定、可衡量的目标。最好从客户期望的服务水平开始。有时这些写在与客户的正式服务水平协议 (SLA) 中,有时它们是客户对服务的隐含期望。 好的 SLO 是简单的。不要使用你可以跟踪的每一个指标作为 SLI;选择真正重要的那些。如果你选择太多,就会很难关注那些重要的。 一个好的 SLO 是可以达到的,而不是空想的 从现实的目标开始。不切实际的目标会造成不必要的挫折,然后会掩盖 SLO 的有用反馈。记住,这是为了能够实现,它是为了反映用户体验。SLO 不是 OKR。 让你的 SLO 简单易懂也很重要。最有效的 SLO 是所有利益相关者都能阅读的。 针对有良好流量的目标服务 流量太少不足以监测趋势,并且在低流量环境中可能会导致嘈杂的警报和不规则性。相反,太多的流量可能会掩盖客户特定的问题。 团队协调 团队应该是创建 SLO 和 SLI 的人,而不是经理。你的 SLO 应该为你的服务和客户体验提供反馈,因此团队共同努力创建 SLO 是很好的。 将 SLO 审查嵌入团队仪式中 当你使用 SLO 时,它们提供的信息可以帮助指导决策,因为它们添加了上下文并关联了模式。这在需要平衡可靠性和功能速度时很有帮助。早期,团队定期审查 SLO 是一个很好的实践。 迭代和调整 一旦 SLO 审查成为团队仪式的一部分,重要的是根据你所拥有的信息进行迭代,以便能够做出更有见地的决策。 随着你从 SLO 中了解更多,你可能会发现你的假设并不反映实际情况。在 SLO 实施的早期阶段,你可能会发现有许多之前没有考虑到的因素。如果你还有很多剩余的错误预算,你可以相应地调整你的目标。 警报和标签 SLO 警报不同于典型的数据源警报。因为 SLO 的警报让你知道你的燃烧率趋势需要关注,所以了解如何设置和平衡快速燃烧和缓慢燃烧的警报以让你保持知情而不会导致警报疲劳是很重要的。 优先处理你的警报 将你的警报首先路由到指定的个人,以验证你的 SLI。在快速燃烧警报触发时,通过 OnCall 或主要升级渠道向指定的工程师发送通知,以便适当的人员能够快速响应可能存在的紧迫问题。在正常工作时间内向团队发送缓慢燃烧警报以进行分析和响应。 使用标签 建立良好的标签实践。保持它们的有限性,以便于进行分诊。 Grafana SLO 使用两种标签类型:SLO 标签和警报标签。SLO 标签用于对 SLO 进行分组和过滤。警报标签添加到缓慢和快速燃烧警报中,并用于路由通知和添加警报的元数据。 查询提示和陷阱 有许多配置 SLO 查询的方法。最终,这一切都取决于你的需求。最终,只要记住:如果你没有代表用户体验的指标,那么你就需要新的指标。 保持查询简单 最好的 SLI 基于 Prometheus 计数器指标(例如单调增加的系列),并使用标签将已计数的事件编码为成功或失败(例如:requests_total{code=”200”})。如果你的指标看起来不像这样,那么重新用适合的指标工具你的服务可能比尝试用复杂的 SLI 查询定义来解决问题更好。 可用性和延迟是针对请求驱动服务开始的最常见的 SLO。例如: 可用性(非 5xx 响应):requests_total{code!~”5..”} / requests_total 延迟(小于 1 秒):requests_duration_seconds_bucket{code!~”5..”, le=”1.0”} / requests_duration_seconds_count{code!~”5..”} 新鲜度是消息队列或批处理流程中常见的 SLO,在这里你希望确保在工作请求变得过时时,每个项目(也许经过多次重试)都能完成。 新鲜度文章来源地址https://www.toymoban.com/news/detail-856255.html

到了这里,关于Best practices for Grafana SLOs的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • Grafana集成prometheus(2.Grafana安装)

    检查防火墙3000端口是否开启 admin/admin http://ip:3000

    2024年02月14日
    浏览(36)
  • Grafana集成prometheus(4.Grafana添加预警)

    上文已经完成了grafana对prometheus的集成及数据导入,本文主要记录grafana的预警功能(以内存为例) databorard面板点击edit,下方有个Alert的tab,创建Alert rules依赖contact points,可以先随便填写再修改 Alerting 上面入口创建后也会同步到此处 添加推送名称 选择DingDing, 钉钉URL获取参考

    2024年02月14日
    浏览(34)
  • Grafana集成prometheus(3.Grafana添加promethus数据)

    选择Connections - Datasources,点击Add New data source,填写Promitheus Server Url,点击 save test完成配置 选择prometheus数据库 选择code 填入对应的查询公式(监控公式参考Prometheus监控公式) 修改面板名称Title 点击save 百度可以获取常见用途的公式,mark备用。

    2024年02月14日
    浏览(52)
  • Grafana(三)Grafana 免密登录-隐藏导航栏-主题变换

    Grafana 的常用方式: 将配置好的Grafana图嵌入到系统页面中 为了实现可免登录访问,可以通过如下方式进行设置: 1. 修改Grafana配置文件 在Grafana的配置文件 /etc/grafana/grafana.ini 中,找到 [auth.anonymous] 配置块,将其下的匿名访问控制 enabled 设置为 true,组织权限设置为 Viewer。

    2024年01月21日
    浏览(27)
  • Grafana技术文档-概念-《十分钟扫盲》 Grafana官网链接

    Grafana: The open observability platform | Grafana Labs Grafana是一个开源的度量分析和可视化套件,常用于对大量数据进行实时分析和可视化。以下是Grafana的基本概念: 数据源(Data Source):Grafana支持多种不同的时序数据库数据源,对每种数据源提供不同的查询方法,并能够很好地支持

    2024年02月13日
    浏览(40)
  • Grafana系列-GaC-1-Grafana即代码的几种实现方式

    Grafana 系列文章 Terraform 系列文章 GaC(Grafana as Code, Grafana 即代码) 很明显是扩展自 IaC(Infrastructure as Code, 基础设施即代码)的概念. 在Terraform 系列 - 什么是 IaC?一文中, 我们已经详细地说明了相关的概念, 我们可以直接套用在 GaC 上: Grafana 即代码 (Grafana as Code, GaC) 是指通过 代码 而

    2024年02月09日
    浏览(28)
  • grafana--利用安装插件(aliyun-cms-grafana)监控阿里云产品

    一、grafana安装部署 1、安装部署 grafana安装部署详情请见官方文档 https://grafana.com/grafana/download?pg=getplcmt=selfmanaged-box1-cta1 2、启动服务 3、登录服务 默认账号密码: admin admin 二、grafana部署插件(aliyun-cms-grafana) 1、下载插件并解压缩 2、修改配置文件 2、重启服务 3、登录grafa

    2024年02月15日
    浏览(37)
  • 忘记Grafana 密码怎么办 教你2种Grafana重置admin密码方法详细步骤

    长久没登录,居然吧grafana 的密码忘了 记录下 1)修改密码 注意:admin123表示新密码; 2)重启服务 1)查看Grafana配置文件,确定grafana.db的路径 配置文件路径:/etc/grafana/grafana.ini 通过配置文件得知grafana.db的完整路径如下: 或可通过shell的find工具直接全盘查找grafana.db的路径:

    2024年02月05日
    浏览(47)
  • Smoothieware_best-for-pnp 工具链的升级尝试

    正在迁移Smoothieware_best-for-pnp到MCUXPresso的失败实验中徘徊. 现在已知2者的工具链版本是不一样的. 通过2进制比对, 知道2家用的都是公版的gcc for arm的版本. MCUXPresso 是10.3, NXP家并没有改gcc, 是公版的实现. gcc-arm-none-eabi-10.3-2021.10-win32.zip Smoothieware_best-for-pnp 是4.8, 也是公版的实现

    2024年02月05日
    浏览(33)
  • Grafana(二)Grafana 两种数据源图表展示(json-api与数据库)

    在先前的博客文章中,我们搭建了Grafana ,它是一个开源的度量分析和可视化工具,可以通过将采集的数据分析、查询,然后进行可视化的展示,接下来我们重点介绍如何使用它来进行数据渲染图表展示 Docker安装Grafana-CSDN博客 文章浏览阅读1.2k次,点赞25次,收藏22次。分析上

    2024年01月17日
    浏览(40)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包