k8s集群部署Prometheus和Grafana

这篇具有很好参考价值的文章主要介绍了k8s集群部署Prometheus和Grafana。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

一、部署nfs-client-provisioner

参考https://zhaoll.blog.csdn.net/article/details/128155767

二、部署prometheus

创建pvc

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-data
  namespace: devops
  annotations:
    volume.beta.kubernetes.io/storage-provisioner: master-nfs-storage
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: master-nfs-storage
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-rules-data
  namespace: devops
  annotations:
    volume.beta.kubernetes.io/storage-provisioner: master-nfs-storage
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: master-nfs-storage

创建RBAC

apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: devops
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
rules:
  - apiGroups:
      - ""
    resources:
      - nodes
      - nodes/metrics
      - services
      - endpoints
      - pods
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - configmaps
    verbs:
      - get
  - nonResourceURLs:
      - "/metrics"
    verbs:
      - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: devops

创建Prometheus的configmap,也就是配置文件

# Prometheus configuration format https://prometheus.io/docs/prometheus/latest/configuration/configuration/
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: devops
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: EnsureExists
data:
  prometheus.yml: |
    global:
      scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

    rule_files:          #prometheus配置告警规则
    - "rules/*.yml"

    scrape_configs:
    - job_name: prometheus    #监控prometheus本身
      static_configs:
      - targets:
        - localhost:9090

    - job_name: k8s-master   #写要监控的node节点
      static_configs:
      - targets:
        - 11.0.1.3:9100
        - 11.0.1.4:9100
        - 11.0.1.5:9100


    - job_name: k8s-node   #写要监控的node节点
      static_configs:
      - targets:
        - 11.0.1.6:9100
        - 11.0.1.7:9100

    # kubernetes-apiservers自动发现,将apiserver的地址进行暴露并获取监控指标
    - job_name: kubernetes-apiservers
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - action: keep
        regex: default;kubernetes;https
        source_labels:
        - __meta_kubernetes_namespace
        - __meta_kubernetes_service_name
        - __meta_kubernetes_endpoint_port_name
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

    - job_name: kubernetes-nodes-kubelet
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        # k8s的node节点自动发现,自动发现k8s中的所有node节点并进行监控
    - job_name: kubernetes-nodes-cadvisor
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __metrics_path__
        replacement: /metrics/cadvisor
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        # 发现endpoint的资源,主要是发现endpoint资源类型的pod,可以通过kubectl get ep查看谁是endpoint资源
    - job_name: kubernetes-service-endpoints
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_scrape
      - action: replace
        regex: (https?)
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_scheme
        target_label: __scheme__
      - action: replace
        regex: (.+)
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_path
        target_label: __metrics_path__
      - action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        source_labels:
        - __address__
        - __meta_kubernetes_service_annotation_prometheus_io_port
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      - action: replace
        source_labels:
        - __meta_kubernetes_service_name
        target_label: kubernetes_name

        # 自动发现services资源
    - job_name: kubernetes-services
      kubernetes_sd_configs:
      - role: service
      metrics_path: /probe
      params:
        module:
        - http_2xx
      relabel_configs:
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_probe
      - source_labels:
        - __address__
        target_label: __param_target
      - replacement: blackbox
        target_label: __address__
      - source_labels:
        - __param_target
        target_label: instance
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels:
        - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      - source_labels:
        - __meta_kubernetes_service_name
        target_label: kubernetes_name

        #自动发现pod资源
    - job_name: kubernetes-pods
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_scrape
      - action: replace
        regex: (.+)
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_path
        target_label: __metrics_path__
      - action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        source_labels:
        - __address__
        - __meta_kubernetes_pod_annotation_prometheus_io_port
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_name
        target_label: kubernetes_pod_name

创建Prometheus的sts和svc

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus
  namespace: devops
  labels:
    k8s-app: prometheus
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    version: v2.2.1
spec:
  serviceName: "prometheus"
  replicas: 1
  podManagementPolicy: "Parallel"
  updateStrategy:
   type: "RollingUpdate"
  selector:
    matchLabels:
      k8s-app: prometheus
  template:
    metadata:
      labels:
        k8s-app: prometheus
    spec:
      priorityClassName: system-cluster-critical
      serviceAccountName: prometheus
      initContainers:
      - name: "init-chown-data"
        image: "busybox:latest"
        imagePullPolicy: "IfNotPresent"
        command: ["chown", "-R", "65534:65534", "/data"]
        volumeMounts:
        - name: prometheus-data
          mountPath: /data
          subPath: ""
      containers:
        - name: prometheus-server-configmap-reload
          image: "jimmidyson/configmap-reload:v0.1"
          imagePullPolicy: "IfNotPresent"
          args:
            - --volume-dir=/etc/config
            - --webhook-url=http://localhost:9090/-/reload
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
              readOnly: true

        - name: prometheus-server
          image: "prom/prometheus:v2.23.0"
          imagePullPolicy: "IfNotPresent"
          args:
            - --config.file=/etc/config/prometheus.yml
            - --storage.tsdb.path=/data
            - --web.console.libraries=/etc/prometheus/console_libraries
            - --web.console.templates=/etc/prometheus/consoles
            - --web.enable-lifecycle
          ports:
            - containerPort: 9090
          readinessProbe:
            httpGet:
              path: /-/ready
              port: 9090
            initialDelaySeconds: 30
            timeoutSeconds: 30
          livenessProbe:
            httpGet:
              path: /-/healthy
              port: 9090
            initialDelaySeconds: 30
            timeoutSeconds: 30
          # based on 10 running nodes with 30 pods each

          volumeMounts:
            - name: prometheus-rules
              mountPath: /etc/config/rules
            - name: config-volume
              mountPath: /etc/config
            - name: prometheus-data
              mountPath: /data
              subPath: ""
      terminationGracePeriodSeconds: 300
      volumes:
        - name: config-volume
          configMap:
            name: prometheus-config
        - name: prometheus-data
          persistentVolumeClaim:
            claimName: prometheus-data
        - name: prometheus-rules
          persistentVolumeClaim:
            claimName: prometheus-rules-data
---
kind: Service
apiVersion: v1
metadata:
  name: prometheus
  namespace: devops
  labels:
    kubernetes.io/name: "Prometheus"
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
spec:
  type: NodePort
  ports:
    - name: http
      port: 9090
      protocol: TCP
      targetPort: 9090
      nodePort: 30090
  selector:
    k8s-app: prometheus

三、部署metrics

metrics文件

apiVersion: v1
kind: ServiceAccount
metadata:
  name: kube-state-metrics
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-state-metrics
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
rules:
- apiGroups: [""]
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs: ["list", "watch"]
- apiGroups: ["extensions","apps"]
  resources:
  - daemonsets
  - deployments
  - replicasets
  verbs: ["list", "watch"]
- apiGroups: ["apps"]
  resources:
  - statefulsets
  verbs: ["list", "watch"]
- apiGroups: ["batch"]
  resources:
  - cronjobs
  - jobs
  verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
  resources:
  - horizontalpodautoscalers
  verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: kube-state-metrics-resizer
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
rules:
- apiGroups: [""]
  resources:
  - pods
  verbs: ["get"]
- apiGroups: ["extensions","apps"]
  resources:
  - deployments
  resourceNames: ["kube-state-metrics"]
  verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kube-state-metrics
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: kube-state-metrics
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: kube-state-metrics-resizer
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: kube-system
---
# Config map for resource configuration.
apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-state-metrics-config
  namespace: kube-system
  labels:
    k8s-app: kube-state-metrics
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
data:
  NannyConfiguration: |-
    apiVersion: nannyconfig/v1alpha1
    kind: NannyConfiguration
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-state-metrics
  namespace: kube-system
  labels:
    k8s-app: kube-state-metrics
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
spec:
  selector:
    matchLabels:
      k8s-app: kube-state-metrics
  replicas: 1
  template:
    metadata:
      labels:
        k8s-app: kube-state-metrics
    spec:
      priorityClassName: system-cluster-critical
      serviceAccountName: kube-state-metrics
      containers:
      - name: kube-state-metrics
        image: quay.io/coreos/kube-state-metrics:v1.8.0
        ports:
        - name: http-metrics
          containerPort: 8080
        - name: telemetry
          containerPort: 8081
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          timeoutSeconds: 5
      - name: addon-resizer
        image: registry.cn-hangzhou.aliyuncs.com/rookieops/addon-resizer:1.8.6
        resources:
          limits:
            cpu: 100m
            memory: 30Mi
          requests:
            cpu: 100m
            memory: 30Mi
        env:
          - name: MY_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: MY_POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
        volumeMounts:
          - name: config-volume
            mountPath: /etc/config
        command:
          - /pod_nanny
          - --config-dir=/etc/config
          - --container=kube-state-metrics
          - --cpu=100m
          - --extra-cpu=1m
          - --memory=100Mi
          - --extra-memory=2Mi
          - --threshold=5
          - --deployment=kube-state-metrics
      volumes:
        - name: config-volume
          configMap:
            name: kube-state-metrics-config
---
apiVersion: v1
kind: Service
metadata:
  name: kube-state-metrics
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/name: "kube-state-metrics"
  annotations:
    prometheus.io/scrape: 'true'
spec:
  ports:
  - name: http-metrics
    port: 8080
    targetPort: http-metrics
    protocol: TCP
  - name: telemetry
    port: 8081
    targetPort: telemetry
    protocol: TCP
  selector:
    k8s-app: kube-state-metrics

四、部署node_exporter

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: devops
  labels:
    app: node-exporter
spec:
  selector:
    matchLabels:
     app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
    spec:
      hostPID: true
      hostIPC: true
      hostNetwork: true
      containers:
      - name: node-exporter
        image: quay.io/prometheus/node-exporter
        ports:
        - containerPort: 9100
          hostPort: 9100
        securityContext:
          privileged: true
        args:
        - --path.procfs
        - /host/proc
        - --path.sysfs
        - /host/sys
        - --path.rootfs
        - /host/root
        - --collector.filesystem.ignored-mount-points
        - '"^/(sys|proc|dev|host|etc)($|/)"'
        - "--collector.processes"
        volumeMounts:
        - name: dev
          mountPath: /host/dev
        - name: proc
          mountPath: /host/proc
        - name: sys
          mountPath: /host/sys
        - name: rootfs
          mountPath: /host/root
      tolerations:
      - key: "node-role.kubernetes.io/master"
        operator: "Exists"
        effect: "NoSchedule"
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: dev
          hostPath:
            path: /dev
        - name: sys
          hostPath:
            path: /sys
        - name: rootfs
          hostPath:
            path: /

五、部署grafana

创建pvc

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grafana-ui-data
  namespace: devops
  annotations:
    volume.beta.kubernetes.io/storage-provisioner: master-nfs-storage
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: master-nfs-storage

创建RBAC

apiVersion: v1
kind: ServiceAccount
metadata:
  name: grafana
  namespace: devops
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: grafana
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
rules:
  - apiGroups:
      - ""
    resources:
      - nodes
      - nodes/metrics
      - services
      - endpoints
      - pods
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - configmaps
    verbs:
      - get
  - nonResourceURLs:
      - "/metrics"
    verbs:
      - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: grafana
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: grafana
  namespace: devops

创建sts和svc:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: grafana-ui
  namespace: devops
spec:
  serviceName: "grafana"
  replicas: 1
  selector:
    matchLabels:
      app: grafana-ui
  template:
    metadata:
      labels:
        app: grafana-ui
    spec:
      containers:
      - name: grafana-ui
        image: grafana/grafana:latest
        imagePullPolicy: "IfNotPresent"
        ports:
          - containerPort: 3000
            protocol: TCP
        resources:
          limits:
            cpu: 100m
            memory: 256Mi
          requests:
            cpu: 100m
            memory: 256Mi
        volumeMounts:
          - name: grafana-ui-data
            mountPath: /var/lib/grafana
            subPath: ""
      securityContext:
        fsGroup: 472
        runAsUser: 472
      volumes:
        - name: grafana-ui-data
          persistentVolumeClaim:
            claimName: grafana-ui-data   #前面创建的pvc
---
apiVersion: v1
kind: Service
metadata:
  name: grafana-ui
  namespace: devops
spec:
  type: NodePort
  ports:
    - name: http
      port: 3000
      protocol: TCP
      targetPort: 3000
      nodePort: 31000
  selector:
    app: grafana-ui

六、部署alarm
创建PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: alertmanager-data
  namespace: devops
  annotations:
    volume.beta.kubernetes.io/storage-provisioner: master-nfs-storage
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: master-nfs-storage

创建ConfigMap
注意替换里面的企业微信信息

apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: devops
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: EnsureExists
data:
  alertmanager.yml: |       #通过cm来创建alertmanager配置文件内容
    global:
      resolve_timeout: 1m
    receivers:
    - name: 'wechat'   #定义接收者名字
      wechat_configs:
        - corp_id: 'wwa7c************b'  # 企业微信唯一ID,我的企业--企业信息
          to_party: '2'        # 企业微信中创建的接收告警的部门【告警机器人】的部门ID
          agent_id: '1000002'  # 企业微信中创建的应用的ID
          api_secret: '8sOqFAiyx22c-***********************-Cw'  # 企业微信中,应用的Secret
          send_resolved: true
          message: '{{ template "wechat.default.message" . }}'
    route:
      group_interval: 1m
      group_by: ['env','instance','type','group','job','alertname']
      group_wait: 10s
      receiver: wechat    #发送接收者名字为: "wechat"
      repeat_interval: 1m
    templates:       #定义模板位置
      - '*.tmpl'
  #微信发送告警信息的模板文件
  wechat.tmpl: |
    {{ define "wechat.default.message" }}
    {{- if gt (len .Alerts.Firing) 0 -}}
    {{- range $index, $alert := .Alerts -}}
    {{- if eq $index 0 }}
    ========= 监控报警 =========
    告警状态:{{   .Status }}
    告警级别:{{ .Labels.severity }}
    告警类型:{{ $alert.Labels.alertname }}
    故障主机: {{ $alert.Labels.instance }}
    告警主题: {{ $alert.Annotations.summary }}
    告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
    触发阀值:{{ .Annotations.value }}
    故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    ========= = end =  =========
    {{- end }}
    {{- end }}
    {{- end }}
    {{- if gt (len .Alerts.Resolved) 0 -}}
    {{- range $index, $alert := .Alerts -}}
    {{- if eq $index 0 }}
    ========= 告警恢复 =========
    告警类型:{{ .Labels.alertname }}
    告警状态:{{   .Status }}
    告警主题: {{ $alert.Annotations.summary }}
    告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
    故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    {{- if gt (len $alert.Labels.instance) 0 }}
    实例信息: {{ $alert.Labels.instance }}
    {{- end }}
    ========= = end =  =========
    {{- end }}
    {{- end }}
    {{- end }}
    {{- end }}

创建Deploy和svc

apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: devops
  labels:
    k8s-app: alertmanager
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    version: v0.14.0
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: alertmanager
      version: v0.14.0
  template:
    metadata:
      labels:
        k8s-app: alertmanager
        version: v0.14.0
    spec:
      priorityClassName: system-cluster-critical
      containers:
        - name: prometheus-alertmanager
          image: "prom/alertmanager:v0.14.0"
          imagePullPolicy: "IfNotPresent"
          args:
            - --config.file=/etc/config/alertmanager.yml
            - --storage.path=/data
            - --web.external-url=/
          ports:
            - containerPort: 9093
          readinessProbe:
            httpGet:
              path: /#/status
              port: 9093
            initialDelaySeconds: 30
            timeoutSeconds: 30
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
            - name: alertmanager-data
              mountPath: "/data"
              subPath: ""
          resources:
            limits:
              cpu: 10m
              memory: 50Mi
            requests:
              cpu: 10m
              memory: 50Mi
        - name: prometheus-alertmanager-configmap-reload
          image: "jimmidyson/configmap-reload:v0.1"
          imagePullPolicy: "IfNotPresent"
          args:
            - --volume-dir=/etc/config
            - --webhook-url=http://localhost:9093/-/reload
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
              readOnly: true
          resources:
            limits:
              cpu: 10m
              memory: 10Mi
            requests:
              cpu: 10m
              memory: 10Mi
      volumes:
        - name: config-volume
          configMap:
            name: alertmanager-config
        - name: alertmanager-data
          persistentVolumeClaim:
            claimName: alertmanager-data
---
apiVersion: v1
kind: Service
metadata:
  name: alertmanager
  namespace: devops
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/name: "Alertmanager"
spec:
  type: NodePort
  ports:
    - name: http
      port: 9093
      protocol: TCP
      targetPort: 9093
  selector:
    k8s-app: alertmanager

配置告警规则
进入到nfs服务器上prometheus-rules-data目录下,创建告警规则文件如下:
node_health.yaml:

groups:
- name: node_health
  rules:
  - alert: HighMemoryUsage
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.9
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }}"
      description: "Instance {{ $labels.instance }} of job {{ $labels.job }} has High Memory."

  - alert: HighDiskUsage
    expr: node_filesystem_free_bytes{mountpoint='/'} / node_filesystem_size_bytes{mountpoint='/'} < 0.71
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }}"
      description: "Instance {{ $labels.instance }} of job {{ $labels.job }} has High Disk."

  - alert: NodeCPUUsage
    expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} CPU使用率过高"
      description: "{{ $labels.instance }} CPU使用大于60% (当前值: {{ $value }})"

node_status.yaml:

groups:
- name: 实例存活告警规则
  rules:
  - alert: 实例存活告警
    expr: up == 0
    for: 1m
    labels:
      user: prometheus
      severity: error
    annotations:
      summary: "Instance {{ $labels.instance }} is down"
      description: "Instance {{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
      value: "{{ $value }}"

修改prometheus的ConfigMap的内容,在最后增加以下部分:

    alerting:         #这里是定义alertmanager告警的地址
      alertmanagers:
      - static_configs:
        - targets:
          - alertmanager.devops.svc.cluster.local  #这里是alert的服务名称

以上是所有的配置。文章来源地址https://www.toymoban.com/news/detail-530068.html

到了这里,关于k8s集群部署Prometheus和Grafana的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • 在k8s集群内搭建Prometheus监控平台

    Prometheus由SoundCloud发布,是一套由go语言开发的开源的监控报警时间序列数据库的组合。 Prometheus的基本原理是通过HTTP协议周期性抓取被监控组件的状态,任意组件只要提供对应的HTTP接口就可以接入监控。不需要任何SDK或者其他的集成过程。这样做非常适合做虚拟化环境监控

    2024年02月14日
    浏览(35)
  • K8S部署Prometheus

    和Zabbix类似,Prometheus也是一个近年比较火的开源监控框架,和Zabbix不同之处在于Prometheus相对更灵活点,模块间比较解耦,比如告警模块、代理模块等等都可以选择性配置。服务端和客户端都是开箱即用,不需要进行安装。zabbix则是一套安装把所有东西都弄好,很庞大也很繁

    2024年02月04日
    浏览(26)
  • 使用Prometheus对k8s集群外的Elasticsearch进行监控

    本文介绍了使用Prometheus对k8s集群外的elasticsearch进行监控,这里Prometheus是使用operator部署于k8s集群中,相较于进程部署或docker部署的Prometheus,部署过程更为复杂,不能通过直接修改配置文件的方式增加job,而需要采用k8s的方式进行配置。 配置步骤为: 1,增加endpoint和service,

    2024年02月04日
    浏览(46)
  • 【Minikube & Prometheus】基于Prometheus & Grafana监控由Minikube创建的K8S集群

    通过运行以下命令来检查状态 由于使用的是 Minikube,第二个命令 prometheus-server 使用 NodePort . 这样,当 Pod 准备就绪时,就可以轻松访问 Prometheus Web 界面: http://192.168.20.20:30944/targets 由于使用的是 Minikube,为了轻松访问 Grafana 的 Web 界面,将该服务公开为 NodePort 。 注意: Gr

    2024年02月03日
    浏览(54)
  • Prometheus+Grafana(外)监控Kubernetes(K8s)集群(基于containerd)

    1、k8s环境 版本 v1.26.5 二进制安装Kubernetes(K8s)集群(基于containerd)—从零安装教程(带证书) 主机名 IP 系统版本 安装服务 master01 10.10.10.21 rhel7.5 nginx、etcd、api-server、scheduler、controller-manager、kubelet、proxy master02 10.10.10.22 rhel7.5 nginx、etcd、api-server、scheduler、controller-manager、kubel

    2024年02月16日
    浏览(35)
  • k8s v1.27部署prometheus

    本文使用Operator方法部署prometheus,既可以被描述为一个包,也可以被描述为一个库。 此存储库收集Kubernetes清单、Grafana仪表板和Prometheus规则,并结合文档和脚本,以使用Prometheus Operator提供易于操作的端到端Kubernetes集群监控。此stack用于集群监控,因此它被预配置为从所有K

    2024年01月22日
    浏览(30)
  • k8s集群监控方案--node-exporter+prometheus+grafana

    目录 前置条件 一、下载yaml文件 二、部署yaml各个组件 2.1 node-exporter.yaml 2.2 Prometheus 2.3 grafana 2.4访问测试 三、grafana初始化 3.1加载数据源 3.2导入模板 四、helm方式部署 安装好k8s集群(几个节点都可以,本人为了方便实验k8s集群只有一个master节点),注意prometheus是部署在k8s集群

    2024年02月12日
    浏览(33)
  • k8s集群监控及报警(Prometheus+AlertManager+Grafana+prometheusAlert+Dingding)

    k8s集群部署后,急需可靠稳定低延时的集群监控报警系统,报警k8s集群正常有序运行,经过不断调研和测试,最终选择Prometheus+AlertManager+Grafana+prometheusAlert的部署方案,故障信息报警至钉钉群和邮件,如需要额外监控可部署pushgateway主动推送数据到Prometheus进行数据采集 Promet

    2024年02月08日
    浏览(36)
  • 外独立部署Prometheus+Grafana+Alertmanager监控K8S

    用集群外的prometheus来监控k8s,主要是想把监控服务跟k8s集群隔离开,这样就能减少k8s资源的开销。 CentOS Linux release 7.7.1908 (Core)  3.10.0-1062.el7.x86_64  Docker version 20.10.21 主机名 IP 备注 prometheus-server.test.cn 192.168.10.166 k8s集群 192.168.10.160:6443 集群master-vip 需要通过exporter收集各种维

    2024年02月08日
    浏览(34)
  • Prometheus接入AlterManager配置邮件告警(基于K8S环境部署)

    基于 此环境做实验 1.创建AlertManager ConfigMap资源清单 执行YAML资源清单: 2.配置文件核心配置说明 group_by: [alertname]:采用哪个标签来作为分组依据。 group_wait:10s:组告警等待时间。就是告警产生后等待10s,如果有同组告警一起发出。 group_interval: 10s :上下两组发送告警的间隔

    2024年04月17日
    浏览(29)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包