Prometheus

istio学习笔记(5)-prometheus配置改造

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 1. 前言警告：本文仅用于萌叔自己总结之用，对其它人而言可能毫无营养，没有阅读价值。要让一个k8s + istio的集群真正能够在生产可用，我们需要考虑如下几类指标。 Node Metrics Container Resource Metrics Kubernetes API Server Etcd metrics Kube state-metrics 对于准备上集群的服务而言，我们会比较关注服务所在容器的CPU、内存、网络流量、磁盘使用率等业务指标：QPS、StatusCode、ErrorCode、请求延迟、缓存使用情况，连接池等其中很大一部分其实都是标准指标，所有的服务都应该会有。另外istio已经对container的输入和输出流量进行了拦截，基于这些条件。萌叔希望达到如下效果目标通过prometheus的自动发现功能，发现并监控Container Resource Metrics 通过prometheus的自动发现功能，发现并监控envoy拦截到的部分指标 3）通过prometheus的自动发现功能，发现并监控app暴露的prometheus metrics 对于标准指标实现在grafana上的自动配置(生成Dashboard和Graph) 显然对于1）2）中的指标都是标准指标 2. 配置根据参考资料4的说法 spec: template: metadata: annotations: prometheus.io/scrape: true # determines if a pod should be scraped. Set to true to enable scraping. prometheus.io/path: /metrics # determines the path to scrape metrics at. Defaults to /metrics. prometheus.io/port: 80 # determines the port to scrape metrics at. Defaults to 80. 服务本身如果想暴露自己的指标，可以通过在pod上增加注释prometheus.io/scrape prometheus.io/path prometheus.io/port。但是实际使用中我发现注入了istio之后，这3项的值已经被修改为 ...

玩转PROMETHEUS(6) 实现自定义的Collector

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 1. 前言 prometheus的官方和社区为了我们提供了丰富的exporter。对常见的硬件设备、数据库、消息队列以及常见的软件进行监控。另外官方还为我们提供了4种指标类型方便我们自定义exporter Counter Counter代表累计指标，它表示单调递增的计数器，通常用于表示服务请求数，完成的任务数，错误的数量。 Gauge Gauge表示某种瞬时状态，某一时刻的内存使用率、消息队列中的消息数量等等。它的值既可以增大，也可以减小。 Histogram 通常用于top percentile，比如请求耗时的TP90、TP99等 Summary 类似于Histogram 我们回顾一下prometheus的指标采集的一般过程 1）创建指标 HTTPReqTotal = prometheus.NewCounterVec(prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests made.", }, []string{"method", "path", "status"}) 2）指标注册到 DefaultRegisterer prometheus.MustRegister( HTTPReqTotal, ) 3）指标和对应的值通过HTTP API暴露出来 The caller of the Gather method can then expose the gathered metrics in some way. Usually, the metrics are served via HTTP on the /metrics endpoint. ...

玩转Prometheus(5)-监控Redis和MySQL的工具包(业务层)

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 1. 引言对于高可用的服务，监控的粒度往往都会非常细。如果恰好你也在使用 Prometheus, 也需要在业务层对Redis连接池和MySQL连接池进行监控。那么此篇文章对你而言将是一种福利。 Redis Client go-redis/redis MySQL Client jinzhu/gorm 2. 样例代码 go get github.com/vearne/golib main.go package main import ( "github.com/go-redis/redis" "github.com/jinzhu/gorm" _ "github.com/jinzhu/gorm/dialects/mysql" "github.com/prometheus/client_golang/prometheus/promhttp" "github.com/vearne/golib/metric" "log" "net/http" "time" ) func main() { // init redis client := redis.NewClient(&redis.Options{ Addr: "localhost:6379", PoolSize: 100, }) // ***监控Redis连接池*** metric.AddRedis(client, "car") // init mysql DSN := "test:xxxx@tcp(localhost:6379)/somebiz?charset=utf8&loc=Asia%2FShanghai&parseTime=true" mysqldb, err := gorm.Open("mysql", DSN) if err != nil { panic(err) } mysqldb.DB().SetMaxIdleConns(50) mysqldb.DB().SetMaxOpenConns(100) mysqldb.DB().SetConnMaxLifetime(5 * time.Minute) // ***监控MySQL连接池*** metric.AddMySQL(mysqldb, "car") // do some thing for i := 0; i < 30; i++ { go func() { for { client.Get("a").String() time.Sleep(200 * time.Millisecond) mysqldb.Exec("show tables") } }() } http.Handle("/metrics", promhttp.Handler()) log.Fatal(http.ListenAndServe(":9090", nil)) log.Println("starting...") } func AddRedis(client RedisClient, role string) func AddMySQL(client *gorm.DB, role string) role 仅用于区分不同的Redis实例 ...

玩转Prometheus(4)--发现异常节点

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 1. 前言一个服务常常运行几十或者上百个实例上。通过使用docker容器或者有意的为之，我们会控制实例运行的环境完全一致。因为docker容器所处或者虚拟机是与其他容器或者虚拟机共存的(同一个物理机)。又或者因为物理机的硬件设备的潜在故障，某些实例会表现出异常的行为。(也有可能是程序本身的原因) 如何找到异常节点就变得十分重要。 2. 分析&展示报警这种节点往往会表现出以下的特点请求超时请求错误多 2.1 监控图表在监控图表上，以非200请求举例，我们可以使用topk列出失败请求最多的实例 topk(3, sum(rate(http_requests_total{project="fake-service", run_mode="product", status!~"200|201|204"}[5m])) by (instance)) 列出HTTP状态码非200的最多的3个实例图1 从图1我们看出蓝色曲线的实例，非200的HTTP请求数量显著的高于其他实例 2.2 配置报警参考资料2推荐这样去发现异常实例，如果某个实例指标的值 > 所有实例指标的平均值 + 2 * 所有实例指标的标准差那么可能有异常实例存在 DSL floor(max(sum(rate(http_requests_total{project="sdk-api", run_mode="product", status!~"200|201|204"}[5m])) by (instance))) > avg(sum(rate(http_requests_total{project="sdk-api", run_mode="product", status!~"200|201|204"}[5m])) by (instance)) + 2 * stddev(sum(rate(http_requests_total{project="sdk-api", run_mode="product", status!~"200|201|204"}[5m])) by (instance)) 后记前几天线上就发生了一起这样的故障。找到异常节点，并重启后，故障恢复，报警解出。参考资料标准差 Practical Anomaly Detection

玩转Prometheus(3)--数据存储

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 1.引言在监控系统中，海量的监控数据如何存储，一直是设计人员所必须关心的问题。OpenTSDB选择了Hbase；Open-Falcon选择了RRD(round-robin database)。Prometheus另辟蹊径，借鉴了facebook的paper(参考资料2)，设计了自己的TSDB(time series database)。本文试图简单介绍TSDB中使用的2个压缩算法 2. 简单的聊聊TSDB的文件结构值得一提的是在prometheus中 1）数据是没有做预聚合的，所有的聚合操作都是在查询阶段进行。据笔者观察查询的时间跨度如果超过7d，速度就会变得比较慢(3 ~ 5秒) 2）Prometheus数据都是单机存储的，数据存在丢失的可能，最近产生的数据存储在内存中，历史数据落在硬盘上，默认数据存储15天。可以使用storage.tsdb.retention.time来修改数据存储的跨度 --storage.tsdb.retention.time=7d 文件结构 ├── 01D5X2A81S8FMS16S5Q1GWNQDE │ ├── chunks // chunk数据 │ │ └── 000001 │ ├── index // 索引文件 │ ├── meta.json // 人类可读的文件 │ └── tombstones ├── 01D5X2A83TVJD7FFGKPHED5VA1 │ ├── chunks │ │ └── 000001 │ ├── index │ ├── meta.json │ └── tombstones └── wal // wal下的全是预写日志，类似于MySQL中的binlog ├── 00000010 ├── 00000011 ├── 00000012 ├── 00000013 └── checkpoint.000009 └── 00000000 如果读者仔细观察会发现，文件夹的modify时间是递增。没错，监控数据首先按照时间维度，划分在不同的文件夹中, 然后通过索引文件index去定位不同的series, 实际的数据在chunks文件夹中。（补充，WAL文件夹存有最近的数据。可简单的把WAL理解为最近的临时数据，chunks中的为归档数据。） ...

玩转Prometheus(2)--计算Top Percentile

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 1. 前言在笔者的上一篇文章玩转Prometheus(1)–第1个例子，我提到可以用Prometheus，来统计服务的TP90的请求耗时。那么TP90到底是什么意思？在Prometheus中，它又是如何计算的？ 2. 概念–中位数/Top Percentile 2.1 中位数中位数（Medians）统计学名词，是指将数据按大小顺序排列起来，形成一个数列，居于数列中间位置的那个数据。中位数用Me表示。当变量值的项数N为奇数时，处于中间位置的变量值即为中位数；当N为偶数时，中位数则为处于中间位置的2个变量值的平均数。我们来看一个示例，假定有1组请求，耗时(单位毫秒)如下: [5, 8, 6, 50, 7, 10, 9, 11] 将它们按耗时，从小到大排列 [5, 6, 7, 8, 9, 10, 11, 50] 上面的示例N = 8, 中位数取(Array[3] + Array[4])/2 = (8 + 9)/2 为8.5ms 2.2 Top Percentile Top Percentile表示百分比分布统计。TP50表示50%的请求都小于等于某个值，TP90表示90%的请求小于等于某个值。 [ 5, 6, 7, 8, 9, 10, 11, 50] [12.5%, 25%, 37.5%, 50%, 62.5%, 75%, 87.5%, 100%] 上例中TP50=8ms，TP100=50, 50%的请求在8ms以内完成，100%的请求在50ms内完成 ...

玩转Prometheus(1)--第1个例子

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 前言在工作的这几年里，接触不少监控系统, Nagios、Cacti、Zabbix、Open-falcon, 今年开始在新公司使用Prometheus, 网上有文章把Prometheus 称为新一代的监控系统，我一直很好奇，它的新体现在哪儿，相比与传统的监控系统，它有什么优势。在经过一段时间的使用以后，我觉得我有了一些体会，下面我们通过1个例子来感受一下。 Prometheus的体系架构图应用场景从目前各个公司的实践情况来看，Prometheus主要用于应用服务的监控，尤其是基于docker的应用服务；而像主机的运行情况(cpu使用率、内存使用率)，网络设备的监控等，依然由传统的监控系统来做。 1. 模拟的应用服务假定我们有一个web服务叫fake_service fake_server.go package main import ( "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp" "gopkg.in/gin-gonic/gin.v1" "strconv" "strings" "time" ) var ( //HTTPReqDuration metric:http_request_duration_seconds HTTPReqDuration *prometheus.HistogramVec //HTTPReqTotal metric:http_request_total HTTPReqTotal *prometheus.CounterVec ) func init() { // 监控接口请求耗时 // HistogramVec 是一组Histogram HTTPReqDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "The HTTP request latencies in seconds.", Buckets: nil, }, []string{"method", "path"}) // 这里的"method"、"path" 都是label // 监控接口请求次数 // HistogramVec 是一组Histogram HTTPReqTotal = prometheus.NewCounterVec(prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests made.", }, []string{"method", "path", "status"}) // 这里的"method"、"path"、"status" 都是label prometheus.MustRegister( HTTPReqDuration, HTTPReqTotal, ) } // /api/epgInfo/1371648200 -> /api/epgInfo func parsePath(path string) string { itemList := strings.Split(path, "/") if len(path) >= 4 { return strings.Join(itemList[0:3], "/") } return path } //Metric metric middleware func Metric() gin.HandlerFunc { return func(c *gin.Context) { tBegin := time.Now() c.Next() duration := float64(time.Since(tBegin)) / float64(time.Second) path := parsePath(c.Request.URL.Path) // 请求数加1 HTTPReqTotal.With(prometheus.Labels{ "method": c.Request.Method, "path": path, "status": strconv.Itoa(c.Writer.Status()), }).Inc() // 记录本次请求处理时间 HTTPReqDuration.With(prometheus.Labels{ "method": c.Request.Method, "path": path, }).Observe(duration) } } func DealAPI1(c *gin.Context) { time.Sleep(time.Microsecond * 10) c.Writer.Write([]byte("/api/api1")) } func DealAPI2(c *gin.Context) { time.Sleep(time.Microsecond * 20) c.Writer.Write([]byte("/api/api2")) } func main() { router := gin.Default() g := router.Group("/api") g.Use(Metric()) g.GET("api1", DealAPI1) g.GET("api2", DealAPI2) // 暴露给Prometheus router.GET("/metrics", gin.WrapH(promhttp.Handler())) router.Run(":28181") } 使用fake_client.go模拟真实用户的请求完整代码用法: ...