Elasticsearch经验总结（持续补充）

版权声明 本站原创文章 由 萌叔 发表
转载请注明 萌叔 | https://vearne.cc

起因：

ES在笔者所在的公司使用也有3年多了，集群的规模达到上百台，期间也有很多的经验，我这里总结出来分享给大家，技术水平有限，如有错误请指正。

事项：

这些事项，我把它们以问题的形式列出，并会持续补充

1. 关于shard大小的分配

ES的shard是在index创建好时，就已经分配了，所以shard数量的选择非常重要，根据经验shard的大小在10GB ~ 20GB 较为合适。选择这个大小的原因如下
1）ES是通过移动shard来实现负载均衡，如果shard过大移动会非常缓慢
2）另外每个shard相当于一个lucene实例，lucene实例也对应着一组Java线程，所以shard数也不应该过多

2. 关于index的命名设计

如果数据是随着时间增长的，可以选择按月，或者按天分库
index的命名可以是
index_201701、index_201702、index_201703
或
index_20170301、index_20170302、index_20170303
然后可以为他们指定别名index_2017，这样可以直接使用这个别名查询所有index库
另外ES的库是可以关闭的，关闭以后，不占内存空间，只消耗硬盘空间

3. SSD OR 机械硬盘？

Elasticsearch的速度有赖于索引，大量的索引是以文件的形式存储在硬盘上的，如果你的数据量较大，且单次的查询或聚合量较大，那么应该使用SSD，据我们的测试表明，再查询的数据量较大的情况下，
使用SSD的ES速度是机械硬盘的ES速度的10倍，官方说法在正确配置的情况下，SSD的写入速度是机械硬盘的500倍

给一个参考值
数据单条记录1kB
操作系统Centos 6.7
内存64G
ES版本2.3 ，堆内存31GB
单个ES data node处理能力

机械硬盘	SSD
1w/min	10w/min

见参考资料[1]

If you are using SSDs, make sure your OS I/O scheduler is configured correctly. When you write data to disk, the I/O scheduler decides when that data is actually sent to the disk. The default under most *nix distributions is a scheduler called cfq (Completely Fair Queuing).

This scheduler allocates time slices to each process, and then optimizes the delivery of these various queues to the disk. It is optimized for spinning media: the nature of rotating platters means it is more efficient to write data to disk based on physical layout.

This is inefficient for SSD, however, since there are no spinning platters involved. Instead, deadline or noop should be used instead. The deadline scheduler optimizes based on how long writes have been pending, while noop is just a simple FIFO queue.

This simple change can have dramatic impacts. We’ve seen a 500-fold improvement to write throughput just by using the correct scheduler.

4. 版本问题

请确保Java版本在1.8以上，ES 5.x 比早期的版本性能有较大提升。

5. ES实例的堆大小的设定

ES的官方建议是将内存的一半大小作为ES的堆大小，并且对内存大小不要超过32GB（实际只能到31GB左右）。
对于32GB的内存而言，只需要32-bits的指针，而对内存再大的话，就需要更长的指针。官方说法31GB的效果相当于40GB的效果
对于大内存的机器，可以部署多个ES实例。

实践经验表明，64GB内存的机器，ES实例堆的大小可以设到31GB左右，96GB内存的机器，ES实例堆的大小可以设到64GB

检查堆内存设置到多大，是否能够开启指针压缩技术

java -Xmx32766m -XX:+PrintFlagsFinal 2> /dev/null | grep UseCompressedOops

如上，表示如果最大堆内存设为32766MB，jvm是否会开启指针压缩

详见参考资料[2]

6. 参与选主的机器，不要设定的过多

1）在ES中，只有能够参与选主的ES实例（master-eligible node），才能被选为Master节点，某个实例必须收到超过半数投票人的投票，才能当选为master节点
经验表明，参与选主的机器过多，集群会变得非常不稳定
正如人类社会的代议制一样，如果每一个决策都需要全体国民决定，那这个决策过程，会变得非常低效。
2）另外参与选主的ES实例不要存放数据，也不作为client

By default a node is a master-eligible node and a data node, plus it can pre-process documents through ingest pipelines. This is very convenient for small clusters but, as the cluster grows, it becomes important to consider separating dedicated master-eligible nodes from dedicated data nodes.

从实践经验看，在集群中，挑选3个实例参与选主即可，堆内存可设为16GB。可以与其他ES实例混部。
见参考资料[3]

7. HugePage引发的问题

在我们的集群运行在centos6上，有段时间，我们密集的导入一批数据，观察部分节点的负载在集群中显得十分突兀，影响了整体的吞能力，结果发现是centos默认开启了HugePage，导致cpu_sys 过高
可用以下命令关闭THP特性

echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

注意： 该配置重启后会失效

见参考资料[4]

8. cancel Task

ES 2.3以后提供了对task的取消接口
查询任务对ES整体性能影响很大，有些大的查询任务可能会运行超过1 ~ 2 个小时，影响小的查询任务的执行
这里提供一个脚本，用于取消运行时间过长的查询任务
kill_long_task.py

# coding=utf8
#!/usr/bin/python
#############################################
# 扫描出ES中, 运行时间超过10分钟的task
# 并且把他们cancel
#############################################
import requests
import os
import logging
from logging import Logger

log_conf = {
    "level": logging.INFO,
    "log_dir": "."
}

def init_logger(logger_name='all'):
    if logger_name not in Logger.manager.loggerDict:
        logger = logging.getLogger(logger_name)
        logger.setLevel(log_conf['level'])
        # file
        fmt = '%(asctime)s - %(process)s - %(levelname)s: - %(message)s'
        formatter = logging.Formatter(fmt)

        # all file
        log_file = os.path.join(log_conf['log_dir'], logger_name + '.log')
        file_handler = logging.FileHandler(log_file)
        file_handler.setFormatter(formatter)
        logger.addHandler(file_handler)

        # error file
        log_file = os.path.join(log_conf['log_dir'], logger_name + '.error.log')
        file_handler = logging.FileHandler(log_file)
        file_handler.setFormatter(formatter)
        file_handler.setLevel(logging.ERROR)
        logger.addHandler(file_handler)

    logger = logging.getLogger(logger_name)
    return logger

logger = init_logger("all")


def main():
    logger.info('[start]kill long task')
    # 1. scan
    wait2cancel_set = set()
    url = "http://localhost:9200/_tasks?actions=*search&detailed"
    res = requests.get(url)
    dd = res.json()
    for value in dd['nodes'].values():
        for task_id, task_info in value['tasks'].items():
            # 注意这里是纳秒
            run_secs = task_info['running_time_in_nanos']/1000/1000/1000
            # 10 min
            if run_secs > 60 * 10:
                wait2cancel_set.add(task_id)

    logger.info('wait2cancel_list:%s, count:%s', wait2cancel_set, len(wait2cancel_set))
    # 2. cancel
    for task_id in wait2cancel_set:
        # 请自行修改ES的地址
        url = "http://localhost:9200/_tasks/%s/_cancel" % (task_id)
        res = requests.post(url)
        logger.info("cancel task, task_id:%s, result:%s", task_id, res.content)

    logger.info('[end]kill long task')

if __name__ == '__main__':
    print '--------start-----------'
    main()
    print '--------end-----------'

见参考资料[5]

9. 使用term对Document count的统计是近似的

见参考资料[6]
一个简单的聚合形如

{
    "aggs": {
        "bucket_uid": {
            "terms": {
                "field": "uid", 
                "size": 20, 
                "shard_size": 50, # 可选，表示每个data node 会返回的top 50个结果
                "show_term_doc_count_error": true # 可选，每项在最坏情况与实际情况的差值上界
            }
        }
    }, 
    "size": 0
}

返回的结果形如:

  ... ...
  "aggregations": {
    "bucket_uid": {
      "doc_count_error_upper_bound": 2583, # 最坏情况与实际情况的doc count差值上界
      "sum_other_doc_count": 905568, # 除了这top 20的uid，其它uid出现的doc count
      "buckets": [
        {
          "key": 5772399388,
          "doc_count": 3873,
          "doc_count_error_upper_bound": 895
        },
   # uid 等于 5772399388的文档数是3873, 但是在最糟糕的情况下，实际的文档数据可以与这个统计值差895的

10. 关闭HeapDumpOnOutOfMemoryError

JVM的设置，默认会开启HeapDumpOnOutOfMemoryError
当堆内存溢出时，或者JVM被OOM时，会自动生成DUMP文件
解释:
ES当负载比较高的时候，实际内存有可能会超过设置的最大堆内存，如果开启此设置，JVM会锁住其内存空间进行DUMP操作。在dump的过程中，无法对集群中的其它heartbeat进行相应，会被其它节点认为此节点已经掉线，Master会将其从节点中移除，继而又会触发shard的迁移。因此建议关闭此参数

可以之间使用Jinfo关闭此参数

jinfo -flag -HeapDumpOnOutOfMemoryError  <pid>

参考资料

1.https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html
2.https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html
3.https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html
4.CentOS6上Hadoop集群中服务器cpu sys态异常的定位与解决
5.https://www.elastic.co/guide/en/elasticsearch/reference/5.3/tasks.html
6.term 近似统计doc count

vearne@ut

280

起因：

事项：