UTF8 encoding is longer than the max length 32766

版权声明 本站原创文章 由 萌叔 发表
转载请注明 萌叔 | https://vearne.cc

起因：同事在向ES插入数据时，收到了如下错误

mapping结构如下：

{
    "test": {
        "mappings": {
            "test_ignore32766": {
                "properties": {
                    "message": {
                        "type": "string",
                        "index": "not_analyzed"
                    }
                }
            }
        }
    }
}

{
    "error": "RemoteTransportException[[Pietro Maximoff][inet[/10.1.1.51:9300]][indices:data/write/index]]; nested: IllegalArgumentException[Document contains at least one immense term in field=\"message\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[-28, -72, -83, -27, -101, -67, -25, -69, -113, -26, -75, -114, -26, -83, -93, -27, -100, -88, -25, -69, -113, -27, -114, -122, -26, -106, -80, -28, -72, -128]...', original message: bytes can be at most 32766 in length; got 69345]; nested: MaxBytesLengthExceededException[bytes can be at most 32766 in length; got 69345]; ",
    "status": 400
}

此问题的原因是这样的，message字段设置为not_analyzed，表示对这个字段不做分词索引，但对这个字段本身仍然是要索引的，也就说可以用term进行搜索

{
    "query":{
        "term":{
            "message":"Syntax error"
        }
    }
}

而对过长的文本进行索引，是非常开销资源的，因此ES对字节长度大于32766的字段串会报警。
对于这个问题，有2个解决方法：
1）如果完全不需要索引，可以将mapping设置为

{
    "test": {
        "mappings": {
            "test_ignore32766": {
                "properties": {
                    "message": {
                        "type": "string",
                        "index": "no"
                    }
                }
            }
        }
    }
}

2）使用ignore_above, 将mapping设置为

{
    "test": {
        "mappings": {
            "test_ignore": {
                "properties": {
                    "message": {
                        "type": "string",
                        "index": "not_analyzed",
                        "ignore_above": 20
                    }
                }
            }
        }
    }
}

如果message字段的长度大于20个字节，这条记录本身会被插入库中，但不会建立索引

参考资料:
1. ignore_above
2. UTF8 encoding is longer than the max length 32766

vearne@ut

271