ES | 萌叔

elasticsearch如何存储关联关系？

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 1. 前言之所以写这篇文章是因为我已经在不止一个群里，看到有人问如何在ES中存储关联关系。 2. 答案你可能会在网上看到有说Join datatype和Nested data type的，但是其实这都不是ES该有的玩法。 Join datatype和Nested data type都会涉及多次查询的开销 Join datatype本身的数据就是在不同的表中，对于分布式数据库，还涉及数据从不同的节点上拉取和组装的开销。那么应该怎么做？答案就是用冗余的宽表来存储关联关系举例说明假如我需要在ES中存储的实体有书籍、书籍有作者信息、书名等等信息，显然实体之间有如下关系如果在传统的关系型数据库中，就需要创建2张表，一张表表示作者，一张表代表书。但是对于nosql数据库，只需要一张表(书)即可，doc结构形如: { "name":"zhangsan", "publisher_identifier": "xxx-xxxx-xxx" "author":{ "name": "jobs", "phone": "111111111" } } 作者信息作为书的属性存储在一起，放一个doc中即可。这样的做法必然是会带来数据冗余，但是以空间换时间，查询速度就有了保障。现代的nosql数据库大多应对的海量数据的存储查询的问题，因此大都是分布式结构。在这种情况下，整体的设计方案必须足够简单，才能够易于维护和扩展。同样的做法，也完全适用于HBase。 3. 说几句某些人可能不爱听的话 ES集群的使用成本其实是很贵的，用了就别怕贵，觉得烧钱就别用 ES自身的性能优化工作做得还是很好的，对大多数人而言，不需要考虑优化，性能不够，就老老实实的加硬件就行。高版本相比低版本性能和稳定性都有很大的提升，优先考虑高版本 SSD对ES的性能提升非常明显(便宜不一定不是好货，但好货一定不便宜) 4. 参考资料 1.Join datatype 2.Nested data type 3.宽表和窄表的区别打赏我

聊聊关于es打分的有趣现象

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 1. 引子公司内部有简单的搜索引擎，使用ES搭建。前两天测试人员问我，为什么同一个查询条件，同一条数据，多次查询。score会发生变化。经过验证，确实存在这种问题，那么这种情况到底怎么产生的呢？ 2. 例子来造个例子 2.1 创建index curl -XPUT -H "Content-Type: application/json" dev1:9200/test -d ' { "settings": { "index.number_of_replicas": "2", "index.number_of_shards": "1" }, "mappings": { "_default_": { "dynamic_templates": [], "properties": { "brand": { "type": "keyword" } } } } } 2.2 写入数据第1次执行 insert1.py import requests for i in range(500): url = "http://dev1:9200/test/car/%d" % (i) res = requests.put(url, json={"brand":"buick", "age":i}) print(i, res.status_code) 第2次执行 insert2.py ...

ES内部分享

ES内部分享 1. ES简介 ElasticSearch是Elastic公司开发的开源分布式搜索引擎开源分布式全文检索 OLAP(结合kibana使用) Resful API NoSQL database 1.1 和Lucene的关系 { "name": "uf_-1wJ", "cluster_name": "UT", "cluster_uuid": "xvGp84DyQVOSxLXP43oDpA", "version": { "number": "6.2.4", "build_hash": "ccec39f", "build_date": "2018-04-12T20:37:28.497551Z", "build_snapshot": false, "lucene_version": "7.2.1", "minimum_wire_compatibility_version": "5.6.0", "minimum_index_compatibility_version": "5.0.0" }, "tagline": "You Know, for Search" } 简单而言，ES是Lucene的分布式版本，当然扩充了接口和在线分析功能 1.2 基本概念 1.2.1 索引层面 index type doc term 1.2.2 倒排索引正常顺序 doc -> term 倒排索引 term -> doc 1.2.2 集群层面 Cluster Node Shard Segment 多种角色 Master-eligible node 2）Data node Ingest node Tribe node coordinating node ...

elasticsearch中自定义doc的路由(routing)规则

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 1. 前言前几天有人在群里问，es是否可以指定某个字段为路由值。ES自定义的doc的路由，不过es的操作是，在写入一个doc时，指定doc的routing key。 2. 例子 2.1 创建一个新的index PUT /my_index2 { "settings": { "index": { "number_of_shards": 2, "number_of_replicas": 1 } } } index只有2个shard，shard 0和 shard 1 2.2 设置mapping PUT /my_index2/_mapping/student { "_routing": { "required": true }, "properties": { "name": { "type": "keyword" }, "age": { "type": "integer" } } } 2.3 指定doc路由 PUT /my_index2/student/1?routing=key1 { "name":"n1", "age":10 } PUT /my_index2/student/2?routing=key1 { "name":"n2", "age":10 } PUT /my_index2/student/3?routing=key1 { "name":"n3", "age":10 } 上面的3条命令会使得doc1、2、3放置在同一个shard上 shard 0 ...

UTF8 encoding is longer than the max length 32766

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 起因：同事在向ES插入数据时，收到了如下错误 mapping结构如下： { "test": { "mappings": { "test_ignore32766": { "properties": { "message": { "type": "string", "index": "not_analyzed" } } } } } } { "error": "RemoteTransportException[[Pietro Maximoff][inet[/10.1.1.51:9300]][indices:data/write/index]]; nested: IllegalArgumentException[Document contains at least one immense term in field=\"message\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[-28, -72, -83, -27, -101, -67, -25, -69, -113, -26, -75, -114, -26, -83, -93, -27, -100, -88, -25, -69, -113, -27, -114, -122, -26, -106, -80, -28, -72, -128]...', original message: bytes can be at most 32766 in length; got 69345]; nested: MaxBytesLengthExceededException[bytes can be at most 32766 in length; got 69345]; ", "status": 400 } 此问题的原因是这样的，message字段设置为not_analyzed，表示对这个字段不做分词索引，但对这个字段本身仍然是要索引的，也就说可以用term进行搜索 ...

Elasticsearch经验总结（持续补充）

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 起因： ES在笔者所在的公司使用也有3年多了，集群的规模达到上百台，期间也有很多的经验，我这里总结出来分享给大家，技术水平有限，如有错误请指正。事项：这些事项，我把它们以问题的形式列出，并会持续补充 1. 关于shard大小的分配 ES的shard是在index创建好时，就已经分配了，所以shard数量的选择非常重要，根据经验shard的大小在10GB ~ 20GB 较为合适。选择这个大小的原因如下 1）ES是通过移动shard来实现负载均衡，如果shard过大移动会非常缓慢 2）另外每个shard相当于一个lucene实例，lucene实例也对应着一组Java线程，所以shard数也不应该过多 2. 关于index的命名设计如果数据是随着时间增长的，可以选择按月，或者按天分库 index的命名可以是 index_201701、index_201702、index_201703 或 index_20170301、index_20170302、index_20170303 然后可以为他们指定别名index_2017，这样可以直接使用这个别名查询所有index库另外ES的库是可以关闭的，关闭以后，不占内存空间，只消耗硬盘空间 3. SSD OR 机械硬盘？ Elasticsearch的速度有赖于索引，大量的索引是以文件的形式存储在硬盘上的，如果你的数据量较大，且单次的查询或聚合量较大，那么应该使用SSD，据我们的测试表明，再查询的数据量较大的情况下，使用SSD的ES速度是机械硬盘的ES速度的10倍，官方说法在正确配置的情况下，SSD的写入速度是机械硬盘的500倍给一个参考值数据单条记录1kB 操作系统Centos 6.7 内存64G ES版本2.3 ，堆内存31GB 单个ES data node处理能力机械硬盘 SSD 1w/min 10w/min 见参考资料[1] If you are using SSDs, make sure your OS I/O scheduler is configured correctly. When you write data to disk, the I/O scheduler decides when that data is actually sent to the disk. The default under most *nix distributions is a scheduler called cfq (Completely Fair Queuing). ...