数据库 | 萌叔

happybase put()操作默认使用批量?

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 起因：前段时间，我们把通过happybase向hbase 写数据的操作put() 操作换成了batch() 结果发现性能并没有提升阅读代码，我发现put() 实现使用的就是批量插入 table.py def put(self, row, data, timestamp=None, wal=True): """Store data in the table. This method stores the data in the `data` argument for the row specified by `row`. The `data` argument is dictionary that maps columns to values. Column names must include a family and qualifier part, e.g. `cf:col`, though the qualifier part may be the empty string, e.g. `cf:`. Note that, in many situations, :py:meth:`batch()` is a more appropriate method to manipulate data. .. versionadded:: 0.7 `wal` argument :param str row: the row key :param dict data: the data to store :param int timestamp: timestamp (optional) :param wal bool: whether to write to the WAL (optional) """ with self.batch(timestamp=timestamp, wal=wal) as batch: batch.put(row, data) # 很明显是批量操作 batch.py ...

redis 启动警告及处理

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 起因: 生产环境的一台redis机器 Can't save in background: fork: Cannot allocate memory 导致redis服务停止，但是当时机器的内存是64G，redis使用到的内存只有40多G 我们都知道，redis 如果开启了持久化，RDB模式的bgsave 以及 AOF模式下，重写appendonly.aof 都会导致redis fork 出一个子进程。但是难道操作系统的进程fork难道不应该是copy-on-write 的吗？这件事让我重新关注起redis启动时的日志来。首先来看看redis启动时所报的日志 1610:M 12 Sep 07:46:20.524 # Server started, Redis version 3.0.1 1610:M 12 Sep 07:46:20.524 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. 1610:M 12 Sep 07:46:20.524 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled. 1610:M 12 Sep 07:46:20.525 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1610:M 12 Sep 07:46:20.525 * The server is now ready to accept connections on port 6379 1610:M 12 Sep 07:57:21.819 * Background saving started by pid 1615 1615:C 12 Sep 07:57:21.827 * DB saved on disk 1615:C 12 Sep 07:57:21.827 * RDB: 4 MB of memory used by copy-on-write 1610:M 12 Sep 07:57:21.925 * Background saving terminated with success 可以看到警告有3个 ...

我在数据库方面踩过的"坑"

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 前言：前段时间在公司内部做了一个分享总结了部分我在使用各种数据库方面的遇到的问题。也在这里分享给大家。强调一下，这里的坑，我是打了引号的，有些坑，不过是某种数据库的特点，或者因为我们错误的事情而引出了问题，并不一定完全就是这种数据库有问题。 1. 业务篇 1）业务场景不合理的业务设计，永远是对程序员最大的伤痛在我维护的系统中有这样一种场景，用户要一次性下载全年或者半年的舆情数据，数据量会很大，单个任务就会达到数百万条数据。任何一个系统要在短时间内吞吐数据数百万条记录，也不是件很轻松的事情，尤其当这样的任务很多的时候。目前这个时间跨度已经被调整成了3个月。说到这里不经让我想到12306错开时间发售火车票。任何时候从业务角度的优化，总能带来立竿见影的效果 2）字段设计在我维护的某个系统中，同一种指标，在不同的表中，被存成了不同的字段名，这给我们带来了巨大的痛苦。所以建议对于同一种指标，或者事物使用同样的字段名（名称）进行表达、存储，否则后期光转换都要人命 3）表结构的反范式设计大数据场景下，不要受到关系数据库范式设计的太多影响数据机构能够立体的，尽量立体，不要扁平化以新浪微博的一条转发举例一条转发会包含有这条微博的作者这条微博的内容 text 原创微博retweeted_status 原创微博的内容 retweeted_status.text 原创微博的作者 retweeted_status.user … 一条记录就包含了这条转发，以及与这条转发相关的大部分内容，在实际使用时，无需连表查询可以方便的用NoSQL 数据库进行存储 { "created_at": "Tue May 31 17:46:55 +0800 2011", "id": 11488058246, "text": "求关注。"， "source": "<a href="http://weibo.com" rel="nofollow">新浪微博</a>", "favorited": false, "truncated": false, "in_reply_to_status_id": "", "in_reply_to_user_id": "", "in_reply_to_screen_name": "", "geo": null, "mid": "5612814510546515491", "reposts_count": 8, "comments_count": 9, "annotations": [], "user": { "id": 1404376560, "screen_name": "zaku", "name": "zaku", "province": "11", "city": "5", "location": "北京朝阳区", "description": "人生五十年，乃如梦如幻；有生斯有死，壮士复何憾。", "url": "http://blog.sina.com.cn/zaku", "profile_image_url": "http://tp1.sinaimg.cn/1404376560/50/0/1", "domain": "zaku", "gender": "m", "followers_count": 1204, "friends_count": 447, "statuses_count": 2908, "favourites_count": 0, "created_at": "Fri Aug 28 00:00:00 +0800 2009", "following": false, "allow_all_act_msg": false, "remark": "", "geo_enabled": true, "verified": false, "allow_all_comment": true, "avatar_large": "http://tp1.sinaimg.cn/1404376560/180/0/1", "verified_reason": "", "follow_me": false, "online_status": 0, "bi_followers_count": 215 }, "retweeted_status": { "created_at": "Tue May 24 18:04:53 +0800 2011", "id": 11142488790, "text": "我的相机到了。", "source": "<a href="http://weibo.com" rel="nofollow">新浪微博</a>", "favorited": false, "truncated": false, "in_reply_to_status_id": "", "in_reply_to_user_id": "", "in_reply_to_screen_name": "", "geo": null, "mid": "5610221544300749636", "annotations": [], "reposts_count": 5, "comments_count": 8, "user": { "id": 1073880650, "screen_name": "檀木幻想", "name": "檀木幻想", "province": "11", "city": "5", "location": "北京朝阳区", "description": "请访问微博分析家。", "url": "http://www.weibo007.com/", "profile_image_url": "http://tp3.sinaimg.cn/1073880650/50/1285051202/1", "domain": "woodfantasy", "gender": "m", "followers_count": 723, "friends_count": 415, "statuses_count": 587, "favourites_count": 107, "created_at": "Sat Nov 14 00:00:00 +0800 2009", "following": true, "allow_all_act_msg": true, "remark": "", "geo_enabled": true, "verified": false, "allow_all_comment": true, "avatar_large": "http://tp3.sinaimg.cn/1073880650/180/1285051202/1", "verified_reason": "", "follow_me": true, "online_status": 0, "bi_followers_count": 199 } } } 2. hbase 篇 1）无法建立索引 hbase 最大的问题是无法建立索引两个变象建立索引的办法 ...

利用redis实现分布式环境下的限频

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc redis 本身有计数器，并且可以做原子的增1操作，特别适合用来做分布式环境下的限频 # coding:utf-8 import time import threading from redis import StrictRedis class Counter(object): def __init__(self, redis_url): self.redis_client = StrictRedis.from_url(redis_url) def increment(self, key): t = int(time.time()) sign = t / 60 redis_key = key + ':' + str(sign) counter = self.redis_client.incr(redis_key) # 注：设置key的失效时间没有必要和原子增1操作包含在一个事务中。 self.redis_client.expire(redis_key, 300) # 设置key的失效时间300 seconds return counter if __name__ == '__main__': redis_url = 'redis://127.0.0.1:6379/0' c = Counter(redis_url) for i in range(100): time.sleep(0.2) x = c.increment('hello') if x > 50: print "over limit" print x 这里限频有个前提条件，就是分布式环境中时钟，必须尽量对齐。在上面的例子中频率限制就是50次/分钟

UTF8 encoding is longer than the max length 32766

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 起因：同事在向ES插入数据时，收到了如下错误 mapping结构如下： { "test": { "mappings": { "test_ignore32766": { "properties": { "message": { "type": "string", "index": "not_analyzed" } } } } } } { "error": "RemoteTransportException[[Pietro Maximoff][inet[/10.1.1.51:9300]][indices:data/write/index]]; nested: IllegalArgumentException[Document contains at least one immense term in field=\"message\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[-28, -72, -83, -27, -101, -67, -25, -69, -113, -26, -75, -114, -26, -83, -93, -27, -100, -88, -25, -69, -113, -27, -114, -122, -26, -106, -80, -28, -72, -128]...', original message: bytes can be at most 32766 in length; got 69345]; nested: MaxBytesLengthExceededException[bytes can be at most 32766 in length; got 69345]; ", "status": 400 } 此问题的原因是这样的，message字段设置为not_analyzed，表示对这个字段不做分词索引，但对这个字段本身仍然是要索引的，也就说可以用term进行搜索 ...

kingshard初探

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 起因：之前的相当长时间一直在寻找mysql的分布式解决方案，一直没有特别理想的答案，有同事给推荐了kingshard，所以决定一探究竟。 1.安装机器共3台机器 IP 说明机器1 192.168.122.1 安装kingshard 机器1 192.168.122.3 安装mysql实例(node1) master 没有slave 机器1 192.168.122.4 安装mysql实例(node2) master 没有slave 安装请参考官方资料 https://github.com/flike/kingshard/blob/master/doc/KingDoc/kingshard_install_document.md 我安装的kingshard 2016年8月9日的版本，目前kingshard还没有参数或配置，把kingshard以后台守护进程的方式启动，因此作者建议使用supervisor进行管理。因为我主要是为了观察效果，所以直接在终端中启动 ./kingshard -config=../etc/ks.yaml 以下是我的配置文件 ks.yaml # server listen addr addr : 0.0.0.0:9696 # server user and password user : kingshard password : kingshard # if set log_path, the sql log will write into log_path/sql.log,the system log # will write into log_path/sys.log #log_path : /Users/flike/log # log level[debug|info|warn|error],default error log_level : debug # if set log_sql(on|off) off,the sql log will not output log_sql: on # only log the query that take more than slow_log_time ms #slow_log_time : 100 # the path of blacklist sql file # all these sqls in the file will been forbidden by kingshard #blacklist_sql_file: /Users/flike/blacklist # only allow this ip list ip to connect kingshard #allow_ips: 127.0.0.1 # the charset of kingshard, if you don't set this item # the default charset of kingshard is utf8. #proxy_charset: gbk # node is an agenda for real remote mysql server. nodes : - name : node1 # default max conns for mysql server max_conns_limit : 32 # all mysql in a node must have the same user and password user : kingshard password : kingshard # master represents a real mysql master server master : 192.168.122.3:3306 # slave represents a real mysql salve server,and the number after '@' is # read load weight of this slave. #slave : 192.168.59.101:3307@2,192.168.59.101:3307@3 down_after_noalive : 32 - name : node2 # default max conns for mysql server max_conns_limit : 32 # all mysql in a node must have the same user and password user : kingshard password : kingshard # master represents a real mysql master server master : 192.168.122.4:3306 # slave represents a real mysql salve server slave : # down mysql after N seconds noalive # 0 will no down down_after_noalive: 32 # schema defines sharding rules, the db is the sharding table database. schema : db : kingshard nodes: [node1,node2] default: node1 shard: - table: test_shard_day key: mtime # 指定分表所用的时间字段 type: date_day nodes: [node1,node2] date_range: [20160306-20160307,20160308-20160309] 由于我主要是用到kingshard的按时间分表功能，所以这里只配置了 ...

peewee 对象clone函数

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 起因: 我需要得到peewee对象的clone对象，写个简单的小程序 def clone(instance): obj = instance.__class__() # print a._meta.fields data = getattr(instance, "_data") for key in a._meta.fields: # print key if key != 'id': setattr(obj, key, data[key]) return obj

Redis 关于大量1级key的测试

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 起因：有文章提到Redis的1级key数量不应该超过100w 我十分怀疑这个结论是怎么得到的。我们知道在Redis中一个DB中的所有key维护在一个HashMap中，较多的key当然会导致，key的移动更加困难，当然由于HashMap的原因，在rehash时，可能消耗更多的时间；另外内存可能有少部分浪费。可是key的数量增大到100w以上，是否真的会带来其它问题吗 1. 测试验证 1.1 测试方法使用string 类型，不断的插入新key，为了保证key仅可能不重复，且长度一致使用自增变量 i 的md5值每写入10w个key记录一下当前的内存值，已经插入这10w个key所消耗的时间 import json import redis import random import time import hashlib r = redis.Redis(host='localhost',port=6379,db=5) SIZE = 100000 fp = open('result.txt', 'w') counter = 0 for i in xrange(0, 100): t1 = time.time() for j in xrange(0, SIZE): counter += 1 print 'counter', counter m = hashlib.md5() m.update(str(i * SIZE + j)) key = m.hexdigest() r.set(key, 1) t2 = time.time() margin = t2 - t1 info = r.info() #print info ll = [] ll.append( str((i + 1) * SIZE) ) ll.append(str(margin)) ll.append(str(info['used_memory'])) fp.write(','.join(ll) + '\n') fp.flush() print (i + 1) * SIZE, margin, info['used_memory'] fp.close() 1.2 测试数据测试总计写入1000w个key，耗时大概在半小时图1 ...

Elasticsearch经验总结（持续补充）

版权声明本站原创文章由萌叔发表转载请注明萌叔 | http://vearne.cc 起因： ES在笔者所在的公司使用也有3年多了，集群的规模达到上百台，期间也有很多的经验，我这里总结出来分享给大家，技术水平有限，如有错误请指正。事项：这些事项，我把它们以问题的形式列出，并会持续补充 1. 关于shard大小的分配 ES的shard是在index创建好时，就已经分配了，所以shard数量的选择非常重要，根据经验shard的大小在10GB ~ 20GB 较为合适。选择这个大小的原因如下 1）ES是通过移动shard来实现负载均衡，如果shard过大移动会非常缓慢 2）另外每个shard相当于一个lucene实例，lucene实例也对应着一组Java线程，所以shard数也不应该过多 2. 关于index的命名设计如果数据是随着时间增长的，可以选择按月，或者按天分库 index的命名可以是 index_201701、index_201702、index_201703 或 index_20170301、index_20170302、index_20170303 然后可以为他们指定别名index_2017，这样可以直接使用这个别名查询所有index库另外ES的库是可以关闭的，关闭以后，不占内存空间，只消耗硬盘空间 3. SSD OR 机械硬盘？ Elasticsearch的速度有赖于索引，大量的索引是以文件的形式存储在硬盘上的，如果你的数据量较大，且单次的查询或聚合量较大，那么应该使用SSD，据我们的测试表明，再查询的数据量较大的情况下，使用SSD的ES速度是机械硬盘的ES速度的10倍，官方说法在正确配置的情况下，SSD的写入速度是机械硬盘的500倍给一个参考值数据单条记录1kB 操作系统Centos 6.7 内存64G ES版本2.3 ，堆内存31GB 单个ES data node处理能力机械硬盘 SSD 1w/min 10w/min 见参考资料[1] If you are using SSDs, make sure your OS I/O scheduler is configured correctly. When you write data to disk, the I/O scheduler decides when that data is actually sent to the disk. The default under most *nix distributions is a scheduler called cfq (Completely Fair Queuing). ...