vint–针对int型的压缩格式

版权声明 本站原创文章 由 萌叔 发表
转载请注明 萌叔 | https://vearne.cc

1. 引言

vint是lucene所采用的针对int型的压缩方法。
对于绝大多数语言，int型都占4个字节，不论这个数据是100、1000、还是1000,000。vint的理念是采用可变长的字节，来表示一个整数。数值较大的数，使用较多的字节来表示，数值较少的数，使用较少的字节来表示。

2. vint的格式定义

A variable-length format for positive integers is defined where the high-order bit of each byte indicates whether more bytes remain to be read. The low-order seven bits are appended as increasingly more significant bits in the resulting integer value. Thus values from zero to 127 may be stored in a single byte, values from 128 to 16,383 may be stored in two bytes, and so on.

每个字节仅使用第1至第7位(共7bits)，第8位作为标识，表明(为了得到这个整型)是否需要读取下一个字节。

从下面的例子我们可以看出
整数1 的2进制表示为 "00000001", 仅用1个字节
整数130 的2进制表示为 "10000010 00000001" 仅用2个字节

此处输入图片的描述

由于使用vint压缩格式以后，每个字节仅有7个有效位，因此它的最大占用字节数为 Math.ceil(32/7) = 5 需要5个字节，表示任意一个int型占用的字节数是1至5，以平均2.5来计算, 相比原有的4字节表示方法，节约空间37.5%

3. 实现

3.1 编码

编码代码可参考
lucene 3.x
package:org.apache.lucene.store
DataOutput.java

  public final void writeVInt(int i) throws IOException {
    while ((i & ~0x7F) != 0) {
      writeByte((byte)((i & 0x7F) | 0x80));
      i >>>= 7;
    }
    writeByte((byte)i);
  }

Negative numbers are supported, but should be avoided.

注意:这里代码注释按时我们，应避免对负数使用vint压缩格式

3.2 解码

解码代码可参考
lucene 3.x
package:org.apache.lucene.store
DataOutput.java

public int readVInt() throws IOException {
    byte b = readByte();
    if (b >= 0) return b;
    int i = b & 0x7F;
    b = readByte();
    i |= (b & 0x7F) << 7;
    if (b >= 0) return i;
    b = readByte();
    i |= (b & 0x7F) << 14;
    if (b >= 0) return i;
    b = readByte();
    i |= (b & 0x7F) << 21;
    if (b >= 0) return i;
    b = readByte();
    i |= (b & 0x0F) << 28;
    if ((b & 0xF0) == 0) return i;
    throw new IOException("Invalid vInt detected (too many bits)");
  }