Variable byte coding is a universal coding method for integer values that is commonly used to compress posting lists into files.
Its main advantage over other coding methods is the very low coding/decoding cost, yet producing good compression results.
The coding function is pretty simple:
- the significant bits of the number to be coded are split in groups of seven bits (starting from the least significant bits, filling with zeros the last group if it contains less than seven bits);
- the first N-1 groups are written in a byte, setting the last continuation bit to one;
- the last group is written in a byte with the continuation bit set to zero.
It is left to the reader to derive the decoding function (proof by omission).
Examples (continuation bit is shown in bold):
For the above list of numbers the variable byte coding saves 6 bytes with respect to the fixed size four-bytes coding.
decimal = 32-bits binary = variable byte coding
120 = 00000000 00000000 00000000 01111000 = 01111000
1563 = 00000000 00000000 00000110 00011011 = 10011011 00001100
45248 = 00000000 00000000 10110000 11000000 = 11000000 11100001 00000010
1273065 = 00000000 00010011 01101100 11101001 = 11101001 11011001 01001101
2154789658 = 10000000 01101111 01111011 00011010 = 10011010 1111011 10111101 10000011 00001000
The last value requires more than four bytes, but in typical applications, e.g. saving gaps-valued postings lists, you can expect that the distribution of numbers you are coding is well shifted toward small values.
The VariableByteCoding class implements methods for coding and decoding a long value to a stream, there is also a method that returns just the number of bytes required to code a value, and a method that fits a coded value into a given number of bytes (which obviously has to be equal of greater to the number bytes required to code the value) by properly filling the remaining bytes.
Enough, it is time to let the code speak by itself:
// Base - The toolbox of the non-project-specific things.
// Copyright (C) 2007-2008 Andrea Esuli
//
// This program is free software; you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation; either version 2 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
//
// Andrea Esuli (a_n_d_r_e_a ( a_t ) e_s_u_l_i . i_t )
using System;
using System.IO;
namespace Esuli.Base.IO
{
public class VariableByteCoding
{
public static long Read(Stream stream)
{
long value = 0;
long byteCode = 0;
int shift = 0;
do
{
byteCode = stream.ReadByte();
value += (byteCode & 127) << shift;
shift += 7;
}
while (byteCode > 127);
return value;
}
public static void Write(long value, Stream stream)
{
do
{
byte byteCode = (byte) (value & 127);
value = value >> 7;
if (value != 0)
{
byteCode |= 128;
stream.WriteByte(byteCode);
}
else
stream.WriteByte(byteCode);
}
while (value != 0);
}
public static int ByteSize(long value)
{
int size = 0;
do
{
++size;
value = value >> 7;
}
while(value != 0);
return size;
}
public static void Write(long value, int byteSize, Stream stream)
{
int realByteSize = ByteSize(value);
if (realByteSize > byteSize)
throw new Exception("The given value requires more bytes than specified to be encoded");
else if (realByteSize == byteSize)
Write(value, stream);
else
{
do
{
byte byteCode = (byte)(value & 127);
value = value >> 7;
byteCode |= 128;
stream.WriteByte(byteCode);
--byteSize;
}
while (value != 0);
while (byteSize > 1)
{
stream.WriteByte(128);
--byteSize;
}
stream.WriteByte(0);
}
}
}
}