Protocol Buffers: Encoding and Message Structure

Overview

How to read binary wire format for protocol buffer message: varint and zigzag encoding, wire type, key value pair structure, tips for better efficiency.

Goals

In this post, we try to lean and understand the binary wire format for protocol buffer messages, that is, how to understand what the protocol buffer message is by only reading and seeing the bytes of a protocol buffer object. I will give actual bytes of real protocol buffer message object and to explain how to understand those bytes. To achieve this, there following are the key concepts/techniques we need to learn and understand:

  • Varint: Endianness, Two’s complement, and Varint
  • ZigZag Encoding: Protocol Buffer’s more efficient way to encode negative values
  • Wire type
  • The key-value pair structure of a protocol buffer message object, how to encode the key using field number and wire_type

Base 128 Varints

Varints are a way to encode integer with variable-length: Smaller numbers take a smaller number of bytes.

1 0101100         ...        0 0000010
- --------        ...        - --------
msb  7-bit group             msb  7-bit group

So there are two components of a varint for each byte:

  1. MSB: Most Significant Bit.  If msb == 1, then it indicates that there are further bytes to come, the last byte’s msb is zero.
  2. 7-bit group: This is the 7-bit group component, that is, the lower 7 bits of each byte are used to store the two’s complement representation of the number in groups of 7 bits, and it is Little-endian format, i.e.,  Least Significant Group First.

So intuitively, when we see bytes of a varint, we should be able to find its decimal value by the following steps:

  1. Drop the msb from each byte: Do not be confused, drop does not mean the msb is useless, but the msb isquite  important to determine the length of a varint if multiple varints are concatenated together.
  2. Reverse groups of 7 bits because, varints store numbers with the least significant group first (little-endian).
  3. Concatenate the reversed 7-bit groups and this is the two’s complement of the decimal value it represents.
  4. Now it would be a bit tricky to deal with negative values, Protocol Buffer has two different encoding ways for negative values, for standard type like int32, it has to use all of its 10 bytes to treat the negative value as a large unsigned value, and for signed integers like sint32, it uses a way called ZigZag encoding which is the focus of next section. Here we follow the standard way.

Let’s first take a look at the positive value example like the following:

1010 1100 0000 0010

How do we figure out that this is actually equal to decimal value 300? So following the above steps, first we drop the msb from each byte,

1010 1100 0000 0010
→ 010 1100  000 0010

The step 2, we reverse the two groups of 7 bits, and then step 3 concatenate the bits and get the two’s complement, not the for this actual value part, the most significant bit is zero meaning it is a positive value, so we can directly calculate its decimal value as 300:

000 0010  010 1100
→  000 0010 ++ 010 1100
→  100101100
→  256 + 32 + 8 + 4 = 300

Negative Value is a bit more tricky: if we print out the byte array in Hex, we will always find the value is encoded as ten bytes long in the binary format no matter how bit the value is. For example, I have a message:

message Test {
  optional int32 a = 1;
}

And if I set the field a to be -6, and I print out the byte array, the output is:

08
fa
ff
ff
ff
ff
ff
ff
ff
ff
01

08 is the key which is the focus of following sections, but let’s focus on the actual data, the next ten bytes (Hex value). Why it is ten bytes? This is because it is hard to tell if the number is negative or not unless the 2′s complement number/value (the 7 bit groups) is sign extended to a multiple of 7 bits and ensure that the most significant bit is 1 for a negative number. More specifically, protocol buffer needs 10 bytes to encode regular/standard 64 bit integers (why? if 9 bytes, it only has 63 bits to encode the 2′s complement value, so we need one more byte the cover the complete 64 bits), and for a negative value, sign extend it to the most significant bit means we always get 10 bytes.

Now we can get back to the above ten bytes, and to see how we can get the decimal value -6. So first drop the msb, and we get

1111 1010 1111 1111 ... 1111 1111 0000 0001
→ 111 1010 111 1111 ... 111 1111 000 0001

then we reverse the 7-bit groups, and we know it is a negative value because it is ten bytes and the most significant bit (the msb is the bit 63 of the following groups) in the 7-bit groups is 1 (do not be confused with the msb of the original bytes)

111 1010 111 1111 ... 111 1111 000 0001
→ 000 0001 ++ 111 1111 ... ++ 111 1111 ++ 111 1010 (70 bits)
→ 1111... 1010
→ 1010 ^ 1111 + 0001 = 6

Note we apply an additional step to get the decimal value of a negative two’s complement one, so we know the absolute value of this negative value is 6, and we know the original value is -6.

The confusing part is bit 64 and above, they are all set to be zero (the reversed 7-bit groups has 70 bits in total and the most significant 6 bits are all zero), this is what protocol buffer does, but since it encodes 64 bit integers, we should look at the the bit 63 which is 1 and we know it is a negative number.

ZigZag Encoding

So as we already discussed above, there is a space efficiency concern about the varint for negative numbers. That is, when it comes to encoding negative numbers. If we use int32 or int64 as the type for a negative number, the resulting varint is always ten bytes long – it is, effectively, treated like a very large unsigned integer. Google protocol buffer provides a more efficient encoding called ZigZag encoding when we define the field types to be sint32 andsint64 instead of int32 and int64.

So for a number n, the ZigZag encoding is performed as:

(n << 1) ^ (n >> 31) if 32 bit and (n << 1) ^ (n >> 63) if 64 bit

  • (n « 1) is the logical shift (vacant bit positions are filled in with 0s)
  • (n » 32) is the arithmetic shift, so the difference is when the number is shift to right, arithmetic shift will fill in those vacant bit with the sign, i.e., 1 for negative, 0 for positive, so this is called signed shift as well

The above formula is a bit less intuitive, but I give another formula equivalently but not using bit operations, for a number n, ZigZag encode n to be unsigned as follows:

if n == 0:
	return 0
if n &gt; 0:
	return 2 * n 
if n &lt; 0:
	return 2 * |n| - 1

We can verify the correctness of the above formula by looking at examples:

By definition, ZigZag encoding maps signed integers to unsigned integers so that numbers with a small absolute value(for instance, -1) have a small varint encoded value too. It does this in a way that “zig-zags” back and forth through the positive and negative integers, so that -1 is encoded as 1, 1 is encoded as 2, -2 is encoded as 3, and so on, as we can see in the following table:

Signed Original Encoded As
-1 1
1 2
-2 3
2147483647 4294967294
-2147483648 4294967295

We can list a few more say -3 is mapping to 5 which is 2 * 3 – 1 while 3 is mapping to 6 which is 2 * 3 and -4 is mapping to 7 which is equal to 2 * 4 -1 and so on.

So tip 1: we should use signed integers like sint32 and sint64 if we know that a field could possibly be negative.

Now let’s take a look at the byte array of a real example and to see how we get back the original decimal value of that byte array. The message definition is changed to use sint32:

message Test {
  optional sint32 a = 1;
}

And if we have sint a = -6 we will get output of the byte array of this message object:

08
0b

08 is the key and is a combination of the field_number and wire_type which will be explained in the next section. So

0b

is the actual data. Now we have a question, by just looking at the bytes array data, we actually do not know what exactly value it is, we need to correlate it the by referencing the message type’s definition (i.e. the .proto file) to determine how to decode this byte:

  • If it is “standard” integers: int32 and int64, then it is a regular positive value
  • If it is signed integers: sint32 and sint64, then we need the further decode it by the formula above use 2* n or 2 * n -1

Note the above rule applies generally: say if we see a 10 bytes long binary format like the previous example, without referencing the message type definition, i.e., the .proto file, we will not be able to know whether it is a large positive value or it is actually a negative value with small absolute value, so eventually, to correctly parse the binary data, we need combined information to determine. This will be explained in details in the next section.

Now let’s get back to the example, we now know it is ZigZag encoding because it is signed integer in the .proto definition, so we first get

0b  =  11 decimal

and then we set up the formula:

2 * (-n) – 1 = 11,

so we get a conclusion that  n is  -6 which is exactly the decimal value we set in the code.

Wire Type and Key Value Pair Encoding

A protocol buffer message is a series of key-value pairs,  The binary version of a message just uses the field’s number and wire_type as the key – the wire type refers to the binary format data type and it only provides just enough information to find the length of the following value in the binary stream. The name and declared type for each field can only be determined on the decoding end by referencing the message type’s definition (i.e. the .proto file). 

  • Key: a varint with the value field_number « 3 wire_type
  • Value: this is the actual data. It could be ‘varint’, ‘fixed size, 32 bit or 64 bit’, ‘Length-delimited like string or embeded message and so on’

The available wire types are as follows:

Type  Meaning            Used For
0     Varint             int32, int64, uint32, uint64, sint32, sint64, bool, enum
1     64-bit             fixed64, sfixed64, double
2     Length-delimited   string, bytes, embedded messages, packed repeated fields
3     Start group        groups (deprecated)
4     End group          groups (deprecated)
5     32-bit             fixed32, sfixed32, float  

1) Let’s explains the key first. So we know there are 6 different wire_types available, and the group ones are deprecated. To encode this wire_type information, we need 3 bits because 3 bits are needed to encode number 5, this is why we first shift filed_number to left by 3 positions. Remember what we mentioned in previous post: Protocol Buffers: Project Structure, Message Format and Java API , for efficiency concern, it is better to reserve id number from 1 to 15 for those more frequently used field or repeated fields.** Why 1 to 15?** Now we can explain it: we use varint, for one byte, the msb is used to indicate whether there are coming bytes, the least significant 3 bits are used to encode the wire_type, so there are only 8 – 1 – 3 = 4 bits to encode the id_number within one byte, and the range is 1 to 15.

Now we can get back to explain the first byte of the examples we talked about:

08
0b

So far, in all the examples we use, the id_number is 1 and the type is integer, so the wire_type is 0, so for

08

We know the key is a varint too, and the highest bit zero indicate this is the only byte for the key, and we find the least 3 bits are 000 and the filed_number is 1, so we know the tag is 1. This is consistent with the .proto file definition.

Regarding with the enum type here, another tip is: enum seems to not work well with version compatibility, that is, if we use an enum in our protocol buffer message, then for later versions, if we want to either add or reduce the enum types, different version of protocol objects will find it difficult to understand each other. I think we have 2 options and I think the first option is better:

  • Do not use enum but use say string instead.
  • Make the first value in the enum something like UNKNOWN = 0. Then old programs reading a protobuf with an enum value they don’t recognize will see it as UNKNOWN and hopefully they can handle that reasonably, eg by skipping that element. This is less desirable because to determine the reasonable behavior when seeing UNKNOWN is difficult right? How to deal with it if it is UNKNOWN? In most cases, we probably need to know what exactly the data is when we see it.

2) Then let’s talk about the value of a message object. From the above discussion, we already know how to parse the key, after that, we should know how to determine the length of the value and then read the data properly, so the table indicates, there are two categories of value formats and let’s see how we parse them:

  1. varint: we already have a lot of discussion about it, see previous sections
  2. non-varint: this includes fixed size data (i.e. 64-bit and 32 bit) and length delimited (e.g., string, embedded message)

We take a closer look at non-varint here:

2.1) Non-varint Numbers

Non-varint numeric types are simple – double and fixed64 have wire type 1, which tells the parser to expect a fixed 64-bit lump of data; similarly float and fixed32 have wire type 5, which tells it to expect 32 bits. In both cases the values are stored in** little-endian byte order: **this is similar to the least significant groups first:

Little-Endian

2.2) Strings

A wire type of 2 (length-delimited) means that the value is a varint encoded length followed by the specified number of bytes of data.

message Test2 {
  required string b = 2;
}

Setting the value of b to “testing” gives us:

12 07 74 65 73 74 69 6e 67

The red bytes are the UTF8 of “testing”. The key here is 0×12 → tag = 2, type = 2. The length varint in the value is 7 and lo and behold, we find seven bytes following it – our string.

2.3) Embedded Messages

Here’s a message definition with an embedded message of our example type, Test1:

message Test3 {
  required Test1 c = 3;
}

And here’s the encoded version, again with the Test1′s a field set to 150:

1a 03 08 96 01

As we can see, the last three bytes are exactly the same as our first example (08 96 01), and they’re preceded by the number 3 – embedded messages are treated in exactly the same way as strings (wire type = 2).

One last thing to end this section,** how do we know the following bytes are actually strings or embedded message**: I haven’t checked out this detail yet, but as we mentioned, the declared type for each field and how decoding could be done can only be determined by referencing the message type’s definition of the .proto file, so in a different stage of the protocol buffer parsing, there must be a way to determine whether we recursively parse an embedded message or we directly read it as a string.

Version Compatibility

More specifically, the version compatibility problem is: When a message is encoded, the keys and values are concatenated into a byte stream. When the message is being decoded, the parser needs to be able to skip fields that it doesn’t recognize. This way, new fields can be added to a message without breaking old programs that do not know about them.

They way that protocol buffer message is encoded as a series of key-value pairs naturally resolves this problem: Because we have the key and the id_number, we can easily know which field should be set or not set without worrying about which version of the protocol buffer object, all we need to know is the id_number and thus one good tip to keep in mind this that never reuse the id_number:

Never reuse the tag numbers of fields within a message, after a release. And don’t even reuse a tag number of a deleted field for say a newly added field.

So far, I haven’t checked out the following topics, but I ll leave them as TODO:

    • Optional And Repeated Elements
    • Packed Repeated Fields
    • Field Order
  • proto2 VS proto3

Reference

https://developers.google.com/protocol-buffers/docs/encoding

Summary

In this poset, we discussed how to read binary wire format for protocol buffer message: varint and zigzag encoding, wire type, key value pair structure, tips for better efficiency.

Written on November 1, 2015