Protocol Buffers: Project Structure, Message Format and Java API

Overview

This post covers typical Java maven project structure with .proto files how to define protocol buffer message, Java API to write/read protocol buffer objects.

Recap Of Protocol Buffer Basics

Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler.

Serialize(write) and retrive(read) structured data is what protocol buffer 2 mainly about. And protocol buffer is invented in Google because traditional ways have their problems:

  • Java built Serialization: a host of well-known problems which could be found in Effective Java, does not work well with C++/Python
  • Ad-hoc encoding: e.g., encoding 4 ints as “12:3:-23:67″, works best for simple data, but not suitable when structured data become more complicated
  • Serializing to XML, seems to be attractive because a) XML is human readable b) a lot of existed XML parsing API, however, XML is heavy, both space intensive and performance penalty will apply

In comparison, protocol buffer exhibits the following advantages:

  • Light-weighted
  • Structured and less error prone
  • Compatibility is well considered within protocol buffer, so we can achieve compatibility with both old and new formats

Another thing is that the Google Proto2 actually only supports 3 languages: C++, Java and Python (proto3 supports more). You can see in the protoc help list if you followed my previous post: Protocol Buffers: Introduction and Installation and installed protocol buffer in your system:

protoc -h
  ...
  --cpp_out=OUT_DIR           Generate C++ header and source.
  --java_out=OUT_DIR          Generate Java source file.
  --python_out=OUT_DIR        Generate Python source file

There are other third party supports for other languages like PHP, Ruby, etc. See Third-Party Add-ons for Protocol Buffers for further reference. We will only cover the Java API in this post, as for other languages, please refer to the other documentations, how to define messages in the .proto file should be the same though.

Project Structure and Protoc Command Line

I think the following Java Maven project structure is a good one to follow when integrate protocol buffer features into the application:

pb-app
|-- pom.xml
`-- src
    |-- main
    |   `-- java
    |       `-- com
    |           `-- mycompany
    |               `-- app
    |                   `-- App.java
    |   `-- proto
    |       `-- *.proto
    `-- test
        `-- java
            `-- com
                `-- mycompany
                    `-- app
                        `-- AppTest.java
`-- target

So follow these steps to establish your Java protocol buffer projects using maven and eclipse:

  1. Either directly create a maven project in eclipse or using mvn command line to create the project and then use eclipse to import that project, this project should have the same structure as above except the proto folder
  2. Create a folder called ‘proto’ under src/main, that is, same level as the java folder, and put all the .proto files in this folder
  3. Edit the pom.xml file to add the protocol buffer api as a maven dependency so those generated Java code could compile
  4. Now you can define your messages / protocol buffer objects in the .proto file and use the protocol buffer compiler protoc to generate the Java source code, in the command line you can specify the output folder and you should put them into the correct place in the project structure and later one, in your customized application code, you can use the protocol buffer objects as you want, the project/application should compile and build successfully.

I will cover the command line here in this section and how to define the message and how to use Java API in the next two sections.

A typical command line syntax would be the following:

protoc --java_out=$DST_DIR $SRC_DIR/*.proto

Only two things are required to be specified in the command line: –java_out is where we want the generated code to go as well as what language we want to generate which is indicated by the name, like if it is c++, then it would be –cpp_out. And we also need to give the path to our .proto files. Another option we probably will use is the following:

-IPATH, --proto_path=PATH
           Specify the directory in which to search for imports. May be
           specified multiple times; directories will be searched in order. If
           not given, the current working directory is used.

So the -I option is to specify the folders to look for those imports, we will use import (e.g., import “myproject/other_protos.proto”;) when we want to use a field type that is already defined in another .proto file.

Also note that protoc is not quite clever about how to distinguish absolute path and relative path, so be careful when it comes to this issue while we try to generate the source code.

Protocol Buffer Message Syntax (.proto file)

This section is really talking about write the correct .proto file that could be complied using protoc. There are several keywords we could use like

  • package: avoid name collisions among the .proto files (NOT the generated source code files like the Java files)
  • option: define options, like java outer class
  • message: the core concept to define a protocol buffer message
  • import: imports other .proto files, will not be covered, check the official docs for help
  • service: define the RPC service, but protocol buffer 2 does not have any existed RPC network layer, we have to implement the RPC framework ourselves, or use gRPC, gRPC is recommended to work with proto3 not proto2, so proto2 still serves purely for serialization purpose, this will not be covered either

I will talk about the package and the option keyword first and then give the details of how to define the message.

    // the package keyword:
    // 
    // 1) always recommended to have this to avoid name collisions in the 
    //    Protocol Buffers name space, i.e., among different .proto files
    // 2) this will be used as the Java package name if there is no explicit java_package defined
    package tutorial; 
    
    
    // the option keyword:
    // 
    // java package name, if both the following and package above are not defined: 
    // in the generated java class file, there is no package declaration
    option java_package = "com.example.tutorial";
    // the following defines the class name which should contain all of the classes in this file. 
    // if not defined, use the file name in camel case as the outer class name
    option java_outer_classname = "AddressBookProtos";

Next, let’s see how the define a protocol buffer message, we need to follow this syntax:

// message syntax
message MessageName {
    modifier type var_name = id marker [default = default_value]; 
}

So message consists of multiple fields, each field can have modifier, type, name, id and default values. These are:

1) modifier:
There are only three modifier values we can use:

  • required: if a field is required, then this filed has to be set for the whole object to be used, otherwise, exceptions would be thrown, however, nothing is forever, so be careful to make a field as required, there could be compatibility issues
  • optional: optional field could be unset, if we try to access a field which is not set, default value of that field will be return
  • repeated: this kind of field works like dynamic array, java.util.List, if it is not set, the it returns an empty List in Java, it will not return null at all in this case, actually there is no null in protocol buffer (not 100% sure ???)

2) type:
Many standard simple data types are available as field types, including bool, int32, float, double, and string, check the doc for the complete list. And field type could be either simple data types like int32 or other message types.

Also note: say we define message A and B in test.proto file and we define another message type AInner inside the definition of message A, then we have to use A.AInner to access message AInner. For example, in the addressbook.proto file, if we want to use PhoneNumber type in AddressBook message, we need to use Person.PhoneNumber, if we just use PhoneNumber, we will get “PhoneNumber” is not defined error message when trying to compile. From this, we could see, there is no public/private concept in .proto files.

3) var_name:
We should always use lowercase-with-underscores for field names in our.proto files; this ensures good naming practice in all the generated languages, and we do not have to worry the Java code, it will be automatically converted into camel-case naming

4) id marker:
id marker is used to identifie the unique ‘tag’ this field uses in its binary encoding. There are certain notes we need to keep in mind and optimizations we can do about the id:

  • Smallest id = 1, largest id = 2^29 – 1, id number of range 19000 through 19999 are reserved for protocol buffer implementations, and thus are not allowed to use as customized id marker
  • The id tag should not be changed once the message is used, tag ranges from 1 to 15 takes one byte to encode (id_number + field type)
  • Tags in the range of 16 through 2047 take two bytes, as an optimization, it would be good to reserve the range 1 through 15 for frequently occurring message elements both currently used or for the future, the reason behind is, for those more commonly used fields or repeated fields, it would be frequently required to re-encoding the tag number, so the smaller the tag space, the more efficient the re-encoding would be. And repeated fields are particularly good candidates for this optimization.

5) Default values: we can specify the default values as well.

So that is basically everything we need to know about how to write the .proto file and one last thing to remember: protocol buffer does not support class inheritance, so do not write anything like a class inheritance in the .proto file.

Protocol Buffer Java API

So after we define messages in the .proto file, we can generate the Java source code for us now, and since proto2 is really just designed to serialize and retrieve structured data, the Java API would also just mainly be about how to read and write the defined protocol buffer objects.

1) The generated Java source code typically contains:

  • Builder: we should always use the builder pattern to create an instance of the protocol buffer object
  • For each field, it will generate hasField(), getField(), if it is a repeated field, then it has no hasField() method, but instead, it will have getFieldCount() to indicate whether it is an empty List.
  • For repeated: getRepeatedFileds() return a List, getRepeatedFieldCount() to determine if there are any elements, getRepeatedFiled(int idx) to return an element indexed at idx;
  • setters are in the Builder class: simple java-bean style, use clearField() to unset a filed.
  • The generated Java message objects are immutable once constructed, just like Java String

2) There are some Standard Message Methods: Each message and builder class also contains a number of other methods that let you check or manipulate the entire message, including:

  • isInitialized(): checks if all the required fields have been set.
  • toString(): returns a** human-readable** representation of the message, particularly useful for debugging.
  • mergeFrom(Message other): (builder only) merges the contents of other into this message, overwriting singular fields and concatenating repeated ones.
  • clear(): (builder only) clears all the fields back to the empty state.

These methods implement the Message and Message.Builder interfaces shared by all Java messages and builders. For more information, see the complete API documentation for Message

3) The follow are the API for** Parsing and Serialization**: each protocol buffer class has methods for writing and reading messages of your chosen type using the protocol buffer binary format. These include:

  • byte[] toByteArray();: serializes the message and returns a byte array containing its raw bytes.
  • static Person parseFrom(byte[] data);: parses a message from the given byte array.
  • void writeTo(OutputStream output);: serializes the message and writes it to an OutputStream.
  • static Person parseFrom(InputStream input);: reads and parses a message from an InputStream.

These are just a couple of the options provided for parsing and serialization. Again, see the Message API reference for a complete list.

4) How to add other rich features to the proto classes:

  • Protocol buffer classes are basically dumb data holders (like structs in C++);
  • We should use decorator pattern: wrap the generated protocol buffer class in an application-specific class.
  • We should never add behaviour to the generated classes by inheriting from them. This will break internal mechanisms and is not good object-oriented practice anyway.

5) Compatibility issue, Extending a Protocol Buffer

In order to make new buffers to be backwards-compatible, and our old buffers to be forward-compatible, the following rules have to be  followed :

  • we must not change the tag numbers of any existing fields.
  • we must not add or delete any required fields.
  • we may delete optional or repeated fields.
  • we may add new optional or repeated fields but you must use fresh tag numbers (i.e. tag numbers that were never used in this protocol buffer, not even by deleted fields).

What it looks like after changing:

  • To the old code, optional fields that were deleted will simply have their default value, and deleted repeated fields will be empty.
  • New code will also transparently read old messages.
  • New optional fields will not be present in old messages, so we will need to either check explicitly whether they’re set with has_ or provide a reasonable default value in our .proto file with [default = value] after the tag number
  • Note also that if we added a new repeated field, our new code will not be able to tell whether it was left empty (by new code) or never set at all (by old code) since there is no has_ flag for it.

TODO

Still need to research more about the following topics:

  • Advanced usage of Reflection:
  • Protocol Buffer Encoding
  • Code style guide
  • Define service in proto2 like: service FooService {rpc GetSomething(FooRequest) returns (FooResponse); }

Reference:

https://developers.google.com/protocol-buffers/docs/javatutorial

Summary

This post covers typical Java maven project structure with .proto files how to define protocol buffer message, Java API to write/read protocol buffer objects.

Written on October 19, 2015