Learning Google Protocol Buffers with Python - Part 1

Posted on  Oct 27, 2018  in  Python Programming - Advanced Level  by  Amo Chen  ‐ 5 min read

This article is part of a series:

The blog post What I Learned from Quip on How to Build a Product on 8 Different Platforms with Only 13 Engineers explains how Quip managed to build products for 8 different platforms with only a 13-member team. It’s definitely something worth learning from.

A key concept from the post is Build once, use multiple times. It encourages minimizing the repetition of creating the same components, thereby increasing the reusability of components. The article also reveals that Quip heavily uses Google Protocol Buffers. By defining data structures using Google Protocol Buffers, automatic code generation can occur for reading and writing the same data structure across various languages or platforms. It can even act as a data exchange format to transfer between different platforms, reducing repetitive development costs and thus improving development efficiency.

With such a handy tool, let’s learn Google Protocol Buffers using Python!

Environment for This Article

  • Python 3.6.5
  • Google Protocol Buffers 3.6.1
  • macOS 10.13.6

To install protobuf on macOS:

$ brew install protobuf

Three Steps of Google Protocol Buffers

Using Google Protocol Buffers is actually straightforward, requiring only three steps:

  1. Write a .proto file to define the data structures you need (also known as message types).
  2. Compile the .proto file using protoc to automatically generate code.
  3. Start using the code generated by protoc.

Writing a .proto File

Currently, there are two syntax versions for writing a .proto file: proto2 and proto3. proto3 supports more programming languages, such as Go, Ruby, Objective-C, PHP, and C#, and it also includes JSON Mapping, allowing us to easily write Protocol Buffers in JSON format.

Hence, when writing a .proto file, it is necessary to specify the Protocol Buffers version (we will use proto3 for this demo):

syntax = "proto3";

In addition to specifying the syntax version, you can also specify a package to avoid conflicts when message types have the same name:

package foo.bar;

However, the package syntax is ignored when compiling .proto files into Python, because Python modules correspond to their path in the file system. Changing the filename or path can usually avoid name conflicts. Therefore, if you’re building applications solely with Python, you can ignore the package. But if multiple programming languages are involved in building applications, it is advisable to set a package.

After specifying the syntax and package, you can formally define the necessary data structures, which Google Protocol Buffers refers to as message types.

Each message type begins with the message keyword, followed by the name of the message type, and the fields are defined within curly braces.

For example, let’s define a message type named User:

message User {
	int32 id = 1;      // user's id
	string name = 2;   /* nickname */
	string email = 3;
}

In the structure above, there are three fields: id, name, and email. Their data types are int32, string, and string, respectively. The numbers at the end of each field are not default values but field numbers. Each field in a message type must have a field number. The smallest field number starts at 1, and the maximum is 536,870,911 (2 raised to the power of 29 minus 1, which is rarely used to define so many fields). Numbers from 19,000 to 19,999 are reserved by Google Protocol Buffers and cannot be used.

It’s worth mentioning that if there are performance considerations, it is recommended to reserve numbers 1–15 for the most frequently used fields, as numbers 1–15 require only 1 byte for storage capacity.

The above example also demonstrates two ways to comment in Google Protocol Buffers:

  1. // comment
  2. /* comment */

At this point, we should have completed a .proto file for one message type (named user.proto in this article). The complete content is:

syntax = "proto3";

message User {
    int32 id = 1;      // user's id
    string name = 2;   /* nickname */
    string email = 3;
}

Compiling the .proto File with protoc

After completing the .proto file, use the protoc command to compile it. If you wish to output the .proto file as code that can be used in Python, specify the output parameter as --python_out <destination directory> to store it in a directory:

$ mkdir protobufs # Create a directory to store the compiled Python code
$ protoc --python_out protobufs user.proto

Using the Code Generated by protoc

Once the Python code is generated and compiled, you can import and use it:

$ python
>>> from protobufs.user_pb2 import User
>>> u = User()
>>> u.id = 1
>>> u.name = 'John'
>>> u.email = '[email protected]'

If you try to assign a type that is not accepted for a property, an error will occur, such as giving id a string type value:

>>> u.id = 'string'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'string' has type str, but expected one of: int, long

Trying to set a property not defined within the message type will also result in an error:

>>> u.unknown_field = 'test'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: Assignment not allowed (no field "unknown_field" in protocol message object).

Once values are set within the message, you can output it as a binary form of string:

>>> output = u.SerializeToString()
>>> output
b'\x08\x01\x12\x04John\x1a\[email protected]'

One of the reasons Google Protocol Buffers are efficient is because it converts data into a binary format. Thus, it is often used in conjunction with Kafka in practice.

To read data, use ParseFromString:

>>> user = User()
>>> user.ParseFromString(output)
24
>>> user.id
1
>>> user.name
'John'
>>> user.email
'[email protected]'

From the example above, you can see that after instantiating User(), you can read the message using the ParseFromString method, and then the values within user are set.

Summary

The above is the simplest tutorial for Google Protocol Buffers.

While it seems simple, in reality, once A defines, B can generate code based on the same .proto, allowing A and B to interpret data released via Google Protocol Buffers effectively. This saves the cost of repeatedly developing the same modules across different applications/platforms and achieves efficient integration.

In the next article, we will explain more important syntax and features in proto3.

References

https://developers.google.com/protocol-buffers/docs/proto3