Learning Google Protocol Buffers with Python - Part 1
Posted on Oct 27, 2018 in Python Programming - Advanced Level by Amo Chen ‐ 5 min read
This article is part of a series:
- Learning Google Protocol Buffers with Python - Part 1
- Learning Google Protocol Buffers with Python - Part 2
- Learning Google Protocol Buffers with Python - Part 3
The blog post What I Learned from Quip on How to Build a Product on 8 Different Platforms with Only 13 Engineers explains how Quip managed to build products for 8 different platforms with only a 13-member team. It’s definitely something worth learning from.
A key concept from the post is Build once, use multiple times. It encourages minimizing the repetition of creating the same components, thereby increasing the reusability of components. The article also reveals that Quip heavily uses Google Protocol Buffers. By defining data structures using Google Protocol Buffers, automatic code generation can occur for reading and writing the same data structure across various languages or platforms. It can even act as a data exchange format to transfer between different platforms, reducing repetitive development costs and thus improving development efficiency.
With such a handy tool, let’s learn Google Protocol Buffers using Python!
Environment for This Article
- Python 3.6.5
- Google Protocol Buffers 3.6.1
- macOS 10.13.6
To install protobuf on macOS:
$ brew install protobuf
Three Steps of Google Protocol Buffers
Using Google Protocol Buffers is actually straightforward, requiring only three steps:
- Write a
.proto
file to define the data structures you need (also known as message types). - Compile the
.proto
file usingprotoc
to automatically generate code. - Start using the code generated by
protoc
.
Writing a .proto
File
Currently, there are two syntax versions for writing a .proto
file: proto2 and proto3. proto3
supports more programming languages, such as Go
, Ruby
, Objective-C
, PHP
, and C#
, and it also includes JSON Mapping, allowing us to easily write Protocol Buffers in JSON format.
Hence, when writing a .proto
file, it is necessary to specify the Protocol Buffers version (we will use proto3
for this demo):
syntax = "proto3";
In addition to specifying the syntax version, you can also specify a package
to avoid conflicts when message types have the same name:
package foo.bar;
However, the package
syntax is ignored when compiling .proto
files into Python, because Python modules correspond to their path in the file system. Changing the filename or path can usually avoid name conflicts. Therefore, if you’re building applications solely with Python, you can ignore the package
. But if multiple programming languages are involved in building applications, it is advisable to set a package
.
After specifying the syntax
and package
, you can formally define the necessary data structures, which Google Protocol Buffers refers to as message types.
Each message type begins with the message
keyword, followed by the name of the message type, and the fields are defined within curly braces.
For example, let’s define a message type named User
:
message User {
int32 id = 1; // user's id
string name = 2; /* nickname */
string email = 3;
}
In the structure above, there are three fields: id, name, and email. Their data types are int32, string, and string, respectively. The numbers at the end of each field are not default values but field numbers. Each field in a message type must have a field number. The smallest field number starts at 1, and the maximum is 536,870,911 (2 raised to the power of 29 minus 1, which is rarely used to define so many fields). Numbers from 19,000 to 19,999 are reserved by Google Protocol Buffers and cannot be used.
It’s worth mentioning that if there are performance considerations, it is recommended to reserve numbers 1–15 for the most frequently used fields, as numbers 1–15 require only 1 byte for storage capacity.
The above example also demonstrates two ways to comment in Google Protocol Buffers:
// comment
/* comment */
At this point, we should have completed a .proto
file for one message type (named user.proto
in this article). The complete content is:
syntax = "proto3";
message User {
int32 id = 1; // user's id
string name = 2; /* nickname */
string email = 3;
}
Compiling the .proto
File with protoc
After completing the .proto
file, use the protoc
command to compile it. If you wish to output the .proto
file as code that can be used in Python, specify the output parameter as --python_out <destination directory>
to store it in a directory:
$ mkdir protobufs # Create a directory to store the compiled Python code
$ protoc --python_out protobufs user.proto
Using the Code Generated by protoc
Once the Python code is generated and compiled, you can import and use it:
$ python
>>> from protobufs.user_pb2 import User
>>> u = User()
>>> u.id = 1
>>> u.name = 'John'
>>> u.email = '[email protected]'
If you try to assign a type that is not accepted for a property, an error will occur, such as giving id a string type value:
>>> u.id = 'string'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'string' has type str, but expected one of: int, long
Trying to set a property not defined within the message type will also result in an error:
>>> u.unknown_field = 'test'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: Assignment not allowed (no field "unknown_field" in protocol message object).
Once values are set within the message, you can output it as a binary form of string:
>>> output = u.SerializeToString()
>>> output
b'\x08\x01\x12\x04John\x1a\[email protected]'
One of the reasons Google Protocol Buffers are efficient is because it converts data into a binary format. Thus, it is often used in conjunction with Kafka in practice.
To read data, use ParseFromString
:
>>> user = User()
>>> user.ParseFromString(output)
24
>>> user.id
1
>>> user.name
'John'
>>> user.email
'[email protected]'
From the example above, you can see that after instantiating User()
, you can read the message using the ParseFromString
method, and then the values within user are set.
Summary
The above is the simplest tutorial for Google Protocol Buffers.
While it seems simple, in reality, once A defines, B can generate code based on the same .proto
, allowing A and B to interpret data released via Google Protocol Buffers effectively. This saves the cost of repeatedly developing the same modules across different applications/platforms and achieves efficient integration.
In the next article, we will explain more important syntax and features in proto3
.
References
https://developers.google.com/protocol-buffers/docs/proto3