Python Module Tutorial - dataclasses

Posted on  Feb 1, 2023  in  Python Programming - Intermediate Level  by  Amo Chen  ‐ 8 min read

Python’s dataclasses is a new module added in Python 3.7, mainly used to define structured data in the form of classes.

The dataclasses module provides some convenient features to help automatically generate commonly used class methods such as __init__, __repr__, __eq__, etc., saving developers time in writing repetitive code.

Using dataclasses can make Python programs more concise and improve code readability.

Are you ready to show off your skills using dataclasses in your Python code?

Requirements

  • Python 3.10

Why do we need data classes?

We often represent data with fixed structure in the form of classes, which not only increases readability but also encapsulates the related operations to the class, thus increasing maintainability.

This kind of class which is specially used to store structured data is called data class, and it can be simply imagined as a container of data.

For example, using python claass to present coordinate points may look like the following example:

class Point(object):
    def __init__(self, x, y):
        self.x = x
        self.y = y

As can be seen from the above example, the __init__ method is simply assigning each data to the attributes within the class, and as the data attributes or data classes become more and more, we may keep repeating similar code.

class Point(object):
    def __init__(self, x, y, z, ...):
        self.x = x
        self.y = y
        self.z = z
        ...

In this case, you can try using the Python dataclasses module. You can use the dataclass decorator and define the class properties with Python type annotations. For example, the previous Point class can be simplified as follows:

from dataclasses import dataclass


@dataclass
class Point:
    x: int
    y: int

Does it look much simpler after switching to using dataclass?

The usage is no different from using a normal class:

p = Point(x=1, y=2)
# equals
p = Point(1, 2)

What does the Python dataclass decorator do?

For example, the following class which does not use dataclass:

class Point(object):
    def __init__(self, x, y):
        self.x = x
        self.y = y

If not using dataclass, simply printing the class:

p1 = Point(1, 2)
print(p1)

The result will be as follows, it will print out the memory address of the class:

<__main__.Point object at 0xffff8ffcca90>

But in most cases, this information is of little help for us to debug, for data types, we are far more concerned with the value stored than its memory address.

If you want to make the information printed out by the class more clear, you must implement the __repr__ method, for example:

class Point(object):
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def __repr__(self):
        return f'Point(x={self.x}, y={self.y})'

In this way, when printing the class, its output result is more meaningful.

Point(x=1, y=2)

However, this also means that when we add an attribute, we have to modify both the __init__ and the __repr__ in two places, which is not lazy enough.

By using dataclass, __init__, __repr__ and __eq__ are generated by default, so developers just need to focus on defining attributes and data types. Therefore, there is no need to implement __init__ and __repr__:

@dataclass
class Point:
    x: int
    y: int


p1 = Point(1, 2)
print(p1)

The printout result is:

Point(x=1, y=2)

Are you feeling pleased with using dataclass now?

Set default values for dataclass.

Setting default values for dataclass is also very simple, just like defining attribute values at the class level.

The following example sets x and y coordinates default value to 0 for Point:

@dataclass
class Point:
    x: int = 0
    y: int = 0

field method

Particularly, when the default value is a mutable data type such as list or dict, the aforementioned default setting method cannot be used alone, as in the following example:

from typing import List


@dataclass
class Point:
    x: int = 0
    y: int = 0
    tags: List[str] = []

The above execution result will throw a ValueError error and prompt you to provide a default value using the default_factory method:

ValueError: mutable default <class 'list'> for field tags is not allowed: use default_factory

This kind of error needs to be corrected with field.

from dataclasses import dataclass
from dataclasses import field


@dataclass
class Point:
    x: int = 0
    y: int = 0
    tags: List[str] = field(default_factory=list)

Why do we need to use field to fix it?

Python stores default member variable values in class attributes.

The reason for this is that the way dataclass sets default values will make Python put them at the class variable level, meaning they can be accessed without instantiating the class:

print(Point.x)  # the result is 0

For mutable data types such as list, dict, set, etc., all data classes will use the same set of data, which will lead to changes in one and the other will also be affected. To get a better understanding, you can try the following example to see it in detail. The following example modifies the data attribute of t1, which causes t2’s data to also change. Further investigation reveals that the memory locations of t1’s data and t2’s data are actually the same, which is something we need to pay special attention to when defining Python class attributes:

from typing import List


class TagList(object):
    data: List[str] = []


t1 = TagList()
t1.data.append('t1')
t2 = TagList()
print('t2.data =>', t2.data)
print('t1.data is t2.data =>', t1.data is t2.data)

The above execution results are as follows:

t2.data => ['t1']
t1.data is t2.data => True

In order to avoid potential errors caused by mutable data types, dataclasses additionally designed the field method for correction, so the previous Point class can be further modified as follows:

from dataclasses import dataclass
from dataclasses import field


@dataclass
class Point:
    x: int = 0
    y: int = 0
    tags: List[str] = field(default_factory=list)

If you try to modify the tags attribute, you will find that it no longer changes the other tags after changing one of them:

p1 = Point()
p1.tags.append('p1')
p2 = Point()
print('p2.tags =', p2.tags)
print('p1.tags is p2.tags =', p1.tags is p2.tags)

The above example’s results are as follows:

p2.tags = []
p1.tags is p2.tags = False

The field method also provides many parameters that can be set. If you are interested, you can refer to the official documentation for more details.

Post-init processing

The generated __init__() code will call a method named __post_init__(), if __post_init__() is defined on the class.

Python’s dataclass automatically generates an __init__() method that tries to call the __post_init__() method (if it is implemented) after initializing attribute values.

This is quite suitable for some attributes that need to be generated after initialization, such as the distance between a point and the origin, which is suitable for using __post_init__() to generate because the calculation of the distance to the origin requires the x and y coordinate values to be determined first.

from dataclasses import dataclass
from dataclasses import field


@dataclass
class Point:
    x: int = 0
    y: int = 0
    dist_from_origin: float = field(init=False)

    def __post_init__(self):
        self.dist_from_origin = (self.x ** 2 + self.y ** 2) ** 0.5


p1 = Point(3, 4)
print(p1)

The execution result of the above example is as follows. We can see that we successfully assigned the value of dist_from_origin through the post_init() method:

Point(x=3, y=4, dist_from_origin=5.0)

Frozen dataclass

If you want the dataclass to be read-only and prevent any code from modifying its values after instantiation, you can pass the frozen=True parameter to the dataclass decorator. For example:

from dataclasses import dataclass


@dataclass(frozen=True)
class Point:
    x: int = 0
    y: int = 0

When we try to modify its value:

p1 = Point(3, 4)
p1.x = 5
print(p1)

Python will raise a FrozenInstanceError error when we try to change the value of a frozen dataclass instance.

FrozenInstanceError: cannot assign to field 'x'

In addition, it is worth noting that Frozen dataclasses cannot be used together with the __post__init__() method, otherwise a FrozenInstanceError error will be thrown, as shown in the following example:

from dataclasses import dataclass
from dataclasses import field


@dataclass(frozen=True)
class Point:
    x: int = 0
    y: int = 0
    dist_from_origin: float = field(init=False)

    def __post_init__(self):
        self.dist_from_origin = (self.x ** 2 + self.y ** 2) ** 0.5


p = Point(3, 4)

The execution result of the above example is:

FrozenInstanceError: cannot assign to field 'dist_from_origin'

Using slots parameter to increase performance

Due to the fact that Python stores the data and methods of a class in a dictionary by default, there may be some impact on memory and access efficiency. Therefore, Python provides the slots class attribute for developers to explicitly specify which attributes the class has, in order to save memory and make access faster.

Therefore, the dataclass can also take the slots=True parameter to explicitly specify the __slots__ and increase performance:

from dataclasses import dataclass


@dataclass(slots=True)
class Point:
    x: int = 0
    y: int = 0

Conclusion

When used appropriately, dataclass can improve the readability and maintainability of Python code. If the usage requirements are relatively simple, then using dataclass is a lightweight option.

However, dataclass does not have the functionality to verify data types. If there is a need for data type verification and checking, it is recommended to use pydantic.

That’s all.

Happy Coding!

References

dataclasses — Data Classes — Python 3.11.1 documentation