Writing Python Like Rust: A Guide

Script Home Set as “Starred” to receive article updates promptly
Writing Python Like Rust: A Guide

Source丨51CTO Technology Stack (ID: blog51cto) Author丨kobzol Planning丨Qianshan Proofreading丨Yun Zhao

If reprinted, please contact the original public account

Several years ago, I started programming in Rust, which gradually changed the way I design programs in other programming languages, especially Python. Before I began using Rust, I typically wrote Python code in a very dynamic and loosely typed manner, without type hints, passing and returning dictionaries everywhere, and occasionally reverting to “string type” interfaces. However, after experiencing the strictness of Rust’s type system and noticing all the issues it prevents “by construction”, I suddenly became very anxious whenever I returned to Python and didn’t get the same guarantees.

It is important to clarify that the “guarantees” here do not refer to memory safety (Python itself is reasonably memory safe), but rather to “robustness”—the concept of designing APIs that are difficult or impossible to misuse, thus preventing undefined behavior and various errors. In Rust, misusing an interface often leads to compilation errors. In Python, you can still execute such incorrect programs, but if you use a type checker (like pyright) or an IDE with type analysis (like PyCharm), you can still get quick feedback about potential issues at a similar level.

Eventually, I began adopting some concepts from Rust in my Python programs. It can essentially be boiled down to two things—using type hints as much as possible and adhering to the principle of making illegal states unrepresentable. I try to do this for both programs that will be maintained for a while and oneshot utility scripts. Mainly because, in my experience, the latter often turns into the former 🙂 This approach has led to programs that are easier to understand and modify.

In this article, I will show several examples of such patterns applied to Python programs. This is not rocket science, but I still feel it may be useful to document them.

Note: This article contains many opinions on writing Python code. I do not want to add “in my opinion” to every sentence, so consider everything in this article as my perspective on the matter, rather than attempting to promote some universal truth 🙂 Furthermore, I am not saying that the ideas presented are all invented in Rust; they are certainly used in other languages as well.

01

Type Hinting

First and foremost, use type hints as much as possible, especially in function signatures and class properties. When I read a function signature like this:

def find_item(records, check):

I have no idea what is happening with the signature itself. Is records a list, a dictionary, or a database connection? Is check a boolean or a function? What does this function return? What happens if it fails, does it raise an exception or return None? To find answers to these questions, I either have to read the function body (and often recursively read the function bodies of other functions it calls—which is annoying) or read its documentation (if available). While documentation may contain useful information about the function’s functionality, it is unnecessary to use it to document the answers to the previous questions. Many questions can be answered through the built-in mechanism—type hints.

def find_item(
  records: List[Item],
  check: Callable[[Item], bool]
) -> Optional[Item]:

Did I spend more time writing the signature? Yes. Is that a problem? No, unless my coding is bottlenecked by the number of characters written per minute, which does not really happen. Writing types explicitly forces me to think about what the actual interface the function provides is and how to make it as strict as possible to make it difficult for its callers to misuse it incorrectly. With the above signature, I have a good understanding of how to use the function, what to pass as arguments, and what I expect to return. Additionally, unlike documentation comments, which can easily become outdated when the code changes, the type checker will yell at me if I change the types and do not update the function’s callers. If I am interested in what an Item is, I can directly use Go to definition and immediately see what that type looks like.

I am not an absolutist in this regard; if I need five nested type hints to describe a single parameter, I usually give up and give it a simpler but less precise type. In my experience, this situation does not occur often. If it does occur, it might actually indicate that the code has issues—if your function parameters can be a number, a string tuple, or a dictionary mapping strings to integers, it might indicate that you want to refactor and simplify it.

02

Using Data Classes Instead of Tuples or Dictionaries

Using type hints is one thing, but that only describes what the function’s interface is. The second step is to make these interfaces as precise and “locked down” as possible. A typical example is returning multiple values (or a complex value) from a function. The lazy and quick way is to return a tuple:

def find_person(...) -> Tuple[str, str, int]:

Great, we know we are returning three values. What are they? Is the first string the person’s first name? The second string the last name? What is the phone number? Is it age? Position in some list? Social security number? This input is opaque, and unless you look at the function body, you have no idea what is happening here.

The next “improvement” might be to return a dictionary:

def find_person(...) -> Dict[str, Any]:
    ...
    return {
        "name": ..., 
        "city": ..., 
        "age": ...
    }

Now we actually know what the individual returned properties are, but we have to check the function body again to find out the answers. In a sense, the types have become worse because now we do not even know the number and types of the individual properties. Furthermore, when this function changes and the keys in the returned dictionary are renamed or deleted, there is no easy way to find out using the type checker, so its callers often have to resort to very manual and annoying run-crash-modify code loops.

The correct solution is to return a strongly typed object with additional types for named parameters. In Python, this means we have to create a class. I suspect that tuples and dictionaries are often used in these cases because it is much easier than defining a class (and naming it), creating a constructor with parameters, storing the parameters in fields, etc. Since Python 3.7 (and even faster with the package polyfill), there is a faster solution—dataclasses.

@dataclasses.dataclass
class City:
    name: str
    zip_code: int


@dataclasses.dataclass
class Person:
    name: str
    city: City
    age: int


def find_person(...) -> Person:

You still need to think of a name for the class you create, but other than that, it is as concise as possible, and you get type annotations for all properties.

With this data class, I have a clear description of what the function returns. When I call this function and work with the return value, the IDE’s autocomplete will show me the names and types of its properties. This may seem trivial, but it is a huge productivity boost for me. Additionally, when the code is refactored and the properties change, my IDE and type checker will yell at me and show me all the places that need to change without me having to run the program at all. For some simple refactorings (like renaming properties), the IDE can even make these changes for me. Furthermore, with clearly named types, I can build a vocabulary (Person, City) that can be shared with other functions and classes.

03

Algebraic Data Types

One thing I probably miss most from Rust in most mainstream languages is algebraic data types (ADT). They are a very powerful tool for explicitly describing the shape of the data my code is working with. For example, when I work with packets in Rust, I can explicitly enumerate all the various packets I can receive and assign different data (fields) for each of them:

enum Packet {
  Header {
    protocol: Protocol,
    size: usize
  },
  Payload {
    data: Vec<u8>
  },
  Trailer {
    data: Vec<u8>,
    checksum: usize
  }
}
</u8></u8>

With pattern matching, I can react to the various variants, and the compiler checks that I have not missed any cases:

fn handle_packet(packet: Packet) {
  match packet {
    Packet::Header { protocol, size } => ...,
    Packet::Payload { data } |
    Packet::Trailer { data, ...} => println!("{data:?}")
  }
}

This is very valuable for ensuring that invalid states are unrepresentable and thus avoiding many runtime errors. ADTs are particularly useful in statically typed languages, where if you want to use a set of types uniformly, you need a shared “name” to refer to them. Without ADTs, this is often done with OOP interfaces and/or inheritance. When the set of types used is open-ended, interfaces and virtual methods have their place, but when the set of types is closed, and you want to ensure you handle all possible variants, ADTs and pattern matching are more appropriate.

In dynamically typed languages (like Python), there is actually no need to share a name for a set of types, mainly because you don’t have to name the types used in the program upfront. However, creating union types still proves useful, using something akin to ADTs:

@dataclass
class Header:
  protocol: Protocol
  size: int

@dataclass
class Payload:
  data: str

@dataclass
class Trailer:
  data: str
  checksum: int

Packet = typing.Union[Header, Payload, Trailer]
# or `Packet = Header | Payload | Trailer` since Python 3.10

Here, Packet defines a new type that can be either a header, payload, or trailer. Now, when I want to ensure that only these three classes are valid, I can use this type (name) throughout the rest of the program. Note that the classes do not have any additional explicit “tags”, so when we want to differentiate them, we have to use isinstance or pattern matching:

def handle_is_instance(packet: Packet):
    if isinstance(packet, Header):
        print("header {packet.protocol} {packet.size}")
    elif isinstance(packet, Payload):
        print("payload {packet.data}")
    elif isinstance(packet, Trailer):
        print("trailer {packet.checksum} {packet.data}")
    else:
        assert False


def handle_pattern_matching(packet: Packet):
    match packet:
        case Header(protocol, size): print(f"header {protocol} {size}")
        case Payload(data): print("payload {data}")
        case Trailer(data, checksum): print(f"trailer {checksum} {data}")
        case _: assert False

Unfortunately, here we have to (or more accurately, should) include the annoying assert False branch so that the function crashes when it receives unexpected data. In Rust, this would be a compile-time error.

Note: Several people on Reddit have reminded me that assert False is actually completely optimized out in optimized builds (python -O …). Therefore, directly raising an exception would be safer. There is also typing.assert_never from Python 3.11, which explicitly tells the type checker that falling into this branch should be a “compile-time” error.

A great property of union types is that they are defined outside of the classes that are part of the union. Thus, the classes do not know they are included in the union, which reduces coupling in the code. You can even create multiple different unions using the same types:

Packet = Header | Payload | Trailer
PacketWithData = Payload | Trailer

Union types are also very useful for automatic (de)serialization. Recently, I discovered a great serialization library called pyserde, which is based on the old Rust serde serialization framework. Among many other cool features, it is capable of utilizing type annotations to serialize and deserialize union types without any extra code:

import serde

...
Packet = Header | Payload | Trailer

@dataclass
class Data:
    packet: Packet

serialized = serde.to_dict(Data(packet=Trailer(data="foo", checksum=42)))
# {'packet': {'Trailer': {'data': 'foo', 'checksum': 42}}}

deserialized = serde.from_dict(Data, serialized)
# Data(packet=Trailer(data='foo', checksum=42))

You can even choose the serialization method of the union tags with serde. I have been looking for similar functionality because it is very useful for (de)serializing union types. However, implementing it in most other serialization libraries (like dataclasses_json or dacite) that I have tried is very annoying.

For instance, when using machine learning models, I store various types of neural networks (like classification or segmentation CNN models) in a single configuration file format using unions. I also find it useful to version data in different formats (in my case, the configuration files) as follows:

Config = ConfigV1 | ConfigV2 | ConfigV3

By deserializing Config, I can read all previous versions of the configuration format, thus maintaining backward compatibility.

04

Using Newtype

In Rust, it is common to define data types that add no new behavior but merely specify the domain and intended use of some very generic data types (like integers). This pattern is called “newtype” and it can also be used in Python. Here’s an inspiring example:

class Database:
  def get_car_id(self, brand: str) -> int:
  def get_driver_id(self, name: str) -> int:
  def get_ride_info(self, car_id: int, driver_id: int) -> RideInfo:


db = Database()
car_id = db.get_car_id("Mazda")
driver_id = db.get_driver_id("Stig")
info = db.get_ride_info(driver_id, car_id)

Found an error?

……

……

The parameters in get_ride_info are swapped. There are no type errors because both car_id and driver_id are simple integers, so the types are correct even if the function call is semantically wrong.

We can solve this problem by defining separate types for different types of IDs using “NewType”:

from typing import NewType

# Define a new type called "CarId", which is internally an `int`
CarId = NewType("CarId", int)
# Ditto for "DriverId"
DriverId = NewType("DriverId", int)

class Database:
  def get_car_id(self, brand: str) -> CarId:
  def get_driver_id(self, name: str) -> DriverId:
  def get_ride_info(self, car_id: CarId, driver_id: DriverId) -> RideInfo:


db = Database()
car_id = db.get_car_id("Mazda")
driver_id = db.get_driver_id("Stig")
# Type error here -> DriverId used instead of CarId and vice-versa
info = db.get_ride_info(<error>driver_id</error>, <error>car_id</error>)

This is a very simple pattern that helps catch hard-to-find errors. It is especially useful, for example, if you are dealing with many different types of IDs (CarId vs DriverId) or certain metrics that should not be mixed together (Speed vs Length vs Temperature).

05

Using Constructor Functions

One thing I really like about Rust is that it does not have constructors built into the language. Instead, people tend to use regular functions to create (ideally correctly initialized) instances of structures. In Python, there is no constructor overloading, so if you need to construct an object in multiple ways, someone will end up with an __init__ method that has a lot of parameters, which are used for initialization in different ways and cannot be used together.

Instead, I prefer to create “constructor” functions with explicit names, making it clear how to construct an object and from what data to construct it:

class Rectangle:
    @staticmethod
    def from_x1x2y1y2(x1: float, ...) -> "Rectangle":
    
    @staticmethod
    def from_tl_and_size(top: float, left: float, width: float, height: float) -> "Rectangle":

This makes constructing objects clearer and does not allow users of the class to pass invalid data when constructing the object (for example, by combining y1 and width).

06

Using Type System Invariants

Using the type system itself to encode invariants that can only be tracked at runtime is a very general and powerful concept. In Python (and other mainstream languages), I often see classes as hairy balls of mutable state. One of the sources of this chaos is the code trying to track object invariants at runtime. It has to consider many scenarios that could theoretically happen because the type system does not make them impossible (“If the client has been asked to disconnect, and now someone tries to send a message to it, but the socket is still connected,” etc.).

Writing Python Like Rust: A Guide

Client

This is a typical example:

class Client:
  """
  Rules:
  - Do not call `send_message` before calling `connect` and then `authenticate`.
  - Do not call `connect` or `authenticate` multiple times.
  - Do not call `close` without calling `connect`.
  - Do not call any method after calling `close`.
  """
  def __init__(self, address: str):

  def connect(self):
  def authenticate(self, password: str):
  def send_message(self, msg: str):
  def close(self):

…Easy, right? You just have to read the documentation carefully and make sure you never violate the rules above (to avoid undefined behavior or crashes). Another approach is to fill the class with various assertions that check all the mentioned rules at runtime, leading to messy code, missed edge cases, and slower feedback when errors occur (compile-time vs runtime). The core of the problem is that a client can exist in various (mutually exclusive) states, but instead of modeling these states separately, they are all merged into one type.

Let’s see if we can improve this by splitting the various states into separate types.

First, does it make sense to have a Client that is not connected to anything? Probably not. Such an unconnected client cannot do anything before you call connect anyway. So why allow such a state to exist? We can create a function that returns a connected client:

def connect(address: str) -> Optional[ConnectedClient]:
  pass

class ConnectedClient:
  def authenticate(...):
  def send_message(...):
  def close(...):

If the function succeeds, it will return a client that supports the “connected” invariant, and you cannot call connect again to mess things up. If the connection fails, the function can raise an exception or return None or some explicit error.

A similar approach can be used for the authenticated state. We can introduce another type that maintains the invariants of the client being connected and authenticated:

class ConnectedClient:
  def authenticate(...) -> Optional["AuthenticatedClient"]:

class AuthenticatedClient:
  def send_message(...):
  def close(...):

Only once we truly have an instance of AuthenticatedClient can we start sending messages.

The last issue is the close method. In Rust (due to destructive move semantics), we can express the fact that once the close method is called, you can no longer use the client. In Python, this is not possible, so we have to use some workaround. One solution might be to fall back to runtime tracking, introducing a boolean property in the client and asserting that close has not yet been called before send_message. Another method might be to completely remove the close method and only use the client as a context manager:

with connect(...) as client:
    client.send_message("foo")
# Here the client is closed

Without the close method available, you cannot accidentally close the client twice.

Writing Python Like Rust: A Guide

Strongly Typed Bounding Boxes

Object detection is a computer vision task I sometimes engage in, where the program must detect a set of bounding boxes in an image. Bounding boxes are basically beautified rectangles with some additional data, and they are ubiquitous when implementing object detection. One annoying thing about them is that sometimes they are normalized (the coordinates and sizes of the rectangles in the interval [0.0, 1.0]), but sometimes they are denormalized (the coordinates and sizes constrained by the size of the image they are attached to). When you send bounding boxes through many functions that handle data preprocessing or postprocessing, it’s easy to mess it up, like normalizing bounding boxes twice, which leads to very annoying debugging errors.

This has happened to me several times, so once I decided to solve the problem by splitting these two types of bbox into two different types:

@dataclass
class NormalizedBBox:
  left: float
  top: float
  width: float
  height: float


@dataclass
class DenormalizedBBox:
  left: float
  top: float
  width: float
  height: float

By separating these, normalized and denormalized bounding boxes are no longer easily mixed together, which primarily solves the problem. However, we can make some improvements to make the code more ergonomic:

  • Reduce duplication through composition or inheritance:

@dataclass
class BBoxBase:
  left: float
  top: float
  width: float
  height: float

# Composition
class NormalizedBBox:
  bbox: BBoxBase

class DenormalizedBBox:
  bbox: BBoxBase

Bbox = Union[NormalizedBBox, DenormalizedBBox]

# Inheritance
class NormalizedBBox(BBoxBase):
class DenormalizedBBox(BBoxBase):
  • Add runtime checks to ensure that normalized bounding boxes are indeed normalized:

class NormalizedBBox(BboxBase):
  def __post_init__(self):
    assert 0.0 <= self.left <= 1.0
    ...
  • Add a method to convert between the two representations. In some places, we might want to know the explicit representation, but in other places, we want to use a generic interface (“any type of BBox”). In that case, we should be able to convert “any BBox” to one of the two representations:

class BBoxBase:
  def as_normalized(self, size: Size) -> "NormalizeBBox":
  def as_denormalized(self, size: Size) -> "DenormalizedBBox":

class NormalizedBBox(BBoxBase):
  def as_normalized(self, size: Size) -> "NormalizedBBox":
    return self
  def as_denormalized(self, size: Size) -> "DenormalizedBBox":
    return self.denormalize(size)

class DenormalizedBBox(BBoxBase):
  def as_normalized(self, size: Size) -> "NormalizedBBox":
    return self.normalize(size)
  def as_denormalized(self, size: Size) -> "DenormalizedBBox":
    return self

With this interface, I can have the best of both worlds—types separated for correctness and using a unified interface for ergonomics.

Note: If you want to add some shared methods to the parent/base class that return instances of the corresponding class/type, you can use typing.Self starting from Python 3.11:

class BBoxBase:
  def move(self, x: float, y: float) -> typing.Self: ...

class NormalizedBBox(BBoxBase):
  ...

bbox = NormalizedBBox(...)
# The type of `bbox2` is `NormalizedBBox`, not just `BBoxBase`
bbox2 = bbox.move(1, 2)

Writing Python Like Rust: A Guide

Safer Mutexes

Mutexes and locks in Rust often provide a very nice interface behind them, with two advantages:

When you lock a mutex, you get a guard object that automatically unlocks when the mutex is destroyed, leveraging the old RAII mechanism:

{
  let guard = mutex.lock(); // locked here
  ...
} // automatically unlocked here

This means you won’t accidentally forget to unlock the mutex. Similar mechanisms are also commonly used in C++, although explicit lock/unlock interfaces without guard objects are also available for std::mutex, meaning they can still be misused.

In this design, the data protected by the mutex is stored directly inside the mutex (structure). Using this design, it is impossible to access the protected data without actually locking the mutex. You must lock the mutex first to get the guard and then use the guard itself to access the data:

let lock = Mutex::new(41); // Create a mutex that stores the data inside
let guard = lock.lock().unwrap(); // Acquire guard
*guard += 1; // Modify the data using the guard

This sharply contrasts with the common mutex APIs in mainstream languages (including Python), where the mutex and the data it protects are separate, making it easy to forget to actually lock the mutex before accessing the data:

mutex = Lock()

def thread_fn(data):
    # Acquire mutex. There is no link to the protected variable.
    mutex.acquire()
    data.append(1)
    mutex.release()


data = []
t = Thread(target=thread_fn, args=(data,))
t.start()

# Here we can access the data without locking the mutex.
data.append(2)  # Oops

While we cannot achieve exactly the same benefits in Python as we do in Rust, not everything is lost. Python locks implement the context manager interface, which means you can use them in a block with “with” to ensure they are automatically unlocked when the scope ends. With a little effort, we can go even further:

import contextlib
from threading import Lock
from typing import ContextManager, Generic, TypeVar

T = TypeVar("T")

# Make the Mutex generic over the value it stores.
# In this way we can get proper typing from the `lock` method.
class Mutex(Generic[T]):
  # Store the protected value inside the mutex 
  def __init__(self, value: T):
    # Name it with two underscores to make it a bit harder to accidentally
    # access the value from the outside.
    self.__value = value
    self.__lock = Lock()

  # Provide a context manager `lock` method, which locks the mutex,
  # provides the protected value, and then unlocks the mutex when the
  # context manager ends.
  @contextlib.contextmanager
  def lock(self) -> ContextManager[T]:
    self.__lock.acquire()
    try:
        yield self.__value
    finally:
        self.__lock.release()

# Create a mutex wrapping the data
mutex = Mutex([])

# Lock the mutex for the scope of the `with` block
with mutex.lock() as value:
  # value is typed as `list` here
  value.append(1)

With this design, you can only access the protected data after actually locking the mutex. Obviously, this is still Python, so you can still break the invariant—like storing another pointer to the protected data outside the mutex. But unless your behavior is hostile, this makes the mutex interface in Python safer to use.

In any case, I am sure I have used more “robust patterns” in my Python code, but these are all I can think of at the moment. If you have similar examples of ideas or any other comments, please let me know.

  1. Fairly speaking, if you are using some structured format (like reStructuredText), the parameter type descriptions in documentation comments might also be used. In that case, the type checker might use it and warn you when there is a type mismatch. However, if you are using a type checker anyway, I think it is better to leverage the “native” mechanism to specify types—type hints.

  2. aka discriminated/tagged unions, sum types, sealed classes, etc.

  3. Yes, there are other use cases for newtypes besides those described here, don’t yell at me anymore.

  4. This is called the typestate pattern.

  5. Unless you try hard, like manually calling the magic __exit__ method.

Original link:

https://kobzol.github.io/rust/python/2023/05/20/writing-python-like-its-rust.html

<END>
Programmer Exclusive T-Shirts
Direct purchase link 👇



  Recommended Reading:

Will Developers Go Crazy Testing This?
Why Go Doesn’t Handle Errors Like Rust?
30-Year-Old Code Gets Axed! Microsoft Rewrites Windows Kernel with 180,000 Lines of Rust
Is It Too Early to Replace C/C++ with Rust?
Four Major Languages Strengthen Their Dominance, Rust Threatens C/C++
Office 2019/2021 Professional Enhanced Edition, Genuine Lifetime License!
Writing Python Like Rust: A Guide

Leave a Comment