Introduction
When you first rewrite Python’s performance bottlenecks in Rust, eagerly anticipating performance improvements, only to find that performance analysis shows 60% of the time wasted on hidden data copies—this experience can be both amusing and frustrating. In fact, when passing large-scale data between Python and Rust, boundary handling is more important than kernel algorithms.
This article will introduce 5 battle-tested PyO3/maturin patterns to help you achieve zero-copy or near-zero-copy data exchange. Whether you are dealing with arrays, images, or log data, these techniques will keep your cross-language pipeline efficient. The core principle is simple: borrow, move, or share—but never duplicate.
Pattern 1: Directly Borrowing NumPy Arrays (Zero-Copy)
When dealing with numerical data, use the <span>numpy</span> crate for PyO3 and adopt a read-only view. You can achieve shape/stride safety, data type checks, and zero additional copies when the data is contiguous.
How It Works
NumPy exports the buffer protocol.<span>PyReadonlyArray*</span> borrows pointers and metadata from Python. You compute in Rust while Python retains ownership.
Code Example: Summing an f64 Array (No Copy)
// Cargo.toml
// [dependencies]
// pyo3 = { version = "0.21", features = ["extension-module"] }
// numpy = "0.21"
use numpy::{PyReadonlyArray1, PyArray1};
use pyo3::prelude::*;
#[pyfunction]
fn sum64(x: PyReadonlyArray1<f64>) -> PyResult<f64> {
// Borrow slice (no copy if contiguous array)
let view = x.as_slice()?;
Ok(view.iter().copied().sum())
}
#[pyfunction]
fn scale_inplace(x: &PyArray1<f32>, factor: f32) -> PyResult<()> {
// Note: We trust the caller won't create aliases; actual code should use read-only mode
unsafe {
for v in x.as_slice_mut()? {
// Modify NumPy array directly, no copy
*v *= factor;
}
}
Ok(())
}
#[pymodule]
fn fastlane(py: Python<'_>, m: &PyModule) -> PyResult<()> {
m.add_function(wrap_pyfunction!(sum64, m)?)?;
m.add_function(wrap_pyfunction!(scale_inplace, m)?)?;
Ok(())
}
Python Side Call
import numpy as np
import fastlane
# Zero-copy operation
a = np.random.randn(1_000_000).astype(np.float64)
print(fastlane.sum64(a))
# In-place update
b = np.ones(8, dtype=np.float32)
fastlane.scale_inplace(b, 3.0)
Pro Tip: If the array is not contiguous, call <span>np.ascontiguousarray</span> once in Python and continue to reuse that buffer.
Pattern 2: Accepting memoryview for Raw Bytes and Images
For images, compressed data, or any binary blocks, avoid using <span>bytes</span> (which creates a PyBytes copy). Instead, receive a <span>memoryview</span> and borrow it in Rust. This is the clearest way to connect I/O pipelines (camera frames, Parquet pages, log blocks) without allocating memory.
Code Example: Borrowing memoryview
use pyo3::prelude::*;
use pyo3::types::PyMemoryView;
#[pyfunction]
fn crc32_mv(py: Python<'_>, mv: &PyAny) -> PyResult<u32> {
let mv: &PyMemoryView = mv.downcast()?;
let itemsize = mv.itemsize();
let len = mv.len() * itemsize;
// Only borrow, no copy
let ptr = unsafe { mv.as_ptr() as *const u8 };
let data = unsafe { std::slice::from_raw_parts(ptr, len) };
Ok(crc32fast::hash(data))
}
Python Side Call
import numpy as np
import fastlane
# Read from file and use memoryview
buf = np.frombuffer(open("frame.bin", "rb").read(), dtype=np.uint8)
print(fastlane.crc32_mv(memoryview(buf)))
Why It Works:<span>memoryview</span> is protocol-first—zero intermediate byte clones. It is also a convenient adapter for <span>bytearray</span><span>, </span><code><span>numpy</span><span>, and even </span><code><span>mmap</span><span>.</span>
Pattern 3: Moving Rust Memory to Return NumPy Arrays (No Back-Copy)
Sometimes Rust should own the allocation (e.g., large results or batch conversions). Use <span>IntoPyArray</span> to transfer ownership of a <span>Vec<T></span> (or <span>ndarray::Array</span>) to Python/NumPy without copying. NumPy will release it later—perfect handoff.
Code Example: Generating on Rust Side, Zero-Copy to Python
use numpy::{IntoPyArray, PyArray1};
use pyo3::prelude::*;
#[pyfunction]
fn linspace(n: usize, start: f64, stop: f64, py: Python<'_>) -> PyResult<Py<PyArray1<f64>>> {
let mut v = Vec::with_capacity(n);
let step = (stop - start) / (n.saturating_sub(1)) as f64;
// Generate data
for i in 0..n {
v.push(start + step * i as f64);
}
// Move Vec -> NumPy, no copy
Ok(v.into_pyarray_bound(py).unbind())
}
Python Side Call
import fastlane
# Allocated by Rust
x = fastlane.linspace(1_000_000, 0.0, 1.0)
Rule of Thumb: “Borrow in, move out” is the lowest friction mode for handling large numerical data.
Pattern 4: Reusing Buffers Across Calls (Arena + Views)
To be honest: even if cross-language copies are avoided, repeated allocations/releases will still hurt latency. The solution is a buffer registry (internal arena) that grows only once and slices into views for each call.
Architecture Sketch
Python np.ndarray ──borrows──> [Rust Kernel]
| \
v v
[Arena Buffer] [Small Temporary Stack]
|
└──> into_pyarray (move once)
Code Example (Simplified)
use once_cell::sync::Lazy;
use parking_lot::Mutex;
// Global arena buffer
static ARENA: Lazy<Mutex<Vec<u8>>> =
Lazy::new(|| Mutex::new(Vec::with_capacity(1_000_000)));
#[pyfunction]
fn normalize_u8_inplace(x: &PyAny) -> PyResult<()> {
// Borrow via memoryview for generality
let mv = x.downcast::<pyo3::types::PyMemoryView>()?;
let len = mv.len();
let data = unsafe {
std::slice::from_raw_parts_mut(mv.as_ptr() as *mut u8, len)
};
// In-place modification
for v in data {
*v = (*v).saturating_sub(8);
}
Ok(())
}
#[pyfunction]
fn transform_with_scratch(py: Python<'_>, n: usize) -> PyResult<Py<PyArray1<f32>>> {
let mut arena = ARENA.lock();
// Grow only once
if arena.len() < n * 4 {
arena.resize(n * 4, 0);
}
// Fill as f32 view
let data = unsafe {
std::slice::from_raw_parts_mut(arena.as_mut_ptr() as *mut f32, n)
};
for (i, v) in data.iter_mut().enumerate() {
*v = (i as f32).sin();
}
// Move data
let vec = unsafe {
let mut out = Vec::from_raw_parts(data.as_mut_ptr(), n, n);
std::mem::forget(arena); // Note: if you need to keep arena; for demonstration only
out
};
Ok(vec.into_pyarray_bound(py).unbind())
}
Production Practical Strategies
- Keep several typed arenas (e.g.,
<span>Vec<f32></span>,<span>Vec<u8></span>), sized to maximum demand - Only return IntoPyArray at the end; keep intermediate steps in the arena to avoid loss
- If lifetimes become complex, wrap the arena in a small Rust struct and expose it as a Python object
Pattern 5: Professional Packaging with maturin (Keep Wheels Slim)
The fastest bridging still requires smooth installation.<span>maturin</span> builds manylinux/macOS wheels, “out of the box,” allowing you to slim down size and avoid surprises.
Project Layout
yourpkg/
├─ src/lib.rs
├─ pyproject.toml # maturin PEP 517 backend
└─ Cargo.toml
pyproject.toml Configuration
[build-system]
requires = ["maturin>=1.5"]
build-backend = "maturin"
[project]
name = "yourpkg"
version = "0.1.0"
requires-python = ">=3.9"
[tool.maturin]
bindings = "pyo3"
strip = true
features = ["pyo3/extension-module"]
Development and Release
# Local editable build
maturin develop
# Build wheels
maturin build --release
# Publish
maturin publish
Wheel Optimization Suggestions
- Use
<span>strip</span>and<span>LTO</span>compilation in<span>Cargo.toml</span>for smaller artifacts - Avoid bundling large models; download to your controlled cache on first run
- Pin the versions of
<span>pyo3</span>and<span>numpy</span>you have tested (ABI mismatches are disguised copy traps)
Performance Monitoring Key Points
Metrics to Watch
<span>sys.getsizeof</span>/<span>array.nbytes</span>comparison before and after calls—should not silently bloat<span>perf_counter()</span>measures Python calls against Rust internal timing—if Python side dominates, you may be over-copying or allocating- Cache hit rate of arena/registry—if resizing on every call, you are paying a tax
Case Study: 4K Image Pipeline, Speeding Up 3x Without Changing Kernel Code
Before Optimization
Python loads PNG → bytes → PyO3 receives <span>&[u8]</span> → Rust decodes + processes → Python creates NumPy result. There are hidden copies during bytes creation and result assembly.
After Optimization
- Python uses
<span>mmap</span>→ passes<span>memoryview</span>to Rust - Rust decodes, processes, and moves out
<span>Vec<f32></span>via<span>IntoPyArray</span> - Added a scratch arena to reuse decoder workspace
Results
- Throughput improved by about 3 times, with less jitter
- Peak RSS decreased by about 25%
- Kernel code unchanged—only boundary optimizations
Common Pitfalls and Safe Defaults
- bytes vs memoryview: For inbound binary blocks, prefer using
<span>memoryview</span> - Non-contiguous NumPy: Enforce a single C contiguous once, then reuse
- Mutable Borrows: Default to read-only views; modify only when necessary and document alias assumptions
- Thread Handling:
<span>PyO3 + allow_threads</span>can release GIL during heavy Rust work—just don’t touch Python objects when releasing - Lifetimes: Never return slices pointing to memory owned by Python; return owned NumPy arrays or pure Python scalars
Architecture Snapshot
[ Python ] ──(borrows)──> [ Rust via PyO3 ]
np.ndarray PyReadonlyArray / memoryview
| |
v v
(no copy) Compute on borrowed pointer
| |
└────(move Vec<T>)──> NumPy (IntoPyArray) (no back-copy)
Conclusion
Fast kernel loops are just the foundation. The real victory comes from how to cross language boundaries: borrow NumPy, accept memoryview, use IntoPyArray to move results, and reuse buffers to keep the allocator quiet. With <span>maturin</span><span> for clear packaging, you have a non-intrusive bridge.</span>
If your boundaries still feel “sticky,” feel free to leave your dtype/shape in the comments, and I will suggest suitable zero-copy layouts.
Remember:fast code is not just about algorithms, but about how data flows. Master these 5 patterns to make your Python-Rust collaboration truly zero-overhead.
References
- Top 5 Python–Rust Bridges Without Copy Hell: https://medium.com/@ThinkingLoop/top-5-python-rust-bridges-without-copy-hell-36dfd687ee5f
Book Recommendations
The second edition of “The Rust Programming Language” is an authoritative learning resource written by the Rust core development team and translated by members of the Chinese Rust community. It is suitable for all software developers looking to evaluate, get started, improve, and research the Rust language, and is considered essential reading for Rust development work.
This book introduces the basic concepts of the Rust language to unique practical tools, covering advanced concepts such as ownership, traits, lifetimes, and safety guarantees, as well as practical tools like pattern matching, error handling, package management, functional features, and concurrency mechanisms. The book includes three complete project development case studies, guiding readers to develop Rust practical projects from scratch.
Notably, this book has been updated to the Rust 2021 version, meeting the systematic learning needs of beginners and serving as a reference guide for experienced developers, making it the best entry point for building solid Rust skills.
Recommended Reading
-
Rust: The Performance King Sweeping C/C++/Go?
-
A C++ Perspective from Rust Developers: Pros and Cons Revealed
-
Rust vs Zig: The Emerging Systems Programming Language Battle
-
Essential Design Patterns for Rust Asynchronous Programming: Enhance Your Code Performance and Maintainability