How the Rust Borrow Checker Saved Our On-Call Nights

Introduction

Have you ever experienced the agony of being woken up at 3 AM by a production environment crash? Does the helplessness of staring at core dump files, trying to guess which thread freed memory still in use by another thread, drive you crazy?

This article shares a real case: how a payment processing service handling an average daily transaction volume of $2.4 million reduced production crashes from 23 times a month to zero—over a span of six months—by rewriting 47,000 lines of C++ code into Rust.

This is not a “Rust is great” sermon, but a detailed record of technical decisions, including real data, challenges faced, and why the borrow checker ultimately became our most trusted ally.

The Root of the Problem: C++ Let Ambiguity Creep into the Code

The Enemies We Faced

Microsoft states that about 70% of security vulnerabilities stem from memory safety issues in C/C++. Google reached similar conclusions in the Chromium project. What we encountered was the same reliability issue—not being attacked, but tripping over ourselves.

Over seven years of growth: code expanded from 12,000 lines to 47,000 lines, with four developers handling the payment validation path, and the thread model understood in an “archaeological sense.” Mutexes were everywhere, but locks couldn’t resolve the ambiguity of lifetimes. C/C++ cannot prove at compile time that you won’t access already freed memory.

The Types of Errors We Continuously Battled

Use-after-free

Thread A freed <span>PaymentContext*</span>, and thread B dereferenced it 140 microseconds later. Crash.

Data Races

Concurrent writes to <span>transaction_cache</span>. We recorded 37 cases—note, these are just the recorded ones! There may be more lurking.

Double-free

Deleted twice. Sometimes it crashes immediately, sometimes it’s a slow poison that manifests elsewhere after three requests.

Null Pointer Dereference

A classic problem that still bites us twice a week.

The Last Straw

On a Friday night in September 2023 at 11 PM: a batch of 14,300 transactions processed in 3.2 seconds, looking perfect. Then thread 23 read a freed <span>ValidationRule</span>. The entire service crashed, we rolled back all 14,300 transactions, and the CTO was woken up by the pager.

Monday’s discussion:

CTO: “How do we prevent this from happening?”

Me: “Add more protection around that path.”

Senior Engineer: “We already have three mutexes there. Adding more will affect performance, and we still don’t know where the race condition is.”

CTO: “What about Rust?”

Silence. Nervous laughter. Rewriting 47,000 lines of code sounds crazy. But we were spending about 60 hours a month on memory incidents. The cost of not switching was accumulating.

Experiment: A Service, Pure Rust

We chose the authentication middleware—4,200 lines of code, peaking at 12,000 requests per second. Not core business, but not trivial either. Enough to test honestly.

On November 14, 2023, the rewrite began. I expected the borrow checker to slow us down. It did. But it slowed us down in a way that felt like a seatbelt wrinkling a shirt—an inconvenience worth having.

The Borrow Checker: Picky, Annoying, Correct

C++ would compile such code:

// C++: Compiles, sometimes betrays you
std::shared_ptr<Session> session = get_session(user_id);
process_request(session);
// Somewhere else...
session.reset(); // Releases when another thread may still be using it

Rust refuses:

// Rust: Won't let you write unsafe code
let session = get_session(user_id);
process_request(&session);
drop(session); // Error: cannot move while borrowed

Ownership is a rule, not a guideline. You can have multiple immutable borrows, or exactly one mutable borrow—but never both at the same time. If you try to be clever, the compiler will complain.

In the early stages, it felt like nitpicking:

fn validate(payment: &mut Payment) {
    let amount = &payment.amount; // Immutable borrow
    payment.status = Status::Pending; // Mutable borrow
    // Error: cannot mutably borrow `payment` because it is already immutably borrowed
    println!("{}", amount);
}

Second week realization: every “unfair” error was preventing bugs we would have released in the past. In C++, that code would compile and then hurt you under load. In Rust, it dies immediately, loudly, with a paper on why.

Six Months, 47,000 Lines, Zero Crashes

We couldn’t freeze the whole world, so we did it in phases, ready with canary tests and rollback plans.

Phase One (December – January): Core Data Types

<span>Transaction</span>, <span>Payment</span>, <span>User</span>, <span>ValidationRule</span>. Twelve ambiguous ownership points turned into compile errors on day one. Crashes: 23 → 17 times/month (-26%), just because of types.

Phase Two (January – February): Business Logic

Validation pipeline, fraud checks, amount calculations. Our C++ fraud checker changed shared state behind a mapping table. Rare race conditions, terrible to debug.

// C++: Shared state by feel
class FraudChecker {
    std::map<std::string, int> attempts;
    void check(const Payment& p) {
        attempts[p.user_id]++; // Sometimes safe, sometimes not
    }
};

Rust enforces explicit sharing:

use std::sync::{Arc, Mutex};
use std::collections::HashMap;

struct FraudChecker {
    attempts: Arc<Mutex<HashMap<String, i32>>>, // Ensures thread safety
}

impl FraudChecker {
    fn check(&self, p: &Payment) {
        let mut a = self.attempts.lock().unwrap(); // Acquire lock
        *a.entry(p.user_id.clone()).or_insert(0) += 1; // Safely update count
    }
}

More explicit, fewer surprises. (We later switched to sharded concurrent maps to avoid hotspot mutexes.) Crashes: 17 → 8 times/month.

Phase Three (February – March): Asynchronous Runtime and Database

Tokio replaced our maze of futures. Lifetimes across async boundaries were no longer black magic. Patterns that worked for us:

  • <span>Arc<DashMap<…>></span> for hot reads/occasional writes
  • <span>RwLock</span> only when reads dominate in production
  • Use channels when we want ownership transfer instead of sharing

Crashes: 8 → 1 time/month.

Phase Four (April 2): Full Switch

100% Rust in production, rollback plan ready. Nothing dramatic happened—that’s the best outcome. Since then, zero memory-related incidents.

Case Study: Refactoring the Fraud Checker

Let’s take a closer look at a specific example. The original C++ fraud checker had issues because it used <span>std::map</span> to track user attempts without proper synchronization in a multithreaded environment.

Problem Code (C++):

class FraudChecker {
private:
    std::map<std::string, int> attempts; // Shared state, unprotected
    
public:
    void check(const Payment& p) {
        // Multiple threads may modify the map simultaneously, leading to data races
        attempts[p.user_id]++;
        
        if (attempts[p.user_id] > THRESHOLD) {
            throw FraudException("Too many attempts");
        }
    }
};

This code may work fine under light load, but in high concurrency scenarios, it can lead to:

  1. Data races: multiple threads modifying the same map entry simultaneously
  2. Inaccurate counts: due to race conditions, the actual count may be less than the true value
  3. Potential crashes: the internal structure of the map may be corrupted

Rust Solution (First Version):

use std::sync::{Arc, Mutex};
use std::collections::HashMap;

struct FraudChecker {
    // Arc: allows multiple owners
    // Mutex: ensures only one thread can access at a time
    attempts: Arc<Mutex<HashMap<String, i32>>>,
}

impl FraudChecker {
    fn new() -> Self {
        FraudChecker {
            attempts: Arc::new(Mutex::new(HashMap::new())),
        }
    }
    
    fn check(&self, p: &Payment) -> Result<(), FraudError> {
        // Acquire mutex lock for exclusive access
        let mut attempts = self.attempts.lock().unwrap();
        
        // Safely update count
        let count = attempts.entry(p.user_id.clone()).or_insert(0);
        *count += 1;
        
        if *count > THRESHOLD {
            return Err(FraudError::TooManyAttempts);
        }
        
        Ok(())
    }
}

This version guarantees thread safety, but under high concurrency, <span>Mutex</span> can become a bottleneck.

Optimized Version:

use dashmap::DashMap;
use std::sync::Arc;

struct FraudChecker {
    // DashMap: sharded concurrent HashMap
    // Multiple threads can access different shards simultaneously
    attempts: Arc<DashMap<String, i32>>,
}

impl FraudChecker {
    fn new() -> Self {
        FraudChecker {
            attempts: Arc::new(DashMap::new()),
        }
    }
    
    fn check(&self, p: &Payment) -> Result<(), FraudError> {
        // Atomically update count using entry API
        let mut entry = self.attempts.entry(p.user_id.clone()).or_insert(0);
        *entry += 1;
        
        if *entry > THRESHOLD {
            return Err(FraudError::TooManyAttempts);
        }
        
        Ok(())
    }
    
    // Asynchronous version used in the actual system
    async fn check_async(&self, p: &Payment) -> Result<(), FraudError> {
        let count = {
            let mut entry = self.attempts.entry(p.user_id.clone()).or_insert(0);
            *entry += 1;
            *entry // Copy value and release lock
        }; // entry is dropped here, lock is released
        
        if count > THRESHOLD {
            return Err(FraudError::TooManyAttempts);
        }
        
        Ok(())
    }
}

Key Improvements:

  1. Compile-time Guarantees: The borrow checker ensures no data races occur
  2. Performance Boost: The sharding strategy of <span>DashMap</span> reduces lock contention
  3. Clear Ownership: <span>Arc</span> clearly indicates shared ownership, while <span>DashMap</span> handles internal synchronization

Counterpoints (Because Trade-offs Are Real)

Learning Curve

Rust humbled our senior C++ developers for about 6-8 weeks. Pair programming helped.

Our experience:

  • Week 1: Fighting with the borrow checker, feeling frustrated
  • Weeks 2-3: Starting to understand the ownership model
  • Weeks 4-6: Able to write idiomatic Rust code
  • Weeks 6-8: Starting to appreciate the compiler’s strictness

Build Times

4.2 minutes (C++) → 8.7 minutes (Rust). Annoying, but bearable with caching/incremental builds.

Our optimizations:

  • Using <span>sccache</span> for distributed compilation caching
  • Enabling incremental compilation
  • Reasonably partitioning crate boundaries

Ecological Gaps

Writing two FFI bindings for the payment SDK took about 2 weeks.

Asynchronous Boundaries

The borrow checker is stricter in async code. We rely on <span>Arc<DashMap<…>></span><span>, only using </span><code><span>RwLock</span> where reasonable, and using channels to transfer ownership.

Importantly: Every time Rust makes something harder, it’s because we’re about to do something unsafe in C++. Purposeful friction.

What the Borrow Checker Actually Taught

Now I write better C++. Ownership issues live in my mind: who owns this, who modifies it, when does it die, what happens across threads?

Rust forces architectural honesty; if ownership is ambiguous, it won’t compile. It sounds harsh, but it’s actually kind.

Concurrency is no longer a haunted house. In C++, we had “do not touch” comments around shaky locks. In Rust, if it compiles, the type system has ruled out data races. This confidence changes what you try—you parallelize without superstition.

Should You Switch? A Simple Framework

Strong Signals for Switching

  1. Recurring memory errors in production (use-after-free, data races, double-free)
  2. Spending over 40 hours a month dealing with memory incidents
  3. Long-running processes with accumulating memory leaks
  4. Untrusted concurrency (and no plans to refactor)
  5. Sensitive domains (payments/health/credentials)

Possibly Better to Stay Put

  1. Short-term tools
  2. Stable legacy code that is rarely touched
  3. No bandwidth to learn Rust right now
  4. Critical dependencies without Rust solutions (check crates.io first)

How to Mitigate Risks

  1. Start with an isolated 1-5K line service
  2. Establish baseline metrics before touching code
  3. Budget 2-3 times the time for learning
  4. Pair Rust experts with C++ veterans
  5. Let compiler errors guide design—don’t fight them
  6. Canary test both versions for ≥2 weeks; keep rollback hot
  7. Document testing conditions (hardware, traffic patterns) to avoid benchmarking disputes

If your canary reduces memory issues by >50%, proceed. If less than 20%, maybe your C++ is already tight—or the team needs more Rust prep time.

The Plain Truth

Famous vulnerabilities (Morris worm, Heartbleed, BLASTPASS) were all related to memory errors. We weren’t attacked—we were attacking our uptime.

Rust didn’t sprinkle magic dust on our code; it shifted whole classes of failures into compiler errors. Simple, mechanical, gloriously boring.

Since April 2:

  • Zero memory-related crashes
  • P99 latency reduced by 41%
  • On-call no longer makes your stomach tense

The borrow checker felt like an enemy for a month, then like the world’s most picky, always-correct teammate. I stopped arguing with it and started listening.

Do One Thing This Week

Run ASan or Valgrind on your busiest service for 24 hours. Count different memory issues. If you see five or more, you’ve found the business case for Rust. The borrow checker will end those specific failures—all of them.

Conclusion

This migration from C++ to Rust was not a religious conversion but a data-driven engineering decision:

Core Gains

  1. Reliability: From 23 crashes a month to six months of zero crashes
  2. Performance: P99 latency reduced by 41%
  3. Development Efficiency: Freed from 60 hours a month of debugging memory incidents
  4. Team Confidence: Concurrent programming is no longer a minefield

Costs

  1. 6-8 weeks of learning curve
  2. Build times doubled (but can be optimized)
  3. Some ecosystem requires FFI bridging
  4. Initial development speed decreased (but quality improved)

Most Important Lesson

The borrow checker is not a barrier, but a mentor. It forces you to think honestly about ownership, lifetimes, and concurrency. Every compile error is preventing a runtime crash.

Teams Suitable for Switching: Teams with recurring memory issues, high concurrency needs, and critical business logic.

Teams Not in a Rush to Switch: Stable legacy systems, short-term tools, resource-limited teams.

Whether you switch to Rust or not, the mindset taught by the borrow checker—clarifying ownership, explicit sharing, compile-time guarantees—will help you write better code.

References

  1. The Borrow-Checker Playbook That Erased Our On-Call Nights: https://ritik-chopra28.medium.com/the-borrow-checker-playbook-that-erased-our-on-call-nights-7d7147235181

Book Recommendations

The second edition of “The Rust Programming Language” is an authoritative learning resource written by the Rust core development team and translated by members of the Chinese Rust community. It is suitable for all software developers looking to evaluate, get started, improve, and study the Rust language, regarded as essential reading for Rust development work.

This book introduces the fundamental concepts of the Rust language to unique practical tools, covering advanced concepts such as ownership, traits, lifetimes, and safety guarantees, as well as practical tools like pattern matching, error handling, package management, functional features, and concurrency mechanisms. The book includes three complete project development case studies, guiding readers to develop Rust practical projects from scratch.

Notably, this book has been updated to the Rust 2021 version, meeting the systematic learning needs of beginners and serving as a reference guide for experienced developers, making it the best entry point for building solid Rust skills.

Recommended Reading

  1. Rust: The Performance King Sweeping C/C++/Go?

  2. A C++ Developer’s Perspective on Rust: Revealing Pros and Cons

  3. Rust vs Zig: The Emerging Systems Programming Language Battle

  4. Essential Design Patterns for Asynchronous Programming in Rust: Enhance Your Code’s Performance and Maintainability

Leave a Comment