Complete Guide to Data Analysis in Rust: From Basics to Practice

Introduction

Data analysis has become an indispensable part of modern software development. When we talk about data analysis, Python and R are often the preferred languages. However, have you considered using Rust for data analysis? Rust is known for its memory safety, high performance, and concurrency capabilities, which give it unique advantages when handling large-scale data.

This article will take you deep into the core content of the open-source book “Data Analysis in Rust,” which serves as a hands-on guide to data analysis in Rust. Whether you are a Rust beginner or a developer looking to expand your data analysis skills, this article will provide you with valuable learning paths and practical examples.

Why Choose Rust for Data Analysis?

Before diving into the technical details, let’s first understand why Rust is suitable for data analysis:

  1. Memory Safety: Rust’s ownership system prevents memory leaks and data races at compile time.
  2. High Performance: Zero-cost abstractions make Rust’s runtime speed comparable to C/C++.
  3. Concurrency Capabilities: A safe concurrency programming model makes processing large datasets more efficient.
  4. Mature Ecosystem: Data processing libraries like Polars and Arrow are becoming increasingly robust.

Core Technology Stack Introduction

1. Polars: Rust’s Data Processing Powerhouse

Polars is the main data analysis library used in this book. It is based on the Apache Arrow memory model and provides a Pandas-like API with superior performance.

Core Features:

  • Lazy Evaluation (LazyFrame): Delayed execution until data is actually needed.
  • Query Optimization: Automatically optimizes execution plans.
  • Handling Datasets Larger than Memory: Supports ultra-large data through streaming processing.
  • Schema Error Detection: Captures type errors before processing data.

Example Code:

use polars::prelude::*;

// Connect to CSV file (not loaded into memory)
let lf = LazyCsvReader::new(PlPath::from_str("./data/large/census.csv"))
    .with_has_header(true)
    .finish()
    .unwrap();

// Execute data filtering and aggregation
let result = lf
    .filter(col("region").eq(lit("E12000007"))) // Filter London region
    .filter(col("age_group").gt_eq(lit(5)))     // Age 45 and above
    .filter(col("income").is_not_null())        // Income is not null
    .group_by([col("region")])                  // Group by region
    .agg([col("income").mean()])                // Calculate average income
    .collect()                                   // Execute and collect results
    .unwrap();

println!("{}", result);

2. Data Read/Write: Support for Multiple Formats

CSV File Processing

Reading CSV:

use polars::prelude::*;

// Connect to LazyFrame (data not loaded into memory)
let lf = LazyCsvReader::new(PlPath::from_str("./data/large/census.csv"))
    .with_has_header(true)
    .finish()
    .unwrap();

// View the first 5 rows of data
println!("{}", lf.limit(5).collect().unwrap());

Writing CSV:

use polars::prelude::*;

// Read data
let lf = LazyCsvReader::new(PlPath::from_str("./data/csv/census_0.csv"))
    .with_has_header(true)
    .finish()
    .unwrap();

// Convert to DataFrame and write
let mut df = lf.collect().unwrap();
let mut file = std::fs::File::create("./data/output/census_0.csv").unwrap();
CsvWriter::new(&mut file).finish(&mut df).unwrap();

Parquet File Processing

Parquet is a columnar storage format with high compression rates and excellent query efficiency.

Reading Parquet:

use polars::prelude::*;

// Connect to Parquet file
let args = ScanArgsParquet::default();
let lf = LazyFrame::scan_parquet(
    PlPath::from_str("./data/large/census.parquet"),
    args
).unwrap();

// Also supports partitioned Parquet files
let lf_partitioned = LazyFrame::scan_parquet(
    PlPath::from_str("./data/large/partitioned"),
    args
).unwrap();

Writing Partitioned Parquet:

use polars::prelude::*;

let mut df = lf.collect().unwrap();

// Write partitioned by region and age_group
write_partitioned_dataset(
    &mut df,
    PlPath::from_str("./data/output/partitioned/").as_ref(),
    vec!["region".into(), "age_group".into()],
    &ParquetWriteOptions::default(),
    None,
    4294967296, // Max 4GB per file
)
.unwrap();

3. Database Integration

PostgreSQL Connection

Using the ConnectorX library, SQL query results can be efficiently loaded directly into a Polars DataFrame:

use connectorx::prelude::*;
use std::convert::TryFrom;

// Connect to PostgreSQL
let source_conn = SourceConn::try_from(
    "postgresql://postgres:postgres@localhost:5432"
).unwrap();

// Execute query and convert to Polars DataFrame
let query = &[CXQuery::from(
    "SELECT * FROM census WHERE region = 'E12000007' AND age_group = 1"
)];

let df = get_arrow(&source_conn, None, query, None)
    .unwrap()
    .polars()
    .unwrap();

println!("{df}");

4. Cloud Storage Support

Polars natively supports AWS S3, Azure Blob Storage, and Google Cloud Storage:

use polars::prelude::*;

// Configure cloud storage options (example for AWS S3)
let cloud_options = cloud::CloudOptions::default().with_aws(vec![
    (cloud::AmazonS3ConfigKey::AccessKeyId, "your_access_key"),
    (cloud::AmazonS3ConfigKey::SecretAccessKey, "your_secret_key"),
    (cloud::AmazonS3ConfigKey::Region, "us-east-1"),
    (cloud::AmazonS3ConfigKey::Bucket, "your_bucket"),
    (cloud::AmazonS3ConfigKey::Endpoint, "https://s3.amazonaws.com"),
]);

// Read CSV file from S3
let lf = LazyCsvReader::new(PlPath::from_str("s3://bucket/data.csv"))
    .with_cloud_options(Some(cloud_options.clone()))
    .finish()
    .unwrap();

// Read Parquet file from S3
let args = ScanArgsParquet {
    cloud_options: Some(cloud_options.clone()),
    ..Default::default()
};
let lf_parquet = LazyFrame::scan_parquet(
    PlPath::from_str("s3://bucket/data.parquet"),
    args
).unwrap();

Data Transformation Operations

1. Data Filtering

Simple Filtering:

use polars::prelude::*;

let lf = LazyFrame::scan_parquet(
    PlPath::from_str("./data/large/partitioned"),
    ScanArgsParquet::default()
).unwrap();

// Multi-condition filtering
let lf_filtered = lf
    .filter(col("keep_type").eq(lit(1)))           // Permanent residents
    .filter(col("region").eq(lit("E12000007")))    // London region
    .filter(col("age_group").gt_eq(lit(5)))        // Age 45 and above
    .filter(col("income").is_not_null());          // Income is not null

Complex Filtering:

// Build complex filtering expression
let expr = col("region")
    .eq(lit("E12000001"))              // North East region
    .and(col("age_group").gt_eq(lit(6))) // and age 55 and above
    .or(
        col("region")
            .eq(lit("E12000002"))      // or North West region
            .and(col("age_group").lt(lit(6))) // and age below 54
    );

let lf_complex = lf.filter(expr);

List Filtering:

// Filter specific industries
let lf_industry = lf.filter(
    col("industry").is_in(
        lit(Series::from_iter(vec![2, 4, 6, 8])).implode(),
        false
    )
);

2. Column Selection and Creation

Selecting Columns:

use polars::prelude::*;

// Select specific columns
let lf = lf.select([
    col("age_group"),
    col("region"),
    col("income").alias("yearly_income"), // Rename
]);

// Use regex to select columns
let lf = lf.select([
    col("^age.*$"),  // All columns starting with age
    col("region"),
]);

// Exclude certain columns
let lf = lf.select([
    all().exclude_cols(["region", "income"]).as_expr()
]);

Creating New Columns:

// Create column from literal
let lf = lf.with_column(lit(5).alias("five"));

// Create column from calculation
let lf = lf.with_column(
    (col("income").cast(DataType::Float64) * lit(1.02))
        .alias("income_adjusted")
);

// Create column conditionally
let lf = lf.with_column(
    when(col("income").lt_eq(lit(30_000)))
        .then(lit("Low"))
        .when(col("income").lt_eq(lit(70_000)))
        .then(lit("Medium"))
        .otherwise(lit("High"))
        .alias("income_category")
);

3. Data Pivoting

Wide to Long Table (Pivot):

use polars::prelude::pivot::pivot_stable;

// Convert region column to column names
let df_wide = pivot_stable(
    &df,
    ["region"],              // Fields to convert to columns
    Some(["age_group"]),     // Row indices to keep
    Some(["mean_income"]),   // Value field
    false,
    None,
    None,
)
.unwrap();

Long to Wide Table (Unpivot):

// Convert multiple region columns back to one column
let df_long = df_wide.unpivot(
    ["North East", "North West", "London", /* Other regions */],
    ["age_group"],
)
.unwrap();

4. Data Joining

Vertical Concatenation:

use polars::prelude::*;

// Concatenate multiple LazyFrames
let lf_combined = concat(
    [lf1, lf2, lf3],
    UnionArgs::default(),
)
.unwrap();

Horizontal Joining:

// Left join
let lf_left = lf1
    .left_join(lf2, col("id"), col("id"))
    .left_join(lf3, col("id"), col("id"));

// Inner join
let lf_inner = lf1
    .inner_join(lf2, col("id"), col("id"));

// Full join
let lf_full = lf1.join(
    lf2,
    [col("id")],
    [col("id")],
    JoinArgs::new(JoinType::Full),
);

Statistical Analysis

1. Basic Statistics

use polars::prelude::*;

// Calculate multiple statistics
let stats = lf
    .select([
        len().alias("count"),
        col("income").mean().alias("mean"),
        col("income").median().alias("median"),
        col("income").min().alias("min"),
        col("income").max().alias("max"),
        col("income")
            .quantile(lit(0.25), QuantileMethod::Nearest)
            .alias("q25"),
        col("income")
            .quantile(lit(0.75), QuantileMethod::Nearest)
            .alias("q75"),
    ])
    .collect()
    .unwrap();

println!("{stats}");

2. Grouped Statistics

// Calculate average income by region
let by_region = lf
    .group_by([col("region")])
    .agg([
        col("income").mean().alias("mean_income"),
        col("income").count().alias("count"),
    ])
    .collect()
    .unwrap();

3. Weighted Statistics

// Custom weighted quantile function
fn weighted_quantile(col: Expr, wt: Expr, percentile: Expr) -> Expr {
    col.sort_by(
        [(wt.clone().cast(DataType::Float64).cum_sum(false)
            / wt.clone().cast(DataType::Float64).sum()
            - percentile)
            .abs()],
        SortMultipleOptions::default(),
    )
    .first()
}

// Calculate weighted mean and weighted median
let weighted_stats = lf
    .select([
        ((col("income") * col("weight")).sum() / col("weight").sum())
            .alias("weighted_mean"),
        weighted_quantile(col("income"), col("weight"), lit(0.5))
            .alias("weighted_median"),
    ])
    .collect()
    .unwrap();

4. Hypothesis Testing

Use the HypoRS library for statistical tests:

Chi-Square Test:

use hypors::chi_square::independence;

// Prepare contingency table data
let cols: Vec<Vec<f64>> = df
    .get_columns()
    .iter()
    .map(|c| {
        c.as_materialized_series()
            .to_float()
            .unwrap()
            .f64()
            .unwrap()
            .to_vec_null_aware()
            .left()
            .unwrap()
    })
    .collect();

// Perform chi-square independence test
let alpha = 0.05;
let result = independence(&cols, alpha).unwrap();

println!("Chi-Square Statistic: {}", result.test_statistic);
println!("P-Value: {}", result.p_value);
println!("Reject Null Hypothesis: {}", result.reject_null);

ANOVA:

use hypors::anova::anova;

// Perform one-way ANOVA
let result = anova(&cols, alpha).unwrap();

println!("F Statistic: {}", result.test_statistic);
println!("P-Value: {}", result.p_value);

Data Visualization and Reporting

1. Excel Export

Use polars_excel_writer and rust_xlsxwriter to create Excel reports:

use polars_excel_writer::PolarsExcelWriter;
use rust_xlsxwriter::{Chart, ChartType, Workbook};

let mut excel_writer = PolarsExcelWriter::new();
let mut workbook = Workbook::new();

// Create worksheet and write data
let worksheet = workbook.add_worksheet().set_name("Data").unwrap();
excel_writer
    .write_dataframe_to_worksheet(&df, worksheet, 0, 0)
    .unwrap();

// Add chart
let mut chart = Chart::new(ChartType::Bar);
chart
    .add_series()
    .set_categories(("Data", 1, 0, 10, 0))
    .set_values(("Data", 1, 1, 10, 1));

worksheet.insert_chart(1, 3, &chart).unwrap();

// Save file
workbook.save("./output/report.xlsx").unwrap();

2. Plotting

Use Plotlars to create interactive charts:

Bar Chart:

use plotlars::{BarPlot, Orientation, Plot, Rgb, Text};

// Create bar chart
let html = BarPlot::builder()
    .data(&df)
    .labels("region")
    .values("income")
    .orientation(Orientation::Vertical)
    .group("sex")
    .colors(vec![Rgb(255, 127, 80), Rgb(64, 224, 208)])
    .plot_title(Text::from("Income Comparison by Region").font("Arial").size(18))
    .x_title(Text::from("Region").font("Arial").size(15))
    .y_title(Text::from("Average Income").font("Arial").size(15))
    .build()
    .to_html();

// Save as HTML
let mut file = std::fs::File::create("./output/bar_chart.html").unwrap();
std::io::Write::write_all(&mut file, html.as_bytes()).unwrap();

Line Chart:

use plotlars::LinePlot;

let html = LinePlot::builder()
    .data(&df)
    .x("hours_worked")
    .y("Female")
    .additional_lines(vec!["Male"])
    .plot_title("Relationship between Hours Worked and Income")
    .x_title("Hours Worked")
    .y_title("Average Income")
    .build()
    .to_html();

3. Markdown Report

Use Comrak to generate HTML reports:

use comrak::{markdown_to_html, Options};

let mut markdown = "# Data Analysis Report\n\n".to_string();
markdown.push_str("## Summary\n\n");
markdown.push_str("This report analyzes the UK census data...\n\n");

// Add data table
markdown.push_str("## Statistical Data\n\n");
markdown.push_str(&df.to_string());
markdown.push_str("\n\n");

// Add image
markdown.push_str("![Income Distribution Chart](chart.png)\n\n");

// Convert to HTML
let mut options = Options::default();
options.extension.table = true;
let html = markdown_to_html(&markdown, &options);

// Save report
let mut file = std::fs::File::create("./output/report.html").unwrap();
file.write_all(html.as_bytes()).unwrap();

Performance Optimization Tips

1. Utilize Lazy Evaluation

// ❌ Inefficient: Executes at every step
let df1 = df.filter(...).collect().unwrap();
let df2 = df1.lazy().select(...).collect().unwrap();

// ✅ Efficient: Executes in one go
let df = df
    .filter(...)
    .select(...)
    .collect()
    .unwrap();

2. Use Partitioned Data

Partitioned Parquet files can skip irrelevant data blocks:

// Partitioned writing
write_partitioned_dataset(
    &mut df,
    PlPath::from_str("./data/partitioned/"),
    vec!["region".into(), "year".into()],
    &ParquetWriteOptions::default(),
    None,
    4294967296,
).unwrap();

// Partitioned reading (automatically skips irrelevant partitions)
let lf = LazyFrame::scan_parquet(
    PlPath::from_str("./data/partitioned"),
    ScanArgsParquet::default()
)
.unwrap()
.filter(col("region").eq(lit("E12000007"))) // Only read London data
.filter(col("year").eq(lit(2021)));          // Only read data from 2021

3. Choose Appropriate Data Types

// Convert strings to categorical types to save memory
let lf = lf.with_column(
    col("region").cast(DataType::Categorical(None, Default::default()))
);

Practical Case Study: UK Census Data Analysis

Let’s integrate the knowledge learned through a complete example:

use polars::prelude::*;
use plotlars::{BarPlot, Plot};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 1. Read data
    let lf = LazyFrame::scan_parquet(
        PlPath::from_str("./data/census.parquet"),
        ScanArgsParquet::default()
    )?;
    
    // 2. Data cleaning
    let lf_clean = lf
        .filter(col("keep_type").eq(lit(1)))     // Permanent population
        .filter(col("income").is_not_null())     // Income records exist
        .filter(col("age_group").neq(lit(-8)));  // Valid age group
    
    // 3. Data transformation
    let lf_transformed = lf_clean
        .with_column(
            // Create income category
            when(col("income").lt_eq(lit(30000)))
                .then(lit("Low Income"))
                .when(col("income").lt_eq(lit(70000)))
                .then(lit("Medium Income"))
                .otherwise(lit("High Income"))
                .alias("income_level")
        )
        .with_column(
            // Replace region codes with names
            col("region").replace_strict(
                lit(Series::from_iter(vec!["E12000001", "E12000007"])),
                lit(Series::from_iter(vec!["North East", "London"])),
                None,
                Some(DataType::String),
            )
        );
    
    // 4. Grouped statistics
    let stats = lf_transformed
        .group_by([col("region"), col("income_level")])
        .agg([
            col("income").count().alias("Count"),
            col("income").mean().alias("Average Income"),
            col("income").median().alias("Median Income"),
        ])
        .sort(["region", "income_level"], Default::default())
        .collect()?
    ;
    
    println!("Statistical Results:\n{}", stats);
    
    // 5. Visualization
    let df_chart = stats.clone();
    let html = BarPlot::builder()
        .data(&df_chart)
        .labels("region")
        .values("Average Income")
        .group("income_level")
        .plot_title("Distribution of Different Income Levels by Region")
        .build()
        .to_html();
    
    // 6. Save results
    let mut csv_file = std::fs::File::create("./output/analysis_result.csv")?;
    CsvWriter::new(&mut csv_file).finish(&mut stats.clone())?;
    
    let mut html_file = std::fs::File::create("./output/chart.html")?;
    std::io::Write::write_all(&mut html_file, html.as_bytes())?;
    
    Ok(())
}

Common Issues and Solutions

1. Version Compatibility Issues

Due to frequent updates of Polars and Arrow, different versions may be incompatible. Use the df_interchange library to resolve:

use df_interchange::Interchange;

// Convert from Polars 0.43 to Polars 0.46
let df_new = Interchange::from_polars_0_43(df_old)?
    .to_polars_0_46()?;

2. Memory Overflow

When handling large datasets, avoid using <span>.collect()</span>, keep lazy evaluation:

// ❌ May cause memory overflow
let df = large_lf.collect()?;

// ✅ Process in batches or use streaming
let result = large_lf
    .filter(...)
    .select([col("id"), col("value")])  // Only select necessary columns
    .collect()?;

3. Type Inference Failure

Explicitly specify data types:

// Specify schema when reading
let lf = LazyCsvReader::new(PlPath::from_str("data.csv"))
    .with_infer_schema_length(Some(10_000))  // Increase number of inference rows
    .finish()?;

// Or manually convert types
let lf = lf.with_column(
    col("amount").cast(DataType::Float64)
);

Best Practices Summary

  1. Prioritize using LazyFrame: Fully utilize query optimization.
  2. Use partitions wisely: Improve query efficiency for large datasets.
  3. Select appropriate file formats: Parquet is better than CSV.
  4. Avoid premature materialization: Delay <span>.collect()</span> calls as much as possible.
  5. Use type-safe operations: Leverage Rust’s type system to prevent errors.
  6. Handle NULL values: Use <span>.is_null()</span> and <span>.is_not_null()</span> to handle them explicitly.
  7. Monitor memory usage: Use streaming for extremely large datasets.
  8. Version management: Use df_interchange to handle version compatibility issues.

Conclusion

Although Rust is relatively young in the field of data analysis, its unique advantages make it a choice worth exploring. Through the Polars ecosystem introduced in this article, we can:

  • Efficiently handle large-scale datasets, including those larger than memory.
  • Safely perform concurrent data processing without worrying about data races.
  • Flexibly connect to various data sources (CSV, Parquet, databases, cloud storage).
  • Powerfully perform data transformations, statistical analyses, and visualizations.
  • Reliably generate professional reports (Excel, HTML, Markdown).

The open-source book “Data Analysis in Rust” provides a comprehensive learning path for Rust data analysis, from environment setup to practical applications, from basic operations to advanced techniques. Whether you want to enhance data processing performance or conduct data analysis with safety, Rust is a worthwhile investment.

As libraries like Polars and Arrow continue to mature, the Rust data analysis ecosystem will become even more complete. Now is the perfect time to start learning data analysis in Rust!

References

  1. Data Analysis in Rust: https://ericfecteau.ca/data/rust-data-analysis

Book Recommendations

The second edition of “The Rust Programming Language” is an authoritative learning resource written by the Rust core development team and translated by members of the Chinese Rust community. It is suitable for all software developers who wish to evaluate, get started, improve, and research the Rust language, and is regarded as essential reading for Rust development work.

This book introduces the basic concepts of the Rust language to practical tools in a gradual manner, covering advanced concepts such as ownership, traits, lifetimes, and safety guarantees, as well as practical tools like pattern matching, error handling, package management, functional features, and concurrency mechanisms. The book includes three complete project development case studies, teaching readers how to develop Rust practical projects from scratch.

Notably, this book has been updated to the Rust 2021 version, meeting the systematic learning needs of beginners and serving as a reference guide for experienced developers, making it the best entry point for building solid Rust skills.

Recommended Reading

  1. Rust: The Performance King Sweeping C/C++/Go?

  2. A Developer’s Perspective on C++: Revealing Pros and Cons

  3. Rust vs Zig: The Emerging Systems Programming Language Battle

  4. Essential Design Patterns for Asynchronous Programming in Rust: Enhance Your Code’s Performance and Maintainability

Leave a Comment