Armadillo: A Fast C++ Matrix Library

mlpack uses Armadillo matrices for linear algebra operations. Armadillo is a fast C++ matrix library that utilizes advanced template metaprogramming techniques to provide linear algebra operations as quickly as possible.

Detailed documentation about Armadillo can be found on its official website.

However, there are some details to note regarding the use of Armadillo in mlpack.

Getting Started with Armadillo

The syntax of Armadillo is simple and straightforward, designed to be easy to use and read. To give you an idea of what a linear algebra program using Armadillo looks like, here is a simple (artificially designed) program that performs some basic matrix operations.

// Create a 10x15 matrix with random values.
arma::mat m(10, 15, arma::fill::randu);
std::cout << "Size of m: " << m.n_rows << " x " << m.n_cols << "." << std::endl;

// Sum all elements in the matrix.
const double sumVal = arma::accu(m);
std::cout << "Sum of all elements: " << sumVal << "." << std::endl;

// Sum the elements of each column.
arma::rowvec sums = arma::sum(m, 0);
std::cout << "Sum of each column: " << sums;

// Add 1 to all elements.
m += 1;

// Subtract sums from each row.
m.each_row() -= sums;

// Print a single element.
std::cout << "m(3, 4) is: " << m(3, 4) << "." << std::endl;

For more information about Armadillo, please refer to the following resources:

  • Armadillo Documentation
  • Armadillo Example Programs
  • Armadillo/MATLAB Syntax Conversion Table

Representing Data in mlpack

Unlike numpy and some other toolkits, Armadillo matrices store data in column-major format. This means that each column is stored contiguously in memory; that is, x(0, 0) is adjacent to x(1, 0) in memory.

This means that for the vast majority of machine learning methods, storing observations (samples) as columns and dimensions (features) as rows will be faster. This is contrary to the approach taken in most standard machine learning textbooks! This also has implications for linear algebra operations; for example, the Gram matrix of matrix X is typically expressed as X^T X, but when using column-major matrices, the expression must be X X^T.

In general, the following Armadillo types are commonly used internally in mlpack:

  • arma::mat: Dataset and general matrix
  • arma::Row<size_t>: Integer response data, such as labels for classification datasets
  • arma::rowvec: Floating-point response data, such as response values for regression datasets
  • arma::vec: General column vector
  • arma::sp_mat, arma::fmat: Alternative types for representing data; see “Alternative Matrix Types”

Loading Data

mlpack provides two simple functions for loading and saving data matrices in column-major format:

  • data::Load(filename, matrix, fatal=false, transpose=true, type=FileType::AutoDetect) (full documentation)
  • data::Save(filename, matrix, fatal=false, transpose=true, type=FileType::AutoDetect) (full documentation)

For example, consider the following CSV file:

$ cat data.csv
3,3,3,3,0
3,4,4,3,0
...

The following program will load the data, print relevant information, and save the modified dataset to disk.

// Load data from 'data.csv' into 'm'. Throws an exception on load failure (i.e., set fatal to true).
arma::mat m;
mlpack::data::Load("data.csv", m, true);

// Since mlpack uses column-major data,
// - Each column corresponds to a data point!
// - Each row corresponds to a dimension!
std::cout << "The matrix in 'data.csv' has: " << std::endl;
std::cout << " - " << m.n_cols << " points." << std::endl;
std::cout << " - " << m.n_rows << " dimensions." << std::endl;
std::cout << "The second point in the dataset: " << std::endl;
std::cout << m.col(1).t();

// Now modify the matrix and save it in another format (space-separated values).
m += 3;
mlpack::data::Save("data-mod.txt", m);

Although Armadillo does provide .load() and .save() member functions for matrices, the data::Load() and data::Save() functions offer additional flexibility and ensure that data is saved and loaded in column-major format.

Loading and Using Categorical Data

Some mlpack techniques support mixed categorical data, where certain dimensions only take categorical values (e.g., 0, 1, 2, etc.). String data and other non-numeric data can be represented as categorical values, and mlpack supports loading mixed categorical data:

  • data::DatasetInfo auxiliary class stores information about whether each dimension is numeric or categorical. (full documentation)
  • data::Load(filename, matrix, info, fatal=false, transpose=true) (full documentation)

For example, consider the following CSV file containing strings:

$ cat mixed_string_data.csv
3,"hello",3,"f",0
3,"goodbye",4,"f",0
...

The following program will load the data file, print information about the categorical dimensions, and prepare the data for mlpack algorithms that support mixed categorical data.

// Load data from 'mixed_string_data.csv' into 'm'. Throws an exception on load failure (i.e., set fatal to true). This will populate the 'info' object.
arma::mat m;
mlpack::data::DatasetInfo info;
mlpack::data::Load("mixed_string_data.csv", m, info, true);

// Print information about the data.
std::cout << "The matrix in 'mixed_string_data.csv' has: " << std::endl;
std::cout << " - " << m.n_cols << " points." << std::endl;
std::cout << " - " << info.Dimensionality() << " dimensions." << std::endl;

// Print which dimensions are categorical.
for (size_t d = 0; d < info.Dimensionality(); ++d) {
  if (info.Type(d) == mlpack::data::Datatype::categorical) {
    std::cout << " - Dimension " << d << " is categorical with "
              << info.NumMappings(d) << " distinct categories." << std::endl;
  }
}

// Modify the third point to 4,"wonderful",1,"c",0.
// Note that we manually map string values; MapString() returns the category for the given value.
m(0, 2) = 4;
m(1, 2) = info.MapString<double>("wonderful", 1); // Create a new third category.
m(2, 2) = 1;
m(3, 2) = info.MapString<double>("c", 1);
m(4, 2) = 0;

// `m` can now be used with any mlpack algorithm that supports categorical data.

Not all mlpack methods support categorical data. Here is a list of methods that do support categorical data:

  • DecisionTree
  • DecisionTreeRegressor
  • RandomForest
  • HoeffdingTree

Alternative Matrix Types

The documentation for mlpack primarily focuses on arma::mat, arma::vec, and arma::rowvec types, which are based on the double numeric type (e.g., 64-bit floating point). However, many mlpack algorithms and supporting tools accept alternative matrix types and element types:

  • Many methods (such as LogisticRegression) allow specifying the matrix type as a template parameter.
  • Some methods (such as DecisionTree) accept different matrix types during Train(), Classify(), or Predict() without requiring explicit template parameters.
  • In general, any matrix type that supports the Armadillo API can be used; this includes:
    • Single-precision floating-point matrices (arma::fmat, arma::frowvec, arma::fvec)
    • Sparse matrices (arma::sp_mat, arma::sp_fmat)
    • GPU matrices implemented via Bandicoot (coot::mat, coot::fmat) — (Note: support is still in development and experimental)

Here is a simple example of training an AdaBoost model using single-precision floating-point data.

// 1000 random points, 10 dimensions, using 32-bit precision (float).
arma::fmat dataset(10, 1000, arma::fill::randu);

// Random labels for each point, a total of 5 categories.
arma::Row<size_t> labels = arma::randi<arma::Row<size_t>>(1000, arma::distr_param(0, 4));

// Train in the constructor, using floating-point data.
// The weak learner type is now a floating-point perceptron.
using PerceptronType = mlpack::Perceptron<
    mlpack::SimpleWeightUpdate,
    mlpack::ZeroInitialization,
    arma::fmat>;
mlpack::AdaBoost<PerceptronType, arma::fmat> ab(dataset, labels, 5);

// Create test data (500 points).
arma::fmat testDataset(10, 500, arma::fill::randu);
arma::Row<size_t> predictions;
ab.Classify(testDataset, predictions);

// Now `predictions` contains the predicted results for the test dataset.
// Print some information about the test predictions.
std::cout << arma::accu(predictions == 3) << " test points classified as class "
          << "3." << std::endl;

Here is a simple example of training a LogisticRegression model using sparse 32-bit floating-point data.

// Create random, sparse 100-dimensional data.
arma::sp_fmat dataset;
dataset.sprandu(100, 5000, 0.3);
arma::Row<size_t> labels = arma::randi<arma::Row<size_t>>(5000, arma::distr_param(0, 1));

// Train with L2 regularization penalty parameter 0.1.
mlpack::LogisticRegression<arma::sp_fmat> lr(dataset, labels, 0.1);

// Now classify a test point.
arma::sp_fvec point;
point.sprandu(100, 1, 0.3);
size_t prediction;
arma::fvec probabilitiesVec;
lr.Classify(point, prediction, probabilitiesVec);
std::cout << "Prediction for random test point: " << prediction << "." << std::endl;
std::cout << "Class probabilities for random test point: " << probabilitiesVec.t();

Adapting to Other Toolkits (Eigen, etc.)

In general, C++ linear algebra toolkits store data in column-major representation, and the key to converting between toolkits is to access the underlying memory.

Copying Eigen Matrices to Armadillo Matrices.

// Note: This operation is only valid if the Eigen matrix is stored in column-major order.
// For more details see https://eigen.tuxfamily.org/dox/group__TopicStorageOrders.html
Eigen::MatrixXd m;
const size_t rows = 10;
const size_t cols = 20;
m.setRandom(rows, cols); // 10x20 random matrix.

// Copy to Armadillo matrix.
arma::mat mCopy(&m(0, 0), rows, cols);

Copying XTensor Matrices to Armadillo Matrices.

// Note: This operation will only work correctly if the layout_type of the XTensor matrix is column-major (i.e., xt::layout_type::column_major).
// For more details see https://xtensor.readthedocs.io/en/latest/container.html
const size_t rows = 10;
const size_t cols = 20;

// Create a 10 x 20 random matrix with values normally distributed.
// Note that we must ensure the matrix is laid out in column-major format.
xt::xarray<double, xt::layout_type::column_major> m = xt::random::randn<double>({ rows, cols });

// Copy to Armadillo matrix.
arma::mat mCopy(m.data(), rows, cols);

Creating an Armadillo Alias to an Eigen Matrix. Note that changes to the Eigen matrix will be reflected in the Armadillo matrix and vice versa. If the Eigen matrix is released, the Armadillo matrix will become invalid. Please be careful! More details can be found here.

// Note: This operation is only valid if the Eigen matrix is stored in column-major order.
// For more details see https://eigen.tuxfamily.org/dox/group__TopicStorageOrders.html
Eigen::MatrixXd m;
const size_t rows = 10;
const size_t cols = 20;
m.setRandom(rows, cols); // 10x20 random matrix.

// Create an alias to the Eigen matrix in Armadillo. This avoids copying but may be dangerous:
// Be very careful not to delete the Eigen matrix while using the Armadillo matrix!
// See https://arma.sourceforge.net/docs.html#adv_constructors_mat
arma::mat mAlias(&m(0, 0), rows, cols, false, true);

Copying Armadillo Matrices to Eigen Matrices.

const size_t rows = 10;
const size_t cols = 20;
arma::mat m(10, 20, arma::fill::randu); // Construct Eigen matrix by mapping Armadillo memory.
Eigen::MatrixXd eigenM(Eigen::Map<Eigen::MatrixXd>(m.memptr(), rows, cols));

Leave a Comment