A Pure C Language Distributed Task Scheduling System 100 Times Lighter than Ray
Hello everyone, the Laxcus distributed operating system features a distributed task scheduling framework implemented in pure C, named DTT (Distributed Task). Today, I would like to showcase it.
First, let’s look at the results (2 machines, zero configuration, done in seconds)
Machine A (192.168.1.100) — Task submission end, executed twice
// Run the example directly after compilation
#./examples/simple-task
[Info] Message 192.168.1.101 1473192991 reply[0x2] errno:0
[Info] Message 192.168.1.101 1473192991 reply[0x8] errno:0
[Info] Message 192.168.1.101 1473192991 reply[0xa] errno:0
# ./examples/simple-task
[Info] Message 192.168.1.101 556692959 reply[0x2] errno:0
[Info] Message 192.168.1.101 556692959 reply[0x8] errno:0
[Info] Message 192.168.1.101 556692959 reply[0xa] errno:0
No output? Yes! Because it offloads the work to the remote machine, and nothing is executed locally!
Machine B (192.168.1.101) — Task execution end
// Start the server (just one process)
# ./dtts
[Info] Service dtt listening on port 15943...
task write_file call, args = (nil) ret = (nil) ...
task write_file call, args = (nil) ret = (nil) ...
Every time a task is received, it silently appends a line “hello” to /tmp/hello.txt
Now let’s switch to Machine B to see the results:
# cat /tmp/hello.txt
hello
hello
After two runs, the file has two additional lines, all done by another machine!
The entire process requires no configuration files, no ZooKeeper, no Redis, no message queues, and no hardcoded IPs, relying entirely on multicast for automatic discovery, truly plug-and-play!
The core code is this simple (the complete example is less than 30 lines)
#include <stdio.h>
#include "dtt.h"
void task_write_file(dt_args_t *args, dt_ret_t *ret)
{
// This code will execute on the remote machine!
system("echo hello >> /tmp/hello.txt");
printf("task write_file call, args = %p ret = %p ...\n", args, ret);
}
// Declare this as a remotely callable task
DT_DECLARE_TASK(task_write_file);
int main()
{
// Initialization (must be at the very beginning of main)
dt_task_init();
// Create a task and throw it to any machine in the cluster for execution
dt_task_id_t task_id = dt_task_create(task_write_file, NULL, NULL);
// Wait for execution to complete
dt_task_join(task_id);
return 0;
}
Did you see that? Writing distributed tasks is almost no different from writing ordinary C functions, the only extra line is that macro.
How does it achieve such lightweight?
- No dependencies: It relies only on the standard C library and POSIX interfaces.
- Multicast automatic discovery: The server starts in the local area network, and the client automatically discovers it without needing to configure IPs.
- Efficient RPC based on long connections: A very simple binary protocol was hand-written, outperforming many large frameworks.
- Supports parameters and return values (not used in the current example, but full serialization support is already implemented).
- Supports task timeouts, retries, and load balancing (the least busy node executes first).
What can it actually do? Real usable large-scale scenarios
Despite its small codebase, it addresses the most troublesome “last mile” scheduling problem in the AI/HPC field:
-
Lightweight pre/post-task scheduling for AI distributed training
- Before training: dozens/hundreds of machines uniformly pull data, decompress datasets, write partition tables, and pre-generate tokenizers.
- After training: uniformly merge checkpoints, upload to OSS, trigger validation tasks, and clean up /tmp cache.Compared to starting another Ray or a self-developed Python scheduler, DTT is written directly in C, requiring only a few MB of memory, and does not compete for GPU process resources.
HPC cluster batch job preprocessing and cleanup
- Thousands of jobs submitted via Slurm/PBS need to execute 5-10 seconds of preparation/cleanup scripts on each compute node.
- Using DTT, a single command broadcasts this, which is over 10 times faster than writing shell + pdsh/ssh loops, and it naturally includes retries and load balancing.
Pre-alignment of AllReduce environments at scale (hundreds/thousands of nodes)
- Before starting training, ensure that the NCCL_RING, CUDA_VISIBLE_DEVICES, and LD_LIBRARY_PATH are completely consistent across 512 machines.
- One-click issue detection and repair tasks, with all nodes returning results within 3 seconds, much faster than manually logging in one by one.
Zero-intrusion hot updates for inference service clusters
- Hundreds of inference machines need to synchronize model weights, configuration files, and dynamic libraries.
- Using DTT to broadcast rsync + reload tasks completes in seconds, without needing Kubernetes RollingUpdate or restarting processes.
Lighthouse management for bare-metal ultra-large-scale server clusters
- Thousands of physical machines, no K8s, only the most basic BMC + PXE.
- Using DTT as the “heart of the cluster”: batch execution of BMC commands, flashing BIOS, collecting sensor temperatures, and synchronizing time, which operations staff absolutely love.
In summary: For any scenario that requires “executing a piece of C/C++ code simultaneously on hundreds or thousands of bare-metal/virtual machines,” DTT is currently the lightest, fastest, and most stable solution on the market.
Ray is too heavy, Celery is too slow, Kubernetes Jobs are too complex, and pdsh lacks retries and result collection — DTT fills this gap as a “nuclear weapon-level small tool”.
Finally
Many people think of microservices, K8s, and ServiceMesh when it comes to distributed systems. In many real scenarios, we just need to “throw this piece of C code to another machine to run,” and DTT was born to solve this purest need.
It may not have flashy features, but it achieves:
With the least amount of code, it has realized truly usable distributed task scheduling.