For a while now, I've been working on a personal project that's been both a massive challenge and an incredible learning experience: a distributed, fault-tolerant database built from the ground up in Rust. I've been calling it SSK CASPAXOS DB, and I wanted to share some of the core ideas behind its design, the problems I tried to solve, and the cool features that have emerged from it.
No, it is not open source, I will be using it in my upcoming projects as I would have had to build some of these into the projects anyway, this DB is a formalization of what I wanted.
At its heart, SSK is an attempt to build a Redis-like, in-memory feeling database that doesn't compromise on data consistency when running on multiple nodes.
The Core Goal: Strong Consistency without a Leader
Many distributed databases rely on a leader-based consensus protocol like Raft. A single node is elected as the leader, all writes go through it, and it replicates them to followers. This works great, but I was fascinated by a different approach: leaderless consensus.
I chose to build SSK around CASPaxos, a variant of the classic Paxos algorithm. The "CAS" stands for Compare-and-Set, which is the core primitive. Instead of just writing a value, you propose a change based on what you think the current value is. The consensus protocol then ensures that your change is only applied if your assumption was correct.
This leaderless model means any node in the cluster can coordinate a write operation. It feels more symmetric and avoids the complexities of leader election and failover.
The Foundation: Rust, RocksDB, and RockSolid
The storage engine is built on RocksDB, the high-performance embedded database from Facebook. It's an incredible piece of engineering, but its raw API can be complex. To manage this, I'm using my library called RockSolid, which provides a safer, transactional interface on top of RocksDB with support for features like Column Families and TTLs.
This foundation gives SSK persistence and durability. Every committed write is atomically written to a changelog and the primary data store within a single RocksDB transaction, so the system can recover safely from a crash.
More Than Just a Key-Value Store
The goal was never just to store simple keys and values. I wanted a rich data model inspired by the versatility of Redis. As of now, SSK supports a wide range of data types, each with its own set of atomic operations:
Standard Redis Data Types:
Strings: GET, SET, APPEND, INCRBY, STRLEN, MSET, etc.
Lists: A full suite of list commands like LPUSH, RPOP, LRANGE, LTRIM, and LMOVE.
Hashes: Field-value maps with commands like HSET, HGETALL, HINCRBY.
Sets: Unordered collections with powerful algebra commands like SADD, SINTERSTORE (intersection), SUNIONSTORE (union), and SDIFFSTORE (difference).
Sorted Sets (ZSets): Members sorted by score, with commands for ranking, range queries by score or lexicographical order (ZRANGE, ZRANGEBYSCORE, ZRANGEBYLEX), and set algebra.
Custom Data Structures:
Native Bitmaps: High-performance roaring bitmaps for tracking presence/absence in large sets (e.g., daily active users) with commands like BM_ADD, BM_CARD, and bitwise algebra (BM_AND, BM_OR).
HyperLogLog: For efficient, approximate cardinality counting of massive datasets.
H3 Geospatial: An advanced geospatial index using Uber's H3 grid system (xs_h3). This allows for INCREDIBLY fast radius queries (GEOH3_RADIUS) and aggregations within hexagonal grid cells.
Native Distributed Primitives:
Distributed Locks: A native, fault-tolerant locking primitive (ACQUIRE_LOCK, RELEASE_LOCK) built directly on top of the consensus layer.
High-Throughput Counters: A special counter type that uses RocksDB's MergeOperator to handle rapid, concurrent increments with minimal write contention.
The "Secret Sauce": Fast and Flexible Consensus with CASPaxos
The real engine of SSK is its implementation of CASPaxos. I didn't want a system where one node's temporary failure could halt all writes, so a leaderless protocol was a natural fit. CASPaxos is elegant because its core primitive is not just "write," but "conditionally write."
The Two-Phase Protocol
At its core, any write operation in SSK follows a two-phase protocol:
Phase 1 (Prepare): The node coordinating the write (the proposer) picks a unique, timestamp-based ballot number. It sends a Prepare message with this ballot to a quorum of its peers. The peers promise not to accept any older-ballot writes and, crucially, reply with the value they last accepted (if any).
Phase 2 (Accept): The proposer gathers the promises. If it finds a previously accepted value from a higher ballot, it must respect it (this is key for correctness!). Otherwise, it's free to proceed with its own proposed change. It then sends an Accept message to the quorum, asking them to durably commit the new value.
If a quorum of nodes accepts the value, the write is successful. This two-phase dance ensures that all nodes will eventually converge on the same value for a given key, even with concurrent writes and network partitions.
Optimization 1: Piggybacking for Speed
A full two-phase round trip can feel slow for simple operations. One of the key optimizations in SSK's implementation is piggybacking. For operations that aren't conditional (like a simple SET, which doesn't care about the previous value), the proposer can be optimistic.
It combines the Prepare and Accept messages into a single network packet. When an acceptor receives this combined message, it can immediately promise and accept the new value in one go, responding with Accepted instead of Promise. If a quorum of nodes can do this, the entire consensus round completes in just one network round trip instead of two. If there's a conflict (a higher ballot was seen), the acceptor just rejects the piggybacked accept and falls back to the standard two-phase process. This makes uncontended writes nearly twice as fast.
Optimization 2: Extending CASPaxos for Atomic Multi-Key Operations
This is the feature I'm most proud of. In many distributed systems, if you want to atomically change two different keys, you're out of luck.
Because SSK is (currently) a fully replicated system, I could extend the CASPaxos protocol to a "batch mode" to provide strong atomicity across multiple keys. When a command like SMOVE (which removes from one set and adds to another) or a user-defined BATCH of operations arrives, the proposer doesn't just run one Paxos round—it coordinates a multi-key transaction:
Multi-Key Prepare: It sends a PrepareBatch message for all keys involved in the transaction. This locks all keys against concurrent modification.
Speculative Execution: The proposer gathers the last-known values for all keys from the PromiseBatch replies. It then speculatively executes the entire transaction locally in-memory. If a CAS check fails or any other precondition is not met, the entire transaction is aborted before anything is committed.
Single-Point Commit: If the speculation succeeds, the proposer sends a single AcceptBatch message containing the entire sequence of changes to the quorum. The acceptors then apply this batch atomically to their storage engines.
This process ensures that a multi-key operation is a true all-or-nothing affair. It's incredibly powerful for maintaining application-level invariants without forcing developers to build complex retry logic on the client side.
Failure is Not an Option: Node Recovery
Building a distributed database means thinking about failure from day one. A robust recovery mechanism is non-negotiable. When a new node joins the cluster or an old one restarts, it can't just start serving traffic.
Quorum Check: First, it waits until it can connect to a majority of its peers. This is a critical step to prevent split-brain scenarios where the cluster partitions and operates as two independent groups.
State Assessment: It then queries the other nodes to see how far behind its local state is. It compares its last known log entry ID with the latest on the cluster.
Catch-up vs. Bootstrap:
If it's only slightly behind, it enters "catch-up" mode. It reads the changelog from its peers and efficiently replays the operations it missed, applying them one by one to its local state.
If it's too far behind (or brand new), it triggers a "bootstrap." In this mode, it pulls a full RocksDB checkpoint (a consistent, point-in-time snapshot) from a healthy peer, completely wipes its own state, and installs the snapshot. After a quick restart to load the new database files, it comes online fully in sync.
It’s been a journey full of deadlocks, race conditions and lots of learning. Getting a new node to safely join an active cluster and perfectly synchronize its state without causing issues was one of the hardest parts, but seeing it work is incredibly rewarding.
This is still a work-in-progress, and as with any database, the next big question is always "how to scale?" Sharding the data while trying to maintain some of these atomic guarantees is the next mountain to climb. But for now, it's been a fun and deeply educational project in the world of distributed consensus and Rust.