Diennea engineering

HerdDB

A distributed database

HerdDB is a SQL distributed database implemented in Java. It has been designed to be embeddable in any Java Virtual Machine. It is optimized for fast "writes" and primary key read/update access patterns.

Scalability

HerdDB is designed to manage hundreds of tables. It is simple to add and remove hosts and to reconfigure tablespaces to easly distribute the load on multiple systems.

Resiliency

HerdDB leverages Apache Zookeeper and Apache Bookkeeper to build a fully replicated, shared-nothing architecture without any single point of failure.

Ease of use

At the low level HerdDB is very similar to a key-value NoSQL database. On top of that an SQL abstraction layer and JDBC Driver support enables every user to leverage existing known-how and port existing applications to HerdDB.

Why we designed a new distributed database ?

At Diennea we developed EmailSuccess, a powerfull MTA (Mail Transfer Agent), designed to deliver millions of email messages per hour to inboxes all around the world.

We have very particular requirements for the database layer of EmailSucces, we want a system which:

We have been using Apache HBase for long time for our internal business but HBase (and other Big-Data engines) do not satisfy our requirements.

So we designed a new Key-Value database which will be fast enough to handle the write load of the system and then we added an SQL Planner and a JDBC Driver.

We already have great experience of Apache BookKeeper and Apache ZooKeeper projects, as we use them to build sophisticated distributed services, for instance Majordodo (which is open source and you can find it on GitHub and Maven Central), and we decided to use BookKeeper as write-ahead transaction log and ZookKeeper for group membership and coordination..

HerdDB leverages Apache Calcite as SQL Planner, beeing able to support complex query plans, accessing data using optimized access plans.

HerdDB overview

From the API point of view you can see HerdDB as a traditional SQL-based database, so your are going to issue CREATE TABLE, SELECT, INSERT, JOIN…statements and the system will do what you expect.

But internally it works as a Key-Value engine, accessing to data by the Primary Key is as fast a possible, both for reads and for writes. In fact the primary user of HerdDB, EmailSucess, uses it to store the state of every single email message.

On the Key-Value core we added the ability to run scans and multi-row updates, aggregate functions and so on, this way you can use it like any other SQL database.

The main unit of work for an HerdDB cluster is the tablespace, a tablespace is a group of tables. In the context of a tablespace you can run transactions, joins and subqueries which span multiple tables.

In a cluster for each tablespace a leader node is designated by the administrator (with some kind of auto-healing and auto leader reassignment in case of failure) and all the transactions on its tables are run on that node. This system scales well by having many tablespaces and so the load can be spread among all the machines in the cluster.

Indexes are supported by using an implementation of the Block Range Index pattern (BRIN indexes), adapted to the way the HerdDB uses to store data.

The database can be accessed from outside the process by using TLS and authentication is performed using SASL with Kerberos.

Architecture

The write path

The most critical path for data access in HerdDB is the write path, in particular the INSERT and the UPDATE-by-PK data manipulation statements are the most important for us, together with the GET-by-PK.

The leader of the tablespace keeps in memory a data structure which holds all the PKs for a table in an hash table, and an in-memory buffer which contains all the dirty and/or recently accessed records.

When an INSERT reachs the the server the write is first logged to the log, then the map of valid PKs gets updated and the new record is stored in the memory buffer. If an UPDATE is issued on the same PK (and this is our primary use case) the update is directly performed in memory, without hitting “data pages” disks, we only write to the log in order to achieve durability. If a GET comes for the same PK we can read directly the “dirty” record from the buffer. After a configurable timeout or when the system is running out of memory a checkpoint is performed and buffers are flushed to disk, creating immutable data pages, so usually all the work is in memory, writes are performed serially on the transaction log and when flushing to disk complete data pages are written, without ever modifiing existing files. This kind of write pattern is very suitable of our use case: data files are always written or read entirely, leveraging the most of OS caches.

Replication and Apache BookKeeper

HerdDB leverages Apache BookKeeper ability to provide a distributed write ahead log, when a node is running as leader it writes each state change to BookKeeper, working as a replicated state machine. Some features:

For each tablespace you can add a virtually unlimited number of replicas, each ‘replica’ node will ‘tail’ the transaction log and replay each data manipulation activity to its local copy of the data. If a “new leader” comes in, BookKeeper will fence out the “old leader”, preventing any further write to the ledger, this way the old leader will not be able to carry on its activity and change its local state: this will guarantee that every node will converge to the same consistent view of the system.

Apache BookKeeper servers, called Bookies, can be run standalone but the preferred way is to run them inside the same JVM of the database, leveraging the ability to talk to the Bookie without passing from the network stack.