A short guide to LMDB
LMDB is a great embeddable key-value store that we use extensively for Kube.
Using it is not completely straightforward though, so here’s a short guide for future reference.
LMDB has a couple of unique properties:
- It’s embeddable. You open a file and read/write it in-process.
- The database is a memory-mapped file, so zero-copy reading of data is possible.
- It has good write performance and great read performance.
- It supports single-writer/multi-reader multi-process concurrency.
- It supports multiple named databases.
- The database is append only, so the file will never shrink.
- Sorted keys are supported, and you can search for partial keys (only prefix, not random substrings).
- Works on Linux, Mac OS and Windows.
High level concepts
Environment
To do anything with LMDB you first have to open an environment. The environment provides the necessary data structures to access a single database-file, so there is always a 1:1 mapping of database-files you access and environments you have opened. There should only ever be one environment open per database-file per process. Usually there is no good reason to close an environment once it’s open, so you can just leave it upon until the process ends.
MDB_env *env; if (const int rc = mdb_env_create(&env)) { //Error } //Limit large enough to accommodate all our named dbs. This only starts to matter if the number gets large, otherwise it's just a bunch of extra entries in the main table. mdb_env_set_maxdbs(env, 50); //This is the maximum size of the db (but will not be used directly), so we make it large enough that we hopefully never run into the limit. mdb_env_set_mapsize(env, (size_t)1048576 * (size_t)100000); // 1MB * 100000 if (const int rc = mdb_env_open(env, "/path/to/database", MDB_NOTLS | MDB_RDONLY, 0664)) { //Error }
- MDB_NOTLS is used to disable any LMDB internal thread related locking. As long as you manage locking yourself this allows you to have multiple read-only transactions per thread.
- MDB_RDONLY opens the database in read-only mode.
Transaction
Any interaction with the database content, reading or writing, has to go through a transaction. A transaction always applies to the whole environment (representing the database file), and never to individual named databases. Transactions provide full ACID semantics.
MDB_txn *parentTransaction = nullptr; MDB_txn *transaction; if (const int rc = mdb_txn_begin(env, parentTransaction, readOnly ? MDB_RDONLY : 0, &transaction)) { //Error } if (readOnly) { mdb_txn_abort(transaction); } else { mdb_txn_commit(transaction); }
- Transactions can be nested using the parentTransaction argument.
- MDB_RDONLY is useful because you can have multiple read-only transactions open, but only ever one write transaction.
Named Database
Named databases are similar to tables in relational databases in that they act as sub-databases of the overall database. Each named database is opened using a name and then identified by a DBI (DataBase Identifier). If named databases are not used, the database with the name “” is implicitly used.
Named databases should only be opened once per process, and the DBI should then be reused. Special care must be taken while opening databases because there must not be two concurrent transactions opening the same database (more on that in Caveats).
MDB_dbi dbi; if (const int rc = mdb_dbi_open(transaction, "databasename", MDB_DUPSORT | MDB_CREATE, &dbi)) { //Error }
- Various flags can be passed to mdb_dbi_open to alter various properties, such as how the database deals with duplicates or whether to use string or integer keys. Databases must always be opened with the same flags once created.
Cursors
Cursors allow you to lookup a key in a database with sorted keys, and then iterate over preceding or following keys.
MDB_dbi dbi; MDB_cursor *cursor; if (int rc = mdb_cursor_open(d->transaction, d->dbi, &cursor)) { //Error } //Initialize the key with the key we're looking for MDB_val key = {(size_t)someString.size(), (void *)someString.data()}; MDB_val data; //Position the cursor, key and data are available in key if (int rc = mdb_cursor_get(cursor, &key, &data, MDB_SET_RANGE)) { //No value found mdb_cursor_close(cursor); return 0; } //Position the curser at the next position mdb_cursor_get(cursor, &key, &data, MDB_NEXT) mdb_cursor_close(cursor);
Caveats
Using LMDB in complex scenarios has plenty of pitfalls and it’s well worth to invest some time into learning what you can and cannot do.
Here’s a list of things worth highlighting:
- Retrieved values AND keys will point directly into mapped memory. That memory will only remain valid until the transaction either aborts or commits, or the next operation is executed within the transaction. This essentially means that whenever you hold on to a value you must copy the memory.
- Opening of named databases with mdb_dbi_open must fulfill the following:
- There must not ever be two concurrent transactions using mdb_dbi_open.
- A dbi created using mdb_dbi_open will only valid in transactions after that transaction used in mdb_dbi_open has been comitted (you also need to commit read-only transactions in this case).
- An LMDB database is a memory mapped file, and as such large chunks of it will be loaded into memory and show up as your programs memory usage. It is important to understand that this memory can be efficiently reclaimed by the OS if required. Blogpost
- The database file will never shrink. Memory that is no longer used due to removed values, is internally kept track of and reused, but the file itself will never shrink. To shrink the database it would need to be copied.
- LMDB uses 4KB pages internally (that can be changed at compile time), so in the worst-case scenario, if all values are slightly over 2KB, you will end up with twice the database file-size to what the actual payload is (plus some overhead for the keys and the B+Tree).
- No space is reused while a read-only transaction is active, so long running transactions will result in an ever-growing database. Use short-lived read-only transactions.
- Do not use LMDB datbases on remote filesystems.
- LMDB does no internal bookkeeping of named databases, and you will have to ensure yourself that you open named databases with the same flags every time. This can be challenging when creating named databases dynamically (I’m maintaining the flags used for a particular named database in a separate named database).
- In order to run LMDB under Valgrind, the maximum mapsize must be smaller than half your available ram.
- On windows you will require a couple of patches from master that have not yet made it into a release to avoid the database file immediately being the size of the maximum mapsize. It’s perhaps better to just try master than my random cherry-picks 😉
And as a last tip; read the docs in lmdb.h closely. Ultimately everything is described in there, you just have to find the relevant sections.