Misplaced Pages

Lightning Memory-Mapped Database

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

Lightning Memory-Mapped Database ( LMDB ) is an embedded transactional database in the form of a key-value store . LMDB is written in C with API bindings for several programming languages . LMDB stores arbitrary key/data pairs as byte arrays, has a range-based search capability, supports multiple data items for a single key and has a special mode for appending records (MDB_APPEND) without checking for consistency. LMDB is not a relational database , it is strictly a key-value store like Berkeley DB and DBM.

#565434

52-414: LMDB may also be used concurrently in a multi-threaded or multi-processing environment, with read performance scaling linearly by design. LMDB databases may have only one writer at a time, however unlike many similar key-value databases, write transactions do not block readers, nor do readers block writers. LMDB is also unusual in that multiple applications on the same system may simultaneously open and use

104-481: A 12-part series of articles on his analysis of LMDB, beginning July 9, 2013. The conclusion was in the lines of "impressive codebase ... dearly needs some love", mainly because of too long methods and code duplication. This review, conducted by a .NET developer with no former experience of C , concluded on August 22, 2013, with "beyond my issues with the code, the implementation is really quite brilliant. The way LMDB manages to pack so much functionality by not doing things

156-456: A conversion must be done before moving a database from a 32-bit machine to a 64-bit machine, or between computers of differing endianness . LMDB employs multiversion concurrency control (MVCC) and allows multiple threads within multiple processes to coordinate simultaneous access to a database. Readers scale linearly by design. While write transactions are globally serialized via a mutex , read-only transactions operate in parallel, including in

208-618: A crash can lose data from the last not-yet-committed write transaction. Even with all asynchronous modes enabled, it is only an OS catastrophic failure or hardware power-loss event rather than merely an application crash that could potentially result in any data corruption. Two academic papers from the USENIX OSDI Symposium covered failure modes of DB engines (including LMDB) under a sudden power loss or system crash. The paper from Pillai et al., did not find any failure in LMDB that would occur in

260-592: A partial fix in a separate branch. In June 2013, Oracle changed the license of Berkeley DB (a related project) from the Sleepycat license to the Affero General Public License , thus restricting its use in a wide variety of applications. This caused the Debian project to exclude the library from 6.0 onwards. It was also criticized that this license is not friendly to commercial redistributors. The discussion

312-408: A preference for LMDB. There are wrappers for several programming languages, such as C++, Java, Python, Lua, Rust, Go, Ruby, Objective C, Javascript, C#, Perl, PHP, Tcl and Common Lisp. A complete list of wrappers may be found on the main web site. Howard Chu ported SQLite 3.7.7.1 to use LMDB instead of its original B-tree code, calling the result SQLightning. One cited insert test of 1000 records

364-485: A replacement was sought. The baroque API of LMDB was criticized though, forcing a lot of coding to get simple things done. However, its performance and reliability during testing was considerably better than the alternative back-end stores that were tried. An independent third-party software developer utilised the Python bindings to LMDB in a high-performance environment and published, on the technology news site Slashdot , how

416-422: A syncIdSet that identifies the list of entryUUID values of those entries. The present phase is differentiated from the delete phase as follows. Entries that present unchanged entries may only be returned in the present phase. Entries that delete entries may only be provided in the delete phase. In either phase, add entries (including adds that represent all current attributes of modified entries) can be returned. At

468-420: Is a resource-management technique used in programming to manage shared data efficiently. Instead of copying data right away when multiple programs use it, the same data is shared between programs until one tries to modify it. If no changes are made, no private copy is created, saving resources . A copy is only made when needed, ensuring each program has its own version when modifications occur. This technique

520-429: Is commonly applied to memory, files, and data structures. Copy-on-write finds its main use in operating systems , sharing the physical memory of computers running multiple processes , in the implementation of the fork() system call . Typically, the new process does not modify any memory and immediately executes a new process, replacing the address space entirely. It would waste processor time and memory to copy all of

572-568: Is hereafter referred to as "syncrepl". In addition to the base specification, an enhancement known as delta-syncrepl is also supported. Additional enhancements have been implemented to support multi-master replication . The basic synchronization operation is described in RFC 4533. The protocol is defined such that a persistent database of changes is not required. Rather the set of changes is implied via change sequence number (CSN) information stored in each entry and optimized via an optional session log which

SECTION 10

#1732855478566

624-399: Is known via their cookie. If the cookie is absent or indicates that the consumer is totally out of sync, then the provider will, in the refresh stage, send an add for each entry it has. In the ideal case, the refresh stage of the response contains only a delete phase with just a small set of adds (including those that represent the current result of modifies) and deletes that have occurred since

676-552: Is modular and many different backends are now available for interfacing to other technologies, not just traditional databases. Note: In older (1.x) releases, the terms "backend" and "database" were often used interchangeably. To be precise, a "backend" is a class of storage interface, and a "database" is an instance of a backend. The slapd server can use arbitrarily many backends at once, and can have arbitrarily many instances of each backend (i.e., arbitrarily many databases) active at once. Currently 17 different backends are provided in

728-465: Is particularly useful to track recent deletes. The model of operation is that a replication client (consumer) sends a "content synchronizing search" to a replication server (provider). The consumer can provide a cookie in this search (especially when it has been in sync with the provider previously). In the OpenLDAP implementation of the RFC 4533, this cookie includes the latest CSN that has been received from

780-560: Is quite impressive... I learned quite a lot from the project, and it has been frustrating, annoying and fascinating experience". Multiple other reviews cover LMDB in various languages including Chinese. OpenLDAP OpenLDAP is a free , open-source implementation of the Lightweight Directory Access Protocol (LDAP) developed by the OpenLDAP Project. It is released under its own BSD-style license called

832-463: Is removed now since the partnership with MySQL that led to its development was terminated by Oracle after Oracle acquired MySQL. back-bdb and back-hdb have been removed in favor of back-mdb since back-mdb is superior in all aspects of performance, reliability, and manageability. In practice, backends like -perl and -sock allow interfacing to any arbitrary programming language, thus providing limitless capabilities for customization and expansion. In effect

884-421: Is the default on macOS, *BSD, Android, and Windows. Default Linux builds of LMDB are, therefore, the only ones vulnerable to the problem discovered by the zhengmai researchers however, LMDB may simply be rebuilt by Linux users to utilise fsync instead. When provided with a corrupt database, such as one produced by fuzzing , LMDB may crash. LMDB's author considers the case unlikely to be concerning but has produced

936-410: Is thus able to intercept requests and trigger other actions on them before the backend receives them, and it can also likewise act on the backend's results before they reach the frontend. Overlays have complete access to the slapd internal APIs, and so can invoke anything the frontend or other backends could perform. Multiple overlays can be used at once, forming a stack of modules between the frontend and

988-433: Is used as the underlying mechanism in file systems like ZFS , Btrfs , ReFS , and Bcachefs , as well as in logical volume management and database servers such as Microsoft SQL Server . In traditional file systems, file changes overwrite the original data. With COW, when changes are made, a new version of the file is created while keeping the original intact. This approach enables features like snapshots , which capture

1040-729: The Linux kernel 's same-page merging feature. COW is also used in library , application , and system code. The string class provided by the C++ standard library was specifically designed to allow copy-on-write implementations in the initial C++98 standard, but not in the newer C++11 standard: In the PHP programming language, all types except references are implemented as copy-on-write. For example, strings and arrays are passed by reference, but when modified, they are duplicated if they have non-zero reference counts. This allows them to act as value types without

1092-574: The BDB based back-hdb. Many other OpenLDAP users have observed similar benefits. Since the initial benchmarking work done in 2012, multiple follow-on tests have been conducted with additional database engines for both in-memory and on-disk workloads characterizing the performance across multiple CPUs and record sizes. These tests show that LMDB performance is unmatched on all in-memory workloads and excels in all disk-bound read workloads and disk-bound write workloads using large record sizes. The benchmark driver code

SECTION 20

#1732855478566

1144-783: The LDAP reference source from the University of Michigan where a long-running project had supported development and evolution of the LDAP protocol until that project's final release in 1996. As of May 2015 , the OpenLDAP project has four core team members: Howard Chu (chief architect), Quanah Gibson-Mount, Hallvard Furuseth, and Kurt Zeilenga. There are numerous other important and active contributors including Ondrej Kuznik, Luke Howard, Ryan Tandy, and Gavin Henry. Past core team members include Pierangelo Masarati. OpenLDAP has four main components: Additionally,

1196-591: The OpenLDAP Project is home to a number of subprojects: Historically the OpenLDAP server (slapd, the Standalone LDAP Daemon) architecture was split between a frontend that handles network access and protocol processing, and a backend that deals strictly with data storage. This split design was a feature of the original University of Michigan code written in 1996 and carried on in all subsequent OpenLDAP releases. The original code included one main database backend and two experimental/demo backends. The architecture

1248-517: The OpenLDAP Public License. LDAP is a platform-independent protocol. Several common Linux distributions include OpenLDAP Software for LDAP support. The software also runs on BSD -variants, as well as AIX , Android , HP-UX , macOS , OpenVMS , Solaris , Microsoft Windows (NT and derivatives, e.g. 2000, XP, Vista, Windows 7, etc.), and z/OS . The OpenLDAP project was started in 1998 by Kurt Zeilenga. The project started by cloning

1300-466: The OpenLDAP distribution, and various third parties are known to maintain other backends independently. The standard backends are loosely organized into three different categories: Some backends available in older OpenLDAP releases have been retired from use, most notably back-ldbm which was inherited from the original UMich code, and back-tcl which was similar to back-perl and back-shell. Support for other backends will soon be withdrawn as well. back-ndb

1352-465: The OpenLDAP source repository in June 2011. The project was known as MDB until November 2012, after which it was renamed in order to avoid conflicts with existing software. Internally LMDB uses B+ tree data structures. The efficiency of its design and small footprint had the unintended side-effect of providing good write performance as well. LMDB has an API similar to Berkeley DB and dbm . LMDB treats

1404-518: The attributes that have changed). It is still built on the standard syncrepl specification, which always sends changes as complete entries. But in delta-syncrepl, the transmitted entries are actually sent from a log database, where each change in the main database is recorded as a log entry. The log entries are recorded using the LDAP Log Schema. Copy-on-write Copy-on-write ( COW ), also called implicit sharing or shadowing ,

1456-414: The backend. Overlays provide a simple means to augment the functionality of a database without requiring that an entirely new backend be written, and allow new functionalities to be added in compact, easily debuggable and maintainable modules. Since the introduction of the overlay feature in OpenLDAP 2.2 many new overlays have been contributed from the OpenLDAP community. Currently there are 25 overlays in

1508-760: The computer's memory as a single address space, shared across multiple processes or threads using shared memory with copy-on-write semantics (known historically as a single-level store ). Most former modern computing architectures had a 32-bit memory address space, imposing a hard limit of 4 GB on the size of any database that directly mapped into a single-level store. However, today's 64-bit processors now mostly implement 48-bit address spaces, giving access to 47-bit addresses or 128 TB of database size, making databases using shared memory useful once again in real-world applications. Specific noteworthy technical features of LMDB are: The file format of LMDB is, unlike that of Berkeley DB , architecture-dependent. This means that

1560-476: The copy-on-write data, although the allocation can be skipped if there is only one reference. The kernel then updates the page table with the new (writable) page, decrements the number of references, and performs the write. The new allocation ensures that a change in the memory of one process is not visible in another's. The copy-on-write technique can be extended to support efficient memory allocation by keeping one page of physical memory filled with zeros. When

1612-637: The core OpenLDAP distribution, with another 24 overlays in the user-contributed code section, and more awaiting approval for inclusion. Backends and overlays are the two most commonly used types of modules. Backends were typically built into the slapd binary, but they may also be built as dynamically loaded modules, and overlays are usually built as dynamic modules. In addition, slapd supports dynamic modules for implementing new LDAP syntaxes, matching rules, controls, and extended operations, as well as for implementing custom access control mechanisms and password hashing mechanisms. OpenLDAP also supports SLAPI,

Lightning Memory-Mapped Database - Misplaced Pages Continue

1664-476: The end of a present phase, each entry that the consumer has that was not identified in an add entry or present response during the present phase is implicitly no longer in the provider and thus must be deleted at the consumer so as to synchronize the consumer with the provider. Once the persist stage begins, the provider sends search results that indicate only the add, modify and delete of entries (no present unchanged entry indications) for those entries changed since

1716-520: The first benchmark results was therefore criticized as not stating limitations and for giving a "silver bullet impression" not adequate to address an engineers attitude (it has to be pointed out that the concerns raised however were later adequately addressed to the reviewer's satisfaction by the key developer behind LMDB.) The presentation did spark other database developers to dissect the code in-depth to understand how and why it works. Reviews run from brief to in-depth. Database developer Oren Eini wrote

1768-512: The library's relative simplicity and that no other similar key-value store database offers the same guarantees or overall performance, even though the authors explicitly state in presentations that LMDB is read-optimised not write-optimised. Additionally, as LMDB was primarily developed for use in OpenLDAP , its developers are focused mainly on the development and maintenance of OpenLDAP, not on LMDB per se. The developers limited time spent presenting

1820-433: The memory is allocated, all the pages returned refer to the page of zeros and are all marked copy-on-write. This way, physical memory is not allocated for the process until data is written, allowing processes to reserve more virtual memory than physical memory and use memory sparsely, at the risk of running out of virtual address space. The combined algorithm is similar to demand paging . Copy-on-write pages are also used in

1872-408: The old process's memory during the fork only to immediately discard the copy. Copy-on-write can be implemented efficiently using the page table by marking certain pages of memory as read-only and keeping a count of the number of references to the page. When data is written to these pages, the operating-system kernel intercepts the write attempt and allocates a new physical page, initialized with

1924-588: The performance problems of copying on assignment or making them immutable. In the Qt framework, many types are copy-on-write ("implicitly shared" in Qt's terms). Qt uses atomic compare-and-swap operations to increment or decrement the internal reference counter. Since the copies are cheap, Qt types can often be safely used by multiple threads without the need of locking mechanisms such as mutexes . The benefits of COW are thus valid in both single- and multithreaded systems. COW

1976-499: The plugin architecture used by Sun and Netscape/Fedora/Red Hat. In current releases, the SLAPI framework is implemented inside a slapd overlay. While many plugins written for Sun/Netscape/Fedora/Red Hat are compatible with OpenLDAP, very few members of the OpenLDAP community use SLAPI. The major (functional) releases of OpenLDAP Software include: OpenLDAP supports replication using Content Synchronization as specified in RFC 4533. This spec

2028-673: The presence of a write transaction. They are entirely wait free except for the first read-only transaction on a thread. Each thread reading from a database gains ownership of an element in a shared memory array, which it may update to indicate when it is within a transaction. Writers scan the array to determine the oldest database version the transaction must preserve without requiring direct synchronization with active readers. In 2011, Google published software that allowed users to generate micro-benchmarks comparing LevelDB 's performance to SQLite and Kyoto Cabinet in different scenarios. In 2012, Symas added support for LMDB and Berkeley DB and made

2080-463: The project's dependence on Berkeley DB . A specific goal was to replace the multiple layers of configuration and caching inherent to Berkeley DB's design with a single, automatically managed cache under the control of the host operating system . Development subsequently began, initially as a fork of a similar implementation from the OpenBSD ldapd project. The first publicly available version appeared in

2132-415: The provider (called the contextCSN). The provider then returns as search results (or, see optimization below, sync info replies) the present (unchanged entry only used in the present phase of the refresh stage) (no attributes), added, modified (represented in the refresh phase as an add with all current attributes), or deleted (no attributes) entries to put the consumer into a synchronized state based on what

Lightning Memory-Mapped Database - Misplaced Pages Continue

2184-437: The real-world file systems considered; the single failure identified by the study in LMDB only relates to hypothetical file systems. The Mai Zheng et al. paper claims to point out failures in LMDB, but the conclusion depends on whether fsync or fdatasync is utilised. Using fsync ameliorates the problem. The selection of fsync over fdatasync is a compile-time switch that is not the default behavior in current Linux builds of LMDB but

2236-436: The refresh stage completed. The persist stage continues indefinitely, meaning that search has no final "done" response. By contrast, in the refresh mode only a refresh stage occurs and such stage completes with a done response that also ends the present or delete phase (whichever phase was currently active). This protocol keeps a persistent database of write accesses (changes) and can represent each modify precisely (meaning only

2288-423: The refresh stage, two phases may occur: present and delete, where present always occurs before delete. The phases are delimited via a sync info response that specifies which phase is completed. The refresh and persist stages are also delimited via such sync info response. An optional optimization to more compactly represent a group of entries that are to be presented or deleted is to use a sync info response containing

2340-410: The same LMDB store, as a means to scale up performance. Also, LMDB does not require a transaction log (thereby increasing write performance by not needing to write data twice) because it maintains data integrity inherently by design. LMDB's design was first discussed in a 2009 post to the OpenLDAP developer mailing list, in the context of exploring solutions to the cache management difficulty caused by

2392-414: The slapd server becomes an RPC engine with a compact, well-defined and ubiquitous API. Ordinarily an LDAP request is received by the frontend, decoded, and then passed to a backend for processing. When the backend completes a request, it returns a result to the frontend, which then sends the result to the LDAP client. An overlay is a piece of code that can be inserted between the frontend and the backend. It

2444-592: The system managed to successfully sustain 200,000 simultaneous read, write and delete operations per second (a total of 600,000 database operations per second). An up-to-date list of applications using LMDB is maintained on the main web site. Many popular free software projects distribute or include support for LMDB, often as the primary or sole storage mechanism. LMDB makes novel use of well-known computer science techniques such as copy-on-write semantics and B+ trees to provide atomicity and reliability guarantees as well as performance that can be hard to accept, given

2496-504: The time the consumer last synchronized with the provider. However, due to limited session log state (also non-persistent) kept in the provider, a present phase may be required, particularly including the presentation of all unchanged entries as a means (inefficient) of implying what has been deleted in the provider since the consumer last synchronized. The search can be done in either refresh or refreshAndPersist mode, which implies what stages occur. The refresh stage always occurs first. During

2548-570: The updated benchmarking software publicly available. The resulting benchmarks showed that LMDB outperformed all other databases in read and batch write operations. SQLite with LMDB excelled in write operations, and particularly so on synchronous/transactional writes. The benchmarks showed the underlying filesystem as having a big influence on performance. JFS with an external journal performs well, especially compared to other modern systems like Btrfs and ZFS . Zimbra has tested back-mdb vs back-hdb performance in OpenLDAP, with LMDB clearly outperforming

2600-483: Was 20 times faster (than the original SQLite with its B-Tree implementation). LMDB is available as a backing store for other open source projects including Cyrus SASL, Heimdal Kerberos, and OpenDKIM. It is also available in some other NoSQL projects like MemcacheDB and Mapkeeper. LMDB was used to make the in-memory store Redis persist data on disk. The existing back-end in Redis showed pathological behaviour in rare cases, and

2652-473: Was sparked over whether the same licensing change could happen to LMDB. Author Howard Chu clarified that LMDB is part of the OpenLDAP project, which had its BSD-style license before he joined, and it will stay like it. No copyright is transferred to anybody by checking in, which would make a similar move like Oracle's impossible. The Berkeley DB license issue has caused major Linux distributions such as Debian to completely phase out their use of Berkeley DB, with

SECTION 50

#1732855478566

2704-423: Was subsequently published on GitHub and further expanded in database coverage. LMDB was designed to resist data loss in the face of system and application crashes. Its copy-on-write approach never overwrites currently-in-use data. Avoiding overwrites means the structure on disk/storage is always valid, so application or system crashes can never leave the database in a corrupted state. In its default mode, at worst,

#565434