/ 30 min read
Building an LSM Tree Storage Engine from Scratch
Preface
This article will guide you through understanding the Log-Structured Merge-Tree (LSM-Tree), including its core concepts and structure. By the end, you’ll be able to build your own storage engine based on LSM-Tree from scratch.
What is LSM-Tree?
Basic Concepts
The Log-Structured Merge-Tree (LSM-Tree) is a data structure optimized for high-throughput write operations. It is widely used in databases and storage systems, such as Cassandra, RocksDB, and LevelDB.
The key idea of LSM-Tree is to first write operations into an in-memory data structure (typically an ordered structure like a skip list or AVL tree). Later, these writes are batched and sequentially written to disk as SSTables, thereby minimizing random I/O.
Basic Structure
The LSM-Tree is divided into two main components:
- In-Memory Storage
- The core structure in memory is the Memtable.
- All write operations (e.g., set, delete) first go to the Memtable, which inserts these operations into an ordered data structure (e.g., an ordered tree in the diagram).
- Once the Memtable reaches a certain size threshold, it is flushed to disk as an SSTable (written sequentially).
- New write operations continue on a fresh Memtable.
- Disk Storage
- Disk storage involves WAL and SSTable files.
- The WAL (Write-Ahead Log) ensures that recent writes (stored in the Memtable but not yet persisted to disk) are not lost in case of a database crash. Every write to the Memtable is appended to the WAL. Upon restarting the database, entries from the WAL can be replayed in order to restore the Memtable to its pre-crash state.
- The SSTable (Sorted String Table) is a data storage format that holds a series of ordered key-value pairs.
- When the Memtable reaches its size threshold, it generates a new SSTable and persists it to disk. Since the Memtable relies on an in-memory ordered data structure, no additional sorting is required when constructing the SSTable.
- SSTables on disk are organized into multiple levels. Newly flushed SSTables are stored in Level 0. During subsequent compaction phases, SSTables in L0 are merged into Level 1 and higher levels.
- When the size of a level exceeds a threshold, an SSTable compaction process is triggered. During compaction, SSTables in the current level are merged into higher levels, producing larger, more ordered files. This reduces fragmentation and improves query efficiency.
Typically, the structure of an SSTable includes more than just a series of ordered key-value pairs (data blocks). It also contains an index block, metadata block, and other components. These details will be discussed in-depth during the implementation section.
Writing Data
Writing data involves adding a new key-value pair or updating an existing one. Updates overwrite old key-value pairs, which are later removed during the compaction process.
When data is written, it first goes to the Memtable, where the key-value pair is added to the in-memory ordered data structure. Simultaneously, the write operation is logged in the WAL and persisted to disk to prevent data loss in the event of a database crash.
The Memtable has a defined threshold (usually based on size). When the Memtable exceeds this threshold, it is switched to read-only mode and converted into a new SSTable, which is then persisted to Level 0 on disk.
Once the Memtable is flushed as an SSTable, the corresponding WAL file can be safely deleted. Subsequent write operations will proceed on a new Memtable (and a new WAL).
Deleting Data
In LSM-Tree, data is not immediately removed. Instead, deletions are handled using a mechanism called tombstones (similar to soft deletes). When a key-value pair is deleted, a new entry marked with a “tombstone” is written, indicating the deletion of the corresponding key-value pair. The actual removal occurs during the compaction process.
This tombstone-based deletion ensures the append-only property of LSM-Tree, avoiding random I/O and maintaining sequential writes to disk.
Querying Data
The process of querying data starts with a search in the Memtable. If the key-value pair is found, it is returned to the client. If a tombstone-marked key-value pair is found, it indicates that the requested data has been deleted, and this information is also returned. If the key is not found in the Memtable, the query proceeds to search the SSTables from Level 0 to Level N.
Since querying data may involve searching multiple SSTable files and can lead to random disk I/O, LSM-Tree is generally better suited for write-heavy workloads rather than read-intensive ones.
One common optimization for query performance is the use of a Bloom filter. A Bloom filter can quickly determine whether a key-value pair exists in a specific SSTable, reducing unnecessary disk I/O. Additionally, the sorted nature of SSTables enables efficient search algorithms, such as binary search, to be employed for faster lookups.
Data Compaction
Here, we introduce the Leveled Compaction Strategy, which is used by LevelDB and RocksDB.
Another common strategy is the Size-Tiered Compaction Strategy, where newer and smaller SSTables are successively merged into older and larger SSTables.
As previously mentioned, an SSTable stores a series of entries sorted by key. In the Leveled Compaction Strategy, SSTables are organized into multiple levels (Level 0 to Level N).
In Level 0, SSTables can have overlapping key ranges, as they are directly flushed from the Memtable. However, in Levels 1 through N, SSTables within the same level do not have overlapping key ranges, although key range overlaps are allowed between SSTables in different levels.
An illustrative (though not entirely accurate) example is shown below. In Level 0, the key ranges of the first and second SSTables overlap, while in Level 1 and Level 2, the SSTables within each level have disjoint key ranges. However, SSTables between different levels (e.g., Level 0 and Level 1, or Level 1 and Level 2) may have overlapping key ranges.
Let’s now explore how the Leveled Compaction Strategy maintains this organizational structure.
Since Level 0 is a special case, the compaction strategy discussion is divided into two parts:
- Level 0 to Level 1 Since Level 0 allows overlapping keys among SSTables, compaction begins by selecting an SSTable from Level 0, along with all other SSTables in Level 0 that have overlapping key ranges with it. Next, all SSTables in Level 1 with overlapping key ranges are selected. These selected SSTables are merged and compacted into a single new SSTable, which is then inserted into Level 1. All old SSTables involved in the compaction process are deleted.
- Level N to Level N+1 (N > 0) From Level 1 onwards, SSTables within the same level do not have overlapping key ranges. During compaction, an SSTable is selected from Level N, and all SSTables in Level N+1 with overlapping key ranges are also selected. These SSTables are merged and compacted into one or more new SSTables, which are inserted into Level N+1, while the old SSTables are deleted.
The primary difference between Level 0 to Level 1 compaction and Level N to Level N+1 (N > 0) compaction lies in the selection of SSTables at lower levels (Level 0 or Level N).
The multi-SSTable compaction and merging process is illustrated below. During the merge, only the latest value for each key is retained. If the latest value has a “tombstone” marker, the key is deleted. In the implementation, we use the k-way merge algorithm to perform this process.
It is important to note that the above description of the compaction process provides only a high-level overview. Many details need to be addressed during actual implementation.
For example, in LevelDB, when constructing new SSTables for Level N+1 during compaction, if the new SSTable overlaps with more than 10 SSTables in Level N+2, the process switches to constructing another SSTable. This limits the data size involved in a single compaction.
Implementation
Based on the overview of LSM-Tree above, I believe you now have a basic understanding of LSM-Tree and some ideas about its implementation. Next, we will build a storage engine based on LSM-Tree from scratch. Below, we will introduce only the core code; for the complete code, please refer to https://github.com/B1NARY-GR0UP/originium.
We will break down the implementation of LSM-Tree into the following core components and implement them one by one:
- Skip List
- WAL
- Memtable
- SSTable
- K-Way Merge
- Bloom Filter
- Leveled Compaction
Skip List
In the process of introducing data writing, we mentioned that the LSM-Tree first writes data into an in-memory ordered data structure. Some common ordered data structures and the time complexity of their operations are as follows:
Data Structure | Insert | Delete | Search | Traverse |
---|---|---|---|---|
Skip List | O(logn) | O(logn) | O(logn) | O(n) |
AVL Tree | O(logn) | O(logn) | O(logn) | O(n) |
Red-Black Tree | O(logn) | O(logn) | O(logn) | O(n) |
We chose the Skip List for two main reasons: it is simpler to implement and maintain (KISS principle), and the underlying linked list structure facilitates sequential traversal, making it easier to persist in-memory data to disk.
Core Struct
The complete implementation of the Skip List is available at https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/skiplist/skiplist.go .
A Skip List consists of a base linked list and multiple levels of indices built on top of it. For large datasets, the index layers significantly shorten the search path.
In our implementation, the core structure of the Skip List is defined as follows:
- maxLevel: The maximum number of levels in the Skip List (the base linked list has one level).
- level: The current number of levels in the Skip List.
- p: The probability of a node being promoted to a higher level. For example, if
p = 0.5
, a linked list with 10 nodes at the base level will have approximately 5 nodes in the next level of indices. - rand: A random number generator used to compare against
p
. - size: The number of key-value pairs stored in the Skip List, used to determine if the Memtable exceeds its size threshold.
- head: The head node, which holds references to the first node in each level.
The structure of the elements stored in the Skip List is defined as follows:
-
types.Entry
represents a key-value pair in the storage engine, including the key, value, and a tombstone flag for deletion. -
next: Contains pointers to the next element at each level.
This structure may seem abstract, so let’s illustrate it with an example:
In this three-level Skip List, the next
pointers of the head node reference the first node at each level. Elements 3 and 6 store the next element for each of their levels.
For example, if we want to find the next node of element 19 at Level 2, we use e19.next[2-1]
.
Set
The LSM-Tree uses tombstones to perform deletions, so we don’t need a Delete
method in the skip list implementation. To delete an element, simply set the Tombstone
of the entry
to true
. Thus, the Set
method handles inserting new key-value pairs, updating existing ones, and deleting elements.
Let’s explore the implementation of the Set
method. By traversing the nodes in each level from the highest, the last element smaller than the key to be set is saved in the update
slice.
At the end of this traversal, curr
points to the last element smaller than the key to be set in the bottom-level linked list. So, we only need to check if the next element of curr
equals the key we want to set. If it matches, the element has already been inserted; we update the existing element and return.
If the element is not found, it is inserted as a new element. Using randomLevel
, we calculate the index level of this element. If it exceeds the current number of levels in the skip list, we add the head node to the update
slice and update s.level
to the new level count.
Next, construct the element to be inserted, and the next
pointers of each level are updated to complete the insertion.
Get
The skip list can perform fast search operations by relying on multiple layers of indexes. The nested for
loops in the implementation represent the index-based search operation. If the corresponding element is eventually found in the bottom-level linked list, it will be returned.
All
One reason we chose the skip list is its convenient sequential traversal, which is made possible by simply traversing the bottom-level linked list.
WAL
The complete implementation of WAL can be found at https://github.com/B1NARY-GR0UP/originium/blob/main/wal/wal.go.
As mentioned earlier, the purpose of WAL (Write-Ahead Logging) is to prevent data loss in the Memtable caused by database crashes. Therefore, WAL needs to record operations on the Memtable and recover the Memtable from the WAL file when the database restarts.
Core Struct
The core structure of WAL is as follows, where fd
stores the file descriptor of the WAL file:
Write
Since we need to record operations on the Memtable, this essentially involves writing each operation (Set
, Delete
) as an Entry
into the WAL. The definition of the Write
method is as follows:
When writing these entries
to the file, we need to standardize the WAL file format. The format we use here is length + data. First, we serialize the Entry
, then calculate the length of the serialized data, and finally write the length and serialized data sequentially into the WAL file.
The core code is as follows:
Read
Since we use the WAL file format length + data, during reading, we first read 8 bytes (int64
) to obtain the length of the data, and then read the data based on this length and deserialize it to retrieve the Entry
.
The core code is as follows:
Memtable
The complete implementation of Memtable can be found at https://github.com/B1NARY-GR0UP/originium/blob/main/memtable.go.
The Memtable is responsible for writing client operations into the skip list and recording them in the WAL. It can also recover data from the WAL when the database starts.
Core Struct
The core structure of the Memtable is as follows, which includes two main components skiplist
and wal
:
Set
When performing a Set
operation, both the skip list and the WAL need to be updated simultaneously.
Get
To retrieve a value, simply return the result of the skip list’s Get
operation.
Recover
Recovering the Memtable from the WAL file involves first reading the WAL file, then sequentially applying the Entry
records from the WAL file to the Memtable, and finally deleting the recovered WAL file.
Retrieve the list of WAL files:
Read the WAL and recover the Memtable:
SSTable
LevelDB SSTable
In the previous introduction, we only mentioned that “SSTable (Sorted String Table) is a data storage format that maintains a series of ordered key-value pairs.” Here, we will provide a more detailed explanation of the structure of SSTable.
In LevelDB, an SSTable consists of multiple blocks with different purposes. A schematic diagram is shown below:
- Data Block: Stores a sequence of ordered key-value pairs.
- Meta Block: Includes two types:
filter
andstats
. Thefilter
type stores data for Bloom filters, while thestats
type stores statistical information about the Data Blocks. - MetaIndex Block: Stores index information for the Meta Blocks.
- Index Block: Stores index information for the Data Blocks.
- Footer: Fixed in length, it stores the index information of the MetaIndex Block and Index Block, as well as a magic number.
The index information is essentially a pointer structure called a BlockHandle, which includes two attributes: offset
and size
, used to locate the corresponding Block.
Our SSTable
In our implementation of SSTable, we have simplified the LevelDB SSTable structure. A schematic diagram is shown below:
- Data Block: Stores a sequence of ordered key-value pairs.
- Meta Block: Stores some metadata for the SSTable.
- Index Block: Stores index information for the Data Blocks.
- Footer: Fixed in length, it stores the index information of the Meta Block and Index Block.
The complete implementation of SSTable can be found at https://github.com/B1NARY-GR0UP/originium/tree/main/sstable.
Data Block
The structure of the Data Block is defined as follows, storing an ordered sequence of entries.
We implemented three primary methods for the Data Block:
- Encode: Encodes the Data Block into binary data.
We use prefix compression to encode the key-value sequence. In the buffer, we sequentially write the length of the common prefix, the length of the suffix, the suffix itself, the length of the value, the value, and the “tombstone” marker.
Finally, we compress the data using s2.
S2 is a high-performance extension of the Snappy compression algorithm.
- Decode: Decodes binary data into a Data Block.
During decoding, the process is simply reversed. The full key-value pair is reconstructed using the prefix and suffix.
- Search: Locates key-value pairs using binary search.
Index Block
The structure of the Index Block is defined as follows. It stores the first and last key of each Data Block, along with the BlockHandle
of the corresponding Data Block.
Similarly, the Index Block implements three primary methods: Encode, Decode, and Search. The implementation ideas for the Encode and Decode methods are essentially the same, so we will focus on the Search method.
The Search method of the Data Block is designed to locate a specific key-value pair within the ordered key-value sequence stored in a single Data Block. In contrast, the Search method of the Index Block is used to locate the Data Block containing the given key within the entire SSTable.
Meta Block And Footer
The implementations of these two Blocks are quite straightforward, with both requiring only the Encode and Decode methods.
Build
After introducing all the Blocks in our SSTable, constructing an SSTable simply involves building each Block step by step based on the key-value pairs. Finally, the in-memory index and the encoded SSTable are returned.
K-Way Merge
The complete implementation of K-Way Merge is available at https://github.com/B1NARY-GR0UP/originium/tree/main/pkg/kway.
In the conceptual section, we illustrate the process of compressing and merging multiple SSTables through diagrams. This process is accomplished using the k-way merge algorithm.
The k-way merge algorithm is a method to merge k sorted sequences into a single sorted sequence, with a time complexity of O(knlogk).
One implementation of this algorithm uses a min-heap as an auxiliary structure:
- Insert the first element of each sequence into the heap.
- Pop the smallest value from the heap and add it to the result set. If the popped element’s sequence still has more elements, insert the next element from that sequence into the heap.
- Repeat this process until all elements from all sequences are merged.
Heap
The standard library provides a heap implementation in container/heap
. By implementing the heap.Interface
, we can build a min-heap.
- First, define the basic structure of the min-heap. A slice is used to store the elements. Each element includes not only an
Entry
but also anLI
to indicate which sorted sequence this element originates from.
- Implement the
sort.Interface
methods to sort elements in the heap. Special attention is needed for theLess
method: by comparing theLI
of the elements, we ensure that when elements have the same key, those from earlier sequences are ordered first. This facilitates deduplication when merging elements into the result set. This requirement also means that the sorted sequences should be arranged in order from oldest to newest when using the k-way merge algorithm.
- Finally, implement the
Push
andPop
methods.Push
appends an element to the end of the slice, whilePop
removes the last element from the slice.
Merge
The function definition of the Merge
method:
Follows the k-way merge algorithm process.
- First, insert the first element of each sorted sequence into the min-heap.
- Iteratively pop an element from the min-heap and add it to the result queue. If the popped element’s sequence still has more elements, insert the next element from that sequence into the heap. Here, a
map
is used instead of a result sequence. Themap
automatically handles deduplication, with newer keys always overwriting older ones.
Finally, traverse the map
to add elements to the result queue, removing any key-value pairs marked as “tombstones.” Since the map
is unordered, the result queue needs to be sorted before being returned.
Bloom Filter
The complete implementation of the Bloom Filter can be found at https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/filter/filter.go.
A Bloom Filter is a data structure that efficiently checks whether an element is a member of a set.
- It uses a bit array and multiple hash functions.
- When adding an element, the element is hashed using multiple hash functions, which map it to different positions in the bit array, setting those positions to 1.
- During a query, if all positions mapped by the hash functions are 1, the element might exist. If any position is 0, the element definitely does not exist.
The time complexity for both insertion and query operations is O(k), where k is the number of hash functions. False positives may occur (the Bloom Filter predicts that an element is in the set, but it is not), but false negatives cannot occur (the Bloom Filter predicts that an element is not in the set, but it actually is).
True Positive (TP): The system predicts an event as “positive,” and it is indeed positive. False Positive (FP): The system predicts an event as “positive,” but it is actually negative. True Negative (TN): The system predicts an event as “negative,” and it is indeed negative. False Negative (FN): The system predicts an event as “negative,” but it is actually positive.
Core Struct
The core structure of a Bloom Filter includes the bit array (which can be optimized to use uint8
) and multiple hash functions.
New
The method to create a Bloom Filter instance accepts two parameters: n
(the expected number of elements) and p
(the desired false positive rate).
First, the parameters are validated. Then, the size of the bit array (m
) and the number of hash functions (k
) are calculated using specific formulas. Finally, the bit array and hash functions are initialized based on m
and k
.
Add
When adding an element, all hash functions are used to compute hash values for the key
. These values are then mapped to indices in the bit array, and the corresponding positions are set to true
.
Contains
To check if a key
is in the set, the hash functions compute hash values and map them to indices in the bit array. If any of these positions is not true
, the element is not in the set, and false
is returned.
Leveled Compaction
The complete implementation of Leveled Compaction can be found at https://github.com/B1NARY-GR0UP/originium/blob/main/level.go.
After implementing components like K-Way Merge and Bloom Filter, we can complete the final part of our implementation, the most crucial SSTable compaction process in the LSM-Tree storage engine. This implementation follows the Leveled Compaction Strategy described in the “Data Compaction” section.
In the Leveled Compaction Strategy, SSTables are organized into multiple levels (Level 0 - Level N). We need a structure to store this information and manage SSTables across different levels.
Thus, we implemented a structure called levelManager
. We use a []*list.List
to store SSTable information for each level, where the index of the slice corresponds to the level. Each element in the slice is a list.List
, a doubly-linked list that holds information about all SSTables in a specific level.
compactLN
The compactLN
method is responsible for Level N to Level N+1 (N > 0) compaction. It selects the oldest SSTable from LN and all SSTables from LN+1 that have overlapping key ranges with this SSTable.
The selected SSTables are processed in order from oldest to newest. Key-value pairs from their data blocks are added to a two-dimensional slice and then merged using the K-Way Merge algorithm.
With the merged key-value pairs, we construct a new Bloom Filter and SSTable. The relevant information about the new SSTable is appended to the end of Level N+1.
Finally, the old SSTables are deleted, and the newly constructed SSTable is written to disk.
The compactL0
method handles Level 0 to Level 1 compaction. Unlike compactLN
, it selects not only one SSTable from L0 but also all overlapping SSTables in L0. The rest of the process is identical to compactLN
.
search
The search
method locates the corresponding key-value pairs across all SSTables. It starts from L0 and iterates through each level up to LN. By leveraging the Bloom Filter and the ordered structure of SSTables, SSTables that do not contain the desired key-value pairs can be skipped efficiently.
DB
With this, we have implemented all core components of the LSM-Tree-based storage engine. By assembling these components as described in the LSM-Tree introduction, we can finalize the database interface.
-
Complete code: https://github.com/B1NARY-GR0UP/originium/blob/main/db.go
-
Documentation: https://github.com/B1NARY-GR0UP/originium?tab=readme-ov-file#usage
Summary
We began by understanding LSM-Tree, familiarizing ourselves with its core components and the process of handling client requests. Ultimately, we built our own LSM-Tree storage engine from scratch.
Of course, this implementation is just a prototype. A production-grade storage engine requires consideration of many more details. ORIGINIUM will continue to receive optimizations and improvements in the future. Hope this article and ORIGINIUM help deepen your understanding of LSM-Tree.
That concludes everything covered in this article. If there are any errors or questions, feel free to reach out via private message or leave a comment. Thank you!