building my own lsm tree in java
I have read about LSM trees for years. I have skimmed papers, looked at production code, and nodded along as people explained compaction strategies on conference stages. But reading about storage engines and building one are very different experiences. So I decided to stop consuming and start implementing.
This is not a side project for a weekend. I want to understand what it actually takes to approach LevelDB class performance. That means starting embarrassingly simple and earning every optimization.
what i am building
A persistent key value store in Java.
That is it.
No distributed layer. No replication. No transactions. No concurrency in the beginning. Just PUT and GET. If those two operations are not solid, everything else is noise.
I am following the same basic constraints as LevelDB:
- Sorted keys
- Byte array values
- Write optimized design
- Immutable SSTables on disk
- Compaction later
I am not trying to beat C++. I am trying to understand the shape of the problem.
phase 1: the naive lsm
The first version is intentionally unsophisticated.
In memory, I use a TreeMap. I briefly considered a HashMap, but that would force me to sort during every flush. I would rather pay O(log n) during writes than get hit with a sorting spike at the worst possible moment.
When the memtable crosses a size threshold, I flush it to disk as a new SSTable file.
The file format is simple:
- int key_length
- key bytes
- int value_length
- value bytes
Repeated in sorted order.
No blocks. No compression. No bloom filters. No fancy footer.
Just a sorted stream of records.
read path, deliberately painful
Reads follow the standard LSM order:
- Check the memtable
- If not found, scan SSTables from newest to oldest
Newer files win because newer writes override older ones.
Right now, the SSTable lookup is a linear scan.
Yes, that is slow.
I am keeping it that way on purpose. I want to feel what read amplification actually means instead of treating it like a bullet point in a slide deck. When I have 50 SSTables and a random GET touches half of them, I want that frustration. That is the signal telling me what to build next.
what i am not building yet
No WAL. If the process crashes, the memtable is gone. I am accepting that risk for now because durability changes the performance profile and adds complexity around fsync behavior.
No compaction. Which means the number of SSTables will grow without bound. That is fine for this stage. I want to measure how quickly read performance degrades as files accumulate.
No bloom filters. So negative lookups will hurt.
This version is supposed to expose weaknesses.
directory layout
The on disk structure is minimal:
/data /sstables 00001.sst 00002.sst
Each flush increments a monotonically increasing file number. Larger numbers mean newer data. That simple ordering defines precedence during reads.
Nothing clever. Just predictable.
benchmarking plan
I will benchmark under the same basic constraints as LevelDB:
- Fixed size keys
- Fixed size values for consistency
- Dataset larger than memory
- Random read workload after heavy writes
I care about:
- Write throughput
- Read latency
- Flush time
- Number of SSTables over time
I am not expecting miracles. I am expecting to discover bottlenecks I did not anticipate.
why i am doing this
It is easy to say "LSM trees convert random writes into sequential disk IO." It is harder to watch your own implementation stall because compaction is not implemented. It is easy to talk about read amplification. It is different to trace through ten files just to confirm a key does not exist.
I want that experience.
If this first version feels primitive, good. It should. Every feature I add next sparse indexes, WAL, compaction, bloom filters will be justified by a specific pain I can measure in this baseline.
This is version one. It will be slow in places. It will be incomplete. But it will be honest, and that is where I am starting.