RDBM - A RELIABLE DATABASE LIBRARY FOR UNIX


Table of Contents

NEWS ABOUT RDBM

Version 0.9 is released. I added the "reorg" command to the network layer. A perl interface is in the works and I a probably going to write an Apache user authentication module.

WHY NOT ANOTHER DATABASE LIBRARY?

Actually, I wrote this out of pure desperation. I used to use the Berkeley database library for a fairly important (to me) commercial project of mine, Paperboy. Unfortunately, it proved to be quite unreliable. It consistently corrupted the user database so that it dumped core on access or on traversal or even simply "forgot" a few thousand entries when traversing the database.

I'd love to use gdbm, which seems to be resonably fast and I've heard good things about it. Yet, it's GPL copyleft would have forced me to place Paperboy under GPL, too. While I'm planning to do so, I'd like to make some money with it first. So GDBM is out.

Perl comes with a package called sdbm, but the accompanying readme states that it scales very badly.

What else is there? Some SQL database? Way too heavy for me. I don't like big and complex software. There are probably hundreds of bugs in all big software projects, including all the SQL databases I know. While there are lightweight SQL databases, too, e.g. Msql, I don't like SQL per se. It adds another layer to my software. I like my software lean and small. The rdbm 0.1 distribution is 8k in size.

Yet, I found the perfect database library. It's cdb from DJ Bernstein of qmail fame. Of all the software I have used in my life, cdb stands out because it is free, works out of the box, has not produced any hassles whatsoever and never failed me even under exceptional load or circumstances. Yet it combines the virtues I like most on software: it comes with sufficient documentation, it is small, the code is obvious. There is only one problem: you can't write to it. It's a read-only hashing package.

While some misguided individuals use databases to store logs, all the database needs of mine and the companies I worked for so far have been mostly read-only access to small to medium-sized databases. Yet, updates should be fast and not pose any potential threat of corruption.

rdbm addresses these problems with a simple and elegant solution: use a CDB database to store the database and store the diffs in a separate log file. That log file is written to the disk synchronously, so no data can disappear, and it is written in text format (actually, it's the same format cdbmake accepts as input), so it can always be recovered with a text editor or small perl script.

Should the rdbm library or the machine crash, the cdb database should be unaffected because it hasn't been altered all the time, and the log has been written synchronously so it should be there, too. The only thing that can corrupt an rdbm database is a disk malfunction like a bad sector or an operating system malfunction. So if you keep backups and/or use disk mirroring or RAID5 for the log files, you should be completely safe from harm.

The inner structure of rdbm is quite simple. I use a variant of chained hashing for the in-memory structure and I provide a dbm-like interface on top of it. Since the in-core structure serves as cache, too, only one rdbm library can access a database. That's why I'm planning to add an IPC based server process next that is transparently used via the same interface that rdbm already offers. The System V message queue IPC code is already partly there, but I haven't tried it yet. The rdbm code has been rudimentarity tested, even with Checker, and I believe it to be free of memory leaks and pointer problems, but I urge you to have a look at the source code yourself as I'm not infallible.

I like my solution because it is fast. RDBM is fast. The in-memory buffer with the diffs from the log file serve as cache, too, so it should be really fast on longer sessions. rdbm is obviously bad when the log grows large, but I provide a "defragment" function that writes a brand new cdb database and flushes the log file to speed things up again. rdbm is meant for databases that are mostly read from, and the only situation where I can conveive a significant speed problem would be when one does a big batch insertion (because of the synchronous disk writes to the log, yet it should still outperform other databases that can not write linearly but have to scatter their accesses across the disk) or when the cache becomes too large and eats all the virtual memory. I plan to add failure routines to my IPC server that will simply cause it to call the defragment routine and restart itself.

rdbm falls under the same copyright as cdb.

WHERE CAN I DOWNLOAD RDB?

  1. http://www.math.fu-berlin.de/~leitner/rdb/rdbm-0.9.tar.gz
If you read this far, please drop me a mail to leitner@math.fu-berlin.de and tell me what you think about rdbm!