The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Disk-based Read-only Binary Trees?

I need to be able to locate a record in an index as fast as possible (O(1), please).  The index may have over 100M records.  Worst of all, the record is identified by a single key, but the search is for a range of a keyspace.  e.g. if I'm storing integers, I need to be able to find all records whose key is between, say, 955,373,100,000 and 955,373,999,999.

A binary tree seems to lend itself naturally to the search but I haven't had any luck finding a disk-based one with any reasonable create performance.  They don't come even close to saturating disk I/O since they attempt to maintain a complex on-disk structure that supports updates/deletes.  Not necessary in this case, it's easier to just recreate the index.

I can roll my own implementation but I'd rather avoid re-discovering tree-flattening disk-seek minimizing techniques that someone based their thesis on.

Any hints?
Michael B
Friday, July 20, 2007
O(1) is not realistic.
Friday, July 20, 2007
Also, one usually does have to choose between Creation speed and Access speed.  Your specification is on Access speed, but your criticism is on Creation speed.

The nice thing about "roll-your-own" solutions to this is that YOU get to choose what and where to optimize.  The bad thing about "off the shelf" solutions here is that often you're stuck (as you seem to be) with the "vanilla" decisions the off-the-shelf developer made.
Friday, July 20, 2007
Oh!  And in SQL, this would be something like

Select * from MyTable
WHERE DesiredID > 1000000 AND
DesiredID < 100099

Or whatever.  In other words, would a nice MySQL database solve your issue?  PostgreSQL perhaps?
Friday, July 20, 2007
Michael B
Friday, July 20, 2007
Keyword ISAM

Friday, July 20, 2007
I'd be sorely tempted to do an in-memory, sorted index of on-disk record offsets, keyed by the field of interest.  Then you can do a binary search of the in-memory index to find the 'first value' record (or the record that would follow the first value if the first value isn't found).

This would give you an O(log2(n)) search for the first record, and O(1) for each following record.  And yes, it would be a very large array.  Still, 26 'searches' for a 100e6 sized array would be pretty quick.

Your other problem is, if anybody DID do something like this for a PhD thesis, they've probably already been hired by Oracle and the results of their research buried in Oracle's search algorithms.
Friday, July 20, 2007
Pardon me if I'm restating the obvious, but

1. O(1) you can manage by hashing, but hashing won't give you range search.

2. Tree search, O(log n) in databases uses B+ trees, not binary trees. B+ trees do have bookkeeping overheaded, but the advantage is minimizing i/o--you have a shallow tree with fat nodes. In the best case, your root node and the first level below that are cached in memory.

It seems to me that you might have a look at Gray & Reuter's _Transaction Processing_.
George Jansen Send private email
Friday, July 20, 2007
How much information is required in the base record(s)? You could have an in-memory database with keys associated with small remord objects. If more details are reqiored, you can then pull out the full datum on a slower, traditional database. A local cache could then help you manage the full records effectively.

In a distributed environment, you could use a remote caching layer like memcached. Few have an SQL-like interface available, so you'd need to do a multi-key request with a collection representing your key range.
Benjamin Manes Send private email
Friday, July 20, 2007
As a previous poster indicated, you want to use a B? tree.
Basically each node has n keys and n file offsets, where n is chosen so the record is large (like a sector or more). At the lowest level, the key values are sorted key values and the offsets point to the corresponding actual records. At each level above this, the offsets point to nodes at the next lower level, the keys are sorted and all records between key(k) and key(k+1) are found below the node pointed to by offset(k+1). Search is O(log base(n) of N).
    The problem is when you want to insert a new record and the lowest level node is full, you have to split it and push the split up the tree (the B+ tree code does this for you).
    If you insert records one at a time, the tree code is doing a lot of this splitting and pushing up.
    If you have a lot of records to start, IN SORTED ORDER, you can build the tree bottom up keeping current nodes for each level in the tree. When you write out a new record, you add its key and offset to the end of the current lowest level node. If the node is now full, write it out and store its key limits and offset in the next level etc. This takes about as long to build the initial tree as to write out the records. If the records are already sorted on disk, scan them and build the tree.
  As mentioned above, typically the top few levels are kept in memory for faster access.
Saturday, July 21, 2007
What about storing the tree in an array (using indices instead of pointers) and simply memory map that array to the disk?

The operating system takes care of caching issues.
Sunday, July 22, 2007
Try sqlite
Anon Send private email
Sunday, July 22, 2007
You don't need a trees if its read only and the records are fixed size. You just sort the records and do a binary search on them. It hard to make it faster than that.
Sunday, July 22, 2007
I'd go with AllanL5's suggestion too. Just store in key order the key and offset of the record in the database, and do a binary search to look items up. The code for such things is very simple, fits your needs fine, and searching should be extremely quick. (Assuming none of it ends up in the swap file, I would imagine that lookups will be limited solely by memory bandwidth.)

At 100M records, assuming 8 bytes per offset and 4 bytes per key, total space for the array is a little over 1GB. That's not a small array, but it's by no means impossible large, and should fit nicely into physical memory without causing big problems. If you find yourself running out of virtual address space (this is possible) then you can split the table up further and have each chunk separately. I can't see this slowing or complicating things significantly.
Sunday, July 22, 2007
Hmm, the integers you mention aren't 32 bits, so the table would be more like 1.5GB. Oops! Maybe you'd need a 64 bit PC, then. I still think the approach has merit though.
Sunday, July 22, 2007
Just use sqlite.  Please.
Seun Osewa Send private email
Sunday, July 22, 2007

A single-file database.  It uses some sort of b-tree internally.  Very fast and stable.  Used by Apple, Adobe, Google (I think), Mozilla, etc, etc.
Monday, July 23, 2007
The sorted array on disk approach is elegant, but isn't future-proof.  The entire data set needs to fit into RAM if you want the sort to finish for my definitions of fast.  100M records is workable with conventional hardware, but becomes infeasible at 1B.  Of course we can make two passes or split the dataset or do whatever other hack we need to do, but then it stops being so elegant.

I'm not writing it off, I just wanted to avoid thinking.

SQLite's a good lead.  I'll report back someday with the result. ;)
Michael B
Monday, July 30, 2007

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz