The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

store large tree in RDBMS

My first thought is to do records the same way you would do links between structs in memory or folders, with columns like this:

id, parent id, sibling id, first child id, node data columns

Are there established techniques and theory about this kind of thing? You could probably study the costs of common operations, like inserting or removing a record involves updating the parent record or the previous sibling.

I did see a complex implementation using a first and last integer which allowed you to know if a node was a descendent of a node without doing multiple queries to trace up the parentage.
OP - RDBMS tree
Sunday, April 06, 2008
Oracle has supported a CONNECT BY clause for hierarchical queries since something like version 3, and still enhances it now and then. There are some examples of syntax and usage here:

Usually we store a unique id and a parent id to form the relationships. There are a number of performance enhancing techniques available if you need to go that little bit beyond normal -- for example you could store the table in a hash cluster based on the parent id, which would make it very fast to query from parent to child.
David Aldridge Send private email
Sunday, April 06, 2008
Actually there is an established technique for this ... you can store the items with a reference to their parent, but the better solution eliminates the need for recursive or iterative traversals of the hierarchy.

The key is that most hierarchies stored in databases are generally read many more times than they are written.  The "Modified Preorder Tree" requires a bit of processing when you load tree items into the database, but allow simple queries to return the entire hierarchy below a specified point.

The following link (especially page 2) provides a good description of both recursive/iterative techniques as well the modified preorder tree:

The Wikipedia page that describes the various ordering strategies for storing hierarchy in a database can be found here:
Steve Moyer Send private email
Sunday, April 06, 2008
Christopher Wells Send private email
Sunday, April 06, 2008
Joe Celko's SQL for Smarties book has some stuff on handling  trees:
Arethuza Send private email
Monday, April 07, 2008
You may find that the easiest technique is to use
a hack, namely varchars to store paths, where
m characters map to a node. For example if m=2,
you could have at depth 1 the nodes 'AA' and 'AB',
and at depth 2 the nodes 'AAAA' and 'AAAB'.

So the table

ID Path
== ====

represents the tree

  /  \
  N2  N5
 /  \
N3  N4

This scheme can be faster than recursive queries, and
will also work crippled databases that support them.
It is also easier to understand than some of the more
sophisticated schemes. The disadvantage is the extra
trouble and storage it will cost you to implement.
Object Hater
Monday, April 07, 2008
replace("that support them","that *don't* support them")
Object Hater
Monday, April 07, 2008

keep a reference to the parent, if it is -1 it is a root.

Loading is a breeze using recursion.

Monday, April 07, 2008
The two models you want to pay the closest attention to are the adjacency list model and the nested set model.

Links to both of those models have already been posted in other repsonses.

If you choose to follow the adjacency list model,  you need to keep Id and ParentId,  but you do not need to keep SiblingId and FirstChildId.  There are simple queries, explained in the previously reference articles to find all the siblings or all the children of any given entry.

Inserts and certain simple queries will be fast and easy using the adjacency list model. 

Tree traversal will be difficult.  You can either use recursive methods you program yourself,  or Oracle's CONNECT BY or similar operations of different DBMSes.

If you choose the nested sets model,  inserts will be more difficult,  but a broad range of queries will be very easy.  The question about how often the tree is modified is relevant here.

Other aspects of design can influence how often the tree gets updated.  As an example, consider the supervision hierarchy in a personnel database.  If you treat supervision as a relationship between two employees,  you have to modify the tree every time someone gets a job change.  But if you treat supervision as a relationship between JOBS,  not job holders,  you only have to update the tree when new job slots get created,  or when supervision relationships among existing jobs gets reorganized.

The difference could be large.
David Cressey Send private email
Tuesday, April 08, 2008
+1 for Joe Celko's book. Lots of useful stuff in there.

I have some experience with this stuff, including a hierarchical database in an RDBMS with many rows (hundreds of millions, and support for billions).

In the end, I just elected to store the parent ID in each node, as that keeps everything simple, at least in my application.

Oracle has supported connect by prior for many years, and now SQL Server has CTEs, so extracting the hierarchy is trivial.

I found the other methods mentioned above to be less than optimal in my case, as I would always need to support a "re-org" of the IDs, which would be very expensive to compute and would occur at indeterminate times.

The problem I had was that I couldn't determine the depth of the hierarchy ahead of time. If you can do this, then you can avoid the re-org issue.
Odysseus Send private email
Thursday, April 10, 2008

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz