The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

modeling trees in a database

I have an upcoming project where we'll need to model some data that is organized in a tree.

Anyone have any suggested reading (books or online) of alternative ways to implement this?  I can think of a couple of options, but would like to do some research before I code myself into a corner and am stuck.
Jason Send private email
Tuesday, July 15, 2008
 
 
Depends on the database, MS SQL has hierarchy
types and I guess you might be able to roll
your own tree objects in Postgresql.

If your database doesn't support trees, my way
would be to store a path as a string, but you
will find the "correct" way by reading books
by Joe Celko.

Start with "SQL Trees for Smarties". He's got
some stuff online as well, try googling.
 
 http://www.google.co.uk/search?q=joe+celko+sql+trees
Object Hater
Wednesday, July 16, 2008
 
 
A very simple approach is to add a "parent" column to each row of the database, with a reserved value for the root of the tree. Store the PK of the parent row there, and then you can traverse up or down (with recursive queries) as needed. Unless you need other functionality, this should work.
BrotherBeal Send private email
Wednesday, July 16, 2008
 
 
There is no general "right" way to do trees in databases.

Recently, after having employed more complex data models I've been working with an intentionally very simple approach of storing the identity of each node's parent in the node. Now this has the distinct advantage of being really easy to work with, but for most cases it is rather slow - especially if you want to easily support multiple database engines and are therefore avoiding vendor specific features.

The approach I came up with was to keep the data model simple and get the required performance levels by application level caching. This has actually worked out pretty well and the overall complexity is, I suspect, rather lower than if I had worked with a more 'sophisticated' data model.
Arethuza
Wednesday, July 16, 2008
 
 
There are two main approaches:

1) Store ParentID in node, as BrotherBeal suggested (adjaceny list model)
2) Store LeftID/RightID on each node (nested set model)

Option 1) is easy to understand, and harder to query against, although CTEs in SQL Server help a lot (connect by prior in Oracle, I think).

Option 2) is harder to understand, but easier to code against, e.g. you can get all of a node's child nodes by selecting where child.ID between parent.LeftID and parent.RightID. The main disadvantage is that you may not be able to insert your nodes, in which case a re-org operation would be required (like an index page split), which is very expensive computationally.

I have done a lot of work with hierarchies in SQL Server and have always favoured option 1), as I can cope with the disadvantages, especially now that CTEs are available.

I don't like Option 2), as it is unpredictable whether you will be able to perform an insert and the re-org (or partial re-org) is non-trivial.

Check out this link for lots of good stuff on Hierarchies: http://troels.arvin.dk/db/rdbms/links/#hierarchical
Odysseus Send private email
Wednesday, July 16, 2008
 
 
Hi Jason,
in my eVrika application, I store the directory structure as a tree in the database. It might help you.

The source code is located here:
http://sourceforge.net/projects/evrika/
Yiannis Burkels Send private email
Wednesday, July 16, 2008
 
 
SQL Server 2008 has a new HeirarchyID data-type. I haven't used it yet, so can't comment. Just read an article in a magazine this morning on it.

Here's a couple of online articles:

http://technet.microsoft.com/en-us/library/bb677290(SQL.100).aspx

http://tsqldotnet.blogspot.com/2008/05/storing-hierarchical-data-heirarchyid.html
Sgt.Sausage
Wednesday, July 16, 2008
 
 
> A very simple approach is to add a "parent" column

You need a "siblingID" or "siblingIndex" column too if the children of a parent have a sort order.

> get the required performance levels by application level caching

So the suggestion is "don't try to do it in the database".

> an index page split is very expensive computationally

I don't know that it is *very* expensive; but the need for splitting is a direct a consequence of having records physically arranged according to their logical sequence (e.g. if the LeftID is the clustered index), the up-side of which is that the SELECT of any branch is even faster.

I'd guess that what really makes *any* representation or solution inevitably expensive is that the whole table (or at least an entire branch of the tree) must be locked for exclusive-access during an insert (musn't it?), if the siblings are sequenced ... because otherwise you might have two INSERTs adding two younger siblings to the same existing sibling (and they musn't both be the next-youngest sibling).

OP, another solution is that database engines now have some support for working with XML.
Christopher Wells Send private email
Wednesday, July 16, 2008
 
 
@Sgt.Sausage
Yes, the new hierarchyid type looks interesting, especially when you check out the indexing optimisations.

I haven't used it yet, but I'm investigating it for my next major project.
Odysseus Send private email
Wednesday, July 16, 2008
 
 
>> get the required performance levels by application level caching

> So the suggestion is "don't try to do it in the database".

Not at all - the approach I took was to make it work, make sure I had plenty of unit tests and then add application level caching to get the performance I wanted while checking correctness with the unit tests.

Turn the application level cache off and everything still works - it's just a bit slow.
Arethuza
Wednesday, July 16, 2008
 
 
> Arethuza

I see: thank you. That's quite good: your inserts are fast, because you're using a simple schema; and your selects, which are slow because the schema is simple, are cached.

I'm trying the nested set model: so, my selects are fast; the difficulty with inserting is the need to make a new gap in the tree, using a statement like

  UPDATE Folder SET
  LeftValue = CASE WHEN LeftValue >= 7
    THEN LeftValue + 2
    ELSE LeftValue END,
  RightValue = RightValue + 2
  WHERE (RightValue >= 7);

which affects (indexed columns of) every row beyond 7.
Christopher Wells Send private email
Wednesday, July 16, 2008
 
 
Thanks for all the feedback.

Our db will probably be MySQL, not SQL Server.  We will probably need to be able to handle lots of re-ordering of the tree, so that's something to consider.
Jason Send private email
Wednesday, July 16, 2008
 
 
+1 for Celko's _Trees and Hierarchies in SQL for Smarties_, which has some more (non-vendor-specific) considerations, solutions, and optimizations.
Christopher Wells Send private email
Wednesday, July 16, 2008
 
 
There is another approach using a FullPath text.  Not appropriate for every situation, but I find it fits more times then you would expect. 

I  built a data dictionary web application that you would connect to your SQL Server and it would show a tree of all of the databases on that server.  You would then drill down to the schemas in a database, and then down to the tables in a schema, and finally to the columns for a given table. 
http://www.digitaltools.com/InstantDataDictionary.aspx

None of the data is stored by the application – so no ParentId, etc.  It is all pulled from SQL Servers’ INFORMATION_SCHEMA view.  To get the idea if you run the following on your SQL Server:

SELECT *
FROM (
SELECT DISTINCT TABLE_CATALOG as FullPath, 1 AS Level
    FROM INFORMATION_SCHEMA.TABLES
UNION ALL
SELECT DISTINCT TABLE_CATALOG+'\'+TABLE_SCHEMA, 2
    FROM INFORMATION_SCHEMA.TABLES
UNION ALL
SELECT TABLE_CATALOG+'\'+TABLE_SCHEMA+'\'+TABLE_NAME, 3
    FROM INFORMATION_SCHEMA.TABLES
UNION ALL
SELECT TABLE_CATALOG+'\'+TABLE_SCHEMA+'\'+TABLE_NAME+'\'+COLUMN_NAME, 4
    FROM INFORMATION_SCHEMA.COLUMNS
) AS DATA
ORDER BY 1

The FullPath column would look something like:

FullPath
-----------------------------------------------------------------
ReleaseEngine
ReleaseEngine\dbo
ReleaseEngine\dbo\aspnet_Applications
ReleaseEngine\dbo\aspnet_Applications\ApplicationId
ReleaseEngine\dbo\aspnet_Applications\ApplicationName
ReleaseEngine\dbo\aspnet_Applications\Description
ReleaseEngine\dbo\aspnet_Applications\LoweredApplicationName
ReleaseEngine\dbo\aspnet_Membership
ReleaseEngine\dbo\aspnet_Membership\ApplicationId
ReleaseEngine\dbo\aspnet_Membership\Comment
ReleaseEngine\dbo\aspnet_Membership\CreateDate
ReleaseEngine\dbo\aspnet_Membership\Email

So the (simplified) SQL code to pull a branch would look something like:
declare @ParentPath varchar(50)
declare @Level int

set  @ParentPath = ' ReleaseEngine\dbo\aspnet_ Membership'
set @Level = len(@ParentPath)- len(replace(@ParentPath,'\','')) + 1

SELECT FullPath
FROM (…  ) AS DATA
WHERE FullPath LIKE @ParentPath + '\%'
AND Level = @Level
ORDER BY 1

Of course you would write both of these more efficiently, but you get the idea. 

Had a similar situation with a release manager project that would pull the full paths for files from Visual Source Safe.
(i.e. $/Databases/ReleaseManager/Stored Procs/dbo.GetOpenRequests.PRC)
  Again, it wasn’t workable to store the data in my own tables and track the parent / child relationships, so the tree was built by parsing the FullPath.

Also never had a performance problem with this way since you are usually binding to a branch “on-demand”.

If you are housing your own data then re-ordering branches would be a matter of updating the FullPath column and inserts would take care of themselves.
 
So depending on your data this is my alternative way of implementing this.
JBrooks Send private email
Wednesday, July 16, 2008
 
 
Daniel Henriquez Send private email
Wednesday, July 16, 2008
 
 
Suppose the tree structure in question was a family tree. Do any difference rules apply ?
IT guy
Thursday, July 17, 2008
 
 
> Suppose the tree structure in question was a family tree.

* There's no single sort-order between siblings; e.g. siblings might be sorted by age, or alphabetically

* Each person has two parents ... or do they?

* You may not need to worry about the run-time performance of insert/update/delete
Christopher Wells Send private email
Thursday, July 17, 2008
 
 
I'm in no way a database person, but I write a bunch of multi-threaded code. Since you'd need to update the data, it seems naively logical that some of the same tricks might apply.

So, having lowered expectations, I would suggest looking at a Skip List data structure. Updating tree structures concurrently is difficult to manage, so most real-world users just lock the entire tree during modifications. A Skip List is trivial to modify as only two nodes must be locked. Its also very fast to search, like trees, at O(log n).

I'd expect it not to be that difficult to implement this in a database, but probably not be the most common path travelled. I have met very few people know of that data structure simply becaues its not taught in their college courses (e.g. "Introduction to Algorithms" has one-sentance on it). Nir Shavit has a few Skip List based concurrent algorithms published that you could draw from.
Benjamin Manes Send private email
Friday, July 18, 2008
 
 
Benjamin -

A skip list is only tree-like in performance, but it lacks the intrinsic hierarchical representation of a tree. If the OP's simply using a tree for performance benefits (using an autogenerated ID column as the branching criterion, for example), then a skip list is a fine way to go (provided it's documented since most people haven't worked with them). Very interesting idea, though - I hadn't thought to use one!
BrotherBeal Send private email
Friday, July 18, 2008
 
 
hmm, right, I forgot about that the data had relationships. I tend to see trees used purely for ad-hoc searching, so it became an immediate connection after putting out enough fires. So of course, it makes no sense here.
Benjamin Manes Send private email
Friday, July 18, 2008
 
 
>> Suppose the tree structure in question was a
>> family tree. Do any difference rules apply ?

Family trees where people=nodes aren't trees in the mathematical sense.  Spouse nodes (since they come from other trees) behave like root nodes even though they're not intended to.

Worse yet, some families have cycles. ;)
Micah Tan Send private email
Thursday, July 24, 2008
 
 
> Worse yet, some families have cycles. ;)

Some people are their own ancestor?
Christopher Wells Send private email
Friday, July 25, 2008
 
 
>> Worse yet, some families have cycles. ;)

>Some people are their own ancestor?

Oops.  In a directed graph you would need that -- I was thinking about the case of incest in appendices in history books.  The relationship lines appear to form a closed path, but it's not a cycle.
Micah Tan Send private email
Tuesday, July 29, 2008
 
 
Google "nested sets sql" for one of the better general solutions. (Not the only one, but for many applications the most suitable).

I've also seen (and used) a model that contained a "depth" field in addition to "parent", which works if you're almost always loading the whole hierarchy into memory (or an entire level of children at a time). Trickier to shift nodes around the tree than with nested sets, but just as easy to extract a set of children.
Jason Truesdell Send private email
Thursday, July 31, 2008
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz