The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Hierarchical database structures

Consider a content management system, which models documents that are contained in directories. The system will receive the path to each item as part of the request, such as "/section/subsection/document.html".  Now, the most natural primary key for each document is the document's path fragment and it's parent's key:

create table document (
  doc_path varchar(32) not null,
  doc_parent_path varchar(32) not null,
  primary key(doc_path, doc_parent_path)
);

alter table document
  add constraint fk_document_document
  foreign key (doc_parent_path)
  references document(doc_path);

Now, the first problem with this structure is that there's no way to represent the root of the tree because the parent path is part of the primary key, and thus cannot allow nulls. That problem could be avoided with something like this:

create table document (
  doc_id integer identity,
  doc_path varchar(32) not null,
  doc_parent_id integer null,
  primary key(doc_id)
);

alter table document
  add constraint fk_document_document
  foreign key (doc_parent_id)
  references document(doc_id);

alter table document
  add constraint ak_document_path
  unique (doc_path, doc_parent_id);

The problem here is that there's a limit on the number of documents that can ever be created because of the identity key.

There are two other problems for which I can see no obvious solution. First, I'd like to be able to find a document based on its path simply and efficiently, ideally with a single SQL query. Second, given a document's primary key, I'd like to determine its path simply and efficiently, again with a single SQL query if possible.

I can't be the only person who's tried to model this type of structure. How have you all handled it?
comp.lang.c refugee
Thursday, August 25, 2005
 
 
Try googling around (including usenet, not only pages) for "celko tree sql". Joe Celko is a SQL Guru who wrote an entire book on the subject, and has discussed his ideas all over internet.

Caveat: Trees and SQL are not exactly "a match made in heaven". Some vendors introduced proprietary SQL extensions to manage these kind of hierarchies (I only know about Oracle, but I suppose others did it), but even accounting for those, managing these kind of structures is doable, but does not feel "natural" in SQL.
Paolo Marino Send private email
Thursday, August 25, 2005
 
 
Have a look at this article:

Storing Hierarchical Data In A Database:

http://www.sitepoint.com/article/hierarchical-data-database/2
redeye Send private email
Thursday, August 25, 2005
 
 
If I was using a tree structure, I would be tending towards separating the nodes (documents) from the links (parent path) - two tables rather than one.

But do you really have a problem with the identity being a limit? Many databases use integers as surrogate keys - do you expect to reach the identity limit in your application?

However, I'm with Fabian Pascal on this one, no effort was put into SQL to handle trees. Until something tree-relevant is put into SQL, it's always going to be clunky. (e.g. the above sitepoint link basically does all the forced recursion on the tree build instead of traversal - conservation of crap, it all ends up somewhere)
Joel Goodwin Send private email
Thursday, August 25, 2005
 
 
"The problem here is that there's a limit on the number of documents that can ever be created because of the identity key."

You expect this to come up?  That's a lot of documents.

"I can't be the only person who's tried to model this type of structure. How have you all handled it?"

I've built a large web-based content management system; runs a few big sites around.  I used a structure similar to the one that you propose last -- but I don't recommend it.  Too much effort to locate paths from keys and vice-versa.  I haven't had a chance to refactor it but I've decided on a simpler model:

create table document (
  doc_id integer identity
  doc_path varchar(255) not null,
  doc_name varchar(32) not null
  primary key(doc_id), unique key(doc_path, doc_name)
);

That is, each document stores it's complete path from the root.  It makes finding the document and finding the path very quick.  You do have to enforce some structure yourself but that's extremely easy.  There's a lot of string manipulation to navigate the tree.

Queries are easy because you can use LIKE comparisons.  You can retrieve various levels of the tree in a single query and sort them in a reasonable order.
Almost H. Anonymous Send private email
Thursday, August 25, 2005
 
 
" "The problem here is that there's a limit on the number of documents that can ever be created because of the identity key."

You expect this to come up?  That's a lot of documents."

I think it's possible. Bear in mind that using an identity column doesn't just restrict the number of rows that can exist at one time. It restricts the number that can ever be created. That is, if there have been INT_MAX insertions I can insert no more, even if every row has been deleted.

I expect (well, hope) to be dealing with thousands of authors. It's not too farfetched to imagine that the limit could be hit, especially if every object associated with a page (images, etc) is represented. It might still take many years to hit the limit, but I don't want to impose a fixed lifespan on the system.

I'm willing to consider GUIDs, or some numeric ID system which re-uses deleted keys. In fact, I'm sure I'll need something like that in other parts of the system.

AHA, I have a couple of questions about your structure. First, how do you find a node's parent? Second, when the path of a node is changed, how do you ensure that the paths of its children are updated?
comp.lang.c refugee
Thursday, August 25, 2005
 
 
"That is, if there have been INT_MAX insertions I can insert no more, even if every row has been deleted."

That's one insert every second for 136 years -- seems a tad unlikely.  And it doesn't take much effort to reset the pointer value if it ever comes up.  This is pretty much an unreasonable problem.  You'll likely run into other lifetime issues decades before this value ever runs out.

"It's not too farfetched to imagine that the limit could be hit, especially if every object associated with a page (images, etc) is represented."

AHA, I have a couple of questions about your structure. First, how do you find a node's parent?

Take the doc_path and split it by the last '/'.  The first part is the parents doc_path and the second part is the parents doc_name.  I've found it's often desired to want the parent's parent or a certain level up from the root -- similar string splitting works for those cases as well.

"Second, when the path of a node is changed, how do you ensure that the paths of its children are updated?"

Using a LIKE clause in the UPDATE statement and some string manipulation to update all the child nodes.  You are replacing some prefix of doc_path with some new prefix -- it's pretty simple.  Rename would require two updates, one to rename the document and the second to move all the children to be under the new document name.
Almost H. Anonymous Send private email
Thursday, August 25, 2005
 
 
I faced a similar problem recently.
Firstly about the problem of the number of integers you can use a numeric column.
If you can structure your data like Joe Celko it is the best solution. In my case it was not possible.
So I was forced to use recursion. I did not believe it at first but it works.

Here is a UDF that will retrieve the full path from the document ID

CREATE FUNCTION F_GetFullPath (@ID int)
RETURNS varchar(2000) AS 
BEGIN
    Declare @fullPath varchar(2000)
    Declare @parentID int
    Declare @path varchar(50)
    
    select @parentID=parentID
    from document
    where id = @id
    
    if @parentID is null
    begin
        select @code = code from document where id = @ID
        set @fullPath = @code + \
    end
    else
    begin
        select @code = code from document where id = @ID
        set @fullPath = F_GetFullPath(@parentID) + @code + \
    end
END


Here is the sp that will retrieve the id from the full path. It relies on the use of the dynamic cursor that is 'auto-updated' as the temporary is polulated.

CREATE PROCEDURE GetIDFromPath
    @path varchar,
        //get full path
    @result int output
AS

    set nocount on
    
    declare @currID int
    Declare @temppath varchar
    
    Set @temppath = extract the top most dir from the path
    
    Select @currID = select id from document where path =@tempPath

    CREATE TABLE #children (id INT,path varchar)
    
    //find the children of the current id
    insert into #children select id,path from document where parentID = @currID

    declare currchildren cursor FORWARD_ONLY DYNAMIC READ_ONLY for select ID from #children
    open currchildren
    
    fetch next from currchildren into @currID
    while @@fetch_Status = 0 and @tempPath <> ''
    Begin
        
        Set @temppath = extract the next portion of the full path
        
        if exists(select * from #children where path = @temppath)
        begin
            insert into #children select id,path from document where parentID =@currID
            fetch next from currchildren into @currID,@tempPath
        end
        else
        begin
            fetch next from currchildren into @currID,@tempPath
        end
    end

    close currchildren
    deallocate currchildren

    Select @result = max(ID) from #children
    drop table #children


I was very afraid of the second sp but it seems to be going ok so far. You might have problems with the number of times the dbms allows recursion.
Anyway I hope it helps.It has worked for me (so far).
Panagiotis Grontas
Thursday, August 25, 2005
 
 
While I agree with the other poster about considering the risk of depleting the set of numerical IDs pretty far-fetched, I'd like to add that if you expect to have a "significant" number of nodes solutions based on explicit path representations and using LIKE to match substrings do risk to perform poorly.

Some other solution (either Celko's or roll-your-own recursion) will probably serve you better when the number of records grows.
Paolo Marino Send private email
Thursday, August 25, 2005
 
 
"based on explicit path representations and using LIKE to match substrings do risk to perform poorly."

It's a trade off -- improved access performance for slower update performance.  In my case, access is thousands (millions) times more common than updates. 

I currently use a recursive structure that people have mentioned above and it performs like a dog and isn't terribly flexible.  Getting path from id and id from path is easy enough but that's only a small fraction of the types of queries I need to do.  It's much better, in my opinion, to fiddle with text than to recurse into tables.

You don't have to use LIKE to match the substrings -- since you're mostly concerned with prefixes it's pretty easy just to use substrings.  Indexing will work well in your favour.
Almost H. Anonymous Send private email
Thursday, August 25, 2005
 
 
John Rusk Send private email
Thursday, August 25, 2005
 
 
Thanks for all the ideas, everyone. I'm sorry I haven't been more active in this thread. I've been unexpectedly very busy in the last day or so.

It looks like I've got a lot of reading and testing to do.
comp.lang.c refugee
Friday, August 26, 2005
 
 
>> Indexing will work well in your favour.

How does a database or which database will optimize a query with LIKE in it? I thought it would require a full table scan.
Anon
Wednesday, September 07, 2005
 
 
I looked at this. 3 ways:

Joe Celko's nested sets. Time expensive for inserts, but seems to only need Standard SQL.

Materialised Path - Path field with value like 1.2.1.3.2

Parent field.

Joe Celko's looks the best to me.

Thursday, September 08, 2005
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz