The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Recursion in Databases

I've been asked to set up database for a conservation breeding program. I've picked up enough SQL to get started, but I'm stumped when it comes to generating a pedigree chart.
I need to be able to show the lineage of an animal, back 4 or 5 generations.

As a simplistic example, consider the following:

    id | father | mother
    ---+--------+-------
    1 |        |
    2 |        |
    3 |        |
    4 |        |
    5 |  1    |  2
    6 |  3    |  4
    7 |  5    |  6

Odd numbered-animals are male. 1, 2, 3, 4 are the original stock. 5 is the offspring of 1 and 2; 6 of 3 and 4; and 7 of 5 and 6.

I can get the paternal grandfather with something like:
 
  SELECT r.id, gramps.father AS grandfather
  FROM r JOIN r AS gramps
  ON r.father = gramps.id;

This gets exponentially more complex as we go back further , and try to include all parents.  Not to mention actual data about each animal (name, date of birth, owner...).

I'm really a programmer, so my inclination is to fire up a compiler and start coding. Any database folks out there care to offer alternative approaches?

Should I try creating one view with a full pedigree, or one view per generation, or one view per ancester (e.g., father's mother's mother's father)?

Can I create an SQL function to help perform this task inside the database?

Would I be better advised to pack it in, and switch to stamp collecting? At least stamps don't have ancestors.
Mike Send private email
Tuesday, June 20, 2006
 
 
Have you read this:

http://www.bookpool.com/sm/1558609202

"Joe Celko's Trees and Heirarchies in SQL for Smarties".

I'm pretty sure he covers the family tree example in there.
Chris Tavares Send private email
Tuesday, June 20, 2006
 
 
JW, Brooklyn NY
Tuesday, June 20, 2006
 
 
> Any database folks out there care to offer alternative approaches?

You're using the "adjacency" model, which stores (and therefore lets you process) one layer at a time; in contrast, the "nested set" model can work with a whole (multi-layer) branch. For details see for example _Hierarchical data in RDBMSs_: http://troels.arvin.dk/db/rdbms/links/

> Can I create an SQL function to help perform this task inside the database?

Yes, a (specific-database-implementation-dependent) "stored procedure".

> Would I be better advised to pack it in, and switch to stamp collecting?

An "object database", instead of a "relational database", is said to be better at chasing pointers ...
Christopher Wells Send private email
Tuesday, June 20, 2006
 
 
I think the real question here is, "What if an animal has more than two parents?" Remember: Zero, one or infinity...
Jesse Millikan Send private email
Tuesday, June 20, 2006
 
 
> I think the real question here is, "What if an animal
> has more than two parents?" Remember: Zero, one or
> infinity...

Yes, it's certainly important to guard against changes in the laws of nature when designing your application.  Good eye!
Old McDonald
Tuesday, June 20, 2006
 
 
More like... what if you don't know an animals parents?
anon
Tuesday, June 20, 2006
 
 
I agree, it's not a good idea to set that Mother and Father thing in stone.  What about cloned animals?
Kyralessa Send private email
Tuesday, June 20, 2006
 
 
For what it's worth, I looked at a nested set model for something similar and got scared off. There are several things about it that just don't sit right with me. I think that it is good in theory but not in practice.

1) It seems easy to mess up the data such that some relationships are lost. If there are holes in the sequence you can potentially never figure out how to get it back. One key tip off to this is that most examples of nested sets lock the table before doing any changes.

2) It seems like it would only be realistic for small amounts of data. In other words, it doesn't appear to scale well. Imagine inserting a new row into the table with 1Million rows. You would potentially have to adjust more than 50% of the rows to accomodate it. I personally feel that if you have a small amount of data anyway you'd be better off just doing any recursion on the client side and use a normal data model instead of the nested set model.

3) It doesn't support disconnected users well. If someone takes an image of the database and inserts rows on their own laptop, it may be impossible to merge them with changes made by other users. You have major concurrency issues in this scenario that may not be able to be fixed. It goes back to the requirement to lock the table in order to make any adjustments. If you aren't using a centralized database you are hosed.

4) Making simple data entry screens becomes very painful. People don't readily understand the model and have a hard time visualizing it. This makes creating a data entry screen difficult. The whole parent/sibling relationship is much more natural. Not to mention the fact that adding a new entry means changing just about every other row in order to make room.

It just doesn't seem worth the trouble. But that's just my opinion. As I said before I got scared off by the apparent complexity. I potentially could be completely off base and simply not understanding it.
anon
Tuesday, June 20, 2006
 
 
I agree that it's imperfect but I don't know a better way.

> 1) It seems easy to mess up the data ...

With the adjacency model too you can mess up the data, e.g. introduce a cyclic relationship.

> 2) ... scale ... Imagine inserting a new row

On the other hand you can get a whole branch with one select. Perhaps it's optimized for reading rather than for writing.

> 3) It doesn't support disconnected users well.

I don't know that the adjacency model supports them better. If it does, you can convert from the adjacency model to the nested set model and vice versa.

> 4) Making simple data entry screens becomes very painful.

I don't think you expect the user to type in all the values by hand.

> Not to mention the fact that adding a new entry means changing just about every other row in order to make room.

True, but the change is implemented by a single UPDATE statement (increasing the positions of all the right-most children by a constant).

> As I said before I got scared off by the apparent complexity. I potentially could be completely off base and simply not understanding it.

Two clear descriptions of it that I've read so far are http://www.sitepoint.com/article/hierarchical-data-database and http://mrnaz.com/static/articles/trees_in_sql_tutorial/

-----------

Anyway, apart from nested sets, here's another model, which caches the variable-length "lineage" of each record in an additional column: http://www.sqlteam.com/item.asp?ItemID=8866
Christopher Wells Send private email
Wednesday, June 21, 2006
 
 
In Oracle you'd just use a CONNECT BY clause to do this. If the depth of the hierarchy is known to be limited to some reasonable number I could see using a view that self-joins the table the required number of times to get the complete parentage.
David Aldridge Send private email
Wednesday, June 21, 2006
 
 
I am a known Object Database fan, so maybe I'm a little biased here, but I would go the ODB route here. If you have a class Animal with a method GetFamily(generations) then you easily implement it to any required level using recursion without ever having to know how many parents are involved at each level or if any of them are the same animal, which is more than feasible.
Paul Brown Send private email
Wednesday, June 21, 2006
 
 
There is an alternative to the nested set that strikes me as being potentially easier to work with, at least for writes, although I think you'll be forced to use the LIKE operator in your queries.

Since I don't remember the name of the model; let's call it 'node list' or 'concatenated ID'. For the beginning of a new 'tree', you create a node ID of, say, 1. The first child would then get a node ID of 1.1, the second child would be 1.2, a child of 1.2 would be 1.2.1, and so on. Another new 'tree' would start with 2 and then following generations would follow the same pattern. A relatively minor modification would have two 'node ID' fields to accomodate the dual parentage we normally associate with ancestry while still making it possible to represent parthegenesis, cloning, etc. by using only one of the node ID fields when appropriate.

Use of the LIKE operator makes it easy to find all descendents of any given parent, starting in any generation (WHERE nodeID LIKE '1.1.2').

Finding a specific ancestor is as easy as stripping the trailing node(s) (SUBSTRING()) and using the remaining string as the criteria.

Counting the dots (periods) makes it easy to find the generation number, although it wouldn't be tough to include a field that maintains a generation count as a static value.

Adding new entries is as simple as taking the max of the 'trailing node' and adding 1 (for a sibling entry) or simply appending a .1 (for the first child entry). This might be simplified somewhat by actually tracking the 'sibling number' in a separate field.

Nested set is 'read optimized' and this is 'write optimized' (that LIKE operator could cause grief if you end up with a lot of data).
Ron Porter Send private email
Wednesday, June 21, 2006
 
 
By the way, one nice thing about using an object database is that if you factor out your object CRUD methods separately, you can spend your time writing code and getting the object model right, and then when that's done, you can write a new set of CRUD methods with the same interface but this time have them use a relational database.  That way you can, for the most part, deal with code bugs and SQL bugs separately.

Something like db4o works well for this since it has an unlimited evaluation period; just use it while you write your code, then rewrite to use your SQL database when the code's ready.
Kyralessa Send private email
Wednesday, June 21, 2006
 
 
Kyralessa - I don't understand; having written the entire thing around an ODB, why would he want to rewrite it around SQL?

I find your lack of faith disturbing...
Paul Brown Send private email
Wednesday, June 21, 2006
 
 
Maybe you need a geneology program instead of a db app.
outside the box
Wednesday, June 21, 2006
 
 
I think the OP has some decent ideas, but needs to look at whether this is a primarily OLTP or OLAP database.

This question is more of an architecture question, which is design, which is really compromise.
D in PHX Send private email
Wednesday, June 21, 2006
 
 
There's a php lib for nested sets in PEAR. Nested sets are fun.
Matthias Winkelmann Send private email
Wednesday, June 21, 2006
 
 
I don't think the concatenated ID approach is very suitable here. First there's the multiple parent issue. But them it's dificult to predict the size of the keys. If there are many levels in the hierarchy and/or a lot of animals (that require large IDs) the keys will be too big to handle efficiently.
Axel Send private email
Thursday, June 22, 2006
 
 
Good point about the potential scalability of concatenated ID. My usage has been restricted to Bills of Material where 7 layers is pretty deep. I've actually never seen a BoM with more than 5 layers, but I've only been in manufacturing for a few years.
Ron Porter Send private email
Thursday, June 22, 2006
 
 
You could store the path to the offspring:

Original parents (male always on the left)
A,B,C,D
E = (A,B) male
F = (C,D) female
G = (E,F) = ((A,B),(C,D)) male
H = (C,B) female
I = (G,H) = ((EF),(C,B)) = (((AB),(CD)),(C,B))

Table GenePool like:

Name | Gender | Path
A    |  M    |
B    |  F    |
C    |  M    |
D    |  F    |
E    |  M    | (A,B)
F    |  F    | (C,D)
G    |  M    | ((A,B),(C,D))
H    |  F    | (C,B)
I    |  M    | (((AB),(CD)),(C,B))


Select all descendatns of $1
=============================
Select * from Genepool parent, Genepool desc where
instr(desc.path, parent.path) > 0 and parent.name = $1

Select 3rd level generation
=============================
Select * from genepool where instr(path, '(((') > 0

Select parents of I
====================
select * from genepool father, genepool mother, genepool desc where
father.gender = 'M' and mother.gender = 'F' and
concat('(', father.path, ',', mother.path, ')') = desc.path

This may get expensive, you may be better off with storing it's parents too.


Select ancestors of $2
=====================
select * from GenePool parents, Genepool.desc where instr(desc.path, parents.name) > 0 and desc.name = $2
Dino Send private email
Friday, June 23, 2006
 
 
One approach I've used before is just to use the standard adjacency method you describe, but to back it up with a separate two-column table/binary relation which forms the transitive closure of the adjacency relation.

You would then need to update this closure table periodically to keep it in sync. You could write fairly optimal code to update only the portions of the closure table necessary on each update to the main table, or, as I did, you could just regenerate the whole thing every night (in my case the data didn't change much, and having it a day out of sync for the selects wasn't a big deal)

Then to do things like finding all parents, you just join to the transitive closure table, and get them all in one query. If you want them in order too, you'll need an extra column on the transitive closure relation, storing the number of 'steps'.

Also: is there a reason you're using odd/even IDs to designate Male or Female? Sounds a little odd from a data modelling point of view. I would have a separate column for Male/Female. That column could still form part of the primary key, if you wanted it to.
Matt Send private email
Sunday, July 02, 2006
 
 
Don't feel bad about it though - SQL is known to be nororiously bad at handling these sorts of issues in an elegant matter! A better-designed relational query language  should make much lighter work of recursion - see

http://en.wikipedia.org/wiki/Tutorial_D

For an (admittedly slightly vapourware-ish, but interesting) example
Matt Send private email
Sunday, July 02, 2006
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz