The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Modeling an equivalence relation in a database

Hi all,
What's the best way to do this?

Say I have is a list of people who are the same height (let's assume discrete height values, just for the example).  For example:

Adam | Bob
Bob | Chris
David | Eric
Chris | Frank

which means that Adam, Bob, Chris, and Frank are the same height, and David and Eric are the same height.  Now the trivial way to put this into a database is just two columns, each with a unique identifier:

p1    p2
Adam  Bob
Bob    Chris
David  Eric
Chris  Frank

But then you can't easily answer "Who are all the people who are the same height as Adam?"  You could do it, but it'd be a complicated query, right?

Another way is like so:

height person
1      Adam
1      Bob
1      Chris
2      David
2      Eric
1      Frank

That lets us see all the people that are the same as Adam.  But then what if I find out Eric = Frank?  I'd have to update all the rows again.

Is either way commonly used?  Is there another way I'm missing?

Sorry to repost so soon, especially since this is similar to my other question, but I'm not sure how else to figure this out.  Thanks!
Sedate Snail Send private email
Thursday, August 02, 2007
I must be missing something, but wouldn't you just store the height, so that you could return all people where height = Bob.height?
D. Lambert Send private email
Thursday, August 02, 2007
Yeah, that's true, except what if I don't know the height?  Like if all I have is this magic two-column list that someone gave me.
Sedate Snail Send private email
Thursday, August 02, 2007
"Yeah, that's true, except what if I don't know the height?  Like if all I have is this magic two-column list that someone gave me."

Then you have a bad design.  You have already mentioned the update problem.  Then, there is what if Frank grows an inch?


Gene Wirchenko
Gene Wirchenko Send private email
Thursday, August 02, 2007
Right.  I'm asking if anyone could clue me in to what the right design would be.  Let's assume for the sake of example that nobody ever grows.
Sedate Snail Send private email
Thursday, August 02, 2007
There is nothing common about the requirement so you can't say either solution is commonly used.  You'd have to know about the requirements to make a choice.

If you can get away with it the second choice is simpler and easier to work with it.  But if you had to be able  to add and delete equivalent pairs very frequently it might not work so well. 

A third choice that might be workable if your equivance classes aren't too big would be to record all equivalant pairs in the database with a flag saying whether the equivalence was data entered or deduced from data entered equivalencies.
Bill Hamaker Send private email
Thursday, August 02, 2007
Put the known height values in one table and then update the height value in the person table when you know it.



Now you can find all the people with a height or all the people who's height you don't know (assuming null for "don't know")
Just A Guy Send private email
Thursday, August 02, 2007
I'd use the second.... (Assuming this is not really about heights , where i would enter actual heights)....

create table height_equiv(name varchar2(40),equiv number);
create unique index on height_equiv(name);

To enter an equivalence of a=b i would

check if a has an equivalence no
check if b has an equivalence no

if none have one, i'd issue a new no and refer both to it.
if only one has one i'd refer the other to the one found.
if both have one, i would update all the rows refering to the largest to the smallest

in your example
Adam | Bob
Bob | Chris
David | Eric
Chris | Frank

step 1:
insert Adam,1
insert Bob,1

Step 2:
insert Chris,1

Step 3:
insert David,2
insert Eric,2

Step 4:
insert Frank,1

in case you get an "Eric = Frank" you would do a simple:
update height_equiv set equiv=1 where equiv=2;

how many equivalences do you process at once? how many do you process per hour? day? how many total equivalences will you store??

for less than "millions of records" it should be ok.
Totally Agreeing
Thursday, August 02, 2007
Okay, cool.  Yeah, I think I'll stick with the first way then (just saving all pairs in the database) because I will probably be entering equivalent pairs a lot and it preserves the most information.  I just wanted to make sure I wasn't doing anything monumentally dumb either way.

Thanks so much for the help.
Sedate Snail Send private email
Thursday, August 02, 2007
Whoops, missed your post, TotallyAgreeing...

Yeah, option 2 seems more elegant to me too.  However, I would like to know that I directly saw that Adam = Bob, whereas I just inferred that Adam = Chris.  And yeah, it might be a million or two records, but it won't be accessed often, and I don't think speed is an issue, especially when inserting data.

I'll see if I can think of a good way to do the lookup for the first design... if not, I'll just scrap the first design and go with the second, like you said.  Thanks!
Sedate Snail Send private email
Thursday, August 02, 2007
And yeah, it's not about heights. (surprise!)  It's about phrases in different languages.  But I didn't want to get into discussions about why "house" = "casa" = "Haus" was a dumb thing to say.
Sedate Snail Send private email
Thursday, August 02, 2007
Heh. (Sedate Snail and I discussed this using the dictionary example he just now referred to.)

I think you'll find that in general, databases are really a poor solution to this type of "association" or grouping problem.

If you did not have to use a database, I'd recommend implementing a bucket sort (that works on objects), and then simply serialize each object as needed.

Then you can create a 6' bucket and simply throw Adam, Bob, Chris and Frank into it. If Chris grows an inch, resort him, that is, move him into the 6'1" bucket.
Thursday, August 02, 2007
Oracle has a CONNECT BY query that will allow you to walk trees this way.

SELECT name1
FROM pairs
CONNECT BY name1 = PRIOR name2
START WITH name1 = 'adam';

Other databases may have something equivalent. As the other posters have said, really you're writing against SQL rather than with it.
George Jansen Send private email
Thursday, August 02, 2007
sorry for the word-related comment: for languages and words I don't think any of this will work.

A word could have multiple meanings in the same language, and could mean different things in different languages:

- the word "comment" in french means "how" ...
- many words have many different meanings for a simple word like go, in english look at
- words in different languages are rarely equivalent, look at a "paper" dictionary, or any good online dictionary.
Totally Agreeing
Thursday, August 02, 2007
Hmm.  Yeah, I guess you're right.  I just figured a database is the canonical way to store large amounts of data (a million or two records).  I don't have to use one; I could make the phrases objects, bucket sort them, and serialize them, but that raises two issues in my mind:
1. You need my code to get them back (not such a big deal, as you'd need my data layer code to make sense of that database, anyway)
2. Wouldn't I need to load all of the data into memory before dealing with it?  Like if I wanted to look up "the dog", I'd have to load a million entries, which is infeasible.

And as for the language issues: the reason I've been ignoring them the whole time is that I'm not saying that "the dog" is always the same as "el perro."  I'm saying "the dog" has been seen in some bilingual source translated as "el perro."  I'll deal with how to use that to actually translate later; I'm just trying to record it right now.
Sedate Snail Send private email
Thursday, August 02, 2007
Yeah, I was missing something.

So my first impression was that there was an actual attribute (height) that all of the people shared.

I'm now thinking this is more of an association thing.  When I see thesaurus listings, for example, I see relationships like this:

quick --> fast
fast --> starve
quick =/= starve

You want to establish associations between items, and there may be many-to-many associations, but just because you're associated to an item doesn't automatically mean that you're associated to all the items that item is associated with.

So back to your original example, you'd have data to support "Adam is related to Bob" (and vice-versa), and "Bob is related to Chris", and you would have to decide whether to traverse the tree to two layers, or three, or whatever.

It may be that this isn't what you want at all, but otherwise I'd be afraid that you'd just end up with everything being related to Kevin Bacon eventually, you know?
D. Lambert Send private email
Thursday, August 02, 2007
Right, but I think I'd still like to know it.  What I'm really saying is
"'rapido' has been translated somewhere as 'fast'"
and maybe I'd also have
"'fast' has been translated somewhere as 'schnell'"
so I'd like to know that 'rapido' might be related to 'schnell'.  (or it might not, who knows... but it's likely.) You're right about limiting it, though; I don't think I'd want to go more than one layer like that, because at each layer I go through, that's more uncertainty.  And how likely is it that I'll have, say,
(spanish phrase) -> (english phrase)
(english phrase) -> (french phrase)
(french phrase) -> (german phrase)
but not have spanish->german or anything more direct, you know?

So maybe just saving all the pairs is fine, and I could do some kind of complicated but not terribly complicated SELECT to find these elusive pairs.

Thanks again!
Sedate Snail Send private email
Thursday, August 02, 2007
If the equivalence is explicit, then I have a table of equivalency classes and a link table from there to the entities.

If the equivalence is implicit or emergent, I try not to encode that in the database.  In any case, if there does need to be a table in order to make the queries manageable (A ~ B and B ~ C and C ~ D, it's difficult to write arbitrary n-level joins in SQL Server to see that A ~ D also), such a table can be built and updated with a trigger or completely re-built on a schedule.
Cade Roux Send private email
Thursday, August 02, 2007
I think you are approaching this from the wrong angle. Most "translation" problems become more tractable if you have (or "invent") a common uber-language to act as a sort of rosetta-stone.

Conceptually, this would be analogue to having an actual height value in your original example.

So I'd go for something like:

1) A table of "meanings". Feel free to express them in your own language (English). Latin, Esperanto, Russian, Klingon. Just pick one which would work best for you.

TABLE: RosettaStone
MeaningID (PK, Sequence)
Meaning (Text)


8696,"This is a fine wine"
6749,"Is this the railway station?"

TABLE: Translations:
MeaningId (FK to RosettaStone)
TranslationId (PK, Sequence)
Language (ISO code?)
Translation (Text)

8696,26879,'IT',"Questo è un ottimo vino"
6749,76859,'IT',"E' la stazione dei treni, questa?"
6749,78694,'IT',"Questa qua è la stazione?"
6749,78443,'EN',"Is this the railway station?"
6749,89595,'FR',"C'est la gare, ici?"

(My French is poor, but I hope it suffices for the example).

This way you have:
* Italian version of the main meaning, including "synonyms".
* English version of the main meaning (it doesn't matter if it is the same as the "canonical" meaning row, you pay a little redundancy but you gain in simmetry for your operations). Assuming you use English (or any other "real" language that you want to store in the DB) you gain the side effect that you can specifiy "synonyms" in English while keeping the "Rosetta" table less cluttered.
* You can easily find:
- "all the way to ask if this is the station in Italian" (or whatever language).
- The french translation of the Italian "Questo è un ottimo vino", assuming it exists.
- all the meanings for which you don't have a Spanish translation.

and so on.
Paolo Marino Send private email
Friday, August 03, 2007
Cade Roux: A ~ B and B ~ C and C ~ D, it's difficult to write arbitrary n-level joins in SQL Server to see that A ~ D also

In SQL Server 2005, you could probably use a recursive common table expression to do this job [1].  The main problem would be handling the reflexivity of the ~ relation so that you could be sure that the result would be the same for both of these data sets:

(A, B), (B, C), (C, D), and
(A, B), (C, B), (C, D)

Henshaw Send private email
Friday, August 03, 2007

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz