The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Database artificial keys ++

Ok, the previous thread seems to have disappeared into personal insults and suggestions of "don't design your database that way", so lets get back to the question...(at least this is what I thought the original poster was asking)

You have a table which looks something like this:

SentenceText          SentenceType
Hello                  h1
world                  h2
it's hot out there    p
in london              p
isn't it              p

Do you keep the SentenceType as h1, h2 and p (and whatever else they can be) or do you create another table of

SentenceTypeID        SentenceType
1                    h1
2                    h2
3                    p

where ID is the primary key (what I would call an artificial key), and there is a unique not null on the type.

My feeling would be to not bother with the second table IF you know the types are stable (i.e. don't change often or at all). If they change every day then yes, maybe a second table - depending on how big the first table is.

Another Contractor...
Thursday, July 20, 2006
Agreed, you wouldn't bother with that other table. UNLESS there was more info you were attaching to h1, p2, etc..
Thursday, July 20, 2006
What I was trying to say before I got sucked into the personal insults thing (my bad, I wouldn't tolerate it from my 7 year old) is that there are really 2 issues here. One, do you store the sentence types in a separate code table, and two, if you do, then do you use a new, artificial key.

The separate table gives you more flexibility, if that's something you want, but creating an artificial key is dubious, if the code value is really unique.
Steve Hirsch Send private email
Thursday, July 20, 2006
I'm a suspicious type myself, so I tend to not depend on "data being unique".  Thus, if I haven't generated the unique key myself (or had it autogenerated for me, I'm not THAT suspicious) then I don't use it as a 'unique key'.

IF you create the separate table, then YOU have control over how the various h1, h2 etc are mapped.  So, if you assign a unique key to each one, you can put in your 'foreign key' columns the value of the unique key YOU have control over.

When you do that, you also have control, should somebody generate a 'garbled' h1 ('h1a', say) which doesn't already exist in your table, to decide whether to add it, ask about it, or whatever.  If instead you're automatically using it as its own key (in effect), then there may be no way to detect when bogus 'tags' are seen.  They'll be happily added to your table, with resulting confusion.
Thursday, July 20, 2006
Allan has a point about garbled values - althought that could be cured with a separate table with a list of valid entries and a foreign key constraint - you still don't need the artificial key (OK, you get a silly table with only one column...but at least you don't need to join to it to see the actual values rather than the artificial)...

I also have a concern about indexing and artificial keys - eg you have text strings like h1,h2, p that you map to an artificial key 1,2,3. Now you know you are always searching/sorting on that field (so you always have to join to the key table) - and you often want to do like 'h%'. If you put a clustered index on the field in the data table and it has the text strings in that would definately help - if you put a clustered index on what is effectively a meaningless number then you (I think) lose the benefit of the index on all your sorting when you want to actually see the strings?

(Yes, you'd have a view to recombine and could index that but you've gained a layer of complexity to get back to the original table structure!)

Personally I tend to go with if the items are stable, and you don't need any other information about them then don't bother with another table. (eg the html tags would be stable because they are presumably defined by w3c, similarly ISO currency codes). IF you need more information about them eg GBP, Great Britain, Pounds, Pence then obviously you've got another table. Then look at how you will be searching/sorting your datatable to see whether you want the overhead of ALWAYS joining to your key table to get the currency code back to decide whether you need an artifical key!

Plus I think there are questions about speed of joins on int and character keys - <4 characters should be as fast as an int!

Just my 2p worth....(also it makes a really good interview question)
Another Contractor...
Thursday, July 20, 2006
Sticking with the question strictly, the answer comes from a rule that is used in data modeling, and it illustrates the difference between codes and identifiers.

Codes are symbols whose domain is finite--there can only be so many of them. Like the code that appears on your driver's license to describe your eye color. Identifiers are symbols whose domain is potentially infinite (like a list of Customers). A typical identifier is an auto-incrementing integer.

In data modeling and design, codes may serve as a key; but, any potentially infinite list ought to have an identifier.

Since the list of HTML tags (read: codes) is finite, they can be used as a key.
Thursday, July 20, 2006
How will your table hold something like:
<h1><i>Hello <u>world!</u></i></h1>

I think in trying to store XML in an RDBMS your
getting into the ORM problem space.

Honu Send private email
Thursday, July 20, 2006
I would use additional "key id"s only and only if the "key value" is a long string. By long I mean >15  ....

By thumb, I think as long as you're under 15 chars, joining/indexing performances shouldn't be so different from having a simple integer (I repeat: just as a thumb)
Indexing performance issues apart, I dont see other reasons to introduce surrogates like that

Now to Honu's question:
I guess that would be

SentenceText                  SentenceType
<i>Hello<u>world!</u></i>  h1
Hello <u>world!</u>          i
world!                            u

eerrrrrr....... looks weird, did I miss something? :)

of course! HTML is hierarchical.. We should extend the table a bit. Representing a node tree on a db is an academic classic:

SentenceId  SentenceText  SentenceType  FatherSentenceId
1                NULL              h1                  NULL
2                Hello              i                    1
3                world!            u                    2

-- SentenceId is the PK
-- SentenceText is nullable
-- SentenceType is not nullable and related to some "AllowedSentenceTypes" table whose PK is directly the id itself (or maybe some surrogate - see what I wrote in the beginning)
-- FatherSentenceId is nullable and related to SentenceId in the table itself
-- if FatherSentenceId==NULL it means the sentence is root
-- FatherSentenceId must be different from SentenceId  (a node cannot be the father of itself)

The tabbings will be horrible for sure, but I dont have other ways to express myself :)

bye !
Johncharles Send private email
Thursday, July 20, 2006
> Thoughts?

There seem to be 4 options:

1) Don't implement the DB's enforcing that the SentenceType is limited to any particular finite set of values (codes). Downside: obvious.

2) Use a CHECK constraint to enforce it. Downsides: some DBs (e.g. flavours of MySQL) ignore CHECK constraints, and the set of permitted values is less easy to query and/or change at run-time.

3) Use a FOREIGN KEY constraint on a second table which contains the set of permitted SentenceType values. Downside: a 2nd table, and keying on a string is slower than keying on an int.

4) Use a FOREIGN KEY constraint on a second table which maps the set of permitted SentenceType values to an integer key. Downside: a 2nd table, and an extra level of indirection (you're now storing a possibly-less-useful SentenceTypeId in the original table).
Christopher Wells Send private email
Thursday, July 20, 2006
Yep - those are the options as I see it...which do you prefer?

I like the answer from above about the difference between codes and identifiers - that does seem to make a nice distinction (thanks) - which is pretty much what I had come up with. The only caveat to that is that if you are writing a system where as soon as a new identifier appears in your raw data (eg a new customer name) you create a new record in your Customers table then just use the int ID of that new record in your data table then you aren't gaining anything over storing the customer name in your data table

(yes, I know in a real system you'd have to have a separate table for customer so that you could store customer details like address to be normalised - pretend this is a system where all you need for a customer is name)
Another Contractor...
Friday, July 21, 2006
Am I missing something?

If the set of types is finite, you can just make it an ENUM type (in either your database or your code).

This will both save processing time/protect against garbled values, AND give human readibility....
PHDude Send private email
Saturday, July 22, 2006
> you can just make it an ENUM type

In MySQL perhaps, I don't think in MS SQL.
Christopher Wells Send private email
Sunday, July 23, 2006
If not in MS SQL, then do it the other way -

define constants in your code

use an unsigned int in the database

of course, if you're going to be changing the elements alot you probably do want to use a foriegn-keyed table with integer keys...
PHDude Send private email
Monday, July 24, 2006
To be clear, that's what I meant by

> just make it an ENUM type (in either your
> database or *your code*).

A true enum might work too, depeding on implementation you might have a stable cast to an integer.
PHDude Send private email
Monday, July 24, 2006

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz