The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Surrogate Keys in Databases

I'm reading Celko's new book Analytics & OLAP in SQL, and his rant on Surrogate Keys on page 30 & 31 leaves me with questions.

"A common programming method is to build your own 'quasi-surrogate keys' by using a proprietary autonumbering feature in a particular SQL engine.  This is a good sign that you are not using a SQL product meant for data warehousing.  The system should be doing this kind of thing under the covers for you."

I did a quick google on surrogate keys, and there seems to be a large number of people who advocate using an autonumbering primary key that is unrelated to the data.  This is what I have used, though I have worked on databases that use a GUID for the primary key.  Is Celko advocating using a GUID for the primary key?  How is this better than using a sequence?  As long as we won't have more than 2 billion rows, a 4 byte number should be fine, right?

What is an example of a SQL product that is meant for data warehousing, handling surrogate keys for you?
Nathan Send private email
Saturday, December 23, 2006
 
 
Maybe he means keys should always be based on attributes of your data and the database can do row ID assignments behind the scenes for efficiency. Generally the relational model is based on real attributes. You can easily imagine the database creating hidden integer keys and using them in joins and indexes.

In real life this means a lot of times you need to double your queries. Instead of using attributes as foreign keys you need to do a lookup first to get the auto assigned ID for later queries. If you have the tuple then you don't need to make this extra query.

Or he could mean something completely different.
son of parnas
Saturday, December 23, 2006
 
 
johnny
Saturday, December 23, 2006
 
 
Surrogate keys, also known as technical keys, may have their uses in certain circumstances, but there are times when they are more trouble than they are worth. Read my views at http://www.tonymarston.net/php-mysql/technicalkeys.html

There are two ways to use technical keys - (a) intelligently, and (b) indiscriminately. Those who choose option (b) are showing a distinct lack of (a).
Tony Marston Send private email
Sunday, December 24, 2006
 
 
Speaking from the Oracle data warehousing perspective, in particular in relation to Oracle's cost-based optimizer (CBO), I don't favour the use of surrogate keys in that environment.

This is almost entirely because nothing will cause an inefficienct query plan to be generated as well as hiding the cardinalities and data value distributions from the CBO.

If the CBO infers a small result set to be retrieved from the fact table and chooses to use nested loop joins back to dimension tables instead of hash joins then performance will stink. Using synthetic keys for partition key columns is also fraught with problems, though they're not fatal -- they simply mean that the optimizer has to use global (table) statistics for inferring cardinalities instead of partition or subpartition statistics.

So synthetic keys force the optimizer to use generic assumptions about the data that are often inappropriate and which often lead to poor performance.

Lastly there's a design issue with changing values of unique identifiers (which are one of the main reasons for using synthetic keys in OLTP systems). In a DW you often want to preserve the old value and link to the new, which means creating a new dimension table record instead of changing the current one. This allows real values to be used in the fact table.
David Aldridge Send private email
Tuesday, December 26, 2006
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz