The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Database denormalisation - when, why and how?

I've been following the (sometimes heated) discussion on database people versus application people. Yes, crude divide I know, forgive me.

I'm a student of database theory. And I'm interested in why, when and how people 'denormalise'.

As far as I understand it the various stages of normalisation are quite discrete steps, each level leading to the next (apart from BCNF, which seems to be 3.5NF, ish). I'm wrestling with 4NF and 5NF at the moment.

So when people denormalise, what do they do, drop a level and design application code to eliminate the anomalies that the higher level is meant to prevent?

It would be good to see a concrete example.
Mike MacSween
Tuesday, January 04, 2005
 
 
How apt - I'm working myself up to redeveloping my work database and have come to a standstill on deciding whether denormalisation is a good thing or not. When I learnt relational design 3NF was where it all stopped but I suspect that attitudes have moved on given that machine and software capabilities have shifted.
Chris Page Send private email
Tuesday, January 04, 2005
 
 
Ignore purity, think speed.

Take a look at an OLAP database - lots of big tables, hardly normalized, lots and lots of nulls.  And much faster than a small, tight, normalized version of the same.
i like i
Tuesday, January 04, 2005
 
 
So would you do a normalised version first then denormalise, or just ignore the whole idea of normalisation from the start.

What's an OLAP - On Line Analytical Processing? Is that a Data Warehouse?
Mike MacSween
Tuesday, January 04, 2005
 
 
I suspect that most people don't consciously check their tables for conformity with all the various normal forms; I know I don't.  Once you have a feel for normalization, you tend to subconsciously incorporate it into your design process, and the only time you think much about it is when you're _de_normalizing.

The textbook example of denormalization is ZIP codes.  In theory, having city-state and ZIP code fields in the same table is redundant because if you know the ZIP code, you know the city and state.  But since most people easily know their city and state anyway, there's no advantage to creating a lookup table that would balance out the work involved.  Plus the redundancy can be helpful, since having a city and state makes it easier for the post office to correct if the ZIP code is off by a digit.

Another example: I'm working on a database of college courses which includes data about textbooks used.  Course offerings and textbooks are many-to-many, since one course offering can use any number of textbooks, and one textbook can be used in any number of course offerings.  Each textbook also has a publisher, ISBN, and so forth.

Now I could design the textbooks table so that the ISBN is the primary key, or at least a surrogate key.  And I could have a separate Publishers table with its own key linked to the Textbooks table.  But that would go beyond the purpose and requirements of the database.  Our goal isn't to track textbooks and publishers, but just to show what books were used in a course.  If a publisher ends up typed as "Books Limited" for one textbook and "Books Ltd." in another, it's not really important because we're not going to be searching on publisher.

By the way, instead of going through all the various normal forms, domain-key normal form is the thing to strive for...but unfortunately, most articles and books I've seen fizzle out before discussing DK-NF.  Here's one that _does_ cover it:

http://www.gslis.utexas.edu/~l384k11w/normstep.html

I think the rule of thumb centers on ROI: What benefit do you get for the time you spend?  What are you really tracking?  What are your requirements?  If I denormalized courses, so that a given course could end up with variants of its title, I wouldn't be meeting my requirements.  But since keeping accurate publisher name data isn't part of the requirements, normalizing that data is not the best use of my time.
Kyralessa Send private email
Tuesday, January 04, 2005
 
 
'I suspect that most people don't consciously check their tables for conformity with all the various normal forms; I know I don't.  Once you have a feel for normalization, you tend to subconsciously incorporate it into your design process, and the only time you think much about it is when you're _de_normalizing.'

Sorry to be blunt. But that looks to me like old fashioned sloppyness. I suspect that means that many people think they know what normalised means, but don't really. And haven't thought about it it enough.
Mike MacSween
Tuesday, January 04, 2005
 
 
'The textbook example of denormalization is ZIP codes.  In theory, having city-state and ZIP code fields in the same table is redundant because if you know the ZIP code, you know the city and state.  But since most people easily know their city and state anyway, there's no advantage to creating a lookup table that would balance out the work involved.  Plus the redundancy can be helpful, since having a city and state makes it easier for the post office to correct if the ZIP code is off by a digit.'

Yes, in the US I guess that might apply. We don't have states here! It says a great deal that very very frequently in discussions about databases people choose examples for the US.

I seem to remember reading somewhere that an American ZIP code can actually include 2 cities or states. But I might be wrong.
Mike MacSween
Tuesday, January 04, 2005
 
 
"But since keeping accurate publisher name data isn't part of the requirements, normalizing that data is not the best use of my time."

Fine, as long as your requirements are cast in stone. To use your example, I would bet money that at some point people will ask "it would be really useful if we could search by Publisher...?". Unless you really REALLY need the speed of denormalisation, it's dangerous, and limiting in the long term. Databases designed well are not just for ease of querying now, they're for flexibility and ease of future maintainance and upgrading.

Obviously there are projects which it's going to be a write-once operation, and the db will never change. But i've never worked on one...
Andrew Cherry Send private email
Tuesday, January 04, 2005
 
 
(1)  No problem; I'm happy to forgive your bluntness since you're wrong anyhow.

(2)  Actually, it doesn't say a great deal; all it says is they live in the U.S.  As the teachers and schools in my database are all located in and around one midwestern city in the U.S., I don't think I'm going to put a lot of effort into internationalizing this application.  (See, there's that ROI thing again...)  But thanks _ever_ so much for your concern.

Since you're already so clever, I wonder why you bother to ask questions.
Kyralessa Send private email
Tuesday, January 04, 2005
 
 
'Now I could design the textbooks table so that the ISBN is the primary key, or at least a surrogate key.  And I could have a separate Publishers table with its own key linked to the Textbooks table.  But that would go beyond the purpose and requirements of the database.  Our goal isn't to track textbooks and publishers, but just to show what books were used in a course.  If a publisher ends up typed as "Books Limited" for one textbook and "Books Ltd." in another, it's not really important because we're not going to be searching on publisher.'

I don't think that's denormalised, but it might be. Just that you've got inconsistent data because you don't seem to have specified a domain for 'publisher', apart from 'some text'. So you haven't limited the set of values that the field publisher can contain. But if you've got a FD ISBN -> Publisher, it seems to be non-transitively dependent on the PK (so 3NF), which is single field (so 2NF). I'm ready to be corrected.

Does part of the ISBN identify the publisher? If it does there's a raft of horrible issues there!

Looking at that, that is one example where I personally would think very hard about splitting publishers off into a separate table. It's not part of your requirements now, but in 1,2,3 years your procurement section decides to try to force some discounts from publishers and wants to know how much of each publishers books you use. That's a very easy to imagine scenario isn't it? And with a separate publishers table is a doddle. Without one, especially if the miss spellings you've suggest are allowed it's not.
Mike MacSween
Tuesday, January 04, 2005
 
 
Denormalization typically yields speed at the expense of less stable data.  And I always normalize to 3NF first before starting down this path.

After setting up a normalized database I will profile it for common usage scenarios.  It is possible that there will be performance bottlenecks that appear.  Typically this will be on a join or series of joins that are expensive and extremely routine.  Of course I'll try to set up good indexes and hint queries, but at some point I'll consider denormalizing the tables.  I'll essentially join the tables into one larger table. 

This can require additional appliction code to keep things in check, but not always.  You're essentially trading data cleanliness and disk space for transaction speed. 

As with all things, use caution.  You can really destabalize previously stable tables when you denormalize and it can cause huge problems when trying to perform analysis at a later point.
Lou Send private email
Tuesday, January 04, 2005
 
 
'(1)  No problem; I'm happy to forgive your bluntness since you're wrong anyhow.

(2)  Actually, it doesn't say a great deal; all it says is they live in the U.S.  As the teachers and schools in my database are all located in and around one midwestern city in the U.S., I don't think I'm going to put a lot of effort into internationalizing this application.  (See, there's that ROI thing again...)  But thanks _ever_ so much for your concern.

Since you're already so clever, I wonder why you bother to ask questions.'

Is there really any need to be so sarcastic?

I think that people and addresses are one of the commonest things to be used as examples in database issues (presumably because they are the subject of databases so often) but are actually one of the things that are hardest to model.
Mike MacSween
Tuesday, January 04, 2005
 
 
I usually denormalize for reporting. If reporting is not big enough to have its own tables. Don't sweat 4NF and 5NF. 3NF and BCNF are usually good enough for most projects you will be working on.

>So would you do a normalised version first then denormalise, or just ignore the whole idea of normalisation from the start.
Normalize first, then denormalize when appropriate. Take for example, line items on an invoice/bill of sale. If you alter the price for the customer on that invoice ("I'll let you have this one free because the last order was 3 weeks late, or you're such a good customer...or..."), you change the price for every one of those items, if the price hasn't been "disconnected" from the parts/items table.

Some of your customers move. When you analyse sales data for the last year, should the sales reports reflect "where the customer was when the sales happened" or "where the customer is when this report was run"

>What's an OLAP - On Line Analytical Processing?
Short answer is that it is a copy of your data which has been reengineered for reporting. "Let's look at sales by region, grouping by sex. Now let's look at sales by sex, grouping by region, no, let's look at region vs color..."
 
>Is that a Data Warehouse?
No.

The ZIP code example is a good one. ZIP codes never cross state boundaries, so that state is a function of zip. The postal codes sometimes (rarely) change, and for the majority of cases, city/town is also a function of zip code.

If you do large scale mailing, you will need to also pay attention to ZIP+4 (pretty much each mailbox, or bank of mailboxes has its own 9 digit code) as well as the carrier route and delivery point (which stop on the mail route this zip+4 represents). To get the best discounts on postage, you will be validating and purifying your address list as well as presorting the mail for the post office.

When you are dealing with people typing stuff in, you will get typos all the time. Sometimes because people don't remember their own zip code. For that reason, the redundancy is very helpful for figuring out just what is going on.

Sample of some tables:
Addresses (CustomerNumber, TypeCode, AddrLine1, etc)
PhoneNumbers (CustomerNumber, TypeCode, AreaCode, Phone, etc)

If you want 3NF and nothing more, you would then have:
AddressTypeCodes (TypeCode, Definition, ShortVersion, etc)
PhoneTypeCodes (TypeCode, Definition, ShortVersion, etc)

If you want 4NF, you would replace ATC & PTC (and any others) with:
TypeCodes (Type, TypeCode, Definition, ShortVersion, etc)

Sample data for:
PhoneTypeCodes (1, "Main Phone", "main")
PhoneTypeCodes (2, "Fax", "fax")
PhoneTypeCodes (3, "Pager", "pager")
AddressTypeCodes (1, "Address", "main")
AddressTypeCodes (2, "Shipping Address", "ship to")
AddressTypeCodes (3, "Billing Address", "billing")

TypeCodes ("Phone", 1, "Main Phone", "main")
TypeCodes ("Phone", 2, "Fax", "fax")
TypeCodes ("Phone", 3, "Pager", "pager")
TypeCodes ("Address", 1, "Address", "main")
TypeCodes ("Address", 2, "Shipping Address", "ship to")
TypeCodes ("Address", 3, "Billing Address", "billing")

Helpful? Useless? WTF?
Peter
Tuesday, January 04, 2005
 
 
'I usually denormalize for reporting. If reporting is not big enough to have its own tables. Don't sweat 4NF and 5NF. 3NF and BCNF are usually good enough for most projects you will be working on.'

Sure. It looks to me like a lot of things that are in 3NF are in BCNF already, and maybe 4NF and 5NF. I'll keep trying to understand the higher forms.

'>So would you do a normalised version first then denormalise, or just ignore the whole idea of normalisation from the start.
Normalize first, then denormalize when appropriate. Take for example, line items on an invoice/bill of sale. If you alter the price for the customer on that invoice ("I'll let you have this one free because the last order was 3 weeks late, or you're such a good customer...or..."), you change the price for every one of those items, if the price hasn't been "disconnected" from the parts/items table.'

Actually I just assume that in a line item record the 'price' would be the price that the item was sold to the customer at. And that it wouldn't be connected to the products table anyway. Just derived from it initially. I don't think that's denormalised. It's an attribute of lineItem - 'the price we sold this product for on this occasion'.

'Some of your customers move. When you analyse sales data for the last year, should the sales reports reflect "where the customer was when the sales happened" or "where the customer is when this report was run"'

Sure, unless you're storing a history of addresses you can't do the first one, I agree.

'>What's an OLAP - On Line Analytical Processing?
Short answer is that it is a copy of your data which has been reengineered for reporting. "Let's look at sales by region, grouping by sex. Now let's look at sales by sex, grouping by region, no, let's look at region vs color..."'

I see. So probably less tables, more nulls, but less joins, so faster? And no need to allow for concurrency issues such like. Even faster.

I'll try to understand the 4NF bit.
Mike MacSween
Tuesday, January 04, 2005
 
 
Saving the price with an invoice is a good example of denormalization, because you don't actually have to denormalize to save this information.  If you consider that the price of an item isn't just determined by the item itself, but also by the point in time, then you could break out price into its own table with fields for PK, item key, price, and start date/time of that price.

But that would require a join every time you looked up the current price of an item, and also every time you looked up a historical invoice.  Storing the price in the item table itself, and in each invoice when it's created, is a denormalization which eliminates those joins.  (Though whether the speed increase is significant is another question.)
Kyralessa Send private email
Tuesday, January 04, 2005
 
 
Well, Kyralessa's example application where you wouldn't normalize seems to me to be a great example of how to not develop a robust application. By allowing this horrible structure the application is doomed to sit in isolation from the rest of the world -- the data gathered will be of no use to any system outside of the one described.

It may pass in the academic world, but this approach is a disaster commercially. I'm not saying that it doesn't happen (it's how large organization like the US Navy end up with tens of thousands of systems that they didn't even know about http://www.washingtontechnology.com/news/16_21/cover/17789-1.html)

This is stuff that goes even above the level of the DBA being picky about letting people create new tables -- it speaks directly to the strategy of information management, rather than the design of individual systems, but the end effect is that if you don't get the fundamentals right then you are not doing the right thing by your employer/client.

Now OLAP, decision support, and data warehousing are a completely different kettle of fish. Denormalization is acceptable there for the purpose of providing rapid answers to specific business questions, but this is a secondary repository for the data -- the integrity of it is guaranteed through the source system and ETL procedure.
David Aldridge Send private email
Tuesday, January 04, 2005
 
 
'Saving the price with an invoice is a good example of denormalization, because you don't actually have to denormalize to save this information.  If you consider that the price of an item isn't just determined by the item itself, but also by the point in time, then you could break out price into its own table with fields for PK, item key, price, and start date/time of that price.

But that would require a join every time you looked up the current price of an item, and also every time you looked up a historical invoice.  Storing the price in the item table itself, and in each invoice when it's created, is a denormalization which eliminates those joins.  (Though whether the speed increase is significant is another question.)'

I disagree, there's nothing denormalised about storing the price THAT THE ITEM WAS ACTUALLY SOLD FOR in the LineItems table. It's an attribute of LineItems. Whereas StandardCurrentPrice (or something) is an attribute of the Products table. As far as I can see the FDs all apply there.

So when a new line item is created the Current Price is grabbed from Products and entered as Line Item Price. Unless the customer has trade discount (retail less 5%) or large customer trade discount (retail less 10%). Or unless the sales rep simply wants to give this customer a lower price. If you want to force the rep to do that by applying a discount percentage, then you might drag that info out to a seperate table.

I think CurrentPrice and PriceWeSoldItAt, are different attributes of different entities.
Mike MacSween
Tuesday, January 04, 2005
 
 
Yes, I agree with Mike on that one. This is definately the standard practice in the industry, and storing a "discount against standard price" or a "rebate against current price" is exactly the same as storing the sale price anyway.
David Aldridge Send private email
Tuesday, January 04, 2005
 
 
Mike,

I think you're looking for absolute rules to hold onto. There aren't any. There are plenty of things that get called rules of normalization, and it's important to know them, to have internalized them. But it's not important to use one set of rules for every database, or even to apply one set of rules evenly to every bit you model in any one database. What's important is to know _why_ the rules exist, to understand the benefits and drawbacks associated with each of the rules, and to know when it will be to your advantage to use each rule and when it will be to your advantage to break each rule.

Jeremy
JeremyNYC Send private email
Tuesday, January 04, 2005
 
 
ISBNs are unique to each version of a book. The hardback gets one, the paperback gets one, the hardback with a CD has another. The people who run the ISBN system sell the numbers in blocks. So if you are a small publisher, you will buy numbers 10 at a time, if you are a large one, you want a large number space. Just like UPCs.
Peter
Tuesday, January 04, 2005
 
 
'What's important is to know _why_ the rules exist, to understand the benefits and drawbacks associated with each of the rules, and to know when it will be to your advantage to use each rule and when it will be to your advantage to break each rule.'

What was the last rule you broke, how and why?
Mike MacSween
Tuesday, January 04, 2005
 
 
Mike,

You continue to challenge the folks who are trying to give you pointers, instead of trying to understand what we're saying. Slow down some.

I'm working on a database right now that is a reporting tool for data stored in a horrible old database. In the original database, there are all sorts of denormalized data structures and all sorts of cruddy data as a result. In the reporting tool I'm creating I've pushed everything into two tables, to make querying faster.

Trying to look at a situation like this rule by rule would be a long academic exercise. Understanding why I created these two tables is pretty straightforward: it's really fast. Understanding why the original database is designed the way it is may or may not be so simple, though my guess is that it's because the people who built the original didn't really know what they were doing.

Jeremy
JeremyNYC Send private email
Tuesday, January 04, 2005
 
 
Come to think of it, maybe I'll just store nothing but the ISBN, and look up the rest of the book data with a web service.
Kyralessa Send private email
Tuesday, January 04, 2005
 
 
Jeremy,

There you go -- denormalization in an OLTP system = horrible. Denormalization in a reporting system = not horrible.

The difference is that the reporting system is not the guarantor of data integrity, concurrency of data modification etc.

The problem is that the overwhelming majority of systems are OLTP, and denormalizing them is terrible for data integrity, concurrency, maintenance etc., right?
David Aldridge Send private email
Tuesday, January 04, 2005
 
 
'You continue to challenge the folks who are trying to give you pointers, instead of trying to understand what we're saying. Slow down some.'

Nope. The subject title and my original post are clear and honest. But I'll repeat:

I often hear that people decide to 'denormalise'. Given that the various normal forms have precise definitions, I am interested what normal form people denormalise back to, and why. You can't DEnormalise if you haven't normalised first, yes? Or at least if you claim to be denormalising when you don't know what NF you've normalised to in the first place then you haven't. You're just thinking on the hoof. Which may or may not give you a normalised structure, if you can do it 'subconsiously'.

All I read in previous discussions is something along the lines of 'all this relational theory is all very will but in the real world we can't do that so we denormalise'. I paraphrase. By the way I utterly loath the phrase 'the real world', but I wanted to use it before somebody else did.

Your example - you've got a rubbish database that you're forced to work with, so you're doing the best you can. I understand that, we all find ourselves in less than ideal positions. I suppose that's one example of when to denormalise. I was really talking about situations where your DO have control over the data schema.

Let me put it another way. If I started a thread entitled - 'All who don't normalise to 5NF are morons' there'd be hell on. But I've not seen many good examples in this thread of when it is a good idea to denormalise. I was asking for those.

OK, I won't 'challenge' any of those examples. Just let me know when it might be a good idea to 'denormalise'. And I'll decide for myself.

Pip pip
Mike MacSween
Tuesday, January 04, 2005
 
 
"There you go -- denormalization in an OLTP system = horrible. Denormalization in a reporting system = not horrible."

Dave,

Well, this system isn't big enough that I would think to call it an OLTP system, though it probably should be...

In any case, yeah, sort of--in general you'd want something that's being used for collecting data to be really normalized, but there will be cases even in a data collection system where it's just less of a pain in the ass for the user, the developer, or both, to denormalize things. Speed and accuracy have to be balanced against each other. The things we know as rules of normalization are tools to help us keep our data accurate. The thing we call experience is a tool we use to help us not get taken out into the alley and beaten by our clients.

Jeremy
JeremyNYC Send private email
Tuesday, January 04, 2005
 
 
ISBNs are occasionally re-used by confused publishers.  It would be in your best interests not to use it as a key field.

This is one of the painful lessons a previous employer learned while building an online bookstore.

In fact, exhortations against relying upon _any_ identifiers from the real world is usually part of DB Design 101.  ZIP codes, ISBN, SSN, phone numbers, are all really, really bad ideas for keys.
Art Send private email
Tuesday, January 04, 2005
 
 
>I am interested what normal form people denormalise back to, and why...
Reporting tables (processed and preserved reports) are usually kept in 2NF.

Why:
If the weekly reports take 8 hours of cpu time to run, they (should) only get run once. So if Manager252 wants to see what the numbers were for last week and compare them to the first week of december, they can just print out the pre-digested numbers (and it will take a couple min on the printer rather than have to preserve dead tree editions all the time, or spend 16 hours of cpu time to get him these 2 reports).

Because the tables are de-normalised, it keeps the status of the stuff being reported: at the time the report was made. Widget_43 might be obsolete and discontinued *today*. It wasn't when the reports were originally run.

Mike, does this answer the question you were looking for?

And Art's warnings are correct. Never assume the idents out of your control are unique and/or never recycled.
Peter
Tuesday, January 04, 2005
 
 
>> The things we know as rules of normalization are tools to help us keep our data accurate. The thing we call experience is a tool we use to help us not get taken out into the alley and beaten by our clients. <<

Heh heh, I like it.

There is a point that I'd like to make here though, which is that the performance problems that many people associate with normalized databases are often easily overcome with RDBMS-specific features.

For example, a developer that wants to store in a CUSTOMER table the total value of all invoices for that customer probably doesn't know that storing the invoices in an Oracle hash cluster (based on customer_id) would allow them to retrieve this total only at the additional cost of a couple of database logical reads -- a small price to pay for data integrity and concurrency.

Apart from the reporting databases, it would take an extreme need to make me abandon normalization.
David Aldridge Send private email
Tuesday, January 04, 2005
 
 
I used to be really normalized. Then, we added a system to our app which lets Mere Mortals do SQL-like queries on our tables. We did this solely because a competitor offered it and customers like being able to ask random questions like 'show me all my clients who live in Quebec and own Nortel stock'. Since we implemented this app, virtually every design decision has been colored by needing to support this feature - which of course requires formerly horrifying-seeming levels of denormalization.

To sum up the guy who said normalization was good for OLTP but bad for reporting, I could not agree more.
NetFreak Send private email
Tuesday, January 04, 2005
 
 
'Reporting tables (processed and preserved reports) are usually kept in 2NF.

Why:
If the weekly reports take 8 hours of cpu time to run, they (should) only get run once. So if Manager252 wants to see what the numbers were for last week and compare them to the first week of december, they can just print out the pre-digested numbers (and it will take a couple min on the printer rather than have to preserve dead tree editions all the time, or spend 16 hours of cpu time to get him these 2 reports).

Because the tables are de-normalised, it keeps the status of the stuff being reported: at the time the report was made. Widget_43 might be obsolete and discontinued *today*. It wasn't when the reports were originally run.

Mike, does this answer the question you were looking for?'

Absolutely. Static, historical data that can run faster denormalised? That makes sense. And presumably if you ever wanted to, for some not-as-yet-forseen reason renormalise them, you could. No data has been lost in the denormalisation (probably the opposite).

'And Art's warnings are correct. Never assume the idents out of your control are unique and/or never recycled. '

Yup, learned that a long time ago.
Mike MacSween
Tuesday, January 04, 2005
 
 
Personally, Art, I always make last names the key.  After all, there's not that many people named Smith.  If there's more than one, I just make them Smith1, Smith2, Smith31415927, and so forth.  Unless it's somebody I'm mad at, and then I give them a new last name so they can't find themselves in the database.
Kyralessa Send private email
Tuesday, January 04, 2005
 
 
>> Personally, Art, I always make last names the key.  After all, there's not that many people named Smith.  If there's more than one, I just make them Smith1, Smith2, Smith31415927, and so forth.  Unless it's somebody I'm mad at, and then I give them a new last name so they can't find themselves in the database.<<

don't forget the beauty of mixing the case up ... "SMITH", "smith", "sMiTh" etc.
David Aldridge Send private email
Tuesday, January 04, 2005
 
 
I'm a VB programmer; what's "case"?
Kyralessa Send private email
Tuesday, January 04, 2005
 
 
"I often hear that people decide to 'denormalise'. Given that the various normal forms have precise definitions, I am interested what normal form people denormalise back to, and why. You can't DEnormalise if you haven't normalised first, yes?"

I've one piece of denormalization in my largest project.  I began normalized and spent a long time normalized.  Eventually the extra performance was needed and this was denormalized -- current data was copied into another table to prevent a join.  The performance improvement justified the change and I've minimized the danger as much as possible.

The situation is rare.  There are other places I could denormalize for performance but don't because the improvement isn't enough to justify the increase in problems.
Almost Anonymous Send private email
Tuesday, January 04, 2005
 
 
>> I'm a VB programmer; what's "case"? <<

It's the thing that VB programmers carry their "SQL for Dummies" book in as they are escorted from the building.

:D
David Aldridge Send private email
Tuesday, January 04, 2005
 
 
I read somewhere that the main purpose of the whole activity of construction/building software is "managing complexity". So my answer to the when, why and how is. When the database model becomes too complex, beacause it is complex and find a solution that is less complex.
My name is secret!
Tuesday, January 04, 2005
 
 
>>  I read somewhere that the main purpose of the whole activity of construction/building software is "managing complexity". So my answer to the when, why and how is. When the database model becomes too complex, beacause it is complex and find a solution that is less complex. <<

I think you'd have to be more clear about the form of complexity you are talking about. Denormalization is it's own form of complexity, and it's a form that leads to maintenance problems, data integrity problems, and concurrency problems.
David Aldridge Send private email
Tuesday, January 04, 2005
 
 
"There are other places I could denormalize for performance but don't because the improvement isn't enough to justify the increase in problems."

Materialised views might be an option in this case, especially for reporting. (i.e., read only applications)
Art Send private email
Tuesday, January 04, 2005
 
 
>>I think you'd have to be more clear about the form of complexity you are talking about. Denormalization is it's own form of complexity, and it's a form that leads to maintenance problems, data integrity problems, and concurrency problems.<<

Well lets say you are doing a customer database, and you start to normalize your address table to having a separate table with the cities, because there are a limited number of city names. It does good things for data integrity... But now lets look who is on the receiving end of the address book, Joes coffee the small coffeeshop with only one location or Joes coffee the small franchise in several cities. Now I can see the normalisation being a good for Joes Coffee the franchise but I can't really see the need to build the customer database more complex for the single location when a string is enough.
My name is secret!
Tuesday, January 04, 2005
 
 
How does storing the name of a city in an address table make the table denormalised?
Mike MacSween
Tuesday, January 04, 2005
 
 
>How does storing the name of a city in an address table make the table denormalised?
I think the previous poster thinks that ZIP (postal) codes uniquely identify a city? And if he does think that zip codes make a candidate FK for a city table, I think he should step away from the keyboard with his hands in the air.

When you start getting a few thousand addresses in your tables, you will start to notice that the rules you think apply, really don't.

The table structure for a few dozens to few hundreds of addresses is going to be somewhat different than the structure you will need if you are sending tens of thousands of postal letters per month.
http://www.usps.com/webtools/
http://zip4.usps.com/zip4/welcome.jsp
Peter
Tuesday, January 04, 2005
 
 
>>How does storing the name of a city in an address table make the table denormalised?<<

Storing the city as a string in an address table doesn't follow the second normal form(2NF). "A relational table is in second normal form 2NF if it is in 1NF and every non-key column is fully dependent upon the primary key.", to translate, with customer 1, 2 from city AAA and customer 3 from city BBB thus giving redundancy with two rows with AAA in it. city column with AAA's doesn't depend on the primary key "customer_id". Thus cities should have their own table with city_id's and that should be "related" with the "customer_id".
My name is secret!
Tuesday, January 04, 2005
 
 
ref. Joe's Coffee -- the argument here seems to be that normalization only matters in non-trivial databases, where you actually have more than one data item to store, right? But even then there seem to only be two options in the case you identify -- hardcode the city name, so you can only have customers in one city, or just repeat the city name and allow spelling mistakes etc.. It doesn't seem to be a very compelling argument.
David Aldridge Send private email
Tuesday, January 04, 2005
 
 
Hmm. Since every state east of the Mississippi River has a town or city called "Lebanon" then you would have every single one of them represented by the same CityID?
Peter
Tuesday, January 04, 2005
 
 
I see. CustomerID is the primary key of this table then? Wouldn't it have been easier to store the address details in the Customer table?
Mike MacSween
Tuesday, January 04, 2005
 
 
>>I think the previous poster thinks that ZIP (postal) codes uniquely identify a city? And if he does think that zip codes make a candidate FK for a city table, I think he should step away from the keyboard with his hands in the air.

When you start getting a few thousand addresses in your tables, you will start to notice that the rules you think apply, really don't.

The table structure for a few dozens to few hundreds of addresses is going to be somewhat different than the structure you will need if you are sending tens of thousands of postal letters per month.
http://www.usps.com/webtools/
http://zip4.usps.com/zip4/welcome.jsp<<

If you refer to me as the previous poster, no I don't think that postal codes makes good candidates for FK.

You are actually emphasising the point I'm trying to get across actually, the needed structure of the address table for a few dozens to few hundreds addresses is different of that tens of thousands of letters, because the increase of complexity in the tens of thousands case is a good thing.  Please do read my previous posts in this thread.
My name is secret!
Tuesday, January 04, 2005
 
 
>>ref. Joe's Coffee -- the argument here seems to be that normalization only matters in non-trivial databases, where you actually have more than one data item to store, right? But even then there seem to only be two options in the case you identify -- hardcode the city name, so you can only have customers in one city, or just repeat the city name and allow spelling mistakes etc.. It doesn't seem to be a very compelling argument.<<

Well, Joe's coffee could also be a multinational business, then the city should also be denormalized to the address. To give it a non-trivial spin.
My name is secret!
Tuesday, January 04, 2005
 
 
Boy, am I ever learning here! So in my table of models, where I have columns ID, Name, Hair Colour, with values

1, Linda Lovely, Blonde
2, Georgina Gorgeous, Blonde
3, Mike BackEndOfABus, Green

Is that not in 2NF? Because Blonde is repeated, therefore redundant?
Mike MacSween
Tuesday, January 04, 2005
 
 
>>I see. CustomerID is the primary key of this table then? Wouldn't it have been easier to store the address details in the Customer table?<<

It depends on how much complexity you wan't in your application by looking at the end user of your application. If you want or need the data integrity in your application with correct spelling of the city name then you should have it in a separate table.
My name is secret!
Tuesday, January 04, 2005
 
 
How about ...

Table: Model
Model_ID
Model_Name

Table: Model_Hair_Colour
Model_ID
Date_From
Hair_Colour_Id

Table: Hair_Colour
Hair_Colour_ID
Hair_Colour_Name

:)
David Aldridge Send private email
Tuesday, January 04, 2005
 
 
>>Boy, am I ever learning here! So in my table of models, where I have columns ID, Name, Hair Colour, with values

1, Linda Lovely, Blonde
2, Georgina Gorgeous, Blonde
3, Mike BackEndOfABus, Green

Is that not in 2NF? Because Blonde is repeated, therefore redundant?<<

That is correct if you have alot of Blondes and Brunettes, then it could be stored in a separate table with HairColourID, and the HairColourID is becomes a FK in the Model table.
My name is secret!
Tuesday, January 04, 2005
 
 
'>>I see. CustomerID is the primary key of this table then? Wouldn't it have been easier to store the address details in the Customer table?<<

It depends on how much complexity you wan't in your application by looking at the end user of your application. If you want or need the data integrity in your application with correct spelling of the city name then you should have it in a separate table.'

I only ask because you seem to have CustomerID as the Primary Key of the Address table. So can't see any advantage. That's just going to give you an extra one to one join isn't it? Which will presumably slow you down. And after all performance is one of the reasons given for DEnormalising.
Mike MacSween
Tuesday, January 04, 2005
 
 
>> CustomerID is the primary key of this table then? Wouldn't it have been easier to store the address details in the Customer table? <<

On a very large system, such as ones used by utility companies with customers that include just about every address within a region, you'd have an address table to store all known addresses (probably fed from an external source like USPS, or from whatever the UK post office is now called), then maintain a correlation table that stores the transient relationship between customers and addresses. Thus you can track what previous addresses a customer lived at, and what customers previously lived at an address.
David Aldridge Send private email
Tuesday, January 04, 2005
 
 
You said:

>>Boy, am I ever learning here! So in my table of models, where I have columns ID, Name, Hair Colour, with values

1, Linda Lovely, Blonde
2, Georgina Gorgeous, Blonde
3, Mike BackEndOfABus, Green

Is that not in 2NF? Because Blonde is repeated, therefore redundant?<<

That is correct if you have alot of Blondes and Brunettes, then it could be stored in a separate table with HairColourID, and the HairColourID is becomes a FK in the Model table.

And I want to know:

Whether or not that table is in 2NF or not. Like your address example with 2 AAA cities.
Mike MacSween
Tuesday, January 04, 2005
 
 
"I only ask because you seem to have CustomerID as the Primary Key of the Address table. So can't see any advantage. That's just going to give you an extra one to one join isn't it? Which will presumably slow you down. And after all performance is one of the reasons given for DEnormalising."

It is not only the slow down that I'm talking about, to do a extra join also adds complexity, you have to handle the inserts and updates to a different table. In some cases like when having few rows or very little redudancy (few repetitions), the extra hassle of handling a separate table could slow down the implementation of the application.
My name is secret!
Tuesday, January 04, 2005
 
 
David Aldridge said:

>> CustomerID is the primary key of this table then? Wouldn't it have been easier to store the address details in the Customer table? <<

On a very large system, such as ones used by utility companies with customers that include just about every address within a region, you'd have an address table to store all known addresses (probably fed from an external source like USPS, or from whatever the UK post office is now called), then maintain a correlation table that stores the transient relationship between customers and addresses. Thus you can track what previous addresses a customer lived at, and what customers previously lived at an address.'

I reply:

Sure. That's what I would do too. But that's not quite what Secret Name is saying. If you look carefully. The PK of the address table seems to be CustomerID. So you couldn't track customer addresses historically, and it looks to me like just a wasteful join, as I see only a one to one there that isn't needed, unless lots of customers just don't have addresses.
Mike MacSween
Tuesday, January 04, 2005
 
 
And I want to know:

Whether or not that table is in 2NF or not. Like your address example with 2 AAA cities.

It is not in 2NF because the Hair Colour repeates itself.
My name is secret!
Tuesday, January 04, 2005
 
 
My name is a secret said:

'"I only ask because you seem to have CustomerID as the Primary Key of the Address table. So can't see any advantage. That's just going to give you an extra one to one join isn't it? Which will presumably slow you down. And after all performance is one of the reasons given for DEnormalising."

It is not only the slow down that I'm talking about, to do a extra join also adds complexity, you have to handle the inserts and updates to a different table. In some cases like when having few rows or very little redudancy (few repetitions), the extra hassle of handling a separate table could slow down the implementation of the application.'

I say:

I think you've misunderstood me. I'm saying the opposite. You've described an ADDRESS table that has CustomerID as the primary key. So a customer can't have more than one address - CustomerID is the PK, therefore unique, yes? So why bother having a separate table for addresses? Unless you're going to have addresses with more than one customer. But I don't see how you can do that, as there doesn't seem to be any key for addresses apart from CustomerID. Why not make the address fields columns in the Customer table, and do away with the address table altogether. What advantage have you gained with a seperate address table?
Mike MacSween
Tuesday, January 04, 2005
 
 
My Name Is A Secret said:

'And I want to know:

Whether or not that table is in 2NF or not. Like your address example with 2 AAA cities.

It is not in 2NF because the Hair Colour repeates itself.'

and I wonder if:

My table, People, with columns ID, FirstName, LastName and values:

1, David, Smith
2, John, Brown
3, David, Jones

is in 2NF
Mike MacSween
Tuesday, January 04, 2005
 
 
"I think you've misunderstood me. I'm saying the opposite. You've described an ADDRESS table that has CustomerID as the primary key. So a customer can't have more than one address - CustomerID is the PK, therefore unique, yes? So why bother having a separate table for addresses? Unless you're going to have addresses with more than one customer. But I don't see how you can do that, as there doesn't seem to be any key for addresses apart from CustomerID. Why not make the address fields columns in the Customer table, and do away with the address table altogether. What advantage have you gained with a seperate address table?"

Well If more than one customer lives at one address. :) according to 2NF to eliminate redundancy.
My name is secret!
Tuesday, January 04, 2005
 
 
You said:

'Well If more than one customer lives at one address. :) according to 2NF to eliminate redundancy.'

I ask:

How will you accomodate that if CustomerID is the primary key of the address table? Why do you have an address table that has CustomerID as it's primary key??
Mike MacSween
Tuesday, January 04, 2005
 
 
"and I wonder if:

My table, People, with columns ID, FirstName, LastName and values:

1, David, Smith
2, John, Brown
3, David, Jones

is in 2NF"

and two davids and the johns children then, Michael Smith, Lucy Brown and David Jones jr?
My name is secret!
Tuesday, January 04, 2005
 
 
How will you accomodate that if CustomerID is the primary key of the address table? Why do you have an address table that has CustomerID as it's primary key??

Ah, well no I don't have a CustomerID in the address table, A single customer could share the address with a spouse or child that also is a customer. I did a "join" in my head when I wrote the customerID is in the address table.
My name is secret!
Tuesday, January 04, 2005
 
 
My point then, which you are stubbornly refusing to understand, presumably because it will show quite how poor your AAA example is, is this:

You are confusing repeated VALUES in a certain field with unnormalised relations. Whether or not I want to spin hair colour off into another table, as suggested, using the obvious method of a junction table (well, almost a junction table!) by David Aldridge will depend on how I want to use that column. The issue of mistyped values, like cities can actually be solved by declaring a check constraint for a particular attribute. Which in this case might be - Blonde, Brunnette, Redhead, Grey. It doesn't have to be a separate table. Though that's often better.

The issue of whether a table is normalised or not is one of Functional Dependancy. And this looks normalised to me, to 3NF and almost certainly beyond. The Primary Key, ID, determines one value of Name, and one value of Hair Colour. I may be wrong, if I am I would like to hear somebody's opinion of why that table isn't in 3NF.

My second example, of a people table with 2 Davids in it was trying to get you to realise the point. But I'm not sure you have.
Mike MacSween
Tuesday, January 04, 2005
 
 
>So a customer can't have more than one address...
Yes you can, and you MUST MUST MUST accomodate such cases. Businesses often have a physical address: where you walk in the door; a shipping address: where the pallets of loot arrive; a billing address: where the invoices will go to get paid. If you cannot accomodate such cases, you're screwed. If the invoice shows up at the loading dock, it might never get to the accounting department. 

>What advantage have you gained with a seperate address table?

The answer will be a quote from something I posted earlier in this thread:

Sample of some tables:
Addresses (CustomerNumber, TypeCode, AddrLine1, etc)
PhoneNumbers (CustomerNumber, TypeCode, AreaCode, Phone, etc)

The PKs will be (CustomerNumber, TypeCode)

Sample data:
TypeCodes ("Address", 1, "Address", "main")
TypeCodes ("Address", 2, "Shipping Address", "ship to")
TypeCodes ("Address", 3, "Billing Address", "billing")
Addresses (34, 3, "PoBox 12".......)
Addresses (34, 2, "123 Main Street"....)
Addresses (34, 1, "114B N 200 street W"...)

If you deal with car dealerships, you will come across cases where the body shop is several blocks away from the new car lot. And invoices have to be sent to the parent office, which may be in a different state. Someone with a PO Box for their "main address" will need a different address when we ship them several pallets of loot.

Get your city and county names in the USA from:
http://www.census.gov/geo/www/fips/fips.html

Rest of world from:
http://earth-info.nga.mil/gns/html/index.html
Peter
Tuesday, January 04, 2005
 
 
Peter, I think you've missed my point. I agree with all you say, but if you look at Secret Name's post, he's talking about an address table which has the CustomerID as the PRIMARY KEY. So with that structure a customer can't have more than one address.

I know customers have many addresses, YOU know customers have many addresses, by Secret Name's address table won't allow that.
Mike MacSween
Tuesday, January 04, 2005
 
 
Mike, you're getting a bit more worked up than this conversation really warrants. (To the point where you're coming across as quite rude.)  If you want to argue normalisation, I suggest finding someone else who likes arguing it as much as you appear to, such as Fabian Pascal.

Tuesday, January 04, 2005
 
 
Actually not atall worked up. And if pointing out other people's lack of knowledge on something so fundamental looks like 'rude' then so be it. I don't intend to be rude.

Interesting, sad really, that so many who should know don't seem to. Still, if I think I know better than they and they aren't interested then I shall keep my wisdom to myself and let my clients profit from it :)

Yes, you're right, none of this matters really! Let's get back to the important stuff. OK folks, which is better Perl or Python?
Mike MacSween
Tuesday, January 04, 2005
 
 
>Actually not atall worked up. And if pointing out other people's lack of knowledge on something so fundamental looks like 'rude' then so be it. I don't intend to be rude.<

I'm almost offended by that. :)

>Interesting, sad really, that so many who should know don't seem to. Still, if I think I know better than they and they aren't interested then I shall keep my wisdom to myself and let my clients profit from it :)<

Now I'm offended, >:)
My name is secret!
Tuesday, January 04, 2005
 
 
Mike, when you ask a question and then start calling those who respond to it stupid, it's really not very polite.  One hesitates to get really nasty and use the t-word, but...
Kyralessa Send private email
Tuesday, January 04, 2005
 
 
When did I use the word stupid?
Mike MacSween
Tuesday, January 04, 2005
 
 
"Sorry to be blunt. But that looks to me like old fashioned sloppyness. I suspect that means that many people think they know what normalised means, but don't really. And haven't thought about it it enough."
Kyralessa Send private email
Tuesday, January 04, 2005
 
 
And the word stupid is where in that paragraph?
Mike MacSween
Tuesday, January 04, 2005
 
 
For an example of multiple addresses per entity, take the "Department of Defence Activity Address Code"

https://day2k1.daas.dla.mil/daasinq/dodaac.asp?cu=d

Type in FB7054 and you get the three address types for Bolling AFB -- billing, shipping, and something else I can't remember.
David Aldridge Send private email
Tuesday, January 04, 2005
 
 
You said:
"I shall keep my wisdom to myself"

I said:
Amen!
JeremyNYC Send private email
Wednesday, January 05, 2005
 
 
lol
i like i
Wednesday, January 05, 2005
 
 
You have to use equidistant letter spacing like in The Bible Code to find it.
Kyralessa Send private email
Wednesday, January 05, 2005
 
 
>>How does storing the name of a city in an address table make the table denormalised?<<

You don't store a name, you are storing a string. To a computer the "Hickville" and "HickVille" is not the same, lets say your particular application of the DB wants to fetch all addresses to "hickville" the result would be different from "Hickville". also even with an index on table the query could be expensive. If there is a limited number of cities/towns it could be a good thing to solve it with a separate table. In the Hickville case you are trading 9-18 bytes to 2 bytes of information in one row. With a table of repetition of a cityname as a string there is redundancy thus the table is denormalised. With a repetition of a string there is also a lack of integrity.

Wednesday, January 05, 2005
 
 
Hmmm, so the cases for denormalization are when the situation is so simple that it doesn't matter, or for reporting purposes?
David Aldridge Send private email
Wednesday, January 05, 2005
 
 
>>Hmmm, so the cases for denormalization are when the situation is so simple that it doesn't matter, or for reporting purposes?<<

It's about the end usage of the table, if the usage of the info is trivial in a non-trivial database, then the db-model should reflect that.

Wednesday, January 05, 2005
 
 
I dont think I've ever seen a JOS thread generate this many posts.
NetFreak Send private email
Wednesday, January 05, 2005
 
 
David,

Actually, I don't think it's that simple, I think we've just been presenting examples like this because they're easier to illustrate, and they come up more often.

There are times when it actually _does_ make sense to store a calculated field, because the calculation is so complex that it takes a long time to run, and the data are often displayed in list view (requiring the calculation to be done many times), and while some data may be edited, the parts of the data that affect the calculated value are very rarely edited, and you're reasonably confident in your ability to keep the stored, calculated value in synch with the stored values from which it is derived.

In a situation like this, it is important to recognize that these things we call rules are merely guidelines, and that as a developer it's your job to weigh the risks of innaccurate data against the suffering of your users as their hair grows grey waiting for your data to display.

Jeremy

PS: When I grow up, I want to learn how to write paragraphs that are more than one sentence long.
JeremyNYC Send private email
Wednesday, January 05, 2005
 
 
83 posts on this thread.  My!

And they mostly seem on-topic.  My my!

I guess it is true what they say -- if you post something that lots of people feel passionate about, or know something about, you'll get lots of responses.

I propose if somebody wants to continue the Database Denormalization thread, they start another thread.
AllanL5
Wednesday, January 05, 2005
 
 
Yup, 83's a pretty good score. Maybe we could have a politics and religion thread to see if we can beat it. Mind you, for some people this is pretty close to a religion.

Jeremy, that's a good example -- storage of complex calculation results, and it reminds me of a similar situation from a previous job. We had utility meter readings for which the consumption since previous reading needed to be displayed. The developer chose to calculate these on-the-fly, which meant that each calculation had to be done many more times than it would have been if the calculation had been performed on insert/update/delete of reads.

Had the developer used some form of clustering table it would not have been nearly so painful to calculate on-the-fly -- the frequency of the reads meant that if you wanted to retrieve 20 of them for a single meter then you'd invariably have to read 20 blocks from the database, and a table clustered on meter id would have reduced this to 1 or 2.
David Aldridge Send private email
Wednesday, January 05, 2005
 
 
You can have what appear to be dependent values, strings such as City, and still ensure validation without normalising out the City and replacing it with CityID. 

If you can lookup an address by post code or zip code then you can recover the City from that search without needing the user to enter it and mispell it.

My fundamental address design is intended to have multiple connections for different entities but a single physical entry for each address.  Each address has a unique id of its own, post codes and zip codes are not unique and even in applications where the data is from a single city you can have duplicate (or worse in the case of the US, rezoned), zip codes.  Hence you need a separate and invisible AddressID.

Connecting to that Address table will be various master link tables to give you the many to many relations that are required.

A further extension to this model is where you need to keep address information temporally, ie you need a history of where someone, or some company has been.  This is frequently true with Tax systems, for example.

In this case the master link table between the Client and the Address table has the begin and end dates of that address, an empty end date presumes its current (whatever current really means).

The only time I'd squish these relations would be for the purposes of presentation, and then they'd simply be cursors, updateable if necessary.  Updateable cursors can mean that complicated relationships can be updated from flattened views.
Simon Lucy Send private email
Wednesday, January 05, 2005
 
 
Thanks to all who responded. Even those who thought I was calling them stupid, or that I was trolling. Neither of which is true.

I won't repeat it, but my first post meant exactly what it appears to mean.

Those small number who think I was insulting them on a personal level just need to read the posts carefully. This is a discussion forum for software developers. And if I think somebody is wrong I'll say so. I hope others would do the same to me, and in fact invited them to do so. I pushed for clear explanations of why a particular example is or isn't normalised. That seemed to be taken as criticism. But I would expect people with expertise to be willing to explain a technical point.

I accused people of being sloppy too. That's not a personal insult, it's a comment on development style.

An insult is when you say unpleasant things about somebody's mother or personal hygiene (well, unless it's true, then you're just being a good friend!).

When people 'take things personal' in on-line technical discussion forums such as this it often indicates that they're going on the defensive and are not actually as confident of their expertise as they would like to be.

Fabian Pascal was mentioned. Those who aren't familiar with him might like to look at www.dbdebunk.com. He often appears like a lone prophet shouting from the hilltops. He even looks grumpy in the photo! At least Chris Date manages a smile. But what he actually says makes a great deal of sense. I have one of his books. It's very well written, useful and not atall 'pure theory'. And the more I look around me the more I agree with him that the level of understanding of database fundamentals in the development profession is fairly low.

On a general level I suspect that some people might think (though I might be wrong) that database design is a bit like choice of programming tool. That it is to some extent a personal thing. Assembler and C are closer to the metal, but VB is quick and dirty. But you can get the job done with both. And you can be a bit normalised or a lot normalised. Both get the job done. I don't think it's like that. The WHOLE point of relational databases is that they conform to a set of 'rules'. The normal forms, each of which is seperately defined and, at least the first 3, build on one another. And in any good text which describes the normal forms the consequences of NOT being in XNF can be clearly described in terms of update, insert or delete anomalies. Things which make the database wrong.

In a 4GL you can use a Select Case, or lots of nested If Then Elses, or maybe even a Do Until. They'll all get you there, and probably one of them will be the fastest, another the most elegant etc. Logical modelling (i.e. the relational modelling) isn't that. It's the first stage of the design. Up to that point, I would say, everything else is requirements gathering, or a version of it. Even the CDM is, sort of, part of requirements. Maybe. Feel free to disagree.

But once you do the logical model you're making a committment about how this is going to be. And all the metrics I've read about the 'cost' of repairing design faults are that it is 100s of times more expensive to fix something that's wrong at this early stage than it is later.

The response to the subject line was interesting. I used the word DEnormalise. To me that implies that tables are normalised to, let's say, 3NF, then tests run to see what effect that has on performance/maintainability or whatever. And then if necessary certain tables are DEnormalised to recover performance. Which will usually mean 2 or more tables combined into 1. One poster claimed to do that. But general I get the impression that when people say they denormalise they aren't doing what I describe. They just aren't normalising in the first place. If people are going to denormalise then it has to be very clear. 'We went to 3NF, but the outer join on these 2 tables was very slow, so we combined them into one and accepted the NULLS'. Those were the sorts of examples I didn't get many of.

What I got more of was some rather general stuff. 'Sometimes it gets a bit too complex so you can't always fully normalise'. 'The application isn't really about storing that sort of information so I don't need to worry about normalising that bit'.

The idea runs like this, as I understand it: The more that data integrity and consistency, let's call it 'correctness' can be maintained by key constraints in the database itself then the less work there is for the application to maintain correctness. Application developers may THINK that the only way anybody will ever get at the data is through their application - but will that be true for all time? Probably not. And of course it only takes one mistake in the application, one place where the developers have not seen the opportunity for incorrect data to get into the database, and things start to go wrong. And the 'lowest level' way of ensuring data integrity at the database level, even better than triggers and stored procs, is key constraints. I.E. Tables which conform to 3NF.

It appears to me that many developers just aren't very interested in this. They get a bit of an idea about normalisation from a book which is primarily aimed at programmers, but has the obligatory chapter on databases. They get the gist of it and then carry on. I think developers should be interested. The majority of apps are database apps, and the database structure is the basis.

Of course it's understandable. There's no equivalent to 'Hello World' in database design. The first time you write 'Hello World' in C you think, Wow. The first time you do a loop that rips through a recordset and capitalises all the country names you think Wow. The first time you write a PHP script that shoves some data up on a Web site, or automates an email you think Wow. There isn't really an equivalent in database design. It's a bit boring really.

I've gained some insights here. Principlly that one, maybe the main, place you would denormalise is with static, historical data, for reporting purposes. I can see the point of that. Complex calculations too, though David Aldridge suggests an alternative. And sometimes just because you're stuck with what somebody else has done previously. Which is valid, if not ideal.
Mike MacSween Send private email
Wednesday, January 05, 2005
 
 
"And you can be a bit normalised or a lot normalised. Both get the job done. I don't think it's like that. The WHOLE point of relational databases is that they conform to a set of 'rules'."

I disagree.  Sure there are rules and at the most basic level the rules are clear.  But once you get into the case of whether or not you going to store an address in the customer table or use another table it becomes a choice. 

Everything in programming is a trade off.  And that includes database design.  One could normalize a database to death with 4.5NF or beyond but they are going to spend a lot of time doing that and the result will be more complexity for the application.  If we lived in a vacume with infinite time and resources than everyone could just follow the rules.

"Application developers may THINK that the only way anybody will ever get at the data is through their application - but will that be true for all time?"

No.  So you have to know how far you want to take it right now.  If you built for every possible contingency or for every possible feature in the future you'd never finish anything.  And maybe, down the road, you can just redesign your database if needed.
Almost Anonymous Send private email
Wednesday, January 05, 2005
 
 
"It appears to me that many developers just aren't very interested in this. They get a bit of an idea about normalisation from a book which is primarily aimed at programmers, but has the obligatory chapter on databases. They get the gist of it and then carry on. I think developers should be interested."

Those many developers sure are lucky they have you around to set them straight, Mike.
Kyralessa Send private email
Wednesday, January 05, 2005
 
 
>> Those many developers sure are lucky they have you around to set them straight, Mike. <<

Pretty sorry material you're working with there, humor-wise. Shouldn't you be over at slashdot with the rest of the intellectuals?
David Aldridge Send private email
Wednesday, January 05, 2005
 
 
No, I prefer to sit here in isolation along with my doomed app with its inconsistent publisher names.
Kyralessa Send private email
Wednesday, January 05, 2005
 
 
Wow, I walked away from this thread yesterday thinking it was going to be moribund soon, and it just kept on going. :-)

Any time anyone takes the time to learn the core of some piece of technology that lots of people use at a superficial level, there's a chance that you'll form, if not outright resentment, a combative attitude towards those people who haven't bothered to learn the internals.

You'll see this with languages, structured programming, OO, compilers, databases, system calls, APIs, etc.

Most people are happy to treat databases as "tables containing objects".  I.e., there's no data modelling step, and their db is just a straight translation from their object model.  This works for them, and they never bother learning much in the way of normalisation.  They don't understand the differences between normal forms, even if they think they do.  Most working programmers who do any pseudo-RDBMS/SQL work at all fall into this camp.  They get a lot of mileage out of their databases, so there's no compelling argument for learning 6NF. (not that anyone has a justification for 6NF :-))

In conversations such as this, you have people who are annoyed at the ignorant masses who assume they're saying something they're not, and people who think they know what they're talking about and hate to be spoken down to.

I love it. ;>
Art Send private email
Wednesday, January 05, 2005
 
 
Minor facts that got a mention in this thread:

1) The ISBN is masde up up of a language code (1-2) characters, a mod 11 check digit at the end (0-9,X), a publisher code and a book-within-publlisher number. 

2) There are 4 or 5 places in the US where ZIP codes do not align to state borders. APOs were also changed.  These exceptions are very small and in the middle of nowhere out west.  It is simply easier to send the mail to a state post office that is a few hundred miles closer to them than to truck from their home state.  ZIP+4 always matches political units. 

The only reason for not normalizing addresses is that you get your mailing data from a trusted source that scrubs it first (USPS, Melissa Data, etc.) 

If your industry uses SAN (Standard Address Numbers) like the book trade, then you should use them and update them from a trusted source. 

3) Data warehouses are denomalized becuase they have been scrubbed and verified, and then are NEVER updated once the data is in place.  OLTP data is changed all the time and does not get scrubbed at all.  Two very different data stores.

4) Failure to normalize an OLTP database can kill people.  I did a "quicky consulting job" about that last year with a medical relief charity that violated 1NF and screwed up  quantities.
Joe Celko Send private email
Wednesday, January 05, 2005
 
 
When I enter ZIP code into any form, I expect the database to be able to figure out the rest. Making city and state fields mandatory wouldn't help, as I would enter junk into the. Therefore, you have to have (zip, city, state) lookup table your database schema anyway.
Mikito Harakiri
Wednesday, January 05, 2005
 
 
All this talk about normalizing addresses is hilarious. I've never seen it done that way, even on a 7 figure system with 3 million accounts. There is simply a USERS and and ADDRESSES table.
unknown
Wednesday, January 05, 2005
 
 
>> All this talk about normalizing addresses is hilarious. I've never seen it done that way, even on a 7 figure system with 3 million accounts. There is simply a USERS and and ADDRESSES table. <<

Well, like Joe said -- you get your data from an authoratative source, and it's the source that has scrubbed it. Just like with a denormalized reporting database, the integrity is guaranteed elsewhere.
David Aldridge Send private email
Wednesday, January 05, 2005
 
 
'All this talk about normalizing addresses is hilarious. I've never seen it done that way, even on a 7 figure system with 3 million accounts. There is simply a USERS and and ADDRESSES table.'

Neither have I. I'm in the UK so this talk of ZIP codes and whether or not they align with political boundaries doesn't apply to me. Although it is interesting, to some extent. I'd like to see it working in practice, especially applied to international addresses.

There is difference between a fully normalised set of tables, and the use of lookup tables to ensure consistency, which is obviously a lot weaker.

The app I'm currently working on includes UK Postcodes, but they aren't always known at data entry time. And even at requirements time it was explicitly stated that foreign addresses would need to be stored, initially from a small number of known countries, but potentially any country in the world.

If one takes the time to investigate addresses, especially international ones, it becomes clear that it is a distinctly non trivial task to accomodate them all successfully. There are widely differring formats. Some countries, large ones at that, actually use addresses elements in REVERSE order:

Region
City
Area
Street Address.

Some countries have quite well formulated postal addresses (the USA seems to), some fairly well formulated addressess, like the UK, and in some countries it's anybody's guess.

Then try to account for differing addressing schemes in times of conflict and changing boundaries...

Oh yes, and armed services post, get that into your system. Here in the UK post to the military, all 3 branches, goes to BFPO. Which stands for British Forces Post Office (NOT Posted Overseas, like so many think). And there's BFPO numbers, e.g. BFPO 51. And there's no postcode for BFPO addresses and BFPO itself (the organisation) definitely DOESN'T want the country name in the address, even if you know where the service personnel are, because it then goes to that country's civilian postal service and disappears. Or get's blown up if it's big and might be a bomb. And BFPO address DON'T give you a physical location, the format is:

Service No Rank Name
Unit/Regt
Operation Name
BFPO No

So to get back to the normalisation issue, the FD postcode (BFPO no in this case) to City, or whatever, doesn't apply.

Several points arise from all this discussion and from me looking long and hard at addresses:

1. Addresses are a very difficult thing to arrange in a normalised data structure. So probably not the best thing to discuss in terms of normalisation or maybe a very good thing to discuss.

2. But probably possible. Clive Bolton, a regular at comp.database.ms-access came up with a set of tables which looked promising. I'll try to find them.

3. An address isn't the same as a static physical location, as most people think of it. Probably. But that's application dependent. It MIGHT be a location. Or it might be a delivery point. But they often get mixed up.

4. The difficulties surrounding successful address modelling won't get better until people who post in international forums stop using examples that apply to only 4% of the world's population. Don't use ZIP codes or SSNs as examples unless you are willing to start a newsgroup called comp.databasetheory.americansonly. And puleeze, don't get all touchy and 'oh my god you're anti-american'. Time after time in EXACTLY this sort of discussion I've seen all sorts of suggestions which simply don't apply to many people involved in the discussion because they aren't in the USA. Use SSNs as a PK is a non-starter for me (whether it's a good idea or not) because we don't have them here.

Oh yes, and if a few of you would be good enough to post again I'll get an award from Joel for starting a thread with more than 100 posts in it. Thanks.
Mike MacSween
Thursday, January 06, 2005
 
 
>>1. Addresses are a very difficult thing to arrange in a normalised data structure. So probably not the best thing to discuss in terms of normalisation or maybe a very good thing to discuss.<<

I would think that is the best way to get the discussion going. An address is such a simple thing at a glance, put it becomes complex when you start to dig in it.

>>3. An address isn't the same as a static physical location, as most people think of it. Probably. But that's application dependent. It MIGHT be a location. Or it might be a delivery point. But they often get mixed up.<<

I agree, and the modelling of address tables is dependent of the application that uses them. It could be a location or a delivery point and the db modelling must reflect that!


>>4. The difficulties surrounding successful address modelling won't get better until people who post in international forums stop using examples that apply to only 4% of the world's population. Don't use ZIP codes or SSNs as examples unless you are willing to start a newsgroup called comp.databasetheory.americansonly. And puleeze, don't get all touchy and 'oh my god you're anti-american'. Time after time in EXACTLY this sort of discussion I've seen all sorts of suggestions which simply don't apply to many people involved in the discussion because they aren't in the USA. Use SSNs as a PK is a non-starter for me (whether it's a good idea or not) because we don't have them here.<<

I got to point out that my nationality is Swedish.
Secret name
Thursday, January 06, 2005
 
 
"4. The difficulties surrounding successful address modelling won't get better until people who post in international forums stop using examples that apply to only 4% of the world's population."

Mike, an example is just an example.  A few quotes from Fowler's _Refactoring_ are illustrative here:

"A proper example of this requires a long chapter, so I'm showing this refactoring for a method that doesn't need it.  (Don't ask what the logic of this method is, I made it up as I went along.)"

"This seems almost too simple for an example, but, hey, at least it is quick to write:"

"I use the boring and unrealistic example of employee payment:"

"I again use the timresome and brainless example of employee payment:"

"I use the tedious and simplistic example of employee payment."


The purpose of an example is to illustrate a concept; to pick apart the example misses the point.  You might just as well blame bad error messages in production code on this:

http://www.fogcreek.com/FogBugz/NewIn3-0.html
Kyralessa Send private email
Thursday, January 06, 2005
 
 
Address format tends to be specific to the country. You *might* want to consider an address table per country, or for a group of countries when their formats match.

Take a look at Tokyo addresses if you want to see something very different. Can one table layout accomodate US, UK and Japanese addressing?

Mike MacSween, I would like to see those table designs.
Peter
Thursday, January 06, 2005
 
 
You could possibly have a single table for those if you had a parallel table for the position each element had in presenting it (possibly in entering it as well).  However, that doesn't cover situations when slightly different concepts of address elements exist, Japan could maybe be squeezed into a UK format but I think its more like the Taiwanese address which has extra elements and neither are like the German address.

So, if you want to keep semantically structured international addresses (instead of just strings of stuff which make sense to people that read them), you have to have different table structures for different countries requirements.

From the last time I looked at this I have a feeling that there is an ISO (or maybe it was a UN/EDIFACT or ANSII definition) set of address structures for different countries.

Additionally, this may just be the 100th entry in the thread.
Simon Lucy Send private email
Thursday, January 06, 2005
 
 
"Use SSNs as a PK is a non-starter for me (whether it's a good idea or not) because we don't have them here."

My ethnocentric bad. :-)

An SSN, as you've guessed, is the US "Social Security Number".  It's a way for the government to tag us throughout the myriad systems and databases so they can monitor our movements, cash flow, associations, etc., with the eventual goal of arresting us and relocating us to forced-labor collectives.

I think it's roughly similar to your National Insurance Number or Canada's Social Insurance Number, but I'm not sure that it's quite as pervasive an identifier there as it is here.  Almost _any_ database with any financial information related to an individual in the US uses the SSN.

However, some people do not have SSNs or an equivalent (although, that number is going down through some rather severe penalties), or some people have actually changed SSNs due to identity theft or other issues.

So, I could revise my earlier anti-SSN-Key statement and say,"Don't use externally-controlled keys because you can't keep them from changing!" :-)
Art Send private email
Thursday, January 06, 2005
 
 
Art,

One shouldn't forget that SSN aren't completely unique either so using them as keys is a bad idea.
Almost Anonymous Send private email
Thursday, January 06, 2005
 
 
http://www.csl.sri.com/users/neumann/ssaforum.html

* 2i. Allocation of SSNs. There are various cases in which the same SSN has been allocated to or used by different people. In one case, two women with almost the same name and birthdays in the same month were given the same SSN, which was discovered only when one of them was dunned for back taxes on their combined income ([4], p. 196).

I've seen the actual names of these 2 women once, but can't find a better reference right now. I think it's a fairly well known case.
Mike MacSween
Thursday, January 06, 2005
 
 
It's well enough know that it's drilled in your head in school never to use SSN's for keys.  In fact, it's probably the main argument for informationless keys too.
Almost Anonymous Send private email
Thursday, January 06, 2005
 
 
Kyralessa said:

'Mike, an example is just an example.  A few quotes from Fowler's _Refactoring_ are illustrative here:'

I say:

Not sure who Fowler is. Over here, in the United Kingdom, this sceptered isle, this jewel set in a silver sea (oh for heaven's sake Mike!), the Fowler we know is the author of Modern English Usage. I think it used to be called The King's English. Then the king died, leaving only daughters, and the publishers realised the problem (client to developer - 'we will never ever ever want to change the title of this book'). S'about grammar and stuff innit? Like how some people get really really hung up on what they perceive as 'the rules' and consequently produce really ugly awkward difficult to use things by trying to stick to the rules. That would never ever happen in any field of work I was involved in. No siree, I'm not like that. As you can tell.

Yes, I know it was just an example. It's a bit of a peeve of mine, constant USS Centric examples. But not that important I guess. And I suppose if you're an American (and correct me if I'm wrong, but I get the distinct impression that some of you actually are) it makes sense to use examples from what you see around you.

I do think that an 'example' of a programming thing, which I think is what this other Fowler seems to be talking about, is different to an 'example' in database design. In database design every 'example' is actually the thing itself. So each example is different and can't really 'stand' for anything else. Maybe.
Mike MacSween
Thursday, January 06, 2005
 
 
Ironically, I'm not American and I didn't go to an American university -- but they still used SSN as an example!
Almost Anonymous Send private email
Thursday, January 06, 2005
 
 
Mr MacSween, if you don't know who Fowler in a software engineering forum context is, then that would explain a lot of this.

Golly, I'm ashamed to be a brit too.
i like i
Friday, January 07, 2005
 
 
I'm looking at the index for Fowler's _Refactoring_, the work that introduced him to the thread, and there are two pages (actually, two minor parts of two pages) on databases, and much of that is in re object databases.

It's completely acceptable that a senior data architect hasn't read it.

Have you read any Codd or Date?  They're much, much more relevant to this conversation than Fowler.
Art Send private email
Friday, January 07, 2005
 
 
Nope I hadn't heard of Fowler before. Just looked him up on Google. He seems to be an author on the subjects of OOP, UML and Enterprise Architectures.

Would you say his books on those subjects are worthwhile? If so, if I'm looking to buy one then I'll certainly consider one of his.

Seeing as we seem to stick to surnames here my bookshelf has Wiegers, McLure, Scambray, Kurtz, Brooks, Liberty, Prata, Gregory, Slater, Arnott, Lampel, Long, Anderson, McConnell, Minsai, Jennings, Pogue, Walther, McConnell, Davis, Lewis, Griver, Kriegel, Nielsen, Russel, Crawford, Gerend, Thayer, Sweeney, Liberty, Halvorson, Kasevich, Reid, Phan, Prague, Reardon, Meloni, Stephens, Plew, Hernandez, Shapiro, Boyce, Date, Pascal, Tittel, Getz, Gilbert, Litwin, Riordan, Castro, Watson, Leszynski, Foxall, Amo, Irwin, Roman, Baron, Chipman, Smith, Sussman, Simsion,, Connell, Ullman, Converse, Park, Horton, Slovak, Burnham, Gifford, Blackburn, Dewson, Hanselman, Hatfield, Johnson, Liske, Martins, Matsik, Salguero, Shelby, Slager, Thomspon, Townsend, Turley, Watson, Duff, Kumar, Waters, Gianotto, Endrerud, Anton, Stephens, Solin. Oh, and Gates!

If a name appears twice then it's different people. Many of these are co-authors, and of course I have more than one book by some of the authors listed.

I'm sorry I hadn't heard of Fowler, but then I don't know much about the subjects he seems to be expert on (well apart from the course I just took in Smalltalk!).

My impression is that the cursory knowledge I have of OOP is matched by many programmers cursory knowledge of database design. Let me see, OOP is all about Enscapsulation, Polymorphism and Inheritance, yes? There, got it!

I wouldn't dare to spout off about OOP with such flimsy knowledge. I am not worthy. But as and when I need and want to learn more I will. I know it's important, but I don't have unlimited time.

However I have tried (and succeeded, in part, I hope) to gain a deep knowledge and understanding of relational database design.

Thanks for putting me onto Fowler. My computer book budget for this month is used up.
Mike MacSween
Friday, January 07, 2005
 
 
I love this article for the same reasons outlined in this thread:

http://www.apa.org/journals/psp/psp7761121.html

(Yes, I'm giving into temptation to keep adding to the end of this thread and making it longer. :-))
Art Send private email
Friday, January 07, 2005
 
 
Wow, that's a lot of books on your shelf.  You must be a really brilliant database guy.
Kyralessa Send private email
Friday, January 07, 2005
 
 
>>Wow, that's a lot of books on your shelf.  You must be a really brilliant database guy.<<

Sarcasm???

:) "managing complexity" from secret name's first post on the thread, comes from Code Complete(CC) written by Steve McConnell. McConnell happens to be on Mikes bookshelf. If Code Complete is on his shelf I think he should bother to read it. I find myself going back to the book for every project I'm involved in.
Paracelsus
Friday, January 07, 2005
 
 
Paracelsus, have you bought the second edition yet?  I'm wondering if it's worth replacing the first ed.
Art Send private email
Friday, January 07, 2005
 
 
The only copy I have right now is the second edition, the first edition belonged to my previous employer. I don't have all the differences in my head. But I have found a couple of updates that I think are worth the upgrade and a reread. I'm doing a complete reread right now of the whole book.
Paracelsus
Friday, January 07, 2005
 
 
Paracelsus
Friday, January 07, 2005
 
 
I read the whole thread, or most of it, even though I've done database stuff on and off and in my spare time for 8+ years, I guess I clearly don't know much about the pedantic arguments over normalization (notice the 'z').

First, there is no such thing as denormalization, since in most cases it appears that most databases lack proper "normalization", and secondly even if you "flatten" or condense tables into one for the purposes of enhancing performance, technically you are creating a new table that is simply not normalized.

Perhaps the term you meant was "intentionally not normalized" as this is different from "accidentally not normalized" or worse still "ignorantly not normalized".

If you go to the school yard with brass knuckles it is safe to assume you are looking to fight.  When you keep them in your pocket and bait people...

Just because the language is called "English" does not mean it is your language, the King is dead and so is his English.  Spanish will overtake English as the language of Commerce soon enough (unless Mandarin somehow eclipses it).

I like the idea of flattening tables for purposes of reports, in the end I agree that constraints like time and budget limit the model and lead to decision making such as "well, we'll just write our app so that we can't mess this data up".

Human beings are flawed and the things we make are flawed as well, which is a good thing because it ensures jobs for programmers and database designers and a whole host of other people, so when you analyze (notice the 'z') a model it will inevitably have flaws.
Jay Lauffer Send private email
Saturday, January 08, 2005
 
 
'First, there is no such thing as denormalization, since in most cases it appears that most databases lack proper "normalization", and secondly even if you "flatten" or condense tables into one for the purposes of enhancing performance, technically you are creating a new table that is simply not normalized.

Perhaps the term you meant was "intentionally not normalized" as this is different from "accidentally not normalized" or worse still "ignorantly not normalized".'

Yes, you're right. I should have been more precise. I used 'denormalisation' becuase that's the term I've seen used most often. Perhaps these could describe the states of a table:

1. Normalised to xNF. Whether 'by accident', 'by gut feeling', 'by experience', 'by following the xNF rules' or whatever. I think it's better if the NF has been arrived at conciously. But we all know that we arrive at normalised tables often 'by accident'.

2. Not normalised. Atall, as anything else must be normalise to a certain NF.

3. De normalised. The tables was normalised to xNF, Performance/usability/cost or something else suffered badly, so the tables were DEnormalised back to x-1NF. So it's still normalised, to a different normal form.

'Spanish will overtake English as the language of Commerce soon enough (unless Mandarin somehow eclipses it).'

Do you think it's worth learning Spanish?
Mike MacSween
Saturday, January 08, 2005
 
 
Jay,

Why "notice the z"?
Practical Geezer
Saturday, January 08, 2005
 
 
"Denormalized" is just a standard term, but I think it means something different from "non-normalized". It's always implied to me that the database has been designed to be normalized (usually 3NF) and then to cut down on the number of joins required for some popular queries, particular data elements have been "pushed" to a higher level (or maybe pushed to a lower level).

So for example in a hierarchy of A -< B -< C the unique identifier for A might be included in table C, or the number of C's might be stored in table B and/or A.

It has never implied to me that the database was not normalized to at least 3NF in the first place -- my mind boggles at the thought that the initial design wouldn't have aimed for 3NF.

ps. I just noticed that I've been using "Z"s -- evidently I've been in the US for too long. Apparantly though, the use of S's is a fairly recent UK English thing -- my mother tells me that she was taught Z's at school, and that would have been in about the 1940/50's.
David Aldridge Send private email
Saturday, January 08, 2005
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz