The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Relational database design. pros and cons?

I'm familiar with entities, relationships. but just curious about archival process. since its on a normalized form I might be faced with problems with archival. When I need to remove old data, and a process to return it back again(in case they need the information again. am i wrong here?)
Do you have to design the database for future friendly?. To have an easy way of piping the data. As I can see the more its on a normalized form the harder it is to pipe or move the data. And does it make sense to have a separate data server for historical data?
Thanks.
j2e
Sunday, February 27, 2005
 
 
Why do you want to archive the data?
Craig Send private email
Sunday, February 27, 2005
 
 
One of the central concepts of most relational database design is that of NO HISTORY.  The Database, in its current state, does not maintain a history of transactions.

Now, there IS a 'transaction log' that can be used for various purposes -- including 'rolling back' the database to an earlier state.  But usually the transaction log is truncated to prevent it from taking up lots of space.

What typically happens is people who care take a 'snapshot' of the state of the database at particular times (using BulkCopy under Sybase).  Thus they keep the last three or so daily snapshots so they can restore a known-good state of the database.

But in general, "rolling back" to an earlier state of a relational database is not considered A Good Thing -- the value is in the current state only, and roll-backs are for disaster recovery.

I suppose you COULD implement a table in the database to record transactions -- but 'roll-back' using it could be very difficult, depending on the dependencies in your database.
AllanL5
Sunday, February 27, 2005
 
 
And now I see your issue is with archival of "old" data, not needed for "current" queries, but needed for some other purpose in the future.

I think a time-limited history database would be useful for this.  That would be like, all data from 2004 in one database, all data from 2003 in a second database, etc.  Then separate queries could be run against these databases should "legacy" answers be needed.

This can be done -- but again, you're kind of leaving the Relational Database as storage of current state of knowledge.  Just be aware of that.
AllanL5
Sunday, February 27, 2005
 
 
Hold on a minute (mind the pun).

Who says that relational databases only hold "current state of knowledge". At best, that is misleading.

Most relational databases I've seen have had to cater for history, and often do so by some kind of versioning scheme.

Take a look at any database used to support non-trvial business application and you will see that most keep full history.

Whether to archive off, and potentially re-load, old records is a different question. In general, it is not necessary, unless you have a vast amount of data.

Monday, February 28, 2005
 
 
Maybe you should consider a data warehouse for your old data. This would have the advantage of keeping your current relational data clean, and allowing easy access and details analysis of your historic data.
Jon Spokes Send private email
Monday, February 28, 2005
 
 
There's a difference in usage between transactional history as the database sees it and the transactional history that the user requires.

For instance, an accounting system stores invoices, for the current running of the accounts only those Invoices currently active are required but for the historical management of those accounts, those customers, you need to know what they did in the past.  Typically, at some period event transactions which are no longer current (paid, allocated and so on), are moved to an identically shaped table or set of tables.  If you need to get a view of all their transactions then a union of the two tables on whichever account gives you that.

On the other hand, from the database engine's point of view transactional history is what happened to this table and this row and even this field within this row.  That is what can be stored in a transactional log and theoretically rolled back.  A database engine's transactional management probably won't roll back much beyond the current event horizon and for that you have backups.

Object databases often hold every single change within the database itself and serialised until such time as that is thrown away and the database compacted.
Simon Lucy Send private email
Monday, February 28, 2005
 
 
To the OP,
  I have worked on several different relational systems that had to implement an archival and retrieval proccess.  I will admit neither was pretty.  To boil it down to base functionality.

  You need to determin which tables have the dated data you wish to archive.  With a well normalized database the data you want to archive will be spread acrossed multiple tables.  So now you need to take that dated table row and find all the associated rows you also want to archive.  For completeness of the archive database you will need to move all the associated rows from their tables.  The trick and ugly part as far as I have seen is before removing any row from any tables in the main database you must make sure that you are not going to leave any rows behind that are referencing them.  If you are you have to leave the assiciated rows until you finally archive off the last peice of data referencing it.

  This causes some data duplication in both databases.  You could not move the data over till you can actually remove all of it.  The problem with this solution is then you have incomplete data in the archive database with links between tables that may or may not be valid.  If you do move the associated tables data you will need to verify that is is not already there before adding it.

  The same safegards need to be taken with just switching the databases for restoring.

  For those wondering why do this.  Well in the meical insurance industry the fed mandates 24 or 27 months of data on your live system.  That is from membory so they may have lowered it since I worked the archival processes.  But the also mandate that 7 or 10 years of data must be recoverable from tape ect back to a simulare state as the live data.  Also remember datawharehouse data is not enough.  You need the full transactional information in all its detail.  For any decent size insurer this runs into the millions of rows generated per month.
Douglas Send private email
Monday, February 28, 2005
 
 
I have always had difficulty with this concept. If you 'archive' data then how do you get to it? You need to run your programs on it. This means you need all the codes and related lookup tables. So lets say you take a snapshot of the database at June 2004 and call that an archive. So now you want to go back and run your programs against the archive database. You could change your connection strings to point to the new database. But lets say you had a program upgrade in the meantime and the new program is looking for a new database field that doesn't exist in the old database. This means that when you take a snapshot of your database you also need to take a snapshot of the programs you used to run it. Otherwise you have to develop processes which determine what the data is and how you plan to retrieve it and have some method that will be able to recall the data regardless of the old database structure.
jb Send private email
Thursday, March 03, 2005
 
 
>I have always had difficulty with this concept. If you 'archive' data then how do you get to it? You need to run your programs on it. This means you need all the codes and related lookup tables.

Well, that is supposedly what OLAP, and all the cool data reporting and mining tools are supposed to do for you.

After you archive the data, then you don’t need your software anymore to run the reports!!!  (your software will have done its work to produce this data!).

The fact is that your software might involve 10 tables or more to product a simple screen etc. With a normalized system, just about everything from the location, SalesRep, StoreNum etc is likely going to be a related table. When you build the data warehouse, you ONLY store the resulting query data that is join of all those tables etc. And, often sometime the data will be pre-processed by some code.

The result is a nice flat file (not always) that lets you simply query the database by date, store number, sales of the item, and a few other things. There is usually no lookups, and all the data is there. Since the table is generally flat, it is VERY VERY VERY easy to report on the data. In fact, often just sending the results to Excel and using a Pivot table is MORE then enough. (using a pivot tables will let the end user slice and dice by location, by date, by city, by product…just about any question you have with this data will EASILY be answered by pivot tables).

The main problem here is that you do loose some information, and often it is the fine detail information that you loose.  So, you might know the daily sales for a product by store/location, but you would loose the salesman’s name. And, stuff like time of day of the sale. And, things like the tax amounts collected etc.  The key advantage here is that the performance of the reports is going to be VERY FAST (at least compared to the original data!!)  since you only are looking at summary data.
The real art of the data warehouse is how much detail do you capture. Douglas points out in his situation they actually need the transactions (yikes!!). This kind situation would mean that you would have to keep the old data on-line, but still l might opt to build a data warehouse for the management teams to do their reports on. The issue of keeping the actual transactions and backups of this data often is a separate issue from that of the data warehouse. Don’t confuse a data backup with a data warehouse that contains summarized data.

As for the details to keep, I recall a situation for my archive files (for a small reservation system I wrote), that I decided when you bring up a customer, and you want to see last years attendance, it would show the number of seats, and also the total amount for “options” booked. (drinks etc purchased). I figured this was more then enough information for the customers “old” data.

Well, it turned out that the users of the software actually did want to know more details about the invoice (number of drinks, what kind of drinks!!!, and details of the number of dinners etc).  Opps!...my archive does not capture that level of details that they want! I was VERY surprised that users of the system wanted more details then what I assumed was MORE then adequate (but, I was wrong!!).

Without gathering some VERY good requirements as to what the users of the software (and management) requite in terms of looking at past data is what the art is all about of making a good data warehouse .

As mentioned, this summary data also allows you to NOT have to use the original application to make sense of the data (and, with highly normalized systems today, often you MUST run the software that works with the data to make any sense of any of the data..and trying to write repots on that data is VERY difficult (not to mention the large amounts of processing here).


Of course now, we hear of OLAP cubes etc, where you put time, or date as the 3rd axis for reporting. And, this is just a issue of what kind of drill down you going to give the user who is reporting on this data…

If done right, you can off load the reporting system for managers from the actual production server to a data warehouse for the management teams to use.

Albert D.Kallal
Edmonton, Alberta Canada
Kallal@ msn.com
http://www.members.shaw.ca/AlbertKallal
Albert D. Kallal Send private email
Sunday, March 06, 2005
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz