The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Archive Strategy

In our Inventory Control database we have tables that grow and grow and grow as the database gets used. Tables like StockResults, Deliveries, StockCount etc.

I want to archive the data into archive tables, which I can later moved to another database, or even another server. This way I'll have current tables and archived tables (eg. Deliveries and Deliveries_Archive).

Here are some considerations:

1. We will need to report on both the current tables and the archived tables.
2. data that is archived might need to be put back into its current table.

What I'm after is some advice on what strategy to use.

Is there a framework or design for doing this that would help me ?
Do you have an archive strategy that works for you ?
Do you have a clever naming convention that you use ?
Any good articles on this ?

Any help would appreciated.

Craig Send private email
Wednesday, May 03, 2006
You need to define "archive".  The two considerations you've cited contradict what we typically mean by archive.

Err...  specifically, when I archive something, I'm finished with it, I've created a READ_ONLY copy of it, and I've put it someplace where it is controlled and backed up. Essentially, whenever someone asks me to prove the data hasn't been tampered with, I have an established... chain of evidence.

I think what you're looking for is more akin to a mirror of your database. I forget what the specific terminology is (someone else will chime in) but you basically have one database for inserts and other expensive operations, and a second copy of that database optimized for reads and report generation.

But to answer your question, since...

1. We will need to report on both the current tables and the archived tables.

2. data that is archived might need to be put back into its current table.

There really should be no distinction between current and archive data. "Official" (visible) and "draft" (hidden) data, perhaps, but you can just add a column that indicates the current status of that data.
Wednesday, May 03, 2006
Thanks for your response.

Maybe "archive" is the wrong word. I just want to keep the database small for transactions, but need to keep all the data for reports. That's my dilema.

These transaction tables are filled as deliveries etc are entered. At the end of the week, the week is finalised and a new week starts. Last weeks data is still needed for reports, but is not added to or updated.

The database mirror you mentioned sounds like what I'm after. Where data in the transaction database is mirrored to the reporting database. After a week is finalised, we would want to delete that from the transaction database but keep it in the reporting database for reporting. Is this possible ? I guess that depends on the mirroring software.

If the data is mirrored across 2 databases, how do we keep the definitions up-to-date (eg. StockItemDefinitions) ? Would we replicate that between the databases ?
Craig Send private email
Wednesday, May 03, 2006
Craig. I think that you should consider a datawarehouse-like solution.
You could create a database specific for reporting, updated regularly with live record changes.
This allows you to segregate unchanging data to some other tables and run the reporting against the reporting database.

my 2 cents
Sevenoaks Send private email
Wednesday, May 03, 2006
We have done this before.  We had 2 databases called "online" and "batch" I think.  The "online" was the source DB since all interactive transactions went against it, then data was replicated (i.e., using SQL Server replication) in near-real-time to the "batch" DB to keep it in sync for reporting and batch jobs.
Brian Moeskau Send private email
Wednesday, May 03, 2006
We do this too.  I employ use table inheritance with table spaces.

This allows you to create indexes on "sections" of your data.  You could, for example, have:

stock_prices_master (no records stored here)

stock_prices_2005 inherits stock_prices_master

stock_prices_2006 inherits stock_prices_master

And each stock_prices_XXXX can have it's own table space.

Not sure if all RDBMS support table inheritance.  PostgreSQL does this rather well.

This has been talked on this forum here:
Wednesday, May 03, 2006
Thanks for all the advice. It seems that database mirroring (or something similiar) is what I should do.

So I'll a have Transaction database that mirrors with a Reporting database. They both have the same data but all the inserts, update etc happen on the Transaction database and all the reporting  happen on the Reporting database.

This leaves me with another question...

The Transaction database only needs the last weeks data to do its job. The Reporting database needs all the data for reporting (about 3 years of data). It seems a waste that we keep 3 years of data in the Transaction database. Even though all changes in the Transaction database should be reflected in the Reporting database, I was wondering if I flagged a row as Do_Not_Mirror and then deleted that row, it would NOT replicate that delete in the reporting database. That way the Transaction database would be small and efficient, and the Reporting database would have all the data needed for reports. I don't know if such a thing exists.

Is this a good idea ? Or am I wasting my time trying to reduce the size of the Transaction database ? I'm hoping there's a standard way of doind this. Or that someone has done something similiar.

Any advice would be welcome
Craig Send private email
Thursday, May 04, 2006
Just to add to the above posting...

I think mirroring might be the wrong option, as you can't access the mirror when mirroring.

What I should be asking about is Replication (probably transactional replication). But the question is still the same.
Craig Send private email
Thursday, May 04, 2006
Do not think of replication.
Find out a good etl tool and use it to load the reporting database and kill stale data from the transactional.
Replication is meant for keeping db's syncronized, you want to decouple the dbs instead!
Sevenoaks Send private email
Thursday, May 04, 2006

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz