The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Data repository delivering of data incrementally

Hi, I've recently been appointed to manage an existing project within our organisation that has been designated as a central repository for master data.  We are going to store a whole host of things like our client parameters, currencies, ISO country codes and a whole host of other specific information which at least 20 other systems within our organisation will consume. Some want to cache locally the data, others want to come real-time (at least 5 minutes).  Along the way people just assumed that we'd be able to easily share all this glorious information with an extremely wide variety of systems (Windows based, 3rd party, and Mainframe).

Wee currently have 20 hand crafted connections to other systems using a variety of technologies including text files, COM, Web Services etc.  This definately won't scale in the longer term.

We are using SQL 2005 and have been looking at sharing the data in our tables with clients via Web Service calls that we host.  A requirement which sprung up recently was to be able to provide to systems not just snapshot, but incremental facilities and provide the client with a list of changes since they last consumed.  We have History tables which are reasonably reliable, but with the database structure we are going to have problems presenting incremental information to systems which need to join logically across up to 20 tables.

My current thinking is to identify 'subject areas' of data and then to provide a facility to snapshot the data (using views, or select statements across the subject areas table) every 5 minutes and compare against the last snapshot.  Pump the differences into a table which a Web Service could then call and present the information.  Snapshot users or real-time consumers could also utilise the snapshot tables and not hit our base tables.  For incremental the add and deletes will be fairly easy to determine in a stored procedure by joining the tables together, but the updates to be reliable would appear to have to scan the entire snapshot and compare record by record.  We might be able to improve by looking for changed row timestamps across all the tables involved and then compare only those records.

Looking at this from a design perspective does anyone have any ideas on whether this will work reliably well and any areas to watch out for.  I do like the idea of the snapshot comparison, because it will be fairly easy for people to understand, debug, and to shield our consumers from base table changes.  Also we could run multiple versions of 'subject areas' at the same time to not have every application within our company upgrade when we do.  I can also foresee that we could use the snapshot tables for some type of ad-hoc reporting facility.  My concern is around the performance related to the incremental functionality.

Saturday, January 06, 2007
I do like the idea of snapshots as described or parsing SQL Server's transaction logs. However, I'm a bit concerned that what's been proposed is not necessarily what the customer wants. Furthermore, once the basic concept has been prototyped, getting everyone to use it will be a lot like herding cats.

The obvious example: some services such as security and telephony will want to receive status updates as soon as they're recorded and others such as janitorial and payroll can wait until early the next morning. The five minute interval idea means your system has to create 96 snapshots every eight hour business day, and combine them as appropriate just to accomodate all these needs.

Off the top of my head, I think there are really three separate problems, each requiring their own solutions: downstream services that need to be notified upon each change (pushes rather than pulls), services that need to know only what changes have been made since the last request (pulls rather than five minute pushes), and services that need the full data set as of a certain point in time.

From a practical perspective, I'm not really sure that anyone wants to receive a list of changes every five minutes. If I was an application developer, I'd go a little crazy trying to make sure my program properly queried your web service every five minutes without fail (and dealing with all the possible outcomes before the next query is due).
Monday, January 08, 2007

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz