The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Implementing incremental save to a database

I'm composing a document using a win forms interface where each 'paragraph' in the document is a node in a linked list. This is in .Net 2.0.

This 'paragraph' is really a class that hold attributes about the paragraph, these attributes are variable across paragraphs.

I want to offer 'save any time' to an Sql Server database and am wondering about implementation approaches. The data in each node will be saved to 7 tables tied to a document id.

One approach is to delete all the items with a given document id from all the tables and then write them again each time a save is made. This will be executed using parameters and commandText.

My concern with this approach is the redundancy of some of this and also the possibility of exhausting identity keys if someone holds down the save key for a bit. Also there are constraints on the number of parameters that can be submitted and also the number of characters in a CommandText string so in an ideal world only changes, adds and deletes should be sent to the database.

Could you share your ideas on this type implementation together with any pros and cons?

Tuesday, December 05, 2006
"exhausting identity keys"

Is this really possible?
Cade Roux Send private email
Tuesday, December 05, 2006
"exhausting identity keys"

I'd hope not. Oracle at least can store up to 99,999,999,999,999,999,999,999,999,999,999
David Aldridge Send private email
Tuesday, December 05, 2006
If someone keeps pressing the save button over and over, but nothing's changed, why would you need to save?
D In PHX Send private email
Tuesday, December 05, 2006
Ok so we can't exhaust primary keys and the list will be marked as non dirty after each save so that be an issue.

Is the implicit suggestion here to delete and re-write each time there is a user saves and the dirty property is true?

Thanks for your input.
Tuesday, December 05, 2006
so that be an issue.

so that won't be an issue.
Tuesday, December 05, 2006
Delete all and re-add pros:
- simple to code
- potential performance issue if high volume
- may lose some audit information, e.g. who changed what when

I'd do the delete/add unless there was a reason not to - figuring out the optimal updates can be hairy if the data structure is complex.
Mike S. Send private email
Tuesday, December 05, 2006
Why delete and re-add? Why not just do an update?
Tuesday, December 05, 2006
Because he has multiple rows spread across 7 tables that represent the data to be saved.  The save operation may consist of adds, updates and/or deletes to any or all of the tables.
Mike S. Send private email
Tuesday, December 05, 2006
Have you considered calculating a Hash of the changes? Either at the table level or the whole "document"? If calculating an hash of the contents is not too complex, you can store it along with the saved info, and write on the DB only when you detect an actual change.

Tuesday, December 05, 2006
Oops. Didn't notice I forgot to sign the "hash" message above...
Paolo Marino Send private email
Tuesday, December 05, 2006
Doing DELETE / INSERTs  when you really want to UPDATE can be a bad idea in some DBs due to locking issues. I don't know if MSSQL performs locking in the same way that MYSQL does, but in MySQL at least this can cause deadlocks in high-concurrency environments.

The deadlock occurs between the Auto Increment Lock and the Next-Row lock in mysql.

The real lesson here though is that in any database implementation, UPDATES are cheaper than a DELETE / INSERT, especially if you only update dirty fields.

Do it right, you'll be glad you did later.
Andrew Murray Send private email
Tuesday, December 05, 2006
Here's a theory:

Every 'edit' is a transaction against the database. Create a queue that they get pushed into. Have a background thread reading items in the queue and sending them to the database. As long as the queue is empty, the database contents exactly match the user's view. "Saving" consists of flushing the queue to the database, which is a trivial no-op if the queue is empty.

Being transactional, if the user's computer dies horribly and the database isn't destroyed, only uncommitted edits will be lost. And, frankly, unless you're doing something crazy, that shouldn't ever be more than 1 or two keypresses.

The queue of "edits" ought to look remarkably similar to the undo/redo history system, as well.

The "pro" is always-on auto-saving and the heavy-lifting of keeping the database consistent is put onto the database, not reimplemented in own code.

The big "con" is the way some people use "edit and quit without saving" to delete a huge block of changes in one go, but that can be solved by allowing a "snapshot" of the database state to be taken.

> Because he has multiple rows spread across 7 tables that represent the data to be saved.  The save operation may consist of adds, updates and/or deletes to any or all of the tables.

Unless he's clinically retarded and can't spell SQL 2 times out of 3 (yet alone read a good book) then that's not an issue. Honestly, you can stick a bunch of inserts updates and deletes across many tables within a database inside a transaction, easily.

Tuesday, December 05, 2006
Thanks for your input its been useful.

I implemented delete and insert yesteday and it works really well. I do feel quite bad about the redundancy. Just moving a 'paragraph' up generates a whole load of deletes and inserts when only 2 node values need to change.

I am provoked by Andrew Murray and will examine the possiblity of loading all rows based on the document id into a dataset and then walk through the linked list and update the rows that are not equal to the node value, add any new rows once we run out of rows and set the row status flag accordingly and should any rows be left over after the walk through, they will be marked for delete.

Not sure what the performance will be like but I won't know until I measure.

Thanks again.

Wednesday, December 06, 2006
I only chimed in here as at the company I worked for I had to refactor some code that worked in the DELETE / INSERT manner to correctly use UPDATEs.

The code needed on the client side *is* a little more complex, but if you ignore the dirty field optimization and focus more on only updating dirty records you should get 80% of the benefit with 20% of the work. I think you'll also find that if you impelemnt this properly you won't be talking about too much more code.

I should also note that if this will all be deployed in a single user environment, the DELETE / INSERT methodology will probably be fine. Event then, you'd probably be better off adding an isDeleted field to you db records and just setting that and doing an INSERT and periodically do a single transaction that erases all the DELETEed rows.

And by high concurrency above, I don't really mean high concurrency. We were bitten in the ass by the INSERT / DELETE problems with only moderate concurrency.

The main reason I'm so ardent in my encouragement to do it properly is that if you already have it working with INSERT / DELETE, and have proper unit tests, getting everything working (event with only dirty field updates) should really only take a few hours.

Andrew Murray Send private email
Wednesday, December 06, 2006
Oh and if you're iching for more detail on database locking semantics, and why (at least in mysql) this can cause problems read:

Keep in mind that this is only a problem if you're using db-generated auto increment locks. Where the deadlock occurs can be a bit tricky to see. It happens in mysql because the next-key locks are held for the lifecycle of a transaction, while the auto-increment lock is different from all other locks in the system and is only held for the life cycle of the statement that requires the lock and is released before the transaction is done.

So this can happen:

Transaction A is a Delete / Insert
Transaction B is an Insert into the same table

LNK is the next key lock
LAI is the auto increment lock

A begins it's delete, and acquires a row lock on it's row and the LNK.

B begins and acquires the LAI and blocks on a test of the LNK.

A blocks on a test of the LAI.

Deadlock. One of the two transactions is rolled back and retried.
Andrew Murray Send private email
Wednesday, December 06, 2006
Wow, if that is the case, MySQL is insanely retarded.  All that work just to prevent gaps in an opaque, artificial key?

PostgreSQL decided such a feat was rightly impossible and thusly doesn't do gap-free sequential numbering.  In Postgres, if you rollback a transaction that uses a sequence, it will not rollback the the sequence counter as well.  No rollback, no locking, no worries.
Cory R. King
Wednesday, December 06, 2006
By the way, if you go the UPDATE route in Postgres and this will be busy table(s), make sure to adjust the knobs on Postgres so it more frequently vacuums the table*  Otherwise, the physical size of your table will grow pretty large:

* And MySQL's InnoDB has to vacuum too, so you might look into this in their docs.  Both use the MVCC model for concurrency control, which requires it.
Cory R. King
Wednesday, December 06, 2006

Yes, MySQL's implementation (under innodb) in this case seems pretty brain-dead. I like Postgres better for a lot of stuff as well.

But, as I said, there may very well be no issue like this in MSSQL - I've never used it.
Andrew Murray Send private email
Wednesday, December 06, 2006
Found this in the MSSQL "CREATE TABLE" docs:

"SQL Server does not enforce an order in which DEFAULT, IDENTITY, ROWGUIDCOL, or column constraints are specified in a column definition."

I'd wager that they don't lock either.  It looks like you can tweak it to ensure sequential numbering though, thus requiring a lock:

"...if an identity column exists for a table with frequent deletions, gaps can occur between identity values. If this is a concern, do not use the IDENTITY property. However, to ensure that no gaps have been created or to fill an existing gap, evaluate the existing identity values before explicitly entering one with SET IDENTITY_INSERT ON."
Cory R. King
Wednesday, December 06, 2006
Are the hardware constraints such that it's impractical to do the "saves" in memory and only write them to the database after a set time interval or when a specific event is fired?

I would also wonder why it's necessary to delete the record itself. Imagine this scenario - every time you insert a record, you also record a timestamp (or a foreign key linked to a timestamp). When you perform a select, you simply retrieve the most recent timestamp. The catch is that on some scheduled basis, you'll want to go through the system and delete the X oldest timestamped records associated with each user but fortunately the time interval and X can be expressed as configuration parameters.

I'm glad incremential is able to figure out how to do what he needs to do, it's just that the whole approach just strikes me as a little risky - you'd have to do an aweful lot of coding, error trapping and testing before I'd feel comfortable with that many individual deletes.
Wednesday, December 06, 2006
I'm finding that some responses are missing out the fact that deletes will happen. Also that I am expecting up to 30 nodes in my document which can have new additions, be edited, deleted and reordered all of which has to be reflected in the database each time there is a save.

The edit and rewrite approach means theres no faffing about with the nature of the change. Its a brute force method that means once the save is carried out the database is consistent with the document. No going back to sweep up afterward no need to filter on status fields. The approach just feels wrong which is why I posted the question.

There's definitely a lot more code to do the checking and I am uncertain as to what the perfomance will be like or even if I'll go with that approach anyways. For now I need to look into the alternative approach at least to learn how such an approach could work out.
Wednesday, December 06, 2006
If always doing "DELETE then INSERT" to avoid "faffing about" is actually an issue, then something probably should feel wrong with that approach.

For a tiny system (30 "nodes" can't be talking gigabytes of data) it may not actually matter. But still, following good database practise isn't automatically a bad idea.

> No going back to sweep up afterward

If you're doing proper transactions and still having to "sweep up" then you're still doing something wrong. A transaction isn't "committed" if there's work not yet completed.

Wednesday, December 06, 2006
To Cory and Andrew - you guys are right, MSSQL does not guarantee gapless sequences with identity columns.  If a transaction is rolled back, the allocated sequence number is lost unless one either uses DBCC CHECKIDENT to reseed the table or sets IDENTITY_INSERT ON to manually specify the value during an insert.
mssql auto increment
Friday, December 08, 2006
I noticed the original post specified that this was a .Net app.  Well, all the ugliness is already coded to make this work properly.  First, use the provided DataTable objects and don't roll your own storage system.  The DataRows in a DataTable already have a RowState that tracks whether a row is unchanged, added manually, changed, or deleted.  When you call the Update method of a DataAdapter, it checks the RowState of every single row and runs an appropriate INSERT, UPDATE, or DELETE command for that DataRow.

Alternately, you could implement a similar system in your own classes.  Or maybe a hybrid approach.  Keep a DataSet full of DataTables as a private member of the document object and keep links to DataRows in the subordinate objects.  Have the paragraph and other objects store their data in their respective DataRow and let the DataRow do the change tracking.  Then when you call save on the document, it will only issue the necessary SQL statements.
JSmith Send private email
Monday, December 11, 2006
Interesting idea.

Am using data adaptors and tables but in a different way. I had contemplated the hybrid approach but discounted it cause I thought building the tables would take too long and that flaging deletes would be problematic but after implementing the update approach even for up to 100 paragraphs the delays I'm seeing are typical for saving a document. I am now of the mind that table creation will have minimal impact on performance - a whole lot more code though.

I will meditate on this hybrid approach and see if I can't implement it.

Tuesday, December 12, 2006

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz