The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Sql insert vs Update using exceptions

While suffering from a slow netwrok connection and a requirement to use an access mdb file as a database (0% chance of getting anything else, even if free) a program that updates data is stupidly slow.

the origional method was to check a record, see if it was present (based on a key value) if it was there UPDATE it, otherwise INSERT it. this took several hours (its a big database).

revised with to grab all the data locally, process it there then only upload the values that have changed, for small data tables this *flys* but theres a problem of data integrity if two people update things at the same time, also for larger tables this takes ages (runs of a 256meg ram machine - 0% chance of  change there as well).

currently i'm simply trying to 'INSERT' everything, catching an exception if it fails with a key violation *then* UPDATE'ing it, this works, but it feels *wrong* to use the exception methods like this, there has got to be an easier way of driving this.

everything on line looks at stored procedures, or recommends using a proper database system and processing everything on the server.

problem is here, my employer is of the type that won't spend a penny to save a pound, we have outsorced *all* IT support functions so even installing and maintaining something like mysql *myself* will cost an arm and a leg, can't install it on this machine due to the firewalls etc.

Claire Rand Send private email
Friday, December 02, 2005
Yup, your feelings of ickiness are right - it's not a greate way to do things.

Assuming you do bulk updates (you insert/update a bunch of stuff at a time) you could try generating a simple-ish IN query to see if the key already exists (SELECT id FROM table WHERE id IN (1,3,8,12,19,22,30...)). The contents of the IN clause are derived from the IDs of the set of objects you're processing. If you are bulk-updating a lot of stuff, that could be a big query string, but you'll get back a list of those IDs that already exist in your table, just for the ones you are updating, in a single SQL statement.
.NET Guy
Friday, December 02, 2005
the core table has around 1 mil rows, each 'update' is typically pushing around 17k rows, of which around half will be updates, and half will be new. trouble is theres no 'date' field to sort on, thus no easy way of knowing which will be new and which will not be.

the idea of building a query to get back duplicates etc could be good though. thinking a duplicate temp table, insert everything into that, then see whats in both, and update them, and insert everything else. hmm.. at the mo the update takes around 5+ hours, so gets run overnight. i'd like to get it down to being able to run during the day.

which means keeping processor load below 50% (which the network lag does nicely).

guess this is the trouble with having an access file over 400 meg in size.. uck.
Claire Rand Send private email
Friday, December 02, 2005
claire, if you're on MySQL, why don't you use REPLACE?
Berislav Lopac Send private email
Friday, December 02, 2005
All things considered, I don't feel 5 hours is slow for this process. It's reasonable. If you want faster, you will need to switch some significant stuff around. You can try to optimize the the way the data is accessed. See if any query can possibly benefit from extra indexes. I would think you would surely want indexes for tables in a 400MB mdb file on a 256MB machine. Other than some minor prodding and poking, you could try to switch to a database with support and SLA provided by your IT department. No, I don't know how that conflicts with your (bosses) budget.
You may get a speed up by just doing the update, and the insert if the update says:'not found'. As in all things related with performance: test it, time it, experiment, profile.

Also, if it's okay the way it is now, you have regular backups, and the time (dataset) is not steadily increasing to unmanagable volumes, maybe you don't need to change a thing...  Maybe.  (Having a daily process that takes 26 hours is funny, if it happens to someone else.)
Friday, December 02, 2005
Is there any way to segment your data and span the 1M rows on 10 tables of 100K? You could union all the tables in a query to make the application work but the update should deal with the individual, and smaller tables.
Sevenoaks Send private email
Friday, December 02, 2005
Do you compact the MDB regularly?
Eric D. Burdo Send private email
Friday, December 02, 2005
When you sync up the systems can you keep a list of the keys locally?  Then, as the local user(s) are updating the data, you can make some reasonable guesses on what records should be updated vs inserted.

Although it would be out of date relatively quickly, it would provide *some* assistance when you go to sync next time.  And it should take a significant load off the main system.
KC Send private email
Friday, December 02, 2005
I'm not understanding why you don't know ahead of time whether or not you need to do an insert or an update? If I personally create a new record then I do an insert. If I retrieved the original row from the database then I do an update. Having a simple flag that indicates whether or not a record is "new" is all that should be required. I would also assume that other users adding/changing records at the same time is a separate issue of concurrency that SHOULD be handled by catching exceptions. These two issues can and should be handled differently.

As a side note, I work on a system that behaves exactly the way you have described (tries to insert first then tries to update on failure). For what it's worth, this system is a royal pain in the a$$ for many reasons. The person who created our data layer deserves to be taken out back and shot. His "insert then update" scheme is really just a sloppy crutch to keep from having to manage row state internally. He saved a few hours of design work up front but we have paid for it over and over again as time has gone along. We ultimately have to save off row state in other ways and none of them have been consistently implemented.
Turtle Rustler
Friday, December 02, 2005
The temp table may be the way to go.  It's not the cleanest, but it should be pretty easy to try out and see if it helps.  Once the data is in the temp table, you can do an UPDATE WHERE EXISTS and INSERT WHERE NOT EXISTS, then drop the table.
a unique name lets you know it's me Send private email
Friday, December 02, 2005
How about inserting all new rows into an empty table then using sql statements themselves to choose insert or update based on "where" and "join" clauses from the new to old.

Off the top of my head:

insert into existing_things (select * from new_things where key not exist in existing_things);
update existing_things (select * from new_things where key exist in existing_things);
delete from new_things;

But I wonder if MS-Access is capable of subselects.
captain damage
Friday, December 02, 2005
The insert or update, then handle exception approach is a relatively common one on databases not supporting some sort of merge statement.  The key to this approach is in trying to anticipate which operation will be more likely to succeed, and trying that method first, which can potentially avoid the slower exceptions.

This isn't always possible of course, but if something in the data that is being inserted/updated can be used to guess as to which operation to choose, then give that a shot, even if it means breaking up the data to be inserted/updated into different batches of some kind.
Scot Send private email
Friday, December 02, 2005
<<I'm not understanding why you don't know ahead of time whether or not you need to do an insert or an update?>>

Believe me, it is very often the case
that  you just do not know if you are going to do insert or update up until DBMS commit time

In my 15+ years experience with CRUD type scenarious
CRUD is just not CRUD. Difference between Create and Update is decided at the very lat moment, depending if you are running against DB that has Referential integrity built in and you use it,  or YOU have to implement Referential Integrity in your code. Many similar cases come to mind, no time to list them all here....
Friday, December 02, 2005
Even with a million records and thousands of insert/updates, I'm not sure why this takes so long.  Is your table indexed on the key field?  Keep in mind that database table indexes work like book indexes: With a book index, you find all the occurrences of a word by looking in the index, whereas without one, the only way to find all the occurrences of a word is to search through the entire book.

If your key field isn't indexed, then your insert/update procedure is doing the same thing; it's searching through, on average, half a million rows for each existing row, and the whole million for each new row.

Maybe you already have the indexes and you know all this, but if not, have a look; you'll be amazed at the difference an index makes.
Kyralessa Send private email
Friday, December 02, 2005
This is a batch update program? My inclination would be to blast everything into a holding table then run an


followed by a


Or, you could just not worry about it. If your employer wished to move pianos using small sedans would the company driver lose sleep over optimizing it?
George Jansen Send private email
Friday, December 02, 2005
One of the VERY reasons I like MySQL and SQLite ...

"REPLACE INTO" .. find a record and replace it, or insert it (search is by primary key)... LOVE IT. REALLY wish SQL Server had it.
Michael Johnson Send private email
Saturday, December 03, 2005
Additionally, you can use old batch processing techniques... 1. you create a table BATCH_MYTABLE identical to MYTABLE but an additional "USERID" or sessionid...
2. you do a bunch of inserts
3. you fire a store procedure with your sessionid
4. stored proc does all the checking for you.

If the inserts aren't important right this moment, you can set up a job to run every few minutes to just process everything.
Michael Johnson Send private email
Saturday, December 03, 2005
good ideas, i'll look into all this.

the reason for not knowing whats needing to be inserted/updated is simple. I'm pulling information out of a huge (non sql) database, and squirting it into access so that it can be processed automatically. the trouble is i have to overlap the updates since many records will change for a week or two till the relevent jobs are finished, thus some records will already be present but may well have changed, other will need inserting.

the temp table idea is looking better every day followed by the two queries, one to update the other to insert.

and while the company don't really care how long this takes (as long as the wall charts look good) *i* am fed up with waiting while my desktop machine is hopeless due to having no ram left if i run this during the day (overnight works but is a pain when the cleaners turn the machines off (cus the nitwit in h&s has told them it saves enough power (per machine) to print 5,000 pages on our clapped out laser printer)
Claire Rand Send private email
Tuesday, December 06, 2005
Are you using dynamic SQL. Concatenation with '+' could hurt.  If there is a lot of this try using a stringbuilder.
Tuesday, December 06, 2005
May be you can also build your "UPSERT" command like this:

  UPDATE ...
  INSERT ....

but i don't know if Access supports @@ROWCOUNT or some equivalent...
Johncharles Send private email
Tuesday, December 06, 2005
Well firstly thanks for all the brill ideas :-)

i'd love to use MySQL, hell i had a demo, running on my own machine at home (using mockup data), and even though that machine was on dialup it was still *faster* than access, there is zero chance of doing that here though due to an outsorced IT department... Grrr..

so stuck with access and thus no stored procedures (sigh), the insert or update on exception code now works fully, and will full cycle the reports in about a hour, as opposed to overnight.

the idea of storing row states locally has been tried, for some of the smaller datasets it works perfectly, and would scale very well *if* this machine had the memory to run it, once your in swap space forget it. pity cus its very simple to implement.

yes most things serach on a key, two tables ahving a single key (one numeric, one a 10 char string) another links on one 10 char string and a number, the other on a string and two numbers, so indexing isn't optimal but does work (I inherited this bit) a small range of extra tables, all with auto assigned keys are also updated, but they have never presented a problem.

the database is compacted every so often, but this seems if anything to slow things down.

I'm taking a somewhat different approach to future development... redesigning the database from the ground up to get rid of problems like these, paid for by enhancements to the analysis tools that can be added at the same time. Ho hum....

anyway me off to find some marshmellows and an oil terminal. (joke)
Claire Rand Send private email
Tuesday, December 13, 2005

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz