The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Storing binary data in SQL Server

We are about to write a new module for our app suite that allows our users to store and view binary data - digital photos, scanned documents, .PDF files, Word docs, ... whatever.  Each doc will be able to be associated with one or more records in one or more tables.

We don't know yet the potential number of binary "records" to be stored but it could be in the tens or hundreds of thousands over time.

For those of you who have done this, I would appreciate your recommendations as to the best way to store the data. We support Microsoft SQL Server 2000 or 2005 - please don't recommend anything else.  Here are three storage schemes I've come up with so far:

1. Store the data in a "BinaryData" table in the same .MDF file as the rest of the database.  Pro's: it's right there.  Con's: it will surely increase the size of the .MDF tremendously, and we regularly download customer databases for analysis and I'm fearful of the time these downloads will start taking.

2. Store the binary data in a second .MDF file, and write our app so that a "lost link" or the imaging data file not even being there is handled gracefully.  This way if we need to download a database for non image-related analysis we can skip downloading the binary data .MDF.  Pro's: faster downloads.  Con's: I don't know.  What are the problems?

3. Store the binary data as individual DOS files on the server and access them there.  Pro's: no data tables to maintain.  Con's: some idiot deletes the directory; users can't access the server directory due to security settings by their IT department; the need to get sometimes-anal IT directors to allow access to servers; the list goes on.  I'm really not keen on this option - can someone convince me it is a good idea?

Any other suggestions?  Who has done one of the above or has a better idea?

Thanks in advance.
Karl Perry Send private email
Saturday, September 29, 2007
My recommendation would be to store the data in a separate MDF from the logic. This way upgrades to the app are simpler and less likely to involve trashing the data, or requiring upgrade threads that may take days to migrate.

I've worked with your #3 scenario and if you don't like it, then only the size argument may have any impact. At one previous employer (the IT dept of a consulting company, so it was both sides of Office Space), they had about 70k documents in this homegrown CMS system. The unfirable Crazy French Lady would rearrange files and the links in SQL server would get lost.

At a current side gig, the document storage uses file systems and tracks the paths in SQL Server. It is a litigation support CMS and typical cases have tens to hundreds of gigabytes of files (my outlook pst file is 1.5Gb, imagine a case involving a company full of such packrats). Are your clients likely to exceed file limits of SQL Server (other than 2k, or 2k5 express, I'm not sure if there are any practical limits to the full versions of 2k5).

At my day job, we have a product that involves 2 databases, one updated annually. The one updated annually can be distributed as a back up, but the other one has a pretty serious upgrade issue because it also contains client data when the object model changes even slightly (make any change to a binary serialized object's base class and it won't deserialize any more).
Peter Send private email
Sunday, September 30, 2007
I would go with #3, but rather than having the users access the files directly, have them go through a software layer that:
1. Controls concurrent access
2. Compresses the files for trips over the network
3. Maps the keys in the database to a path + filename
4. Optionally updates any indexes (text, metadata, etc)

This way you can let the IT department restrict access to your directory tree to just the account your server software is running under, and you don't have problems with users deleting files willy-nilly.

Note that you'll need a function to map your keys to a path that results in a wide dispersion across folder names.  NTFS does have a very high limit on the number of files in a directory, but what happens is that the filesystem spends a lot of time scanning directory entry records (last I heard, the entries were not hashed, and the code had to do linear searches, checking permissions on each one...)
xampl Send private email
Sunday, September 30, 2007
Xampl, why #3?  You gave some development hints but no reason why you think this is the best.

From my OP you should detect that I'm personally biased against #3 but if there are compelling reasons that it is best I'd like to be convinced of that.  So far your ideas seem like a lot of work to me:

- Convince anal-retentive IT Directors to allow a database access to directories;
- Even though the directories are secured from end users that doesn't mean some pea-brained newby backup operator won't delete the files "because they're over six months old and they don't look important to me";
- Have to develop an interface to manage the file/folder manipulation;
- Have to get around the linear directory search issue you brought up, so this could be slow since there will be lots of files to manage.

I don't mean to just throw away what you say, but as I said I'll need some convincing before I'll accept that #3 is a viable option.  Have you used this approach yourself?
Karl Perry Send private email
Sunday, September 30, 2007
Personally, I would opt for #2, but I am doing some work for a client who insists on #3.

I pointed out that in the case of a disaster, they may lose data. Their log shipping means the database is only ever a few minutes out of date, which is fine. However, they are only going to back up the external files (insurance documents in PDF format, in this case) overnight.

So, when they lose their machine room, they can restore SQL to within a couple of minutes of the point it happened, but the document files, only to the start of the business day. And, just to make life fun, the database will potentially have hundreds of dead links to files that were created, but not backed up in time.

Okay, so you could write I script to go through the database and identify the dead links, and get the documents re-created. However, if you have just had some catastrophe that has wiped out your machine room, you probably have enough on, getting back on track, without having to re-do lots of work that was previously stored on your shiny, expensive SAN.

The whole point of the SAN, with the redundant power, cooling, discs, controllers, etc, etc, was that you shouldn't find yourself in this exact situation.

I pointed all this out, but the IT guys are so confident in their infrastructure that they can't see it ever happening. They are probably right up to the point that they lose the building (gas explosion, etc). Then, they will be in trouble.
Entries of Confusion Send private email
Monday, October 01, 2007
I know you said not to suggest other systems, but SQL Server 2008 (should be released before end of year) has a cool feature to at least be aware of which allows you to store blobs on the file system but still interact with them as if they are in the DB. DB backups even backup these files.

So at least be aware that going forward you can get all the advantages of storing documents on the file system while still having the advantages of storing them in the DB.

This makes options #1 and #2 more compelling assuming you see SQL Server 2008 in your future.

Monday, October 01, 2007
Well I have seen a SAN go down do to hardware problems.  I believe the hardware problem caused the data to be corrupted so everything had to be restored from backup.  Its then that we found what was really being backed up....and what wasn't.
Ted Elliott Send private email
Monday, October 01, 2007
>> Have you used this approach yourself?  <<

We did at a previous job (which is where we discovered some of the aforementioned limitations of NTFS).

We had 18+ million files stored in a directory structure that was managed algorithmically.  Our initial fan-out used the last two places of the identity column key, plus the customer identifier, as our mapping function.  It turned out that 100 directories wasn't enough fan-out, so we went to the last 3 places, and last I heard, that was OK.  There are obviously other, better algorithms for doing this, but this was simple enough to work.

We had replication both of the database (log-shipping) and file system (a 3rd party product whose name I do not know) between our production datacenter and the backup datacenter.

We had experience with dead SANs as well -- we lost an array in the SAN due to multiple drive failures in a short amount of time, and had to restore from backup.  Advice: Buy drives from 3 different production lots when initially populating the SAN so that the MTBF curve doesn't bite you.

We also had issues with operator error -- one of the NOC staff hit the BIG RED SWITCH by accident one day {oops!}.  There's only so much you can do to prevent this, unfortunately.  You know the old saw about nature breeding idiots faster than you can guard against them...
xampl Send private email
Monday, October 01, 2007

For the pure and simple reason that file systems are _designed_ to hold files. Databases are designed to store structured data.

I'd assume the paging you'd get would decimate performance on the SQL Server because the files will be such variable lengths.
I don't know for sure, its worth doing some performance metrics to prove the theorycraft, but i'd definately recommend checking it out.

However as just an off the cuff opinion, it sounds to me like your using the wrong tool for the job due to a personal issue.

Obviously if you do decide to go for #3, make sure your folders are well structured.
Jax Send private email
Monday, October 01, 2007
"I know you said not to suggest other systems, but SQL Server 2008 ..."

Good point - our company committed to SQL Server long ago and I can't change that.  2008 will be in the picture, just like 2005 is since it has supplanted 2000.  So thanks for the info about its file handling capabilities.

Jax, would you amplify your comment? "... it sounds to me like your [sic] using the wrong tool for the job due to a personal issue."  What personal issue?  Is it my insistence on using SQL Server as the DB?  As explained above, I can't control that.
Karl Perry Send private email
Monday, October 01, 2007

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz