The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Time Complexity of Databases

Hi,

I'm looking for any formal documentation regarding the time complexity (i.e. Big-O notation) of database inserts, queries and deletes with and without indexes.

I need this information in order to decide between the two following designs:

I have a database which contains two tables: Images and Specification. The former contains raw bytes, the latter contains the resolution of the image.

Design #1: Each image has its own specification instance
Design #2: Images share a single specification instance

Specification is immutable and I believe many images will theoretically end up having the same resolution. That said, it isn't clear to me whether it is more desirable to go with Design #1 or #2.

I believe #1 results in a simpler code (object construction is simplied) and improved scalability (across different machines) whereas #2 results in reduced space requirement. Depending on the time-complexity of databases  #2 might end up being a lot more efficient as well.

I don't mind investigating specific databases first (say MySQL) if I can't find general information but I haven't been able to find this either. I am hoping I don't have to wade into the database source-code to find out this sort of information.

Any ideas?
Gili Send private email
Tuesday, September 18, 2007
 
 
I found the following MySQL-specific documentation:

http://dev.mysql.com/doc/refman/5.1/en/select-speed.html
http://dev.mysql.com/doc/refman/5.1/en/insert-speed.html
http://dev.mysql.com/doc/refman/5.1/en/update-speed.html
http://dev.mysql.com/doc/refman/5.1/en/delete-speed.html

but they are rather vague and do not provide Big-O notation for most of the operations.

I also wanted to clarify why I believe Design #2 could be faster than #1. Assuming there are one million images with five different kinds of specifications then design #1 would have you issue queries against one million specification objects instead of five. Now, I don't know the exact query speed of the database but I am afraid I will be stuck with O(n) where n = number of images for design #1 as opposed to n = 5 in design #2. If the number of specifications grows in #2 it will do so far slower than in #1. That said, inserting new records in #2 is a lot more complex then in #1 because I need to issue a query (check for duplicates) before saving any new records.
Gili Send private email
Tuesday, September 18, 2007
 
 
i would go for #2.

the property is immutable (no updates), and inserting in the specs table is *not* complex:

SET NOCOUNT ON
SET XACT_ABORT ON
BEGIN TRAN
IF NOT EXISTS (SELECT * FROM Specs WITH (UPDLOCK) WHERE Id='1024x768')
INSERT .....
COMMIT TRAN

is that so complex?
Johncharles Send private email
Tuesday, September 18, 2007
 
 
> I'm looking for any formal documentation regarding the time complexity (i.e. Big-O notation) of database inserts, queries and deletes with and without indexes.

I can't imagine why you'd want to want to consider database without indexes.

I don't know the formal documentation but failing that my intuition suggests to me that the complexity, given hashed indexes which can find any record in O(1) time, should be O(n) i.e. linear.
Christopher Wells Send private email
Tuesday, September 18, 2007
 
 
> given hashed indexes which can find any record in O(1) time

And given the database implementation's use of "pages" which means that inserting or deleting a record is also a constant-time operation.
Christopher Wells Send private email
Tuesday, September 18, 2007
 
 
+1 Chris.

Indexed Database (implying using a hash of the index mapping to a 'record offset') == O(1).

Un-indexed Database (implying the database must examine every record to find a match) == O(n).

The big time-waster in databases that I've run into is complex queries using inner and outer joins, which can return HUGE result sets which innocent programmers then have returned over a (relatively) slow network so their application can filter them further.

So looking at the "O(x)" level of a database is looking at the wrong place.

For your images -- I'd assume the 'resolution' of an image is a 'property' of the image.  So you'll mostly be searching for images, THEN wanting to know its resolution.  I assume you'd rarely want to grab a whole bunch of images just because they were the same resolution.

So I'd organize the database by image, and keep 'resolution' as a field of the image record.  Creating an entirely separate 'resolutions' table just because some HAPPEN to be 'common' doesn't buy you anything and may reduce performance.
AllanL5
Tuesday, September 18, 2007
 
 
<q>So I'd organize the database by image, and keep 'resolution' as a field of the image record.  Creating an entirely separate 'resolutions' table just because some HAPPEN to be 'common' doesn't buy you anything and may reduce performance.</q>
Agree with Allan. At some point I was considering re-using first_name property of a user in my db. It would save some space, user count was several millions, but it turned out to be unnecessary complex.
friday
Tuesday, September 18, 2007
 
 
What's the advantage of storing images in a database?  Why not just save them as files?
Jeff Zanooda Send private email
Tuesday, September 18, 2007
 
 
Let's rename Specification to Resolution for the sake of our discussion moving forward as it simplifies things :)

Some clarifications:

1) If I go with design #1 and I run the query "return all images with resolution X" then I believe that even if you have indexes the query complexity will be O(n) because the number of images with a given resolution will be an order or two away from the total number of images. That is, I expect the top-4 most used resolution would be associated with 90% of all images so if you query for any one of those resolutions you'll have to iterate across 25% of all images in the entire database.

2) The complexity of design #2 is that every time you want to save a new image to the database you now have to check whether its resolution already exists in the database, otherwise you have to create a new one. This means that every image save now requires a prior query. Are you guys saying that the resulting operation would simply cost 2*O(1)? (i.e. negligible overhead)

In the case of #2 I would scan the database once every night and remove any unused resolutions (i.e. ones that have no images associated with them). I believe this would be an expensive operation.

3) Does anyone know the disk-space or memory overheads of indexes?

4) Refactoring by Martin Fowler discussing using shared instances vs new instances for associations. He writes that if your associated objects are mutable then you should use shared instances so if their values change you don't have to update all instances of that association. That is, if Person is associated with a mutable Department you don't have to update all Persons one by one.

That said, he indirectly implies one should avoid using shared references for associations with immutable objects. The reason being that when you use shared references you have to use: FooFactory.getInstance(value) to create new instances as opposed to just invoking "new Foo(value)".

I believe my use-case is a hybrid. When I work with Image and Resolution in memory I always create new instances. Then, when DAO.save(image) is invoked it ensures shared references are used in the database layer. I believe this gives me the best of both worlds (though there is the added cost that if I save the same image to DB multiple times it will re-query to locate shared instances every time as opposed to just once at creation time).
Gili Send private email
Wednesday, September 19, 2007
 
 
Jeff,

I keep on getting that question, so you're not the only one thinking along those lines. I think there is value to accessing all of my data behind the transaction paradigm and also having the comfort of knowing all my data is residing in a single location. I backup the database and I'm done. The database will also enforce certain invariants for me, I won't end up with orphaned image file or image metadata without a corresponding image file.

As a sidenote, in my experience most databases handle BLOB data types really badly so you need to store them in their own dedicated table (at least this is the case in PostgreSQL and MySQL). In the former case this is outright required, in the latter case the JDBC drivers load the entire BLOB into memory even if you don't access all of it. This is really bad because the entire point of BLOB is to access data that does not fit in memory to begin with :) Anyway, my images are rather small so this isn't much of a problem for me. Still, I wish database vendors would do a better job supporting BLOB.
Gili Send private email
Wednesday, September 19, 2007
 
 
You might want to go peruse the sections of the MySql manual dealing with storage engines and indexes. There's a ton of good stuff in there.

http://dev.mysql.com/doc/refman/5.0/en/storage-engines.html
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
http://dev.mysql.com/doc/refman/5.0/en/estimating-performance.html

Although the manual is, of course, written with MySql in mind, a lot of the concepts are generally applicable to all databases.
BenjiSmith Send private email
Wednesday, September 19, 2007
 
 
I see no benefit of splitting the resolution from the binary data at all - normally you would only be talking about a handful of bytes per row.  In SQL Server at least, large binary data is not stored in the same page.  If you need to efficiently search on the resolution or other summary data, you would make a covering index on those columns and the fact that a binary data column is in the table makes no difference.
Cade Roux Send private email
Wednesday, September 19, 2007
 
 
Regarding returning BLOBs from the database, you shouldn't be selecting those columns in the first place unless you are prepared to use them.
Cade Roux Send private email
Wednesday, September 19, 2007
 
 
>> I see no benefit of splitting the resolution from the binary data at all - normally you would only be talking about a handful of bytes per row. <<

I agree.  You're talking about 10-15 bytes for storing resolution info, which are essentially a rounding error when you're storing large media files.

However, if you were storing some larger info, such as a full EXIF structure, you might try something else, but even then, there's only so much repeating data in there that you can pull out into another table (Camera brand, resolution/color depth, aperture, exposure time, and ??)
xampl Send private email
Wednesday, September 19, 2007
 
 
+1 for "Why are we talking about rounding error".  Disk space is cheap, programmer brainsweat is not, optimize accordingly.  You're talking about having to code up a daemon process, possibly involving bugging the system, and incurring extra programmer overhead for every additional query you write for forever, to save far, far less than one meg of disk space.  One meg of disk space runs, what, a fraction of a cent?
Patrick McKenzie (Bingo Card Creator) Send private email
Thursday, September 27, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz