A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.
I'm having a very hard time transitioning from the world of imperative programming (C++/Java) to the world of declarative programming (databases). So much so, that it has caused my programming to grind to a halt. My perfectionist brain simply won't let me continue until I figure this out :(
I've been trained for years to avoid race conditions through meticulous use of mutexes and I'm finding it very hard to accept the anomalies that come with the READ_COMMITTED transaction isolation level.
Is there a discussion forum full of database experts or some book that could help me out? I want to understand how to work around these anomalies and/or understand when it is okay to simply accept them. Every single resource I've read so far has skipped over these details. I am looking for a resource that addresses these issues straight on.
Well, lesson 1 would be that RDBMS's tend to work differently, particularly when it comes to transaction control and locking. Is there a particular platform that you're working with?
Tuesday, September 16, 2008
Jim Gray, who went missing at sea last year, wrote a good deal about concurrency matters; I remember a fairly short paper that covered quite a few items well. The book he wrote with Andreas Reuter, _Transaction Processing: Principles and Techniques_ is outstanding, but not small.
Tom Kyte writes very clearly about Oracle's handling of concurrency (multi-version concurrency control, MVCC).
I'm working with MySQL under JPA -- http://en.wikipedia.org/wiki/Java_Persistence_API -- but I'd prefer not to limit our discussion to a specific platform if possible.
The problem is that I keep on running into theoretical problems in my mind and I can't think up of solutions to them and it bothers me so much that I stop working on the actual problems at hand. In the past I tried refactoring my application focusing on the immediate problems at hand and every single time when I was "90% done" I would run into a single use-case that would fail under my design and I would have to go back and figure out all over again how I should handle database locks.
For example... I have the following tables:
I'll mention two use-cases I've got problems with:
1) Saving a new image:
An image references a Dimension. Now, I decided that I wanted to share a single Dimension amongst multiple images because the number of possible values is low.
I thought up of two possible implementations:
- Before saving an Image, I must first save its associated Dimension.
- Before saving a Dimension, I must first check for duplicates in the database. If the Dimension is new, I save it. If a duplicate already exists I modify the associated Image so it points to the existing instance instead.
Using READ_COMMITTED leads to the following bug:
- Two different transactions try saving the same dimension at the same time. T1 checks for duplicates, finds none. T2 checks for duplicates, finds none. Both save a new copy.
- I worked around this problem by adding a "unique" constraint on the table. The problem is that the transaction encountering the collision must be "replayed". Is there a better way?
- I could implement save(Dimension) so it always inserts a new instance (allowing duplicates) then have a background thread scanning the database every hour or so collapsing multiple instances.
- The problem is that a transaction could be reading an image at the same time its specification is collapsed. When the transaction goes to retrieve the associated specification it will find it has been deleted and it must be "replayed" as well.
2) Selecting a random image:
A client asks the server for a random image. I implemented this as follows:
- count the number of images
- select image at index X
The problem is that image X could get deleted between the two queries such that I end up with the image after it (this isn't so bad I guess) but I could end up with an index past the max (rare but it could happen). The only way I can think of fixing this is "replaying" the transaction.
Two things bother me in the above discussion:
1) If I don't replay the transaction, clients are liable to get back error messages even if they didn't do anything wrong. From their point of view their request is completely valid.
2) Replaying the transaction (to prevent clients from seeing errors which they aren't really responsible for) is a pain to implement. I mean, it's doable but it leads to uglier code.
3) Though I haven't run into a concrete use-case yet, I worries me deeply that I have no idea how to implement "if A and B are true then do C" atomically with READ_COMMITTED (where A, B, and C are queries that must be atomic with respect to one another).
I just thought of another use-case:
3) Scale images:
- Say there are 2 images with a dimension of 640x480 and for whatever reason I decide to scale all such images down to 320x240. So I update the image data and then change the shared dimension object to reflect the new dimensions.
- At the exact same time as the dimension row is being changed, another image is saved that has a dimension of 640x480.
- It is possible that the 3rd image will end up pointing to the old Dimension object without having its data re-scaled.
The only workaround that comes to mind is to either keep Dimension immutable (the two images would be made to point to a new object) or make Dimension non-shareable. I'm curious whether you can think of a third solution which solves this while keeping Dimension shared.
Just curious, but what's with the dimensions?
5 seperate images with the same dimensions are still 5 seperate images, and can continue to be 5 seperate images without violating key restraints.
Tuesday, September 16, 2008
Are you asking why Dimension is a separate table instead of showing up as columns in the Image table? I was under the impression one should use the normalized table form when possible. If you meant something different, I didn't understand your question.
I'm fairly simple-minded, so your example may have just sailed over my head, but as I understand transactions, if the entire operation is wrapped in a READ_COMMITTED transaction, then T2 would not even be allowed to read the tables until T1 has finished processing and committed its transaction. So I think the problem is moot unless you forced a READ_UNCOMMITED transaction.
Note: I am coming from a MS SQL Server environment, and maybe things are different. Also, I'm no expert DBA, so maybe I'm full of carp and just don't realize it.
fair to middlin,
According to http://en.wikipedia.org/wiki/Isolation_(computer_science)#READ_COMMITTED holds write-locks until the end of the transaction, but read-locks are released immediately.
In other words, the following bug is possible:
- T1 reads Dimension 640x480, doesn't save it since it already exists.
- T2 deletes Dimension 640x480, and commits.
- T1 tries to save an image associated with 640x480, and fails because the Dimension has been deleted.
Ok, I think I'm with you. In that case, is there a REPEATABLE READ isolation level available in MySQL. In MSSQL, it is defined as follows:
When it's used, the dirty reads and nonrepeatable reads cannot occur. It means that locks will be placed on all data that is used in a query, and another transactions cannot update the data.
If so, it sounds like that could help. No?
You are discovering the world of compromises built into databases to make them scale.
It is possible to apply locks to everything you read, so the issues you mention never occur. However, this blocks all other transactions, even queries, on the tables/ranges/indexes/rows that you lock, until your original transaction completes.
This means you potentially have problems where transactions block for too long, impacting performance, or even leading to deadlock conditions.
The solution is to use some kind of concurrency control, i.e. locking, generally either "Optimistic Locking", or "Pessimistic Locking".
With optimistic locking, you collect timestamps, so you can tell if another transaction has changed the underlying data. With pessimistic locking, you apply a physical lock flag to the data, while you do your changes, then other transactions know that they can't work with the data.
Both cases have advantages and disadvantages, and you will need to consider the use cases and user expectations for your application carefully before selecting a method to use.
In general, if it is unlikely that your users will update the same data at the same time, then optimistic locking might be best. OTOH, if they might often want to update the same data, pessimistic locking may be more appropriate.
In my case, I almost always use pessimistic locking, as I tend to work with complex business applications. For example, in an insurance underwriting system, it doesn't make sense to allow two users to work on the same policy at the same time, so you need to use explicit locks (pessimistic locking) in cases like this.
>> 5 seperate images with the same dimensions are still 5
>> seperate images, and can continue to be 5 seperate
>> images without violating key restraints.
> Are you asking why Dimension is a separate table instead
> of showing up as columns in the Image table? I was under
> the impression one should use the normalized table form
> when possible. If you meant something different, I
> didn't understand your question.
If your understanding of the rules of normalisation makes you have dimensions as a separate table, then your understanding of normalisation is seriously flawed.
An IMAGE is an entity, therefore it is a table.
A DIMENSION is a property of an image, therfore it is a table column.
If I have a database which deals with financial matters do I make AMOUNT a separate table just because it appears in lots of transactions? Or do I make it a column on the TRANSACTION table?
Wednesday, September 17, 2008
I agree with Tony. It looks like a misunderstanding of normalization is leading to an overcomplicated design. E.g., Would you create a First_Name table so that there is no more than one instance of the name "John" and all the people in the person table with that name have to be linked to the "John" record?
Wednesday, September 17, 2008
+1 on the general consensus -- 'dimension' doesn't really have any meaning on its own, you can't DO anything to a dimension, it's not worthy of being its own database entity.
For the general questions you have, though, assuming 2 tables whose relationship does warrant normalising them:
#1 (Save without duplicates, avoid requery): The 'collapsing rows' idea is not a good one. Trust me on this, just don't do it.
Really, if this is the situation you want, you have 3 approaches:
* Change your transaction level. What you are trying to do here is fundamentally something that requires SERIALIZABLE transaction isolation, so if you're not using that it just won't work. Note that you'll also likely need to tell your DB to make sure it takes the required locks on the SELECT, e.g. (in Transact-SQL; I'm a SQL Server guy. Adapt as appropriate):
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
SELECT * FROM Dimension WITH (UPDLOCK) WHERE ...
then if there's no result:
INSERT INTO Dimension ....
The reason that taking the lock on the select is important is that otherwise you're opening yourself up to deadlocks: two transactions take read locks on the select, then t1 does an INSERT and tries to take a write lock, which waits on t2's read lock; t2 then tries to escalate ITS lock and a deadlock ensues.
* SELECT the row, if it's missing INSERT the row you want, deal with a unique constraint violation if two accesses happen at the same time. Yes, this seems 'messy', and a lot of DB folk will stroke their beards and say 'tut tut' at this. But this may actually make the most sense. Imagine if 99% of the time the row is present, so no insert needs to be done; 0.75% of the time there is no concurrent access; 0.24% of the time the two concurrent accesses will be looking at different rows anyway. Then you're upping an isolation level and escalating locks early for the 0.1% of the time you're actually preventing a problem. Yes, it's a bit "icky"; but most people hit database scaling problems before any other kid of scaling problem; don't make life hard for yourself by complicating things. Understand your data access patterns and cater to the majority of cases, a little bit of 'ick' is OK sometimes.
* Use a special command in your DBMS to do this atomically. SQL Server 2008 introduces a MERGE command which does this kind of thing; MySQL apparently has INSERT ... ON DUPLICATE KEY UPDATE ... which at least seems conceptually similar.
#2 (choosing a random row): again, a couple of options:
* Use REPEATABLE READ. Easy -- this guarantees that if a reader which selects out all the rows to choose one at random will still have the row there when it comes back.
* Read all the data in-memory, then you don't have to hit the database again (again, icky -- but if it makes the most sense, do it)
* Select the random data out in another way. In SQL Server:
SELECT TOP 1 .. FROM ... ORDER BY newid()
in MySQL, I think it's something like
SELECT ... FROM ... ORDER BY rand() LIMIT 1
you have to admit that's a lot cleaner. There are possible performance implications here; again, know your data. Are you getting the DB to sort a bajillion rows on an unindexed column? If so, bad idea. Is it sorting 500 rows, and memory in the app is at a premium? If so, probably good idea. If it doesn't need to be random EACH AND EVERY TIME, then maybe you can cache the random values in the table and run a simple UPDATE ... SET randomSortOrder = rand() every minute/hour/day/month/whatever as appropriate.
Anyway, I've rambled a bit. But I hope that helped.
I guess I over-simplified my problem too much. Dimension is actually a table called Specification that contains: [width, height, color-depth, grayscale]
It stands in its own table because there are very few possible combinations of Specification and potentially millions of possible images. I see this as a way of reducing duplication. Also, it's quite possible that users will say "give ma an image with resolution X" and then I'd want to scan all images with a specific specificationId. This would be far more efficient than scanning a decomposed table.
I did some reading about 3NF and you're right that technically speaking it doesn't require this sort of composition (since specification does not introduce any transitive dependencies into the table). Are you still saying this is a bad idea? If so, why?
Thank you for the detailed answer. It really helps seeing you walk through the problems because it teaches me the mindset necessarily to solve these problems. Two items of interest:
- JPA (the ORM I am using) only allows me to manipulate transaction isolation on a per-connection basis (ugly!) so many of these solutions are pretty much off-limit to me. I am forced to use READ_COMMITTED transaction isolation if I want to avoid configuration headaches.
- Selecting a random image: unfortunately the use-case is more complicated. The client says "give me a random image of a car" and image categories are hierarchical. This means that I have to iterate the list of images under the image category (iterate the hierarchy). This pretty much eliminates the use of any database built-in functionality. Caching results is definitely a possibility but coming back to the original question: is it possible to solve this problem using READ_COMMITTED without replaying the transaction? My guess is no.
PS: You brought up a lot of good solutions! I'm just trying to figure out which ones will require the least amount of work under the ORM. Unfortunately it seems that ORMs create a lot of unforeseen difficulties :(
Wednesday, September 17, 2008
1) On the transaction front, you might look at using Spring's TransactionTemplate with the JpaTransactionManager. I've never used JPA but I have no reason to believe it wouldn't work. I don't know the gory details of how it works but I think you should be able to use this to start/participate in transactions at any isolation level relatively easily. If you're not using Spring already that's a big lump of a JAR to put in your codebase but it might be worth looking at anyway.
2) OK, that random image one is complicated, I'll grant you that. I gather MySQL doesn't have native support for hiearchies (like Oracle's CONNECT BY and SQL Server 2005's CTEs or 2008's new hierarchyid). If it's more suitable, you might want to look a way of storing a hierarchy OTHER than as a 'parentid' type relationship (which I assume you're using). Nested sets or the 'dotted path' type hierarchy (Vehicles = '1.', Cars = '1.1.', Blue Racing Cars with Yellow Spots from Finland and a Smiley Face Sticker on the Back = '188.8.131.52.184.108.40.206.') make these kinds of queries extremely nice; something like SELECT TOP 1 FROM item JOIN category .... WHERE category.hierarchy like '1.1.2.%' would find all racing cars without actually having to walk the hierarchy. Nested sets are even nicer, but have their own problems. Joe Celko has an excellent book on this topic.
The short answer to your question is no, with READ UNCOMMITTED there's nothing to stop someone deleting data from underneath you. However, if you do want to use different transaction levels while still keeping the connection at READ UNCOMMITTED, there are (probably) ways around this. For example, in SQL Server
SELECT somerows FROM sometable WITH (SERIALIZABLE)
forces SQL Server to take the locks it would take if the isolation level was serializable, on that table, and would have the same effect as setting the isolation level on the transaction. Still might not be the best idea but it should be possible, I think.
Disclaimer: I have no idea how MySQL implements things, including transaction control.
I echo the remarks that different RDBMSes implement things differently. The particular RDBMS that I cut my teeth on is no longer a major player. I sympathize with your desire to learn the theory of concurrency control in an RDBMS independent manner, but I fear that you're going to have to get specific to your RDBMS before you get down to something you can actually use when you write code.
I also echo the recommendation for Gray and Kyte as general reading on the topic.
The mainstream of the detailed responses to you has been in the direction of how to wrest concurrency control from your DBMS, and how to handle the responsibility that goes with that control.
My remarks go in a different direction. Another dsclaimer: YMMV.
The engineers who build an RDBMS put a lot of effort into concurrency control. They want to implement a strategy that represents a sweet spot between forbidding concurrent transactions entirely and permitting a single transaction to see a chaotically changing body of data. Forbidding concurrency entirely exagerrates bottlenecks, and slows down processing. Permitting a transaction to see chaotically changing data encourages applications that exhibit non repeatable behavior. Your reference to race conditions suggests to me that you already know all this.
As an aside, I had learned about race conditions in the study of operating systems before I ever ran across my first DBMS.
Many commercial DBMS products, like Oracle, DB2, and SQL server, have put enough effort into concurrency control so that the average application developer is better off to make use of what they provide instead of trying to outsmart the engineers who built the product. There are exceptions to this, and your case could be one of the exceptions. Also, it's possible that MySQL is not as trustworthy as I'm claiming the major industrial products to be.
One thing you ought to look into is virtual snapshotting as an alternative to synchronizing, when it's desired to give a read only transaction a consistent view of the entire database. The virtual snapshot looks at all data as of a point in time "between" two sets of transactions, which I'll call the "before transactions" and the "after transactions". Virtual snapshotting often goes by the name of MVCC.
In MVCC, the DBMS maintains multiple versions of some data, and takes on the responsibility of providing each transaction with the version that the transaction ought to see. This automatically becomes very efficient in the odd circumstance where all the transactions in progress are snapshot transactions. Actually, this circumstance is not all that odd, if there is a daily cycle of usual activity on the database.
If MySQL won't let you run a snapshot transaction, or doesn't work well in an environment where some transactions are snapshotted while others are synchronized, then my comment is beside the point.
Also, I apologize is my terminology is non standard. I'm using the terminology that's specific to the DBMS I learned on.
After I posted, I read some of the responses more carefully. I noticed that a lot of the responses focus on normalization. My comments once again depart from the mainstream.
On the one hand, data normalization is one of the most important concepts you can learn in terms of guidance towards good database design.
On the other hand, if your goal is to learn about concurrency and transaction control, overly focussing on normalization will distract you. The fundamentals of transaction processing remain the same even if the DBMS in question is not relational, and does not use SQL as the interface language.
The fundamentals of transaction control revolve around the ACID properties of transactions. If you don't understand ACID, you won't understand tranactions even after you learn everything there is to know about normalization. If you do understand ACID, you'll be able to deal with transaction control in a situation where the database design is less than fully normalized.
It sounds from some of your responses, as though that's the case here.
Most of ACID can be more easily understood if you already understand race conditions. Unfortunately, most of the best writings about ACID are written for the person who has never learned race conditions. This can result in your having to read through some long and difficult material, only to figure out later on that the author could have phrased the concept more succinctly using race conditions as a prerequisite concept.
Just in case there are any lurkers who don't know this, ACID stands for four desirable properties of a transaction, namely ATOMIC, CONSISTENT, ISOLATED, and DURABLE. A long article could be written on each of these, but it's best to learn about all four at once, at least to the first level.
"It stands in its own table because there are very few possible combinations of Specification and potentially millions of possible images. I see this as a way of reducing duplication. Also, it's quite possible that users will say "give ma an image with resolution X" and then I'd want to scan all images with a specific specificationId. This would be far more efficient than scanning a decomposed table."
In your example, wouldn't it be possible for there to be multiple specifications with resolution X? If so, you would need a join for that query.
You might find that simply indexing the columns that users might want to scan is simpler and faster. You should test this.
Thursday, September 18, 2008
I recommend "Database Design for Mere Mortals" and "SQL for Smarties" (latter by Celko) - I think you'd really like the smarties one. They are almost opposite in tone and approach, but both authors are great.
Friday, September 19, 2008
This topic is archived. No further replies will be accepted.Other recent topics
Powered by FogBugz