The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Avoiding Long 'IN' Subqueries

In one part of our application, the user sees an explorer-like tree control, where the folders as well as the items they contain can be 'checked' off.  Other features in the application show queries that are filtered based on the user's current selections from this tree. 

How can you avoid building up a potentially-huge "IN" subquery when trying to fetch records associated only with those items selected? 

Is it better to have a query batch that builds a temporary table, fills it with many short INSERT statements, and then joins against that?  That doesn't sound any better than an ugly IN. 

I'm worried about having a permanent table that stores a user's selections, because that table could end up containing alot of records for users that no longer use the application.

What other options are there?

(Note that the hierarchy of folders is implemented as a self-referencing table.
foobar
Thursday, November 10, 2005
 
 
How about a view?
Berislav Lopac Send private email
Thursday, November 10, 2005
 
 
You need to benchmark. Sometimes nested IN subqueries are better than massive joins.

Remember, one Cartesian product can ruin your whole day.
sometimes unprepared
Thursday, November 10, 2005
 
 
Another option is to fetch the record each time the user selects it, and move the task of combining those entries to the client.

Obviously, this is not practical in every case, but it does work well when you're just simply hiding/making visible the data and expecting the user to make sense out of it.

A practical example would be a system monitor.  CPU activity represents one folder, memory usage represents a second folder and hard disk activity, a third.  You may want to display any combination of these three folders BUT its very rare to have the computer sit up and say "Foobar, I'm doing a lot of thrashing.  Better close a program or buy more memory."

If you were doing this over the web (through AJAX or something similar), the primary considerations would be a) how many concurrent users and b) how frequently are they going to click open folders?
Anonymous Coward
Thursday, November 10, 2005
 
 
I wouldn't necessarily fear the IN unless you've profiled it. If it really is a problem, you choices are heavily dependent on what your database is good at. I'm most familiar with Oracle so I would tend to look at partitioning and/or CASTing the list of ids to a TABLE within the query so that it executes as a join.
Matt.
Thursday, November 10, 2005
 
 
If your db engine supports union clause, you can build one query per item and "union" them in single query which does not risk can be infinitely long.
Sevenoaks Send private email
Friday, November 11, 2005
 
 
'IN' subqueries are all the rage these days... ;-)

http://www.thedailywtf.com/forums/50258/ShowPost.aspx
+= Send private email
Friday, November 11, 2005
 
 
The temporary table sounds like a winner to me. In an Oracle context it would reduce parse time, avoid cluttering the shared pool with many similar queries, it would probably be easier on the optimizer. You could also use a "global temporary" table where the entries go away at session end or at commit time. Databases less engineered for multiuser access might not be stashing parse trees in the shared pool, so the win there might be less.
George Jansen Send private email
Friday, November 11, 2005
 
 
Ditto what George said. A temporary table would probably be the most elegant and performant solution. If at all possible batch the inserts using a stored proc that receives a delimited list of values as a parameter, to minimize round trips.
mikeymo Send private email
Friday, November 11, 2005
 
 
like... this?

http://www.thedailywtf.com/forums/50258/ShowPost.aspx

(~100 replies, maybe something useful in there?)
arg!
Friday, November 11, 2005
 
 
You described the problem you're trying to solve, but you haven't really said much about why you're concerned about long IN clauses.  If it's performance then measure.  If it's maintainability, stick with the IN clause.  If it's "elegance", maybe temporary tables.  If it's not wanting to show up on http://thedailywtf.com/forums/50258/ShowPost.aspx then use temporary tables.
Jonas Grumby Send private email
Friday, November 11, 2005
 
 
Are you having performance problems, or is this just about a IN clause being ugly?

If you think it's a performance problem, build both solutions, then time them both with a variety of settings.

When deciding about performance impacts, don't guess.  Test.
Dave
Friday, November 11, 2005
 
 
Dave,

My main concern is whether using an "IN" query would be bug-prone, in that someday someone will try to choose a million items from the tree (I'm exagerating), and then SQL Server might consider the query to be a syntax error.  Performance is a lesser concern at this point.  It sounds like the most likely solution is to use a session-specific temporary table.  I know SQL Server offers both "temporary tables" and "table variables", and I can try it out either way.
Okey Dokey
Friday, November 11, 2005
 
 
I'm having trouble visualizing what you mean by "ugly IN" statements. Could you post an example?

Even multi-level INs will often beat a join. The rule of thumb is to think about IN when you are joining with a table, yet no elements from that table are SELECTed.

For instance:

SELECT One.a, Two.b from One, Two, Three WHERE
  One.a = Three.a and Three.b = Two.b

Nothing from Three is returned by the SELECT.
sometimes unprepared
Friday, November 11, 2005
 
 
We go the table route -- but it's *not* a temporary table. A lot of our pick lists like this, the users want to be able to save the lists to use again when they rerun the report, or have multiple different lists to run multiple instances of the report.

Basically, the dialog that allows the user to pick, adds/deletes entries in our table, but we "tag" the entries that allow the user to recall the list at any time and recall the list, modify it, save it, copy it etc.

In essence, our lists are first class entities in the database.

Then, for all reporting (Typically where these types of lists would be shoved into a huge "IN" clause) we give the user the option to run it up against one of these lists. For example, run the Receivables report for *these* (ListA) customers, but run the Profitability report for *those* (ListB) customers.

So, the lists are stored in a table, the user updates them "real time" -- just like any other screen that's bound to the database, and then we run querries JOINed to these lists, so there's no need for the long "IN" statement.

==>I'm worried about having a permanent table that stores a user's selections, because that table could end up containing alot of records for users that no longer use the application.

This is really a bogus argument for most apps, unless you've got hundreds of thousands of users. Most users will have a couple of dozen saved lists, and each of these lists might have a couple of dozen items. If you're using a database that can't handle that, then you've got other issues you're going to be dealing with.
Sgt.Sausage
Friday, November 11, 2005
 
 
Man, this is just like yesterday's Daily WTF.

Short form: if your application needs long IN clauses, you need to redesign your application.
Iago
Saturday, November 12, 2005
 
 
I did some tests a few months ago.  On SQL Server, I used a table variable, which I'd suggest is a better option than a temp table.  It was much faster - to create the table variable, populate it, and join it to the target table - than to use a massive "IN" clause.  The difference in speed was at least one order of magnitude, two if my memory serves correctly.  From memory I tested with 1000, 10,000 and 100,000 ids in the temp table.  The first two were fine (surprisingly good in fact) but I wouldn't recommend the last one ;-)  I terminated it before it completed so I can't tell you how long it would have taken.
John Rusk Send private email
Sunday, November 13, 2005
 
 
PS only a few of the IDs in my temp table were actually valid - the rest were just random data to make up the numbers for test purposes.  So, your mileage may vary in a realistic situation.  And, as another poster said, you might be able to sidestep the whole issue with another design.
John Rusk Send private email
Monday, November 14, 2005
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz