A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.
In general when writing a multi-table SQL SELECT statement, is it best practice (or best performance) to reference the tables (in a FROM clause) in order from the table with the most rows to the table with the least rows, or vice versa? It would be interesting to get opinion on the best approach to use with different DB servers.
Thursday, September 22, 2005
My experience with select statements that join tables across different servers, is that you need to experiment. It is hard to come up with empirical rules when a select statement takes 45 minutes to run one way, and several hours the other way.
By "one way or the other" I mean something like:
serverA.db1.dbo.tableX join serverB.db2.dbo.tableY
serverB.db2.dbo.tableY join serverA.db1.dbo.tableX
Thursday, September 22, 2005
There's an article about MySQL that elaborates about joins:
I'd guess other databases might do it similarly
As far as I know there is no best "general practice" for this stuff. It all depends on what critera are set in the where clause, percentage of records in tables that will end up satisfy where criteria, what indexes are defined, and more.
There is an excellent book, _SQL Tuning_ that is devoted to setting out a (manual) diagram-based method for tuning sql statements, which helps determine optimal join order: http://www.amazon.com/exec/obidos/tg/detail/-/0596005733/qid=1127405560/sr=8-1/ref=pd_bbs_1/002-8156232-3043234?v=glance&s=books&n=507846
In general, though, best practice I know of is just to write something that works and don't fiddle with it unless it's too slow. And in most cases (but not all) where it's too slow you will probably improve performance more not by changing join order but by defining proper indexes.
I try to keep my queries fairly simple and haven't ever actual used the query optimization method described in _SQL Tuning_. I expect it's usefulness grows as queries involve more and more joins. For queries involving just a few joins you can probably get what you want easily by trial and error experimentation. Listing table with most rows first is probably not generally a route to best optimized query, but then again it's probably not a harmful practice to standardize on either.
Your DBMS' optimizer should handle ordering for you in the way that makes sense for it. Try to ensure that your joins can use indexes, or add indexes for new usage patterns.
For SQL Server, the recommendation used to be to load up the WHERE clause to give the query engine more choices for linking the tables so it could evaluate as many indexes as possible for use in the actual query execution. (this is from Sybase v11; YMMV on current versions)
Avoid at almost any cost linking on unindexed columns. I recently had to debug some dynamically created SQL that was doing a join on an unindexed table. In this situation, MS SQL Server scans the target table, creates an ordered (based on the join column) copy table, then scans the ordered copy table for qualifying rows. Not a big deal in normal situations for this program-the unindexed table has <10 rows normally and the primary data is <500 rows. The problem came when a client ran it against 140000 rows-runtime went to 20+ hours.
Adding an index on the problem table cut that in half. Resolving the SELECT values & using the resolved list in the submitted SQL (eliminating the join) cut runtime by 80%.
That sounded like a good experiment, so I took 2 MySQL tables in the same database and tried it out.
Table1 had 465000+ records
Table2 had 300 records
SELECT Table1.dateField,Table2.fullName FROM Table1 INNER JOIN Table2 ON Table1.record_id=Table2.nID WHERE Table1.dateField>'2005-01-01 00:00:00'
then I did...
SELECT Table1.dateField,Table2.fullName FROM Table2 INNER JOIN Table1 ON Table2.nID=Table1.record_id WHERE Table1.dateField>'2005-01-01 00:00:00'
Without a clock to judge by, however, they both seemed to take the same amount of time.
If you are trying to do this cross-server or between 2 databases, this probably doesn't help, but it was fun for me to try out, I'd never thought of trying to change my search speeds by reversing orders of joins.
Thursday, September 22, 2005
"Without a clock to judge by, however, they both seemed to take the same amount of time. . . . I'd never thought of trying to change my search speeds by reversing orders of joins."
The actual join order used internally by the db to run the query is hugely important, but most db's query optimizers will correctly reorder joins in simple queries to optimal order regardless of how you write them. So your joins may well have been processed in same order even when you changed the order in your SQL.
The book I mentioned in post above (_SQL Tuning_) talks about different ways of writing your query that (a) make it more likely that your db's optimizer will be able to reorder to the order you've manually determined is optimal, and (b) make it less likely that your db's optimizer will try to reoder to an order that you have determined is not optimal.
The author of the book says that his techniques of influencing query optimization prove especially useful as the complexity of the query grows, since in these cases his technique enables you to make a manual determination of proper join order that is likely to be better than the choice the db comes up with in its own cost-based optimization. He also talks specifically about Oracle optimization and possible selective use of Oracle's cost-based and the deprecated rule-based optimizers. Apparently one trick with the Oracle RBO is to list the tables in the from clause in exactly the reverse order of how you would prefer the joins to be ordered.
I assume there is plenty of db-specific web-based or book-based documentation on ways to influence query optimization in SQL Server, Oracle, and other popular db's.
OP, I never seen an RDBMS that would respect your order of tables by default. The optimizer will arrange them its way, unless you use an explicit hint to force it not to. And if you're asking questions like this, I would assume you're not yet competent enought to fiddle with an optimizer.
Friday, September 23, 2005
Under the old oracle RBO, it evaluated from the bottum up the tables in the from clause. The current optimizer (CBO) doesn't care (Assuming we are talking about the same level of sql statement, not inline views and scalar subqueries, for example).
This is going to, by definition, be completely database dependent. But also completely and easily testable.
As a side note, the order of evaluation of clauses, which is helpful to know, is (typically):
This topic is archived. No further replies will be accepted.Other recent topics
Powered by FogBugz