The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

database indexing strategies

I'm working for a client where they periodically replicate live data (SQL Server) into a Reports database instance, and then allow both ad-hoc and scheduled reports to be run against the Reports database. It seems that the same indexes and what-have-you exist in both instances, although obviously the volumes on the Reporting instance are much larger

Are there any specific strategies for getting the best performance out of the Reporting database - wrt indexes, for example - where we're just retrieving data? At what point should we be considering data warehousing approaches (if at all - there are no truly ad-hoc queries since these are typically supprted by the apps on the live system).

Thanks in advance for any comments
Rich
poorhouse
Wednesday, February 23, 2005
 
 
Probably the best strategy is to profile the database and see where the performance problems are, if any.

There is no magic bullet here, no strategy that applies in all cases. You simply have to analyse your data and your SQL requests to provide appropriate indexes, where they will have a positive impact.
88EA0708-3036-403D-A0B7-AD6592EA62DC Send private email
Wednesday, February 23, 2005
 
 
I suspected as much... (rolls up sleeves)

GUIDs as PKs; now there's a strategy for slowing things down :)

Rich
poorhouse
Wednesday, February 23, 2005
 
 
The tradeoff for indexes is the overhead of updating the indexes on inserts, updates, and deletes versus the speed provided for reading the data. On a transaction intensive database you want to minimize indexes to maximize the number of transactions that can be performed. On a reporting database you want to maximize indexes to maximize the number of reports that can be performed.

If the updates to the report database are done during off-hours then there is no reason to limit your indexes assuming the off-hours are long enough to do the update. If the update is not done during off-hours then you will need to test the performance limits imposed by the indexes.

If there are no constraints on the report database inputs I would profile the queries to see that each one used indexes rather then table scans.

As far as a true data warehouse you need a thorough understanding of the concepts to make them worth the effort. That will only come from experience and education.
Phil C Send private email
Wednesday, February 23, 2005
 
 
"On a transaction intensive database you want to minimize indexes to maximize the number of transactions that can be performed."

The cost of indexes to updates/inserts/deletes is almost always GROSSLY overestimated. Secondly the benefit of indexes to transaction intensive databases is grossly underestimated.

Extraordinarily few systems REALLY justify minimal indexing.

Wednesday, February 23, 2005
 
 
I disagree. Index pages need to be locked and written to the log when they are updated. They are typically more difficult to update then data pages because of the complications in rolling back. The need to rollback an index page that has been split or two pages that have been merged are the most problematic low level operations that a database engine needs to perform. It takes time and resources. It may not show up in simple applications but when you have multiple users using the db it becomes important. Since they are already using a reporting database I would say they are resource constrained and the overhead of index updating is important.
Phil C Send private email
Wednesday, February 23, 2005
 
 
Re-read my last post - I didn't deny that indexes can have an adverse impact on some operations, but that for most OLTP systems indexes are absurdly, almost comically, avoided without a real empirical basis but for some vague handwaving and abstract talk about the theoretical cost to an update statement.

Most OLTP systems, despite having a heavy insert load, still have a vastly heavier read load. Furthermore a lack of indexes can seriously hinder update operations.

Wednesday, February 23, 2005
 
 
Re-read all of our posts. Which represent a reasonable discussion and which are GROSSLY, REALLY, comically, almost absurdly, vague hand waving and abstract talk?
Phil C Send private email
Wednesday, February 23, 2005
 
 
I made a generalist statement about indexes and OLTP systems. You replied with "I disagree...". I replied with a generalist statement about why people avoid indexes.

Seems fairly clear.

Wednesday, February 23, 2005
 
 
"Are there any specific strategies for getting the best performance out of the Reporting database - wrt indexes, for example - where we're just retrieving data? At what point should we be considering data warehousing approaches (if at all - there are no truly ad-hoc queries since these are typically supprted by the apps on the live system)."

Here's a relatively simple technique:

1) Choose a report that's running slowly, or that you'd like to run faster.

2) Find the SQL that creates the data that that report wants.

3) Make a copy of that SQL.  Change the SQL to output to a new table.

4) Run this query.  You now have a new table of everything that the report wanted.

5) Create a second copy of the report which does a SELECT
* from this new table.

6) Your report (the second copy) should run at LIGHTNING SPEED NOW.

7) If the report has user-selectable criteria, such as 'FieldX' = <User Selectable Data>, then simply remove that
criteria from the SQL in #3.  In other words, when you create the new table, INCLUDE ALL POSSIBLE CRITERIA.

8) If there is user-selectable criteria, simply add these back to the (new copy) report's SELECT * statement.  So instead of SELECT * FROM <NewTable>, you have SELECT * FROM <NewTable> WHERE 'FieldX' = <User Selectable Data>.

9) You now can place an INDEX on FieldX in the new table, but should only do so IF the report runs too slow for your
tastes.  You could probably speed a report which now takes 3 seconds to under .1 second this way, but first test without any indexes, because it'll probably be plenty fast without them!

10) This advice would normally cost you $1000-$3000 for my time as a performance consultant!  But hey, I like you and I like this forum, so it's FREE, for today only!!!

On a serious note though, good luck!!!

"Almost all [performance!] programming can be
viewed as an exercise in caching" - Terje Mathiesen

Peter
Peter Sherman Send private email
Thursday, February 24, 2005
 
 
Thanks peeps, especially Peter.

This stuff is relatively new to me, but it looks like I've achieved a tenfold improvement in performance by tweaking the existing SQL (and I thought _I_ was an 'average' coder) and doing away with just the kind of intermediate table that Peter talked about. Ideally I would have kept it, but the tech problems were execerbated by the usual permie passive/aggressive obstruction issues and in the end it was easier to tear them up.

Just my experience - reflecting on the 'what is wrong with programming?' thread wher Ps have been suggesting that most problems are people problems - poor skills and/or poor attitude.

Rich
poorhouse
Friday, February 25, 2005
 
 
Peter, ROFL.
anomalous
Friday, February 25, 2005
 
 
The one big disadvantage to the "intermediary table" I mentioned was that you have to repopulate it every time.  You could create a view, although performance with a view varies DBMS to DBMS...  The big advantage to an "intermediary table" is that it contains all of the work that a SQL query performed, including JOIN conditions, aggregate fields, etc.  So it doesn't require the DBMS to do any work (which is key to performance) -- other than filter out different types of user selectable criteria (if there are any), which can be optimized with the proper indices.  Is it the right approach?  It varies.  If you were able to tweak your SQL and not need the "intermediary table", then so much the better (the report data will be live every time -- which might be a big plus), but don't discard the technique entirely.  I've seen *BIG* data warehouses where the technique is used to deliver the data in a timely fashion.  But if you've managed to tweak your SQL, then hey, maybe that'll work better -- since you get live report data!
Peter Sherman Send private email
Friday, February 25, 2005
 
 
If you are going to go with the intermediary table idea, think about Indexed Views (SQL Server), in other words materialized views. That way, you don't have to run a periodic job to populate the table, the view (and its indexes on disk) will be updated when the base data changes), and it will perform better than a normail view.

People actually charge for advice like this :-)
anomalous
Friday, February 25, 2005
 
 
There is a good article on indexed views at
http://www.sqlteam.com/item.asp?ItemID=1015
Phil C Send private email
Saturday, February 26, 2005
 
 
The intermediary tables can generally be used automatically, it you have the right RDBMS. In Oracle they're called "materialized views", and the optimizer can invoke "query rewrite" to redirect queries from the large fact tables to the materialized view summary.

It's useful not only for summarizing very large data sets but also to handling complex join methods that don't scale well to modest data volumes. For example if you have an accountancy report that joins to a fact table on "general ledger account between x and y and date# between a and b and sale_type in (1,2,3)" then an MV can be a very efficient tool for getting better response.
David Aldridge Send private email
Saturday, February 26, 2005
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz