The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Clustered indices on SQL tables

How do you decide which index should be clustered? And why? I've seen 2 basically contradictory suggestions (sometimes in the same article)

1) put the index on the primary key which seems completely pointless if you've got an artificial identity PK...

2) put the index on something where there are some duplicates (eg you might put it on the area code of a telephone number)

Another Contractor...
Thursday, December 29, 2005
Putting the clustered index on the generated identity column is useful if the application makes queries based on it: "SELECT col1, col2, col3 FROM tbl_Customer WHERE CustomerId = 33". 

It's also useful if your queries contain a lot of joins on the identity column.
example Send private email
Thursday, December 29, 2005
I'm going with your #2.

I write a lot of systems for the insurance & claims processing industry. Typically, there's an Employer Group and a number of Employees. Of course, we index Employee on the EmployeeID, but we also have a clustered index on the GroupID -- such that all Employees for a given Group are clustered on the physical disk. There are a huge number of queries against the database that start out with "Show me all Employees for this Group that ...<some query criteria>"

Testing early on showed huge performance benefits on a heavily loaded system when we clustered by the GroupID for Employees. Quickly accessing all related employees (related by GroupID, clustered on GroupID) is a lot easier for the DB engine when it's clustered, rather than having to skip randomly around the disk to retrieve them all.

Similary, the Employer Group is clustered by State, again, because there's a large number of queries that start with, for example "For all Groups in Ohio, show me ... <some criteria>"

This will depend on the types and frequency of the queries that you run.

You are correct on #1, though -- it serves absolutely no purpose.
Thursday, December 29, 2005
It's not - it's always "select blah where CustomerName = whatever"...
Another Contractor...
Thursday, December 29, 2005
Clustered indices may have a negative effect on performance if you have regular queries which cross those cluster boundaries.  For instance, in Sgt. Sausage's example if there were queries made on the same set of data on some foreign key that had nothing to do with Group ID or Employee ID (getting their ethnicity, age or gende perhapsr) and they weren't referenced then you may end up with a poor record selection order.

So the moral is if records participate in clustering all queries must include that index, even if it doesn't matter.
Simon Lucy Send private email
Thursday, December 29, 2005
>>  It's not - it's always "select blah where CustomerName = whatever"... <<

I think you skipped over my "if" condition on that statement.

Besides, putting a clustered index on a character field?  Probably not good.  On a hashed version of the string... maybe.

Like Sgt Sausage & Simon said - It depends on the most commonly used access pattern by the application.  So know your users and how they interact with the system in order to optimize performance when choosing a column or columns to cluster by.
example Send private email
Thursday, December 29, 2005
I think you mean MS SQL Server Tables.  SQL stands for Structured Query Language and is a language not a product.  You are talking about a specific product.  Please don't help MS co-opt an industry standard TLA with a specific product by a vendor.
Jim Send private email
Thursday, December 29, 2005
It doesn't appear the poster was pandering to M$, just sloppy.  There are plenty of people out in the Real World and in these forums who refer to any SQL-using DBMS as 'the SQL server' not meaning 'the SQL Server' (your choice, M$ or Sybase). 

Besides, the discussion of where or when to use clustered indexes isn't M$ SQL Server-specific.  It's generated heated discussion at my current employer (M$ SQL Server) and prior employers (Sybase SQL Server, Oracle, DB2).
a former big-fiver Send private email
Thursday, December 29, 2005
(didn't skip over the if...didn't want to start on my other rant about people who insist on artificial autonum primary keys just to avoid having text in a field so that the only reason you have joins on the clustered index PK is because you've unnecessarily split the table in the first place...that's another one)...

Why do you think that a clustered index on a string is bad? I don't know enought about the internals of (MS) SQL Server to know how it would handle this? I may be out of date - this came from SQl Server 6.5 - but I think it uses a B-tree to find the first page on disc that contains the records that you want - then it can just read the pages sequentially until it stops being the records that you want - so it's just a question of how that is set up so that you can quickly find the record that you want from a string? 

Another Contractor....
Friday, December 30, 2005
Thanks for the reply.

Yes, it is an implementation detail of (MS) SQL Server.  By using a clustered index you're storing the key in the index pages (at least as recently as v2000, I haven't done much with v2005 yet).  If you use autonum values as your key, those are only 4 bytes, and so you can get more of them in an index page, reducing the number of pages that get cached in memory.  If you use a string as your key for a clustered index, the degenerate case is that you can only fit one in an index page because you went over 1/2 page capacity.  Even if you don't have the degenerate case, during inserts you're much more likely to need to split the page just because it's so much bigger than an int.  Using Unicode values (NCHAR, NVARCHAR) doubles your space problems

Another strike against them (which isn't so bad if you have an alternate key on them anyway...) is that you must do collation-specific string comparisons during inserts to determine where they go in the index structure.  Even assuming you've managed to fit 5 values in an index page, during index scans means you need to do an average of 3 comparisons to find which link to follow to a data (or another index!) page.  Comparing ints is obviously much faster.  :-)
example Send private email
Friday, December 30, 2005
Cheers! That makes a lot of sense...I wonder whether the problems that you've said would cancel out the benefits of clustering in the following case (simple example)

You've got a table of Employees (name, dob, departmentID), and you always want to find the employees in a given department. For whatever reason DepartmentID is an NVARCHAR(20). As I see it there are 2 options:

1) Keep the table as it is, cluster on DepartmentID which is what you are already searching on - but then the issues of string clustered keys come up.

2) Create a new table with numeric artificial deptIDNumbers, and use those in your Employee table as the clustered index. You gain on the clustered index being numeric - but lose on always having to do a join to search on DeptID (string)...

The obvious answer would be to try it and profile it and see what happens  (or potentially use an indexed view for case 2?), but I'd be interested to know what people would expect to be better! And what would people cluster on if the unique key was (say) first/middle/last name and that was what you were always searching on - but you also had an identity PK?
Another Contractor....
Friday, December 30, 2005
Am I right in thinking that what MS calls a "clustered index" is known in Oracle as an "Index-organized table"?
David Aldridge Send private email
Friday, December 30, 2005
That's likely.
What they call a "clustered index" is like a phone directory.
What they call "non-clustered indices" are like an end-of-book index.
Friday, December 30, 2005
From my perspective, you put the clustered index on a 'range' of data (whichever one you use the most).  You put the nonclustered index(es) on 'individual item' information.

Clusted Indexes are designed to pull ranges of data, like a between clause.  A clustered index physically sorts the data on the disk, and the last 'spot' on the index is the actual record on the disk.  The database (talking M$ Sql Server here) 'reads' a page of data off the disk, pulls what it wants of the page, then pulls the next page off the disk.  You want to optimize your indexes to read the minimum number of pages off the disk for your most popular or most expensive queries (and sometimes there is a tradeoff to be made there, becuase they are not always the same!). 

non-clustered indexes are a way creating pointers for easy access to individual records.  Its somewhat akin to a binary search tree.  A non-clustered index is sorted, but is a separate tree from the data on the disk, and the last 'spot' on the non-clustered index is a pointer to the actual location of the record, either on the disk, or on the clustered index (depending on your DBMS).

For example, lets assume you have an Employee table with EmployeeID, FirstName, Lastname, and HireDate.

EmployeeID is a surrogate key (read: IDENTITY column), and primary key.  Assuming for the sake of argument that LastName and FirstName are natural keys, so you put a unique constraint/index on them.

So, if most of your queries are 'Show me everybody that was hired in 2005 (e.g. between Jan 1 2005 and Dec 13, 2005)', it makes sense to cluster your data on hiredate, because that would mean minium page reads to get all the data you need.  Having the clusted index on EmployeeID could mean you read every page on the disk, because the information is scattered everywhere.  So you are essentially performing a table scan.

If most of your queries are 'Show me the employee with ID 8675309', a clustered index doesn't buy you anything, because its going to only read 1 page from the disk.  But if your query is 'show me all the employees with IDs between 1000 and 2000', a clustered index on EmployeeID makes sense.

A final note, a clustered index is your biggest performer.  A non-clustered index will always be slower than a clustered one, simply because a non-clused index has to tkae the extra step of hopping from the pointer to the actual record, while a clustered index hits the record directly.  If you don't know where your biggest cost is going to come from, leave out the clustered index initially.  Come back when you have more data.  Not having a clustered index at all doesn't hurt anything.  But having a clustered index in your back pocket if you need it can save you a bundle.
another anonymous coward
Friday, January 20, 2006

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz