The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Sizing Databases

Hey all,

I was wondering how you planned database size.  In our environment it's always been an intuitive process (take your best guess and triple the space).  Microsoft provides a calculation for database space based on how large a row will be, factored in with how many rows will fit in a database page, and then from there calculating how many rows will be planned for the database. This calculation is somewhat viable, but it doesn't take into account indexes, fill factors and all the other moving targets associated with databases.

I'm also interested in how things like Fogbugz were designed to scale from the database size of things; because each client is different, there will be a different level of space used for each bug tracking database.

Thanks much -
David Seruyange Send private email
Monday, October 11, 2004
The best way I know to estimate database size is to build the tables, define the indexes, and load the data.  8-)

I start by picking out the few tables that will stand out from the rest as large.  For each, CREATE TABLE and load some sort of data; even if it is garbage data, just make reasonable guesses about the values.  Then extrapolate from there.  And usually I want at least enough headroom for a second copy of the largest table to be kept within the database.

One thing I learned the hard way: when you compare the price of disk storage to the cost of dba time to micro-manage space, it really doesn't pay to try to work without plenty of elbow room.  I insist on enough extra disk space to restore a second copy of my largest database without impacting anything.  This allows me to respond quickly to human error problems like forgetting half the WHERE clause on an UPDATE or DELETE.
Monday, October 11, 2004
Out of curiosity RH, have you ever used Triggers to limit the # of rows that people can delete?

I've often wondered about this because I know it's technically possible, but my OVERHEAD-o-meter seems to kick into paranoia mode.
David Seruyange Send private email
Monday, October 11, 2004
RH makes a good point regarding disk space costs vs. DBA time costs, but don't forget that backup systems can get very expensive once you get over about 100Gb. To be fair though, not many database get that big.

In general, you can work it all out, even the indexes, etc, if you have a good idea of the kind of data you will be dealing with (cardinality, length-variability, etc).

There is no point in considering the little tables, just concentrate on the biggest ones and ignore the others (they will fit into your margin of error, whcih you add at the end anyway).

For example, I did an exercise with a telecoms carrier billing database. This had to main tables, the raw call data from the switch and the processed rows, as well as a couple of hundred other tables. Given the main two tables were expected to have one billion rows in quite quickly (months), I ignored the other tables, even if they had a million rows in they wouldn't have much impact.

You can build the main tables, with indexes, foreign keys, etc and populate them with representative (i.e. not random) data. Then just see how much space it takes. In my case, I put 10 million rows of real data into the tables on a test server. Then it is a reasonable assumption that the data size will scale in a fairly linear fashion (if representative), so multiply by 100 (in my case) and add a comfort factor.

If you have the budget, get all the disk arrays now for the expected load. This is so much easier to manage than starting small and adding discs later. Any DBA can do it, but if you don't have to do it, why bother.
Tuesday, October 12, 2004
I've found that you need to worry about tables where the size of a row is large relative to a page. For example, in MS SQL Server with 8K pages, a table with large 3K rows will only fit 3 in a page (there is some page overhead) giving you a 33% increase in space over what you might otherwise calculate. If you rows are less than 1/2K, you probably won't notice much effect.

Remember that as systems evolve, they grow. So you may have rows that neatly pack into pages now, but space will explode down the road after someone adds a few more columns.

Fragmentation will always be an issue, both for actual data storage and indexes, free space, etc. Also, all of those VARCHARs will throw off calculations. So don't try to calculate things too closely, and allow extra space.
Dave Lathrop Send private email
Monday, October 18, 2004
I just wanted to add a small thought to the issue of how much size to allocate to you DB.  Having spent all of my programming carrer working for the end users adding on and tweeking database applications for one use or anouther is that all database engines have different overhead for the actual use of the data base and every one that I have work with experience an increase in delay for the actual database engine the closer a table gets to capacity and the database as a whole gets to its allocated limit the longer it takes to add,delete and change a row.  So I would recomend setting a safe empty cusion for operational reasons on top of the actual space you feel it will use.
  To be honest you are never going to be sure but if you take your top say 10 to 15 tables for a large robust system that are about 25% of your total row width.  Build those with a percentage of what you feel the volume will be in a given time frame. Remember they will grow until the max retention time before you start archiving off. So to start buy yourself a year or 2 and then plan to increment it up till your decided archive time.  But back to the calculation.  If 25% of my row width for 25% of my expected records takes X space then a safe bet is 16 * X for the space plus say an generic 25% added for operation efficientcy so 20 * X.  This should get alot of you transactional tables and not your static configuration and information talbes so yes your number will be high but of course the chance that the real world will not suprise you with more rows than you estimated is low to.
    And for goodness sakes if your making the product plan for and put inthe manual to do repaking of the database on a rugular basis and posible even provide the tool so the DBA wont forget that its needed.  The more deleted rows you expect the more often to repack.  Sorry just seen to many forget this and wonder at performance.  I have watched a process scream along till it got near the table limit to suddenly take 90% of the total run time for the final 10% of the records becasue someone set the table limit to those to be added plus 1.
Douglas Radenbaugh Send private email
Friday, October 22, 2004

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz