A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.
I am thinking about developing a database application. I aim for 1000-10000 users, possibly more. All users add about 30 records each day. This information is only for the particular user, user A never needs the data that user B entered (think of a personal log). They can also modify their own records. There will be lots of reads and writes for this data, performance is very important.
To me, the most obvious solution seems to make a new table for every user. However, this goes against everything I read about database design, so I would like to hear if you think this is a dumb approach in this case, and what would be better. I think the approach of one user table and one journal table with a user-id foreign key gets large very quickly. I would be interested to learn how other applications handle this.
Saturday, April 30, 2005
1) Why such a large margin of error for the number of users?
2) What back-end are you going to use?
3) Is there any reason why users can't just store their information locally if no-one else needs to look at it?
Whatever happens, though, creating a new table for every user is just plain dumb, especially if you anticipate 10,000 users. Depending on the back-end, how much data is involved (what does "lots of reads and writes" mean?), what you mean by "performance is very important" (is 1 second response time too slow? 5 seconds? 10 milliseconds?), there's no real reason why one user table and one journal table won't work.
However, unless there's a specific reason why all this data has to go into a central database, you're better off keeping it locally; if needs must (backups, etc.), run a batch job to roll updates into a master database on a regular basis, or have the app create a temporary local copy on startup and then roll updates back to the master when the app closes, at timed intervals, etc.
Saturday, April 30, 2005
I have to believe that with proper indexes, a small number of tables will perform much better than thousands of tables.
This question gets asked fairly often. If you were using simple files it might be better to have one per user, but a DBMS is very good at, well, managing a data base. It sounds like you're only talking about a few database hits per second, that shouldn't be a problem.
Saturday, April 30, 2005
If you want this project to fly, read a couple of books on database design, at least one oriented toward the DBMS you are going to actually use.
Saturday, April 30, 2005
Here you go:
Database Design for Mere Mortals
( http://www.amazon.com/exec/obidos/ASIN/0201752840/caseysoftware-20?creative=327641&camp=14573&link_code=as1 )
Yes, that is my associate link.
Saturday, April 30, 2005
Thanks for your replies. I actually have read some books on database design (that has been a while though, it is a good idea to reread them), that's why I 'know' this is not the right approach. I thought that one big table would grow too large quickly though. I now understand that this is not too large yet, I will do some performance tests to make sure (I realize that database performance is only a small part of what makes the app feel fast for the user. I am still interested what one would do for a really large application. What do you do when you have 1.000.000 users that add 100 records a day?
The reason that all data has to be on the server is that it is a web based app. That's also why I don't know the amount of users yet. I am exploring the possibilities. I am still not sure about the backend, I am thinking about PostgreSQL.
Sunday, May 01, 2005
The scaleability question is worth asking but the example provides its own answer.
Given that there is a unique index for each row it is no harder to provide a single user's records from 1 million users than it is from ten. Certainly if you had to read every record in order to discover the ones required then yes 1 million would be a significant difference, but this isn't required. With an index on the unique field the DBMS has far less data to search (the index file is very much narrower), and the index structure delivers the pointers to the actual data rapidly probably in no more than one or two fetches from the index.
Then the data can be returned. If there is a great deal of data then its more a matter of how that data is throttled to give reasonable performance across the physical network.
Sunday, May 01, 2005
"I am still interested what one would do for a really large application. What do you do when you have 1.000.000 users that add 100 records a day?"
Well, you sure as heck wouldn't want a database with a million tables ;^)
For vast amounts of data and throughput you should be thinking RAID arrays and clustering.
Sunday, May 01, 2005
"What do you do when you have 1.000.000 users that add 100 records a day?"
The same thing you'd do if you had one user adding 100,000,000 records a day -- try and forget the "number of users" aspect when designing the database structure. It only becomes important when setting up servers, etc. As a previous poster suggests, for that amount of data you should consider looking into a cluster; that how Google do it, and I think it's fair to say that there's not much they don't know about millions of users querying billions of records and requiring a rapid response time...
Sunday, May 01, 2005
"The scaleability question is worth asking but the example provides its own answer. . . . Given that there is a unique index for each row it is no harder to provide a single user's records from 1 million users than it is from ten. "
In theory (almost), but in practice It would definitely be worth confirming this on a test database. Strictly speaking, it isn't true even in theory, since even using an index it's going to require a few more steps to locate records in 50,000,000 table holding 1,000,000 users' records than it will in a 5,000 record table holding 10 users' records. Shouldn't be much more, but even in theory it takes a little more CPU.
Indexes can affect performance significantly in very large tables. For example, your db/hardware combination may be able to insert 10k records/sec in a table without any indexes, whether that table has 100 or 100,000,000 records. But once you add an index to the table each of those inserts will have to update the appropriate index, and it can take far longer to update a 100,000,000 record index than a 100 record index.
To test this all you need is create an app that inserts records with random id's (or range of id's that match what you expect in your app) and then start it off on an empty table with index and watch its performance over time. I guarantee that your inserts/second will decrease as the table size grows. How much will depend on your db, but if you expect to have tables in the many millions of rows and you will be doing lots of inserts and updates, you should definitely check performance with table size that approximates largest size you expect it will grow to.
SQL Select performance will also be slower on extremely large tables. Of course, it's still going to be blazing fast when it's indexed on the criterion field. But the difference could certainly affect scalability. If a single Select takes .01 seconds on an indexed 100 record table and .02 seconds on an indexed 100,000,000 record table then it could seriously affect scalability. As always, best to test on your own database before you commit one way or the other.
My approach? Pick whichever way you want to do it. As the app progresses you'll learn the advantages and the disadvantages of either method. You can always change (refactor) later once you realize that something works better
in a different way.
The single table approach will probably prove easier in the short to mid-term, but without knowing all of the details, I couldn't give you a better answer.
Wishing you luck,
Yes, it is optimizing before there's any need. But it sounds like the projected use is simple enough that he can easily create a larger test-table to see how the single-table method is going to work. If so, then it seems to me like this is an example of where it's okay to do premature optimization (assuming there's a certainty of table growing to size tested).
There's another possibility for addressing worries about scalability. If each user's data is completely independent of the other users' data, then there isn't even any need that their data be in the same database. So just set up a webserver and database with the single table method. Then if the number of users or records becomes too great, then just make a copy of the app on another server and give the new users a different web address to go to. That way you just scale up by adding new servers. I guess this method would work whether single-table on multi-table method is used, but it seems like it takes some of the scalability worries away from single-table method. And since single-table method is properly normalized and almost certainly simpler to implement, that seems like way to go.
The one table approach does not eliminate the indexing problem it just moves it. The DBMS will have to search the database tables to find the table information before it can do the query. I bet the sql query "comnpile" time will increase to offset the time saved by using small tables. Another possible problem is that you may get locks on the catalog table blocking everything. Go with one table and use a stored procedure if necessary.
If you have 1 table and you lock the whole table each time you do an update it is going to slow things down obviously. Use a database with optomistic locking, or
with record level locking.
Use stored procedures with precompiled SQL so that your not constantly re preparing the same SQL.
Do you have to worry about space used? If you have many updates to variable sized strings you can drive performance down by getting fewer records that actualy fit into a storage block, and therfore use more space aswell.
Multuiple tables is crazy -- there's probably all sorts of things you could do to prevent performance problems, and indexing is the least of them.
How about (in Oracle, at least) creating the table as a hash cluster on user_id? Not only do you get fast indexless access to the user's data but that data is alsop physically clustered in very few table data blocks -- don't forget that with a single table and with intermittent insert of records by each user you are going to be reading a heck of a lot of data blocks in order to retrieve the rows otherwise. Worst case scenario there is one block per row, which would s-u-c-k
Monday, May 02, 2005
Is multiple tables really so crazy? OP doesn't provide enough info, but I wonder if his 'one table per user' is really 'one table per account'. The difference would be that in the latter the data for one account is completely independent of any other account's data.
Before you laugh, consider something like Quickbooks online service. I don't know but I'm guessing they service many thousands of accounts, each with identical db structure (and with many tables). Do they throw the data for each account into the same database tables? Or do they create a new set of tables (i.e., a new database) for each account?
I'm guessing that Quickbooks gives each account its own database. If OP is similarly treating each user as a different account, is it really so obvious that the data for each user should be in the same table? Even the same database? Is the normalized way to have all users' data in single table within same database. Or to put each user's data in a different database? Where is the line drawn between a situation where you put it all in same database and one where you separate accounts into different databases?
Regarding the several comments that have said performance problems with multiple tables would equal or outweigh performance problems with single-table method, I'm not so sure. Maybe with a heavyweight like Oracle that's geared toward handling huge tables. But with something like MySQL or PostgreSQL? I'd create some sample databases (it's not hard) and do some testing. Call it preoptimization if you will. If we're talking about how to handle a siimple single-table db structure here, I can't imagine it taking more than a couple hours of work-time (though automated generation of sample tables on computer may take far longer) and I see no reason not to do it.
"Maybe with a heavyweight like Oracle that's geared toward handling huge tables. But with something like MySQL or PostgreSQL?"
MySQL loads a fixed amount of data for each table into memory. If you had 10,000 tables you'd probably exhaust all available memory in the server before you could execute a single query.
You're really going against the natural structure of the DBMS by using multiple tables vs. a single table with an index. They just simply aren't designed to handle it.
Giving each account it's own database is a little different still and might be a good idea depending on usage. Of course you still run into DBMS limits but not nearly so bad. For something like quickbooks online I'd be tempted to use something like SQLlite to give each client their own database file.
"MySQL loads a fixed amount of data for each table into memory."
Good point. I know some (all?) db's I've worked with also have buffers for each table. But that's only for each _open_ table right? You wouldn't leave these tables open in a web-app, I assume, you'd just be continuously opening and closing them. In any case I'm sure you're right that there's a large amount of inefficiency with the 'one table for each user' method.
Thanks again. I now realize the one table per user approach is indeed not smart, if anything it should be one database per user, or one database per certain number of users. I will do some simple tests to find out how fast PostgreSQL is in this situation. I also thought indexes are not a smart idea on tables with a lot of inserts/edits, so I guess I should just test it.
I had never heard of some of the things you mention, like hash clusters. I have developed databases before, but never had to do more optimization than creating a simple index. It looks like I have a lot to learn, but I do like that.
I also like Almost Anonymous' idea of making one SQLite db for each user. That sounds refreshingly simple. Do you see any disadvantages to that approach (when compared to the one-large-table approach)?
Monday, May 02, 2005
"I'm guessing that Quickbooks gives each account its own database. If OP is similarly treating each user as a different account, is it really so obvious that the data for each user should be in the same table?"
That is what I wondered. The Quickbooks comparison is a good one, I think (I don't know Quickbooks, but my program is also a program where users log data and want to do things with that data). Each user is indeed a different account, and the data of user A has nothing at all to do with the data of user B. There will absolutely never be any reason to have data of different users in one resultset, in fact it is vital that a user never gets to see records from another userid.
Monday, May 02, 2005
That situation is pretty common in most databases. I have tonnes of data that never appears in the same resultset because they split between users.
You can think of the index as means of dividing the data by row. It says that some row belongs to a particular user and you just select those rows.
And you never know when you'll want all the data in one dataset. I've done a lot of mass conversions and other processing that is much either with everything in one place.
You have only talked of 1 table.
1 table is a fairly minimal 'relational' DB.
If you are not doing joining with other tables, then maybe you don't need a relational DB at all. Directly working with a simple file yourself would likely be quicker than going through the generalised layers to access the DB.
One of the benifits of the DB approach is the way it keeps some of the information in memory for quicker access rather than constantly reloading.
If you have common queries and joins they can also be tuned to perform better.
Tell us more
Almost Anonymous you are right of course, there is a possibility that I do want all the data together, but then I think it is not that difficult with a simple script, if I give everyone their own database. Those queries will not have to be fast. I agree it would be more difficult if I gave everyone their own table, I totally abandoned that idea now.
Gorf: there are more tables, but those are merely lookup tables for the end user. I will edit them from time to time, but not that often. The user only edits the logbook table. I do think I need a database, but I now think something like SQLite might fit the bill.
Thursday, May 05, 2005
Hmmm...10000 users, each with their own DB.
So when you decide you need to make schema changes to handle say, new features or bug fixes, you're going to make the changes to 10000 databases?
True, you could script it out, but now you're scripting the actual business logic change, and the implementation of that change across thousands of instances.
Let's say somewhere down the road you need to add a table, and maybe some DRI between the new table and your original one.
So you script out the addition of the table, and the FK constraint, and the script which creates these across thousands of DBs. Don't forget there may be instances of data that fails the constraint condition, which would fail the creation script, so your process has to be robust enough to catch the error, report on it, and continue. So you run it, and 100 attempts fail. In the failing, you discover some data conditions you hadn't thought of, and are forced to redesign. So now you've got to script out an undo script for the change you made, and apply it to the ones that didn't fail, create a new script, lather rinse, repeat...
Now let's talk about testing. Actually, I'll leave that part to your imagination.
Data isolation? SQL 'WHERE' clauses are exceedingly simple to use. Don't be afraid, they don't bite.
One db, one table, one change. Think about maintenance and extensibility. Work it out on paper before you code line one, and do the thought experiments. Look at your setup and think, "with this implementation, what would it take if I had to..."
p.s. In some quaint circles, "premature optimization" is still referred to as "design".
1 Database per user can be a perfectly valid method of working. But you have to balance the system, for instance load & performance against data security & maintainability.
Would you realy want 1 database shared between all the users at an ISP for their email?
Lotus Notes shows good example of how their DB files and software is managed, yet still maintaining individual content for personal holdings.
OP, you seem to be focused on the input side, what is the plan for archiving and deleting?
"1 Database per user can be a perfectly valid method of working. But you have to balance the system, for instance load & performance against data security & maintainability.
Would you really want 1 database shared between all the users at an ISP for their email?"
True, 1 db/user *could* be a valid method of working, depending on the circumstance. Some instances where it might be appropriate:
* Each user's solution is significantly customized
* The db is being hosted for the user, which they expect to have access to.
* You are providing an implementation of a shrink-wrapped product (or the equivalent), which is designed to be implemented once per customer.
When you're providing a service for a disparate group of unrelated users, with no expectation that any user will actually touch the db except via your interface, it's really none of their concern exactly where the data resides. What's of interest to the customer is that you've accounted for their data's security and integrity. Moreover, it's in everyone's best interests that your implementation be maintainable, extensible, and scalable.
To answer the question, I would run, not walk, from an e-mail provider that intended to service more than a handful of users with one db per user. The nightmares this would make for in terms of maintenance/testing/analysis/etc., would virtually assure me that my service would never get upgraded, that errors would take an inordinate amount of time to correct, and opportunities for the provider to review their service would be outweighed by their costs.
I don't want any unauthorized eyes to view my banking transactions, and I certainly don't want my balances mixed up with anyone else's (unless their last name happens to be Rockefeller), but is there any reasonable expectation that a bank would keep a separate db for each account?
Of course, context is everything. If the OP would like to provide us with some, maybe we can make more useful suggestions.
This topic is archived. No further replies will be accepted.Other recent topics
Powered by FogBugz