A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.
I'm working on the architecture of a new system. Essentially, the data requirements are the same as a service like Google Analytics. There will be a steady flow of new data entering the system from numerous clients. This data will be small (about 1k in size) and rarely change. There could be as many as 1,000,000 of these data elements coming in per day. Once the data is in the system, a client can run simple analytics on it and export their raw data in CSV format.
Given that the system could capture 1GB of data a day, the database is going to get large fast. There is no requirement that the database for captured data needs to be the same as the one from which data is exported and/or analyzed.
I have no preconceptions of what database technology to use (although I am a MS SQL guy myself) and wouldn't be concerned about exploring solutions that fall outside of the typical realm of RDBMSs.
What are your thoughts on which database technology and deployment strategy would be best? Are there any reference architectures or case studies that are similar to this scenario? Any other thoughts?
We put this sort of volume of data through mysql databases on Linux on 64bit server class hardware and it seems just fine.
We haven't yet run anything out of room/size/idents.
There are top limits on filesizes, but they're pretty huge these days.
Monday, February 12, 2007
Have you looked at Data Warehousing and OLAP? It sounds like the data you are storing is warehouse data (ie huge volumes of business event data that is essentially immutable), and that you want to provide an efficient, performant means for clients to analyze that data, which is what OLAP was designed to solve.
I've used SQL Server for both tasks, and find it works quite well, especially for clients who are using Office tools.
I will warn you, as the volume of data in the database increases, any queries that need to perform aggregations such as SUM, COUNT or MAX will become slower and slower, to the point where your database server will fall over. OLAP handles this problem by pre-calculating aggregations and storing them in the cube, where all queries are performed.
Raw data still needs to be extracted from the warehouse though.
OLAP still needs to recognise and read the new data of course, and most decent RDBMS's have their own summarisation technology that provides similar performance benefits -- materialized views and query rewrite in the case of Oracle.
Monday, February 12, 2007
Whatever the initial data structure and hardware platform, I would try to imitate data warehouse design by strong normalisation, thus creating a single "big table" with numeric information, referencing if needed to smaller dimension tables.
The ultimate goal is a reduction of the big table record size, while keeping side tables small. It allows for quicker aggregation and/or selection, especially if you well designed indexing.
You may also need to establish the minimal "grain" of your data. E.g. do you really care in which second a particular event took place. Or even you're not that much interested in single events but a result of a whole sequence (session) of events.
Thanks for all of the input. I'm encouraged to hear that MySQL has handled this sort of load in the past as that is one of the database technologies that is on the short list. Regarding the need for an OLAP structure or star schema, I'm less concerned about that. Compared to the data aggregation needs, analysis is going to be pretty infrequent.
I'm considering some sort of structure whereby one database holds core data (clients, accounts, config info, etc.) and then there is a separate DB for each type of data collected (there will be numerous kinds). I could then host each data-type DB on its own server (or several on one server), as needed. I still haven't quite decided what to do if one of the data type databases grows beyond a manageable size - perhaps have multiple and make sure that one client stays on the same server. Any thoughts?
Thanks again for everyone's comments thus far.
Yeah, SQL Server 2000 & 2005 will handle many millions of rows per table without issue, provided each row isn't too huge, and things are appropriately indexed (which is a topic unto itself). You could, for instance, gather all your data into a single table, then move those records off into reporting tables (in a separate database perhaps) using DTS, transforming them along the way, then selecting from the prepared table with your reporting application.
Some of our tables are up to 30M+ rows, and very active, and we're doing fine -- we just move things out periodically, our apps select against the tables we prepare from the big ones, and all holds together just fine.
Without knowing more details, something like that ought to work for you. I can't speak for MySQL, but I know SQL 2K should be fine for something like that. Even one instance of it could work, depending on how much load you'll be throwing at it (in terms of the SELECTs, etc.).
Tuesday, February 13, 2007
This topic is archived. No further replies will be accepted.Other recent topics
Powered by FogBugz