The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Your experience w/ Column oriented database (MonetDB, C-Store)

RDBMS (row based databases) are great for INSERTing data; however, they are painfully slow to retreive (SELECT) data.

Does anyone have any experience using a Column Oriented Database system like MonetDB or C-Store.

If so, would you mind sharing your experience with us.

Thanks in advance.
Frank
Tuesday, March 06, 2007
 
 
'however, they are painfully slow to retreive (SELECT) data."

Really? That's news to me. But maybe my threshold of "pain" is much higher than yours. ;)
dood mcdoogle
Tuesday, March 06, 2007
 
 
RDBMS stands for Relational Database Management System, not row based database. Both of the column oriented databases you mention are RDBMSes.

Terminology confusion aside, almost the entire database-using world finds that conventional DBMSes select quickly enough. Before you decide that it's worth going with a relatively obscure DBMS, make sure the DBMS is really your bottleneck.
clcr
Tuesday, March 06, 2007
 
 
What exactly would make a database "column-oriented" rather than "row-oriented"?
Kyralessa Send private email
Tuesday, March 06, 2007
 
 
I've used Sybase IQ, which I believe may fit your definition of column-based.  As I recall it indexes every column, and the indexes _are_ the data.  It seemed to work well, when used for the kinds of tasks it is designed for (datamart/data warehouse; not OLTP).
John Rusk Send private email
Tuesday, March 06, 2007
 
 
For those of you who don't know - as wikipedia alludes to, the author is correct that column oriented databases are MUCH faster at reading data than writing data compared to Row-based databases.

http://en.wikipedia.org/wiki/Column-oriented_DBMS

Column based databases can provide a 10X increase in performance over row based for database management systems.

http://en.wikipedia.org/wiki/MonetDB

Renowned database designer Michael Stonebraker, creator of Ingres and Postgres, is starting a company called Vertica (http://www.vertica.com/) which is continuing the developed of the C-store (column based database) for commercial use.
Bryan
Tuesday, March 06, 2007
 
 
BUT... most of the products currently available for column oriented databases are some graduates student's thesis. Not a real commercial database. So why would you expect some beta level product to outperform a commercial or open source database that has been constantly tuned FOR performance for 20 years? And note that column oriented databases are not ALWAYS faster for queries. Just sometimes. Has the OP determined that his queries fit qualifications for the so-called huge performance gains he is expecting? Maybe, maybe not. Also, don't forget that Wikipedia and the likes can be edited by anyone. Including the groups advocating column oriented databases. So give us some real facts and show us the numbers. Otherwise, you're just making it all up.

I just take offense at someone coming in here and telling us that something is painfully slow without giving us anything to compare it to. Painfully slower than what? An Apple IIe accessing a floppy disk? Or the latest and greatest whoop-de-jour?
dood mcdoogle
Tuesday, March 06, 2007
 
 
For those of you moaning that column based DBMS are NOT faster, here are a few benchmarks.

http://monetdb.cwi.nl/projects/monetdb/SQL/Benchmark/TPCH/index.html

http://www.mit.edu/~dna/vldb.pdf

http://www.vertica.com/userlogin?pdf=osfa.pdf

Don't take offense when someone tell you row-based DBMS for reading data is slower than column based.  It's simply a by product of the fact that row-based DBMS were originally designed for massive INSERTS.

Do you really think OLAP cubes obtain the crazy performance they do by using row-based DBMS.  They all use column-based.

(And no, I do not work for any column based DBMS company).
Bryan
Tuesday, March 06, 2007
 
 
"I just take offense"

Wow.  I really have nothing to say.  But this is the Internet, so that means I should say something anyway.

d00d, you need to settle the heck down.  Do you also get upset if someone says that C++ is better than C#, or Java better than Perl?

You're way too personally invested in this stuff.

Personally, I take offense when some ignorant prat goes off on someone asking a simple question, when said prat doesn't have any idea what the question means in the first place.

This is fun, I see why you do it. ;)
Anon
Wednesday, March 07, 2007
 
 
I just read the wikipedia article to find out the differences. The difference in data ordering (all data for a single row stored contiguously vs. all data for a single column stored contiguously strikes me as a relatively simple transform. If the difference are as great as some people claim, then it's just a matter of time until a major vendor gives us the option to create dual data stores. One will be row-based, the other will be column-based, and some funky performance-based lazy-write will keep the two synchronized. The appropriate store will be automatically chosen (with hints available, of course) based on the specific nature of the task.

To be honest, I'm not sure what difference there is between the two. Ultimately the row-based system needs something like 'offset = fieldCount' to grab the next chunk of data for a column and 'offset = 1' to grab the next chunk of data for a row. In a column-based system, you need 'offset = fieldCount' to grab row data and 'offset = 1' to grab column data. I'm sure that there is really any difference between 'offset = 1' and 'offset = x', and it looks like both still need to do bookkeeping to track which 'set' (row or column) is current/next/previous/whatever.

And I've probably haven't got a clue what I'm talking about; not an unusual situation these days :)
Ron Porter Send private email
Wednesday, March 07, 2007
 
 
"Don't take offense when someone tell you row-based DBMS for reading data is slower than column based."

I don't take offense when someone tells me that column based is quicker than row based. I take offense when someone tells me that row based is "painfully slow". Clearly it is all relative.

"d00d, you need to settle the heck down.  Do you also get upset if someone says that C++ is better than C#, or Java better than Perl?"

No. I don't take offense when someone tells me that one thing is better than another. But I would take offense if someone told me that C# was "painfully slow" without providing any sort of backup to that claim or frame of reference. That is just trolling in my book you ignorant prat.

"This is fun, I see why you do it. ;) "

Yes it is.
dood mcdoogle
Wednesday, March 07, 2007
 
 
> To be honest, I'm not sure what difference there is between the two.

When you are doing tasks like data mining and you want to sum a column it's much faster to bring in pages of columns than it is to bring in pages of rows and then select the column.
son of parnas
Wednesday, March 07, 2007
 
 
inside the actual files, deep inside SQL Server or Oracle or DB2, do we know how the data is stored? something tells me it isn't in a text-like file with entries like the wikipedia article.
anomalous
Wednesday, March 07, 2007
 
 
DBMS storage is fairly well understood. The wikipedia article *is* a gross simplification, but it does serve to illustrate the point. A typical DBMS will store data in a B tree rather than a flat file. The B tree can be organized by columns or by rows, as described in the wikipedia article.
clcr
Wednesday, March 07, 2007
 
 
I work with kdb+ here in a Manhattan IBank.

kdb uses the column model and is blisteringly fast compared to conventional RDBMS solutions such as SQL Server, Oracle, UDB, etc.

For high volume and/or high speed data management and processing I haven't come across anything which can compare.

http://www.kx.com/

If you have specific questions, let me know.
JJ Send private email
Wednesday, March 07, 2007
 
 
dood mcdoogle >
"BUT... most of the products currently available for column oriented databases are some graduates student's thesis. Not a real commercial database. So why would you expect some beta level product to outperform a commercial or open source database that has been constantly tuned FOR performance for 20 years?"

I beg to differ...

http://www.kx.com

Check out kdb / kdb+

A commercial product which has been around for years and is used in many major investment banks around the world for high volume, high performance data processing and number crunching as well as automated realtime trading systems.

Disclaimer: I don't work for Kx.
I do, however, have experience in using their kdb line of products.
JJ Send private email
Wednesday, March 07, 2007
 
 
I'll second JJ regarding kdb.

There's nothing new about column-oriented RDBMSs ... the've been around forever and every APL programmer has accidentally written his own.

I think kdb is the most highly developed commercial-grade product, but because of its cost, it generally only makes sense for big financial companies.
Paul Mansour Send private email
Wednesday, March 07, 2007
 
 
Yup.. kdb+ is fast fast fast. It could be beaten though. My after hours project is a mostly compatible implementation.. after using it for some time I have some things I want to change. I have it feature complete. Lacking a decent plan to sell it but.
noop Send private email
Wednesday, March 07, 2007
 
 
This is a really interesting post. It never occured to me that you could design a DB this way.

<possibly stupid question alert>
How does indexing work? If you have 50 columns in your Orders table, and you have an index on, say, order#, does the index have to point to each of the 50 fields separately, since they are no longer contiguous?
</possibly stupid question alert>
Greg Send private email
Thursday, March 08, 2007
 
 
noop, I'm intrigued by your claim, care to share more info? You said you are using it; in what type of app?
Ben Bryant Send private email
Thursday, March 08, 2007
 
 
Greq,

I don't know about all implementations, but many have no indexing concept (inlcuding kdb I think).

In your example, if you want to find a find an order number, you simply look it up in the actual order column, which is just a big vector of order numbers, probably sitting in memory, or is easily memory-mapped, as all order number are stored contiguously on the disk.

In these types of RDBMSs the database size is no bigger than the actual data in the database. Nothing is indexed.
Paul Mansour Send private email
Thursday, March 08, 2007
 
 
Usually you know the order# and want to find the 20 other attributes related to that order# (customer name, quantity ordered etc.)

Usually you have millions of orders (too many to keep in memory) and need to be able to find one record directly.

This sounds like an interesting concept for some applications, but perhaps not for run-of-the-mill business transaction processing.

Or maybe I just don't get it.
Greg Send private email
Thursday, March 08, 2007
 
 
Greg,

You only need to keep or have the order# col in memory, and a million is not alot at all. Once you have found the particular order# you are looking for, say in the 500th position, you can then directly read the 500th item from each of the remaining columns of the order table and you have your record.

You are correct, however, that column-oriented RDBMSs have not traditionally been used for typical transaction processing. However there are many types of apps where it excels. For example, if a typical transaction is, say, updating 3 columns on 50,000 rows, then it quite effective.
Paul Mansour Send private email
Thursday, March 08, 2007
 
 
Well, in KDB/KDB+ realtime for example, everything is held in memory. Basically in big arrays.

So if you know the index of the Order# you want, then getting all the other info is O(1) because it's a simple array lookup.

It does use indexing (and enumeration of varchar columns to integers to make searches and lookups on character data much faster)
JJ Send private email
Thursday, March 08, 2007
 
 
Paul beat me to it :)
JJ Send private email
Thursday, March 08, 2007
 
 
ben "I'm intrigued by your claim, care to share more info? You said you are using it; in what type of app?"

I use it to do exactly what I use kdb for. I don't have all the licenses I need due to cost and the fact I just don't want to deal with the available consultants anymore. Its a clean room mostly compatible version that is as fast. I do analysis of terabyte data sets looking for tradeable patterns. Considing realtime ones but that would be politically awkward to put alongside the existing servers. Any particular technical questions?

Someone asked about indexing. kdb does have an indexing concept. Check the documentation. LOL. kdb users joke.
noop Send private email
Thursday, March 08, 2007
 
 
People are missing a few things. Not exactly trade secrets.

You often have wide tables where you only need to get a few of the fields of some row.

You can compress the data to get small space and better performance.

You can make oltp workloads go very fast as well. I don't want to give it all away but heres a big hint. Say you want to simulate a row layout as an extreme example. Why not use a table with a single column. An entry in that column can contain a list of the data for that row.
noop Send private email
Thursday, March 08, 2007
 
 
Just jopined followed the links shown here and im puzzled.
Whats q and kdb. Are they the same?
Also did noop say you say where dealing with terabyte data sets? Is this exchange data? How fast can you deal with these sets? Is this with kdb or another client?

Also dont understand the "kdb users joke"?

A.
Alan Send private email
Thursday, March 08, 2007
 
 
I've never seen kdb in action, though I have seen Hyperion (OLAP) which uses a column oriented database.

It's absolutely amazing how fast it pulls, sums, sorts and JOINS data.  It's serious in real-time.

It's hard to be impressed unless you have seen it in action.  Column oriented database can provide literally an order of magnitude performance boost over traditional row-based databases.
Dennis
Thursday, March 08, 2007
 
 
I forgot to mention, to the original author - no, sorry I have never used MonetDB or C-Store.
Dennis
Thursday, March 08, 2007
 
 
"dealing with terabyte data sets?" yes ticks and derived

"how fast can you deal with these sets?"

in memory a few gigabytes a second for simple analytics. message arrival rates up to 500000 a second (could be more). from disk I'm bottlenecked by the disk system. I think i can bet them when using multithreaded reading & writing.

"Is this with kdb or another client?"

both. shifting to all my own. the server processes sit and run precanned queries for now and no expansion there.

"Also dont understand the "kdb users joke"?"

the docs are cryptic at times and the bodyshop consultants shit. kx themselves are unpolished but good.
noop Send private email
Thursday, March 08, 2007
 
 
Hi noop.  I'm curious about your own replacement for kdb. Are you coding in k, or some other array language? Or directly in C?
Paul Mansour Send private email
Thursday, March 08, 2007
 
 
For what it's worth, Hyperion as Dennis mentioned, was recently purchased by Oracle for over $3 billion
Sammy
Thursday, March 08, 2007
 
 
"Are you coding in k, or some other array language? Or directly in C?"

In fairly portable C. Bit of linux specific stuff. Temptation is to write a version specific to one platform. linux or solaris on x86 perhaps. For 2.0 add a little machine code generator for scalar expressions. Other temptation is to try to commercialize but I'd rather take a one off check to take the system off my hands.
noop Send private email
Thursday, March 08, 2007
 
 
All,

Does anyone know of bench marking tests performed on the offerings suggested against one another and the more pupular offerings such as oracle? Specifically in relation to finance data?

If so can you point me towards them or tell me what the results and what OS was used?
What where the data set sizes or date ranges?

A.
Alan Send private email
Friday, March 09, 2007
 
 
Does anyone know of a MySQL engine which is column based?
Braden
Friday, March 09, 2007
 
 
In the crazy world of DSS and DW for Oracle we often use precalculated summaries (materialized views, for example) to get faster access to subsets of columns of a fact table, or even unaggregated (strictly redundant) sets that contain just the commonly queried columns. For very small columns subsets a bitmap or compressed btree index usually does very nicely.

I'd not be suprised to see that the benchmarks that most favour column orientation do not show an RDBMS making use of such methods.
David Aldridge Send private email
Sunday, March 11, 2007
 
 
You guys are missing the point.  Column databases read far less data when doing selects, and allow much higher data compression.  In a datawarehouse situation where the limitation is how quickly you can fetch X MB of data from disk, this architecture is faster because it fetch 1/10th the data.  It is not a matter of a commercial db being "optimized" because the limitation here is I/O.

As for experience with column dbs, currently they mostly exist only in research paper and proof-of-concept form so i don't think there is any experience to be had.
jay k
Tuesday, March 20, 2007
 
 
MonetDB is very interesting indeed.  However, just to get it to a point where one can issue simple queries against it is a royal pain (compilation, missing utilities, etc.)
falcon Send private email
Thursday, March 29, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz