The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Aggregate functions on HUGE collection of objects

hi,

I need to implement grouping, ordering & average on a huge collection of Objects.
The List is so huge that I cant keep it in memory. So I have to split it in buckets or sublists of objects.
Do you know an algorithm that is capable of taking in buckets of objects and aggregating it? Also, kindly present the space & time complexity of your algorithm.

Implementation should preferably be in java, but Im open to options.

Thanks in advance !
Gary King Send private email
Tuesday, May 27, 2008
 
 
Gary, it sounds like your data should be in a database. Then managing it becomes almost effortless.

But maybe there are other constraints you haven't mentioned.
Steve McLeod Send private email
Tuesday, May 27, 2008
 
 
+1 for the database. That is what RDBMS is for, and the solution would be trivial.

Also, define "huge". I have run aggregates over a billion row  telecoms carrier billing database with no problems.

I guess the "other constraints" are that this homework assignment needs to be in my the end of the month ;-)
Odysseus Send private email
Tuesday, May 27, 2008
 
 
What is huge exactly?
Daniel_DL Send private email
Tuesday, May 27, 2008
 
 
+1 again for using a database - why code when you can buy?

Once upon a time this used to be a common problem because mainframes had memories measured in kilobytes. The way it would be done is to store everything in files and write all your processes to serially consume the files one record at a time, writing files serially as the output. Averaging is of course easy (read a record, add the value, divide at the end) and grouping not much more complex ( read a record, decide which group it belongs in, write it to the appropriate output  file). Sorting is a little more complex, but basically involves repeatedly merging files in a way that partially sorts them.

I'm astonished to discover that Wikipedia has no reference to this kind of sort algorithm - clearly it is now so pre-historic as to not even be worth mentioning.
DJ Clayworth
Tuesday, May 27, 2008
 
 
Another vote for RDBMS.  The queries are easy.

select identifier, sum(statistic), avg(statistic)
from stuff inner join otherstuff on matchcondition
group by granularity;

You get the idea.

If "huge" is on the order of a few hundred thousand entries,  you can use MS Access in place of an RDBMS. If it's on the order of a few hundred million,  you're better off with a stronger database tool.

Database design is not complex, but it is subtle.  Most of the tutorial material teaches you how to design with OLTP in mind.  You might be better off with an OLAP oriented design. 

If you use OLAP data cube and GUI tools along with or instead of RDBMS,  analyzing your data is as easy as playing a video game.
Walter Mitty Send private email
Tuesday, May 27, 2008
 
 
Averaging doesn't require any tricks, even when the data set doesn't fit in memory: run through the data set accumulating the value to be averaged and the counting the number of elements in the set. When you have run through the set, divide the accumulated value by the number of elements.

For "ordering" the data, which I take to mean you want to sort the data, you are looking for an "external sort" (google is your friend). One common external sort is the merge sort, also called the tape sort, which is taught in introductory computer science classes (again, google is your friend). Clearly, you will need to sort from disk, through memory, and back onto disk, which may make things a bit more complicated.

As for "grouping" it is not clear exactly what you mean by this term. Will the number of items in the grouped output be fewer than the number of items in the entire data set? Will the number of items in the grouped output fit into main memory, or will it also need to be stored in "buckets" on disk? You can probably do what you want by first sorting the data set and then running through the sorted data set sequentially.

All in all, it is probably better to simply replace your current storage mechanism with a DBMS, as others have said, which should implement all this stuff for you.
Jeffrey Dutky Send private email
Tuesday, May 27, 2008
 
 
> I'm astonished to discover that Wikipedia has no reference to this kind of sort algorithm - clearly it is now so pre-historic as to not even be worth mentioning.

http://en.wikipedia.org/wiki/External_sorting

It's not pre-historic at all, this kind of algorithm is surely still used in any modern DB system.

I'm writing a sample implementation right now as part of an advanced DB course I'm taking.
Sagi B Send private email
Tuesday, May 27, 2008
 
 
Thanks, Sagi.
DJ Clayworth
Tuesday, May 27, 2008
 
 
Vol 3 of Knuth's Art of Computer Programming had/has a large section on this category of sorting. I remember reading an iterview with him saying that while it was "hot stuff" back when the books were written (3 decades ago), very few folks use it (today) to the point where he'd rip it out completely if he got to rewrite that volume.

http://en.wikipedia.org/wiki/Merge_sort
Peter Send private email
Tuesday, May 27, 2008
 
 
If you're using .Net 3.5, you can build a class implementing IEnumerable, and hide all the implementation details of how you marshal and cache your objects inside that class.  The class just needs a method that yields the next item in the list whenever it's called; how it does this is up to you.

Then you can do just about any analysis you want on the data using Linq queries.  Linq is smart enough to not iterate through the whole list until the query executes, and it's smart enough to not keep a list of all of the objects in memory unless you've written a query that requires it.

My first choice would be to use a database, for all the reasons described here.  But if there are reasons a database isn't suitable, this approach will work.
Robert Rossney Send private email
Wednesday, May 28, 2008
 
 
Be ready to deal with the Impedance Mismatch if you go with a RDBMS.
MoffDub Send private email
Wednesday, June 04, 2008
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz