The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

A Problem I Can't Get My Head Around

Let's say I have an object that contains a date and a value.

Now, let's say I have a collection of the above objects.

Now, let's say some of the objects contain duplicate dates, so:

Object1
Date: December 2007
Value: 1000

Object2
Date: January 2008
Value: 1000

Object3
Date: January 2008
Value: 2000

Object4
Date: December 2007
Value: 500

I want to sum all the values, by date. So December 2007 total would be 1500 and January 2008 would be 3000.

What is the best way to do this with a collection of approx 300 of the above objects.

I'm sorry this isn't really a Friday afternoon type question!

Any help much appreciated!
Matthew Kerry Send private email
Friday, February 01, 2008
 
 
For 300 objects, it doesn't matter how you solve the problem. It'll take milliseconds to do no matter how you approach it.

A more helpful answer will probably depend on your implementation language, but the basic categories are:
1. Sort the collection, then iterate through it, accumulating results until the date changes.
2. Extract all the unique dates, then query the collection for the objects with a particular date.
Mark Bessey Send private email
Friday, February 01, 2008
 
 
Couldn't you just iterate though the objects, check each Date as it comes up and populate a separate collection with the totals?  This may not be the most performant way, but it sure seems simple to me.

Some sudo code...

For Each obj In InitialCollection
    If NewCollection.Contains(obj.Date) Then
        NewCollection.Item(obj.Date).Value += obj.Value

    Else
        NewCollection.Add(obj.Date).Value = obj.Value
    End If
Next
Rich Send private email
Friday, February 01, 2008
 
 
+1 to Rich except use a dictionary or hashtable for the second collection.  The key is the date, the value is the current sum.  This is only if the hashtable/platform supports value equality on the Date type.
Ted
Friday, February 01, 2008
 
 
Mark hits the nail on the head. If there's only 300 objects the solution is nearly irrelevant. Use whatever makes the code easiest to maintain, assuming you don't have some WTF-ish code to extract the data from each object.

I would solve it using the hash table method. Normalize the date and use that as the key, with the value being the running total. Then it becomes trivial to iterate over the objects and update the running totals. That's probably easier for some languages than others. If it's not easy, you're using the wrong language :-)
An anonymous voice in the crowd
Friday, February 01, 2008
 
 
> If it's not easy, you're using the wrong language :-)

SELECT SUM(ObjectValue) FROM ObjectTable GROUP BY ObjectDate
Christopher Wells Send private email
Friday, February 01, 2008
 
 
Sheesh! Obviously you should serialize the collection to XML, put some HTTP around it, add some SOAP headers, create a web service, open up a socket, wait for an asynchronous response, send it to a secondary server in case one fails, compare the results, and reconstitute a result object.

What the hell are they teaching in schools these days, anyway?
Greg Send private email
Friday, February 01, 2008
 
 
@Greg, apparently they're teaching exactly that.
Lance Hampton Send private email
Friday, February 01, 2008
 
 
Depending on the language, it may be possible to define a "CompareTo" method on the objects, put them in an array or list, and sort them. The subtotals would be collected in the usual way passing over the set in sorted order.

I doubt it would be faster than using the Hashtable, but it might have other benefits, for example, the subtotals would created in order sorted by date.
Nutmeg Programmer
Saturday, February 02, 2008
 
 
I think the preferred way would have to be to create a map, hash, collection or whatever keyed by date. Then iterate over the list and accumulate the totals for each date found. Any other approach such as sorting and summing each subrange is too much low level work and more risk of mistakes.
Ian Boys Send private email
Saturday, February 02, 2008
 
 
sum={}
for item in collection:
    sum[item]=1+sum.get(item,0)
PeterR
Saturday, February 02, 2008
 
 
Using C# 3.0:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace GroupSumExperiment
{
    class TimeData
    {
        public DateTime Date { get; set; }
        public int Value { get; set; }
    };

    class Program
    {
        static void Main(string[] args)
        {
            List<TimeData> timeSeries = new List<TimeData> {
                new TimeData { Date = new DateTime(2007, 12, 1), Value = 1000 },
                new TimeData { Date = new DateTime(2008, 1, 1), Value = 1000 },
                new TimeData { Date = new DateTime(2008, 1, 1), Value = 2000 },
                new TimeData { Date = new DateTime(2007, 12, 1), Value = 500 }
            };

            var query = from d in timeSeries
                        group d by d.Date into dateGroups
                        select new { Date = dateGroups.Key, Total = dateGroups.Sum(td => td.Value) };

            foreach (var group in query)
            {
                Console.WriteLine("{0}: {1}", group.Date.ToString("yyyy/MM"), group.Total);
            }     
        }
    }
}
Chris Tavares Send private email
Sunday, February 03, 2008
 
 
Tasks like this are legion in my day job's programming domain.  We use Java.  Every time I get one I weep silently that I can't use Ruby, and then write the (overly long, obnoxious) code to solve it with Java collections.  Which, to be fair, were amazing when they were the best tool I had available.  (Think of how annoying it would be in C without the STL.  Not hard -- it isn't a hard problem, it is CS101 -- just annoying.)
Patrick McKenzie (Bingo Card Creator) Send private email
Monday, February 04, 2008
 
 
'My goal in offering Bingo Card Creator to the world is to in my own small way improve students' education by making fun and educational activities possible.'

EFL might be more apropos? :-)

Monday, February 04, 2008
 
 
Thanks for everyone's comments.

After some wrangling I convinced the DBA to change the query to group by just the month period and it seems to work a treat.

Linq also looks like an interesting development that we need to get to soon.

Monday, February 04, 2008
 
 
> What the hell are they teaching in schools these days, anyway?

You did want all of that done in C, making as much use of pointers as possible, right?

Monday, February 04, 2008
 
 
What happens if the input grows to 30,000? 300,000?  Why not choose a solution that scales. It's as easy as one that doesn't. 

Someone had the right solution above - one loop with a hash lookup each iteration. The running time is O(n), so it scales with any number if input.
DSCheerleader
Tuesday, February 05, 2008
 
 
If you're not using .Net 3.0, this post will be of no interest to you.

Because I was curious to see how smart Linq was when it does grouping, I extended Chris Tavares's example above to compare Linq's execution of such a query with a hand-coded "manual" version.  (See below.) 

I was surprised to find that Linq ran *faster* than my manual method did.  It occurred to me that performing ten million calls to ContainsKey might be the culprit and so I wrote a version of the manual method that doesn't use ContainsKey to check before trying to update the dictionary; instead, it catches the exception if the key's not in the dictionary and adds it.

It seems that given a comparatively small number of keys, catching the KeyNotFoundException is faster than avoiding it in the first place.  And the Linq query is very slightly slower than the manual method.

It's also interesting to see that Linq executes the query when it's referenced, not when it's initialized. 

There's something I'm seeing here that I don't understand, though:  when this program dumps the Linq results, there's a perceivable delay between each line written.  I think that the total delay is about equal to the difference in execution times between the Linq query and the optimized query.  I wonder what's up with that.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;

namespace GroupSumExperiment
{
    class TimeData
    {
        public TimeData(DateTime date, int value)
        {
            Date = date;
            Value = value;
        }
        public DateTime Date { get; set; }
        public int Value { get; set; }
    };

    class Program
    {
        static List<TimeData> TimeSeries = new List<TimeData>();

        static void Main(string[] args)
        {
            InitTimeSeries();
            LinqQuery();
            ManualQueryUsingTryCatch();
            ManualQueryUsingContainsKey();

            ConsoleKeyInfo k = Console.ReadKey();
        }

        static void InitTimeSeries()
        {
            List<DateTime> dates = new List<DateTime> {
                new DateTime(2008, 1, 1),
                new DateTime(2008, 2, 1),
                new DateTime(2008, 3, 1),
                new DateTime(2008, 4, 1),
                new DateTime(2008, 5, 1),
                new DateTime(2008, 6, 1),
                new DateTime(2008, 7, 1),
                new DateTime(2008, 8, 1),
                new DateTime(2008, 9, 1),
                new DateTime(2008, 10, 1),
                new DateTime(2008, 11, 1),
                new DateTime(2008, 12, 1)
            };
            Random r = new Random();

            for (long i = 0; i < 10000000; i++)
            {
                DateTime d = dates[r.Next(dates.Count)];
                int v = r.Next(1000);
                TimeSeries.Add(new TimeData(d, v));
            }
        }

        static void LinqQuery()
        {
            Stopwatch sw = new Stopwatch();

            Console.WriteLine("LinqQuery:");
            Console.WriteLine();

            sw.Start();

            var query = from d in TimeSeries
                        group d by d.Date into dateGroups
                        select new { Date = dateGroups.Key, Total = dateGroups.Sum(td => td.Value) };

            foreach (var group in query)
            {
                Console.WriteLine("{0}: {1}", group.Date.ToString("yyyy/MM"), group.Total);
            }

            sw.Stop();
            WriteElapsedTime(sw.Elapsed);
        }

        static void ManualQueryUsingTryCatch()
        {
            Stopwatch sw = new Stopwatch();

            Console.WriteLine("ManualQueryUsingTryCatch:");
            Console.WriteLine();

            sw.Start();

            Dictionary<DateTime, int> totals = new Dictionary<DateTime, int>();
            foreach (TimeData t in TimeSeries)
            {
                try
                {
                    totals[t.Date] += t.Value;
                }
                catch (KeyNotFoundException)
                {
                    totals.Add(t.Date, 0);
                    totals[t.Date] += t.Value;
                }
            }

            foreach (DateTime d in totals.Keys)
            {
                Console.WriteLine("{0}: {1}", d.ToString("yyyy/MM"), totals[d]);
            }

            sw.Stop();
            WriteElapsedTime(sw.Elapsed);
        }

        static void ManualQueryUsingContainsKey()
        {
            Stopwatch sw = new Stopwatch();

            Console.WriteLine("ManualQueryUsingContainsKey:");
            Console.WriteLine();

            sw.Start();

            Dictionary<DateTime, int> totals = new Dictionary<DateTime, int>();
            foreach (TimeData t in TimeSeries)
            {
                if (!totals.ContainsKey(t.Date))
                {
                    totals.Add(t.Date, 0);
                }
                totals[t.Date] += t.Value;
            }

            foreach (DateTime d in totals.Keys)
            {
                Console.WriteLine("{0}: {1}", d.ToString("yyyy/MM"), totals[d]);
            }

            sw.Stop();
            WriteElapsedTime(sw.Elapsed);
        }

        static void WriteElapsedTime(TimeSpan ts)
        {
            string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
                ts.Hours, ts.Minutes, ts.Seconds,
                ts.Milliseconds / 10);
            Console.WriteLine("Elapsed time: " + elapsedTime);
            Console.WriteLine();
        }
    }
}
Robert Rossney Send private email
Wednesday, February 06, 2008
 
 
Interesting given the amount of literature on how exceptions can cause problems with performance ...
Nigel Sampson Send private email
Thursday, February 07, 2008
 
 
@Nigel Sampson
Exceptions can indeed cause huge problems but the magnitude is relative to call stack unwind that occurs until they are caught. If they are caught very close to the throwing place the penalty is minor, but if the catching occurs far away (call stack wise) the penalty grows at an exponential rate.
AZ
Thursday, February 07, 2008
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz