* The Business of Software

A former community discussing the business of software, from the smallest shareware operation to Microsoft. A part of Joel on Software.

We're closed, folks!

Links:

» Business of Software FAQ
» The Business of Software Conference (held every fall, usually in Boston)
» Forum guidelines (Please read before posting!)

Moderators:

Andy Brice
Successful Software

Doug Nebeker ("Doug")

Jonathan Matthews
Creator of DeepTrawl, CloudTrawl, and LeapDoc

Nicholas Hebb
BreezeTree Software

Bob Walsh
host, Startup Success Podcast author of The Web Startup Success Guide and Micro-ISV: From Vision To Reality

Patrick McKenzie
Bingo Card Creator

Analytics Data Model

As some of you might know, for my mISV project, I'm currently working on a desktop analytics project, kind of like Google Analytics or Omniture, but for desktop software rather than for web apps (if you're interested, you can read more here: http://benjismith.net/index.php/2008/06/02/business-intelligence-for-desktop-software/ Also, there a screenshot mockup here, that I think will really clarify what I'm developing: http://benjismith.net/index.php/2008/06/13/analytics-gui-mockup/ )

I've just finished writing the majority of the server code, which exposes a web-services interface so that client applications can connect and report their anonymous usage statistics.

I also wrote a test simulation, where 25,000 unique clients connect to the server, reporting over a million sessions, taking place over the course of a six-month timeframe. (I wanted to make sure that there are no adverse performance consequences with the growth of the database. So far so good. No scaling issues yet.)

Now I'm starting to design the reporting implementation.

My goal is to provide business intelligence observations, like "80% of my users are on Windows, but the Mac OS users spend 40% more time using the application, taking advantage of more of the advanced features."

To start, I wrote a naive implementation that builds each query on-the-fly and simply executes SQL aggregate functions (like COUNT() and SUM()) with dynamically generated WHERE clauses, over the entire dataset.

As expected, with a million records in the DB, the performance of those on-the-fly queries is unacceptable, each of them taking between 10 and 30 seconds to execute. I can only imagine the terrible performance if the database contained 100 million session records.

So I'm considering an implementation that will pre-calculate all the metrics (using periodic batch-processing) and store the results in a fast reporting table. Queries against the reporting table won't issue any aggregate functions, and the number of records will directly correspond to the number of reporting units in the current timeline (day-of-year, day-of-week, etc), so they'll execute *much* more quickly.

But it's equally obvious to me that I can't possibly anticipate all the queries that the user will want to execute. There are just too many permutations (Windows 98 users with AMD processors, aggregated by day-of-week? Linux users with less than 1GB of RAM, segregated by geographic region and aggregated by week-of-year?)

So pre-computing everything is out of the question.

I'm considering implementing a "query history" table, so that the server learns over time which queries the user executes most frequently, and then schedules those queries for pre-calculation based on a hierarchy of priorities.

But I'm thinking that there might be some intermediate representation, some data structure that allows me to perform the heavy-lifting of the report-generation only once, but still allows a flexible reporting mechanism to execute fast aggregates on the intermediate sums, applying the constraints at the last possible moment.

For the life of me, though, I can't think of an intermediate representation that preserves the breadth of possible constraints while providing sufficient aggregation to speed up the dynamically generated queries.

Keep in mind: if I aggregate the number of unique users per hour, I can't simply sum them to get the number of unique users per day, since many of the users will overlap from hour to hour.

Any suggestions?
BenjiSmith Send private email
Thursday, August 07, 2008
 
 
I guess you have a schema like this:

* 1 table
* Many columns (e.g. client, OS, CPU, etc.)
* Can't define a clustered index on every column

Just as an amateur, off-the-wall suggestion, how about a schema like this:

* 1 table of clients
* Other, separate tables for each attribute ... each table contains two columns, i.e. the attribute value (e.g. "OS"), plus the client ID as a foreign key into the clients table ... these separate tables each define a clustered index on whatever their attribute column is.

You could then do a query like:

select OStable, CPUtable ON OStable.client = CPUtable.client where (OStable.OSvalue = 'Win98') and (CPUtable.CPUvalue = 'AMD')

Maybe that can be faster because then the column referenced in the WHERE clauses will be clustered indexes, so it won't need to read all records in the table.
Christopher Wells Send private email
Thursday, August 07, 2008
 
 
I assume you have indices on all the relevant columns in your database? That plus caching can speed these queries up by several orders of magnitude.

I'd especially look at caching. Assuming you can't get around providing these detailed metrics, you have to execute the queries at some point. Instead of writing a complex framework and trying to anticipate everything (and then generating much data that users will never look at) why not let them execute the queries they want and then cache the results so it's faster the second time.

Other than that, I don't see a logical way to be more efficient as long as you use SQL. There might be a better way to do it using different data structures, but I don't have much experience with that. Why don't you ask the rescuetime guys how they're doing it. They were quite friendly and responded to questions when they started out, maybe that hasn't changed.

good luck,

matt
Matt Winkelmann Send private email
Thursday, August 07, 2008
 
 
And btw: I really like your mockup. Not sure if I want to look at a black user interface for long, but I'd feel like the supervillain of anlytics using that. Muaharharhar :-)
Matt Winkelmann Send private email
Thursday, August 07, 2008
 
 
I think this is a "data warehouse" problem, so read about data warehouse and "star schemas": http://en.wikipedia.org/wiki/Star_schema
Christopher Wells Send private email
Thursday, August 07, 2008
 
 
Christopher Wells:
"I guess you have a schema like this..."

Actually, I have separate tables for 'device', 'environment', 'session', and 'event' (I'm simplifying a little bit, but you get the picture).

I've taken great care to minimize the rowlength of each record, and to make sure that variable-length columns are always in separate tables from the primary analytic tables. Any column that might occur in a WHERE clause has a BTREE index (incidentally, the current DAO implementation uses MySQL InnoDB tables, with foreign key constraints on all table relationships).

The 'session' table is the lynchpin of the analysis. Here's an example query, to count the number of sessions per day, with no constraints on the user's operating environment:

SELECT
  DATE(session.start) as session_date,
  COUNT(*) as session_count
FROM session
GROUP BY session_date
ORDER BY session_date ASC;

With barely more than a million records in the 'session' table, that query takes 10.5 seconds to execute, and it's the simplest of the queries I ever expect to see.

A slightly more complex query that accomplishes the same thing, (but which assumes an environment with multiple software developers, producing multiple applications, each with multiple versions) looks like this:

SELECT
  DATE(session.start) as session_date,
  COUNT(*) as session_count
FROM session
JOIN app_version ON (
  session.app_version_id = app_version.id
) JOIN app ON (
  app_version.app_id = app.id
) JOIN developer ON (
  app.developer_id = developer.id
) WHERE (
  developer.id = 2
) GROUP BY session_date
ORDER BY session_date ASC;

That query takes about nine seconds to execute, even with all the join-logic in place (the developer with ID '2' only has about 14,000 sessions, rather than the full million). So I think the DB design is good, and I'm just hitting the natural performance wall of executing aggregate functions.
BenjiSmith Send private email
Thursday, August 07, 2008
 
 
I worry anytime I see a calculation in a query that must  perform well.  The DATE() call is a concern.  I would consider adding a seperate date field.  Maybe even as an integer as YYYYMMDD.
Mark Griebling
Thursday, August 07, 2008
 
 
Matt Winkelmann:
"why not let them execute the queries they want and then cache the results so it's faster the second time."

Absolutely.

That's been a part of my plan since the beginning. The GUI will always include text saying "Results generated 2 hours ago. Click here to refresh." for each chart.

It's okay to provide slightly stale data most of the time, with periodic automatic refreshing, and on-demand regeneration.

If I don't allow arbitrary constraints, it's easy to running totals on the server that are periodically refreshed.

Constraints against the timeline (e.g., constrain to the last 30 days) are easy to model, for queries like "users per day". But timeline constraints are much more difficult to pre-aggregate for "histogram" types of queries. For example, every time the user asks for the "Operating System of Unique Users" pie chart, for the last 30 days, the whole query needs to be executed from scratch. I can't think of a way to pre-aggregate that data.

Matt Winkelmann:
"And btw: I really like your mockup. Not sure if I want to look at a black user interface for long, but I'd feel like the supervillain of anlytics using that."

Thanks! I'm really proud of it.

In a future version, I plan on letting the user change the basic theme (foreground, background colors, icons, etc). So people who dislike the black background could change it to something else.

But, for now, since I'm just a one-man show, non-functional features like that will have to wait.
BenjiSmith Send private email
Thursday, August 07, 2008
 
 
> that query takes 10.5 seconds to execute

I don't know; you might want to read http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html including the user comments.

Doesn't that seem like a lot, just to iterate a million records? I mean, if those records were cached and sorted in your own application's RAM structures, you wouldn't expect it to take so long, I think.
Christopher Wells Send private email
Thursday, August 07, 2008
 
 
Mark Griebling:
"I worry anytime I see a calculation in a query that must  perform well.  The DATE() call is a concern.  I would consider adding a seperate date field.  Maybe even as an integer as YYYYMMDD."

Good call.

I've worried about that myself, and I've tried to think of some alternate approaches. The 'start' and 'end' fields of the 'session' table are both DateTime fields, each of which consumes 8 bytes, which is pretty compact, considering all the different types of functions I can use on them.

The DATE function trims off the time, allowing aggregation on just the date portion. But I can also use the DAYOFWEEK, DAYOFMONTH, DAYOFYEAR, HOUR, MONTH, QUARTER, and WEEKORYEAR functions to provide different types of aggregation in the resultant reports.

I could decompose each timestamp, replacing them with a collection of fields representing all the different date functions. It'd significantly increase the size of each record, which hurts my soul a little bit, but maybe it'd speed up those queries.

I also have an 'event' table, so that the client can report when a user execute certain features in the monitored application (e.g., "spellcheck", "print", "export to pdf", etc), and each event record includes a timestamp, and they'd need to be decomposed too. But maybe it'd be worth it. I'll have to write some tests.

Hmmmmmm...
BenjiSmith Send private email
Thursday, August 07, 2008
 
 
Just an idea: have you tried the tests with a different database?  You might find that, say, Oracle or SQL Server is a lot faster for the sorts of queries you're doing and the volume of data.
MB Send private email
Thursday, August 07, 2008
 
 
Not yet. I plan on porting the DAO layer for different DB's, but haven't gotten around to it yet.

Realistically, my target market (small software shops, perhaps not quite mISV's, but almost) will probably be hosting on a linux box, without a full Oracle license.

Although I might be able to get better performance from another DB, I need to get at least acceptable performance from MySQL, since it's likely to be the most convenient DB choice for the majority of my customers.
BenjiSmith Send private email
Thursday, August 07, 2008
 
 
Yes, decompose the date. You absolutely don't want to be doing date math on large tables.

This also gets you free aggregation:

SELECT
  session.start_yr,
  session.start_mo,
  session.start_dy,
  COUNT(*) as session_count
FROM session
WHERE
  start_yr = '2008'
  AND start_mo = '5';

This makes it harder to do real date math in the queries, but you don't want to do that anyway. If you want, for instance, a rolling 30-day snapshot at an arbitrary time in the past, you calculate the start and end dates and pass them explicitly to your query.

Date math in reporting queries is evil.
Drew Kime Send private email
Thursday, August 07, 2008
 
 
Oops, I accidentally deleted the "GROUP BY" clause. Without it that COUNT won't work.
Drew Kime Send private email
Thursday, August 07, 2008
 
 
I think I'll give it a try.

Since I don't really need to maintain the minute & second, and since most fields can be represented with a single byte, it actually shouldn't be that bad. Instead of an 8-byte DateTime column, I'll probably have seven or eight fields consuming 10 - 12 bytes.

I'll let you know what kinds of results I see!
BenjiSmith Send private email
Thursday, August 07, 2008
 
 
I was once bitten by the date math problem on a completely different platform and situation. Turns out the native date math functions are fine if you use them a few hundred times. But if you're using them 1000s of times or more, you'll get noticeable delays. I ended up writing my own date object and datetime math functions and sped up my application by 50% or something ridiculous like that.

So my bet is on the date calculations being the culprit.
Oliver Send private email
Thursday, August 07, 2008
 
 
Star schema.  It seems like almost all of your data will be inserted once (one time), and then reported on.  No need for an OLTP type database, which is what you have, if you have lots and lots of tables, etc...

Check out wikipedia .... and Google.  This makes building reports simple, and it makes life on your DB a lot nicer.
Anthony Presley Send private email
Thursday, August 07, 2008
 
 
In real life these problems are solved
using automated pre-aggregation managers,
namely materialized views or OLAP servers.

Both MS SQL and Oracle have materialized
views. MS SQL comes with an OLAP server.

PostgreSQL may have something too, I'm not
sure. With MySQL I suspect you are out of
luck.
Object Hater
Friday, August 08, 2008
 
 
Okay, we're in business!

Decomposing the date fields and eliminating the DATE function took the query execution time down from 10 seconds to around 4 seconds.

On a lark, I also converted the storage engine from InnoDB to MyISAM, and now each query executes in about a half second.

I'm also in the process of denormalizing some of my tables, to create more of a star schema. Actually, it looks like I'll have at least three fact tables: 'session', 'environment', and 'event', which will be used under different circumstances, for different types of queries.

Thanks so much for your help and suggestions. It's made a big difference!
BenjiSmith Send private email
Friday, August 08, 2008
 
 
Great thread - I had never heard of star schemas:

http://en.wikipedia.org/wiki/Star_schema

Just the knowledge that such a thing exists could be really valuable if some day it saves me/someone from designing a normalized database when a completely different approach might be a lot better for the application...
MB Send private email
Friday, August 08, 2008
 
 
Good to hear that. I'll bet once you've got real volume in there you'd have found the DATE function executing longer exponentially.
Drew Kime Send private email
Saturday, August 09, 2008
 
 
I'd go for a separate data model for recording info and one for reporting. That allows you to optimise indexing for each use.

For reporting I'd just use a flat table with loads of columns, populated daily or hourly or whatever. It will end up big as there will be tons of repeated text descriptions.

That is effectively a de-normalised star schema (the fact table) which cuts out the join overhead. Most reporting should then just be aggregations.

Designing datasources for efficient reporting is quite different to designing for updates/storage.
OLAP fan
Saturday, August 09, 2008
 
 
That was my initial design as well, but the more I thought about it, the more I realized that I have no need for any OLTP data at all. The *only* use of the data is for reporting, so I'm just creating OLAP records.
BenjiSmith Send private email
Saturday, August 09, 2008
 
 
I am a bit late here, but I had to smile; this is exactly the sort of data that my company has been dealing with for a long time; we crunch market research data with our own inverted database tech that build n-dimensional cubes of aggregated data on the fly.

You are effectively doing the same thing when you 'cross' variables like OS and say components used. Generally with a few hundred thousand records 2-3 dimensions in a cube is very fast - seconds. Does not tend to grow with numbers of records. but sensitive to number of dimensions and elements within dimensions.
Grant Black Send private email
Monday, August 11, 2008
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz