The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Premature Optimization?

I need to randomly select a sample of certain "transactions" from a table that I already have. In this table is a column that is a "Running Total" of all the amounts up to this point in the table (think receipts up to a  certain date/transaction). The random number generated is between 0 and the total amount of the transactions. So if you get 1890 for the random number it will find the receipt for which the running total is either $1890 or higher (if say the running total skips from 1800 to 2000). During a loop I check to see if the current random number is larger than the last random number I generated, and if so start from the previous cursor position (cursor = row number).

currentRand, lastRand, cursor are integers
rand is a Random from .net

for i = 0 to 10
  currentRand = rand.Next(0, intTotal)
  If currentRand >= lastRand Then
  'do nothing
  Else
  cursor = 0
  End If
  ...
  While Table.Rows(cursor).Item("RunningTotal") < currentRand
    cursor +=1
  End While

  'do stuff
  lastRand = currentRand
Next


Is this premature optimization on my part? It just occurred to me as I was writing the code, I haven't done any code profiling that would tell me how much time is getting sucked up looping through that table. I just kind of assumed that this would be a safe optimization to make, and that it would save some time when someone is loading several thousand transactions from which a large sample will be grabbed.

So what's your call? Worth it? Should I have waited? Feel free to unload, I'm trying to improve. BTW I have taken care of boundary conditions and the code has been modified slightly to make it easier to understand. I'd like criticism to be directed at the algorithm/wisdom of implementing it early moreso than my coding conventions. And yes, it is vb. Language snobs need no respond.

Thanks
tim Send private email
Wednesday, May 11, 2005
 
 
First, I have to say I'm not a SQL guru, not familiar with VB.NET, but the concept is the same in any language.

You're reading data from every preceding row in the table.* If that's the way it must be done, then I wouldn't say your optimization is premature: you can see at once that this is likely to be a bottleneck.

A more SQL-like approach would be to let the database do the work for you (pseudo-code):

For i = 0 To 10:
    currentRand = GetRandomNumber()
    SELECT somefield FROM Table
        ORDER BY RunningTotal
        WHERE RunningTotal > currentRand
        LIMIT 1
    ' do stuff

I don't know how efficient that is (and I think MS SQL uses TOP 1 instead of LIMIT 1). It is easier to understand. And, as a nice benefit, it's usually much! faster to let the SQL engine find the data, than to do row-by-row scanning, especially if you've indexed the column in the ORDER BY clause.

* When you say you're reading from a table, I'm going to assume it's actually the results of a query with an ORDER BY clause. Otherwise, any "order" that the table may be in is implementation-dependent/coincidental.
Nate Silva Send private email
Wednesday, May 11, 2005
 
 
Oops, that should've been:
WHERE RunningTotal >= currentRand
Nate Silva Send private email
Wednesday, May 11, 2005
 
 
Sorry to keep replying, but, if you just need to grab 10 random rows for analysis:

SELECT *
FROM Table
ORDER BY RAND( )
LIMIT 10
Nate Silva Send private email
Wednesday, May 11, 2005
 
 
Thanks for the input, and I'll keep that in mind if I have to do this directly w/ an SQL server of whatever kind. In this case I'm loading a .net DataSet from an excel file of values, selecting random records and doing some more calc's on them before giving the user results. But I do appreciate your input as I haven't had to do much w/ sql yet, and it's nice to know how I would do this in that case. Thanks.
tim Send private email
Wednesday, May 11, 2005
 
 
I'd actually approach the problem slightly differently if I were you.

Instead of just keeping track of the transactions in the database, I'd also keep track of the running total. It's easy to compute at the time of the insert and very costly to computer at query-time. Then, when you write your query, all you have to do is search for the first instance where the running total exceeds your query amount.
BenjiSmith Send private email
Wednesday, May 11, 2005
 
 
Generate all random numbers first, sort them, then find the rows you want in a single pass through the data.
TomH
Thursday, May 12, 2005
 
 
OP said "Thanks for the input, and I'll keep that in mind if I have to do this directly w/ an SQL server of whatever kind. In this case I'm loading a .net DataSet from an excel file of values, selecting random records and doing some more calc's on them before giving the user results."

I thought that using the data source provider in .NET allowed you to use common commands (in SQL) to access data in various formats, so you could infact write SQL to select rows out of a plain text file or an excel spreadsheet aswell as a database.

in any case the sugestion formed in SQL is, 
allocating each row a random number, sorting by the random number column and picking the first/last N results.

It could be faster still by generating your N random numbers, presorting the small subset and traversing the result set only once to find each value.

Can you have a $0 transaction?
 
Or instead of generating random number based on the total value in the cell which as you said jumps, generate the random number to be the row numbers of the data which are unique singular and exact matching.
 
What sort of sample of the data are you looking for?
gorf Send private email
Friday, May 13, 2005
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz