The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Random with no repeats algorithm?

Specifically, I've decided to code a trivia game that asks the user a question and awards him a point if he supplies the right answer. Once the user has correctly answered the question, I'd like to avoid asking him the same question again. Once he's answered all of them, he's essentially "finished" the game, at least until I've added more questions and answers.

In practical terms, I expect a ballpark maximum of a million users and about five thousand questions. However, I see it potentially going up to about 25 million users and 10,000 questions after five years.

Initially, I'd planned to record the questions answered (by unique numerical id) for each user, randomly select a number and see if there's a match. If there is, randomly select another number and so on. The absurd scenario is that the user has answered 999 questions and there's only one question left - it may take an incredibly long time for the system to get around to that one unasked question.

Alternatively, I can give each user a list of questions, 1 through 1000 (or whatever the current number is) and randomly select elements from the array. As the user answers questions, the corresponding elements get subtracted. However, I'm worried that if I get up to a million users and the vast majority only answer a few questions before giving up, I'm occupying a lot of memory and disk space that I don't need to.

Is there a third option?

Do I need to a half and half approach where I start out using the first technique and switch to the second once the user has answered a certain number of questions?

Or should I not worry about efficiency at all and plan to scale up the hardware as I go along?
Tuesday, March 20, 2007
You're already storing the list of questions that each user has answered.

So the next question for each user is randomly selected from the subset which is the set of all questions minus the set of questions answered by this user ... but, you can create this subset just-in-time based on the sets which you already have stored (so this subset occupies no persistent storage).
Christopher Wells Send private email
Tuesday, March 20, 2007
I like that approach. It's true it occupies no storage, but I'm not sure calculating the... outer join(?) every time really buys me much. I take that back, it's an O(n) operation so it would be pretty efficient.
Tuesday, March 20, 2007
I've accomplished this in the past by simply sliding up to the next item when the random index hits an already used item.  I'm sure this isn't a sound technique if you need mathematically correct randomness since unused items one slot up from used items effectively get their probability of selection doubled (since there are now two slots for the one item).  But it works well if you just need the selection to be just perceived as random.  I have a feeling that a lot of music players use this sort of algorithm judging from their behavior.
SomeBody Send private email
Tuesday, March 20, 2007
Hmm...  interesting. And with a trivia style game, there's not much harm in making it a circular list. Worst case scenario, it's still O(n).
Tuesday, March 20, 2007
I think the other answers are better for your particular case, but in general what you are looking for is a "pseudo-random permutation".  The standard (cryptographic) way of doing this is to encrypt a counter with a secret key.
bmm6o Send private email
Tuesday, March 20, 2007
Why does the list need to be "random"? Does it matter if users answer questions in the same order?

Any list looks completely random to a single user. Unless two users compare the order of questions then no one will even know the difference. Basically what I'm getting at is that you could simply store an index to the current question for each user and questions are always answered in the same order but people don't really realize that. When a new user logs in you randomly pick the starting question and store it off. Then you simply increment the pointer. Since each user potentially starts on a different question it appears to be completely random.

That's just my two cents.
anony mouse
Tuesday, March 20, 2007
You could limit your searches by queuing up the next X questions for each user in a database table (since finding the first X entries may cost just about the same as finding the next entry).  If the queue has items for the user, skip the search and just grab the next one.  If the queue is empty, find X more items, set them up in your table, then try again.

This queuing up can also be run during slow times (crontabs at night), prepopulating users' info to keep the cache as full as possible.

It's an optimization -- use only if necessary.
Derek I
Wednesday, March 21, 2007
Do you care if every user's game is truly different from another's game.  Surely it would be enough to make the users **believe** it was.

What about pre-generating random sequences of question orders?  Let's say you generate 26 (or 100 or whatever) sequences of 1,000 items.

I'll illustrate with 5 items for an example

Sequence A:  1,2,3,4,5
Sequence B:  3,2,5,4,1
Sequence C:  4,5,2,3,1

You generate these ahead of time.

Now for each player, you don't store a new sequence, you just store 2 things -
(i) which sequence that player is using,
(ii) where in the sequence the player is.

Or you could treat the sequences as circular, and also (iii) store the starting pos in the sequence

Even if you have a lot of players, and even if they are in communication with each other, they'll never catch on to the sequences being a limited set of possibilities.
S. Tanna
Wednesday, March 21, 2007
create a bit string for each user, with a 0, fore questions that they haven't seen, and 1 for questions that they have seen.
EvilTeach Send private email
Wednesday, March 21, 2007
I would say keep it as simple as possible.  Don't get too carried away with pregenerating sequences or tracking sequences, because it will make it much more difficult to add questions in the future.  Also, you would force your users to stay on a single question until they got it right.  Instead, just pick a number of questions that need to be answered correctly and don't worry about duplicates.  You can now create different recognition levels and publish high scorers.  With 5,000 questions, it is doubtful anyone would notice duplicates.
Wayne M.
Wednesday, March 21, 2007
Haven't read the other responses, but you can generally just seed the random number generator and then keep track of what number you are on:

Depends on your language of course.
Wednesday, March 21, 2007
If you are willing to use an array of unsigned shorts (with length = number of questions) per user:
also keep first unanswered question index (initially 0)
initialize array to 1,2 ...N
to pick a random unanswered question
choose first_unanswered + rand(N-first_unanswered)
if question is answered, swap answeredquestion with first_unanswered) and increment first_unanswered
Wednesday, March 21, 2007
There's a simple trick to picking a random unanswered question efficiently:

int GetUnaskedQuestion(void)
int result = -1;
int chance = 1;
for (int i=0; i < NumberOfQuestions(); i++)
  if (NotAsked(i))
    if ((rand() % chance) == 0)
      result = i;

if (result != -1)
  // Found an unasked question
  // None left.
  // We could mark them all as unasked and try again
  // Or do nothing and return -1 to indicate failure

return result;

The way it works is that you pick the first unasked question 100% of the time. You then replace that with the second one 50% of the time (which is exactly what you want if there's only two questions), replace that random selection with the third question 33% of the time, etc.

The only thing to watch out for is that your random number generator needs a range much bigger than the number of questions in your list to get good randomness.

If you don't want to store 1 bit per question there's another way, but it isn't so random. You simply store an offset into the list and a step value. Initial offset and step value are random. The number of questions has to be a prime number. You then simply do: offset = (offset + step) % QuestionCount; to select the next random question, and you can detect when you get back to the start easily if need be. If you don't have a prime number of questions you can just pick a bigger prime number and repeat the step till you get an index in range.
Wednesday, March 21, 2007
Look at the entry for "Linear Feedback Shift Register" (LFSR) on Wikipedia.  ( This algorithm will give you exactly what you're looking for with minimal storage.  It's only major limitation is that you will be stuck with a fixed number of questions based on the size of the LFSR.
Wednesday, March 21, 2007
How is this different from a standard deck shuffling algorithm?
Wednesday, March 21, 2007
If you want to store as little state as possible, each user's username (or some hash of it or whatever) could be the seed into a function that returns a random permutation of a list.

I haven't really thought about it, but it might be tricky and do that and be able to add new questions. You could store the number of questions that existed at the time of the user's creation, and then move to the next set of questions once the user exhuasts all questions that existed when the user was created. That's a lame hack; I'm sure there' a good way to do it though.

This requires storing three indices per user, but disk space is really really cheap, so I like C. Well's algorithm better. Each user needs about 10k bits = 1.25KByytes of storage. Aren't HDs about $.20/GB now? 25e6 * (.20/1e9) * 1.25e4 = not much.
Wednesday, March 21, 2007
+1 for generating a fixed number of non-repeating sequences and remembering for each user which sequence they are using and where they are in it. For large numbers of users the storage use per-user is negligible. For a large enough number of sequences the users will never notice.

(They probably will notice if you have one sequence and start people at different place in it - they will notice that when they see one question the same question always follows it).
DJ Clayworth
Wednesday, March 21, 2007
I wrote an mp3 shuffler/player in delphi which just stored 3 or 4 integers to record its position in the playlist. It used a small period linear congruential generator (LCG) for the order within the playlist and then used another generator to make up new parameters once the playlist had run out. They're fairly easy to run backwards (which I needed for 1 track backward button).

LCG's are really easy.

global int seed, AParameter, CParameter, ModParameter;

function LCGrandom(maxreturn)
    seed = (seed * AParameter+CParaMeter) % ModParamter;
  until seed<=maxreturn
  return seed;

AParameter has to have the lowest 3 bits 101
CParameter has to have its lowest bit 1 (it's odd)
ModParameter is a power of 2 (and how long your sequence will be). Must be bigger than maxreturn.
setsquare Send private email
Wednesday, March 21, 2007
Here's how i'd do it....

In the user's table store a random seed, and how many questions he has seen.

For every question:
 for (i=0; i<numquestions;i++){
 questionok=!rowsreturned('select questionid from questions answered where userid=:userid and questionid=:questionid',userid,questionid);

p.s. this is written in a pseudo-language, similar to C/Java ...etc. :P
Thursday, March 22, 2007
Hmm... I like the general solution proposed by Adam and Delphite, but I think it can use some additional tweaking. Essentially, I'll need to associate the random result with a corresponding question. Consider this scenario...

1,000 questions. TheDavid has answered questions {3,48,72,134,863,998}.

Adam's question steps through 1, 2, 3, 4, 5, etc, etc, and while the probability of stopping on 5 is correct, if 999 is the last unasked question, we hit the worst case scenario. I think.

Delphite asks just for one random number from 0 to 993 which would be ideal, but we'd never get to question 999.

I'll look at LCGs and LSFRs over the weekend. I did decide that I do not want to rely on pregenerated sequences, so unfortunately that lets out the card shuffling algorithms. As Wayne suggested, I expect to hit a point where I'm adding questions on a daily basis while the game is live. Hmm... I just realized that requirement alone will force me to keep track of which questions have been answered as opposed to which questions have not been answered.
Thursday, March 22, 2007
I ran into this problem at one site and let them know about it.  I do not care if there are duplications with a quiz at *another* time.  It is within the same quiz that I am concerned about.

If you are asking ten questions at a time, pick ten different ones.  At another time, do the same.  If there is duplication between the two, it is probably not much of an issue.

Is it an issue?  If it is, then keep track of the questions asked.  If it is not, go back to keeping it simple.


Gene Wirchenko
Gene Wirchenko Send private email
Thursday, March 22, 2007
For my initial purposes, each user essentially gets one quiz and can leave and come back at any time. I had hoped at some point to be able to show a "top ten" list, either showing who's answered the most questions, or who's had the highest accuracy rate. If the site becomes popular, I'd like to award token prizes.

In such a scenario, I don't want someone to memorize the answers to the point where he can pad his score by answering the same questions over and over. The only thing that really changes from user to user is the sequence of questions. If he misses a question, he gets another chance later (it gets shuffled back into the deck). Once he's answered them all, he can't play anymore unless he creates a new account.
Thursday, March 22, 2007
"Once he's answered them all, he can't play anymore unless he creates a new account."

...or I add new questions to the deck.
Thursday, March 22, 2007
It's rather easier than most of these earlier responses if you don't need true randomness, but the appearance of randomness when two people compare the order of questions.

Step 1: make one master question list, and make it a circular list

Step 2: randomize the starting question for this user

Step 3: randomize how many questions to skip per question (I mean, ask this user every 5th question from his starting point and that one every 13th, or whatever). Let's call that the stepping_ factor number.

Step 4:: each time the user wraps around the end of the list, simply move up one more slot (second time through, do starting question +1, third time through, do SQ+2). This number, which I'll call slot_bump, is the same as the trips past the starting point if you start counting at 0, so you can just increment it every time you pass the element this user started at.

Step 5: When the user wraps and the slot_bump equals the stepping_factor, they're done.

There's no need to even keep track of which questions are asked -- the counter (stepping_factor) and the end-of-list adjustment (slot_bump) keep track of everything you really need for this user in relation to a fixed-length circular list.

This solution does, admittedly, not account for insertions to the list while the user is working through it. Perhaps you can add new questions to new lists, and change which list the user is using once they're done with one?
Christopher E. Stith Send private email
Friday, March 23, 2007

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz