techInterview Discussion

Answers to technical interview questions. A part of

A part of answers to technical interview questions.

Your host: Michael Pryor

counting set bits in an integer

Are there better ways to do this than the following:

char arr[256] = {0, 1, 1, 2, 1, 2, 2, 3, 1, ...........};

count_set_bits (int num)
    char byte0, byte1, byte2, byte3;
    byte0 = num & 0xff;
    byte1 = (num & 0xff00) >> 8;
    byte2 = (num & 0xff0000) >> 16;
    byte3 = (num & 0xff000000) >> 24;

    return (arr[byte0] + arr[byte1] + arr[byte2] + arr[byte3]);

I am aware of the version that uses while loop and shifts,
but could take upto 32 iterations in the worst case.

Wednesday, January 02, 2008
chrbohn Send private email
Wednesday, January 02, 2008
int num_bits_set(int num)
    int count = 0;
          num = num & (num - 1);
    return count;
dark horse
Wednesday, January 02, 2008
This will take O(n) where n = no of bits set in the number.
dark horse
Wednesday, January 02, 2008
Wednesday, January 02, 2008
The one true method of counting bits comes from MIT Hakmem 169 (assuming 32-bit int):

inline UINT32 Count(UINT32 n)
    UINT32 c = n;
    c -=  (n>>1) & 033333333333;
    c -=  (n>>2) & 011111111111;
    c = (c + c>>3) & 030707070707;
    return c % 63;

This can be optimized to 11 register-to-register operations, and has no conditional branches to mess up CPU pipelining.
Skorj Send private email
Thursday, January 03, 2008
@Dark Horse

That algorithm (Kernighan's?) is just awful!  It's both slow and it relies on a subtle math trick so it's not immediately obvious why it works.

"Slow and obvious" and "fast and subtle" both have their places, but "slow and subtle" doesn't.

The "slow and obvious" solution:

UINT32 Count(UINT32 n)
    UINT32 c = 0;
    for( ; n != 0; n >>= 1 )
        c += n & 1;
    return c;
Skorj Send private email
Thursday, January 03, 2008
Yeah..but it's certainly faster than yours and not so obvious too!
dark horse
Friday, January 04, 2008
Also not machine dependent..!
dark horse
Friday, January 04, 2008
My fast method is quite fast on modern processors (there's a similar approach that avoids the divide, but Intel and AMD both say that divide is fast where it pipelines well).  My obvious method works even on archtectures other than two's compliment.  ;)
Skorj Send private email
Friday, January 04, 2008
I did some quick and dirty testing of all the three functions, and MIT Hakmem 169 was consistently faster than the other two.

Running each function 2^32 times while counting the bits of i (where i is the loop control variable):

Q6600, VS2008, Release build
MIT Hakmem 169 1x
Kerninghan 3x
Simple 5x

PIII 500 Mhz, GCC 4.1.1, -O2
MIT Hakmem 169 1x
Kerninghan 3x
Simple 9x
chrbohn Send private email
Monday, January 07, 2008
If you lay out your lookup table in a 16x16 grid, you'll see both the first row and column are:

0 1 1 2 1 2 2 3 1 2 2 3 2 3 3 4

On top of that, if you find the number you're looking for in that grid, and add together the first numbers in its row and column, you'll get the number of bits set in that number.

The easiest way to get the row and column index is a simple divide and modulus. For example:

# say we’re looking for the number of bits in 243
x = 243 / 16 => 15 (int)
y = 243 % 16 => 3

# "a" is the lookup table from above
a[3] + a[15] = 2 + 4 = 6

# Are there six bits set in the number 243?
243 = 11110011 # Indeed there are

You only need 2^(N/2) entries in your lookup table for an N-bit number, so you can get away with 16 instead of 256. The fastest method is still the fancy one from above, but I figured this one out when looking for patterns like Joel mentioned in his article.

I wrote my own article about it a few months ago, which you can read here:

Most of the code's in Ruby, but it's easy to understand, and there's another pattern to the numbers I go into more detail on in the article.

Hope this helps.
Chris Doggett Send private email
Tuesday, January 08, 2008
A 16-byte lookup table would actually be pretty good.  The problem with lookup tables in general is that performance on modern processors is all about pipelining and multiple cache layers, and even a 256-byte lookup table is a problem in that regard (although a 256-byte lookup table is still pretty darn fast for most problems).

A 16-byte lookup table is small enough to stay in the upper cache layers and be fetched very fast.  I don't know whether 2 16-byte lookups would actually be faster than 1 256-byte lookup, but it would be an interesting experiment (on a busy system - on a system that's doing nothing else the 256-byte lookup probably wins as there's no cache contention).
Skorj Send private email
Tuesday, January 08, 2008

The method I mentioned would only need one lookup, since the first row and column contain the same numbers. No need to duplicate them, so you should be able to do it in 16 bytes.

By the same method, you could get all of the set bits in a 16-bit integer with 256 bytes, or a 32-bit int in 65536 bytes.

If you look at the other pattern I mention in my article, and you're clever enough, you could probably come up with some sort of quadtree algorithm and just calculate the additions you'd need to get there from:

0 1
1 2

I'm not that clever yet.
Chris Doggett Send private email
Tuesday, January 08, 2008
tried fillding around with this problem.

I came up with that solution that seem a bit more obvious than the hakmem's one. even tho maybe a bit slower. (% ?)

long countBit(long n);
  n = ( (n & 0x55555555) + ( (n >> 1) & 0x55555555);
  // possible pair value    (0-2): 00 01 10
  n = ( (n & 0x33333333) + ( (n >> 2) & 0x33333333);
  // possible quadbit value (0-4): 0000 0001 0010 0011 0100
  n = ( (n & 0x0f0f0f0f) + ( (n >> 4) & 0x0f0f0f0f);
  // possible byte value    (0-8): 00 to 0000 1000

  // no more mask here, since the result will simply duplicate on each byte.. mask the final byte

  n = (n + (n >> 8)); // 0x00 - 0x10 (??10 ??10 ??10 ??10)
  n = ( (n +(n >> 16)) & 0x000000ff); // 0x0000 to 0x0020

  // I dont like de % operator,what the average cycle count for % ?

  // for 64 bit int .. extend the mask and add 2 binary operation (while moving last mask to end)  same for 128bits
  n = ( (n  + (n >> 32)) & 0x000000ff);
  // possible value  (0-64) = 0x00 to 0x40
  return n;

I also like the fact that its a constant time.  while the loop can be 0 to n iteration
and the table lookup use 1 more lookup per byte
lookup meaning a >>, a +, a & and an index operation

my 2 cents
claude Send private email
Friday, February 01, 2008

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz