The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Set intersection algorithm

Hi all,

We're optimizing a security function in a CMS that checks to see if the user requesting a page is a member of a group that has permission to view the page.

For example, we might have a page "Human Resources" that has view permission for group ids (23, 45, 58, 62).

User A is a member of security groups (12, 33, 45)
User B is a member of security groups (92, 12)

User A gets access to the page (via group 45).  User B is denied.

We're currently looping through each user array, doing a search in the page array for each element.  Works, but we're hoping we could do it a lot faster.


One approach we discussed and some inital testing showed a lot of promise was using bitwise operators. Each group is assigned a 2-bit number (001, 010, 100, etc.).  A user and a page then has the sum of the two bit numbers, and we simply do a bitwise AND on the two summed numbers.  If it's greater than 0, we let the user in, if not the user is denied.

This approach is very fast, however, we have a limit of 64 security groups do to the size of a bigint.  This isn't acceptable.

Does anyone know of an approach/algorithm with performance comparable to bitwise, but without a limit (or at least a much higher limit) to the total number of security groups?

Derek
Derek Send private email
Wednesday, February 20, 2008
 
 
Sort both sets, and iterate through them.  Increment the iterator that points to the lower number.  If the iterators ever point to the same value, that's the intersection you're interested in.  If you get all the way to the end, no intersection.  Work should be linear in the sum of the sizes of the sets.
Brian McKeever Send private email
Wednesday, February 20, 2008
 
 
I don't see why you have a limit of 64 bits (bigint).
Why not assign an array of bigints and iterate over the array? i.e. if no permission in bigint[0], try bigint[1], etc. How many bits do you need?
Just A Guy Send private email
Wednesday, February 20, 2008
 
 
Maybe I should clarify...

Say you need considerably more than 2^64 bits to represent the permissions.

So, make part of each 2^64 bits an index into an array of 64 bit integers (unsigned of course). Let's use the top 4 bits as an example.

So, each permission is set up to have two components: the top 4 bits to represent the offset into the array (0 -> 15) and the other 60 to represent permission bits.

Now you have 16 times 60 bits (960 bits) to play with.

The matching algorithm is simply to mask the top 4 bits, shift them into the index range (0-15), index to the appropriate 64 bit integer and apply the remaining bit mask for a match (AND the top 4 with zero and OR the remainder with the mask)
Just A Guy Send private email
Wednesday, February 20, 2008
 
 
Gack! everywhere I said 2^64 should read just plain 64 bits. I'm going to grab a fresh coffee...
Just A Guy Send private email
Wednesday, February 20, 2008
 
 
First off, profile what you've got. Is it really a problem, or does it just smell slow? How large are the sets in question?

What you have now is O(N*M). Unless your sets are quite large, you won't beat that by enough to matter with anything that requires sorting. If the sets are already sorted you might as well try Brian's method.
clcr
Wednesday, February 20, 2008
 
 
Fastest is to use a Bit Array( http://en.wikipedia.org/wiki/Bit_array ) to represent the groups.  Each group would correspond to 1 bit.

Then simply bitwise-AND the users groups with the groups allowed to use the resource. If the result is zero, the user does not have permission to access the resource.

For Example, with 8 groups, User is { 1, 4, 5 } -> 00110010.  Folder has permissions { 2, 3 } -> 00001100. ANDing them together gives 00000000. Thus the user does not have permission.

It's a bit of work to set up in C/C++.  But Java and C# have built-in classes that can do most of the work for you.  Note that you probably have to change how the permisssions are stored. If you convert them to the bit array every time you check permissions, you may as well use the sorted search method posted above. It will probably have the same speed.
Rohan Verghese Send private email
Wednesday, February 20, 2008
 
 
++ profile to be sure this is REALLY a bottleneck

If it IS a bottleneck, then you need to replace your brute force linear search with a better search method.

I would suggest storing the set values in a structure that supports rapid searching, either some kind of tree, or, better yet, a hash table. You'll lose out a bit on storage efficiency, but you'll go from O(n) with the linear search to either O(log(n)) for the tree or O(1) for the hash table.

Now, if you want to get the union of two sets, you simply iterate over the first set (of cardinality M) and check for existance of each element in the other set (of cardinality N). We go from O(M*N) for the linear search to either O(M*log(N)) for the tree or O(M) for the hash table. For extra credit, choose the smaller set to iterate over during the union operation for maximum speed.

(to all the folks who say that you don't need a computer science education to be a good programmer, this question is the counter-example)
Jeffrey Dutky Send private email
Wednesday, February 20, 2008
 
 
to Jeff D.

While I agree with your numbers for the lookup, I think you also have to take the time to build the permission and user sets into account since no mention is made on whether the sets are persisted over the sessions or even during a session.

If build times also have to be factored in, solutions like b*-trees become less attractive.
Just A Guy Send private email
Wednesday, February 20, 2008
 
 
Instead of doing the set intersection when you CHECK permissions, why not do it when you SET permissions. This way writing permissions becomes the slow process and reading them becomes fast.

For instance, consider this simple membership hierarchy:

Group A
  User 1
  User 2

Say you want to apply a permission to Group A. You want all users in Group A to have that permission right?

So, when you add the permission to Group A, find all members of Group A and apply the permission directly to those users AND indicate that those user permission values were inherited and not applied directly to the user.

This same algorithm must be executed whenever you modify a Group's membership list.

This can be represented using the following tables:

Group (Id, Name)
User (Id, Name, etc.)
GroupMembership (GroupId, UserId)

Then you have your permissions table:

MemberPermission
(MemberType, MemberId, PermissionId, PermissionValue)

where MemberType is 1 = User, 2 = Group and MemberId would then correspond to either a User.Id or a Group.Id depending on the value in MemberType.

Okay, so now you can add a 5th column into MemberPermission called 'Inherited' which indicates if the permission value was inherited from a group membership or applied directly to the member.

Now, to check permissions you don't have to do any set intersection, you just ask for permissions of that user.
Wayne Bloss Send private email
Wednesday, February 20, 2008
 
 
Sorry, meant to say "This same algorithm must *also* be executed whenever you modify a Group's membership list."
Wayne Bloss Send private email
Wednesday, February 20, 2008
 
 
a)  You should NOT be doing this optimization yourself.

b)  If you were using a hash set implementation, such as the tested and proven one which almost certainly exists already on your development platform of choice, this would be an O((min(n,m)) operation instead of O(n*m), because you can check whether an element exists in a set in O(1).  This would also free you from the space constraints of bitwise.  The cost is a wee bit of extra memory.

c)  I am almost certain that this is not your bottleneck.  If you're running a CMS, you have a *full web stack* going -- add another 5,000 operations to the time it takes to access a web page, big deal, it won't move the needle at all.
Patrick McKenzie (Bingo Card Creator) Send private email
Wednesday, February 20, 2008
 
 
Thanks everyone for the suggestions.

We have done extensive profiling and this is a real hotspot for us currently.  It's not a big issue on page loads, but on navigation pages (think a http://www.dmoz.org/ style hierachical view, where every link on the page needs to be checked for permissions) it adds up.

We're also doing the filtering in the database (via a stored procedure calling a function) which means that we can't cache the results in the app tier.  If we can move the security filtering into the app tier, we can cache the results from the database, giving us even better performance.

We're in the process of benchmarking/profiling different options to come up with the best approach.  I'm guessing that iterating through sorted arrays with some smart short-circuiting will work best, but I'll post back with the results.
Derek Send private email
Thursday, February 21, 2008
 
 
On modern hardware and software, your current algorithm is blazingly fast anyway, assuming we are talking about a few security groups, rather than thousands or hundreds of thousands.

Each user belongs to only a handful of groups, right? And each page is accessible to only a few groups? Then your current algorithm probably is so fast that the time taken is not even measurable in milliseconds.

I'd leave it as it is, and find out where your real bottlenecks are.
Steve McLeod Send private email
Thursday, February 21, 2008
 
 
If you are doing this in the database anyway,
can't you implement this with a join?

Select p.pageid from pagesets p, userpermissions u
where p.pageid = u.pageid and p.id = @mypage and u.id = @me

(And you can use rdbms features such as indexes
 or materializedviews to speed it up)

Or have I misunderstood?
Object Hater
Thursday, February 21, 2008
 
 
if you sort the page permissions then binary search the permissions you get O(M log n) for N page permissions and M user permissions.  You shouldn't need to sort the page permissions often as they shouldn't be changing that much.
Brian
Thursday, February 21, 2008
 
 
+1 for pushing it into the database.  This is basically why the database exists.
Brian
Thursday, February 21, 2008
 
 
Trade memory for running time.

Instead of making a list of IDs, put them in a hash table. Make it a perfect hash - array where the index is the key and the value is on/off indicating the ID is in the hash. Use bits instead if you need to optimize memory usage.

Lookups will be 0(1).

good luck.
hashman
Thursday, February 21, 2008
 
 
> We're also doing the filtering in the database (via a
> stored procedure calling a function) which means that we
> can't cache the results in the app tier. 

That doesn't follow.  Assuming that the filtered results are associated with a given session, why wouldn't you be able to cache them?  How dynamic does this have to be?

> If we can move the security filtering into the app tier,
> we can cache the results from the database, giving us
> even better performance.

This also doesn't follow, as best I can tell.

A common design pattern for this is to execute a query at the point of signon that returns a list of all the objects that the user has access to.  The app server then caches this list in some fast data structure like a hashtable and uses it to enforce access rules from that point on.
Robert Rossney Send private email
Tuesday, February 26, 2008
 
 
do a database join, databases are optimized for it, and you already paid for the feature whether or not you use it ;)
antonio vargas Send private email
Friday, February 29, 2008
 
 
Derek wrote:
"Does anyone know of an approach/algorithm with performance comparable to bitwise, but without a limit (or at least a much higher limit) to the total number of security groups?"

There's a whole family of probabilistic membership test algorithms, which give very fast results for arbitrarily large sets, but with some additional penalties.

Perhaps the most famous example is the Bloom filter: it's an algorithm that can tell you with O(1) performance whether something is *not* a member of an arbitrarily large set. The penalty is: if the filter determines something might be in the set, then you'll have to perform a regular membership test on top of it. But it's a very good, *very* fast first-pass filter. Check it out here:

  http://en.wikipedia.org/wiki/Bloom_filter

Hope this helps!
R
Sunday, March 16, 2008
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz