The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

static hashmap to minimize database calls

I'm refactoring some webapp code in a control class that currently looks up a users real name based on their id, the problem is the lookup is done one off, everytime the user name needs to be displayed to the ui it makes a db call.  I was thinking of modifying the control class to include a static hashmap.  thus the new pseudocode would be:

displayval = hashmap.get(userid)
if displayval= null
  do db lookup and add to hashmap

return displayval

my concern is that overtime the size of this hashmap would grow to be very very large.  im not sure of the exact number of users, so i guess what im looking for is some guidance around a ballpark number of users at which point this design would be ill advised(1k,10k,50k,100k)?
I dont use my real name on the web
Tuesday, March 04, 2008
Not enough information to answer this problem -- I'd need to know your access patterns, total number of users, amount of memory available, and where the bottlenecks are in your setup. 

Incidentally, this hashmap doesn't have to grow at all -- resolve collisions by replacement, boom, done.  If Bob and Suzy both hash into bucket # 123, then when Bob logs in, I check bucket #123 for Bob's details.  It doesn't have Bob, it has Suzy's.  I purge Suzy's details, fetch Bob's from the DB, and cache them in bucket #123.  Next time Suzy logs in, the system will find her details are not cached, so Bob gets rotated out and Suzy gets rotated in.

Thrashing is extraordinarily unlikely with a well-designed hash function, and given that you have a database backing it, the impact of thrashing is minimal.

There are other caching schemes which are applicable here, obviously.
Patrick McKenzie (Bingo Card Creator) Send private email
Tuesday, March 04, 2008
It sounds like you want a cache.  Check to see if there's a cache for your platform that meets your needs; otherwise, you could certainly implement a basic cache with the specific features you need.
D. Lambert Send private email
Tuesday, March 04, 2008
To avoid boundless growth of the HashMap, you could periodically purge it of old entries.

You might consider extending the LinkedHashMap class to act as a basic cache. It's really really simple. Here's the code:

package my.package;
import java.util.LinkedHashMap;

public class CacheMap<K, V> extends LinkedHashMap<K, V> {

  int maxCapacity;

  public CacheMap(int maxCapacity) {
    this.maxCapacity = maxCapacity;

  protected boolean removeEldestEntry(Map.Entry eldest) {
    return size() > maxCapacity;

And...that's it! You now have a class that acts just like a HashMap, but it'll also automatically delete the oldest entry in the collection as soon as its size exceeds its maximum capacity.

Pretty cool, huh?

Check it out here:
BenjiSmith Send private email
Wednesday, March 05, 2008
Benji, this is a great suggestion, exactly the type of thing I was looking for, which really brings me back to my original question about how large i should allow the hashmap to grow.  what kind of performance will I see as the size grows?  I'm guessing lookups and insertions are both O(1) as the javadocs seem to indicate?  what about memory? it would be HashMap<String,String>  approximately how much memory is each entry in the map going to take up.  Is there a good rule of thumb or do i just need to test different sizes and the memory consumed by them.
I dont use my real name on the web
Wednesday, March 05, 2008
Most modern systems, even most modern embedded systems, that would be running Java, tend to have many megabytes of memory available, so I wouldn't be too concerned about the size of this hash table. If all the table will contain is the user's id and full name, we aren't talking about more than a couple tens of bytes per user. Any reasonable system should be able to handle tens of thousands of users without the slightest trouble. Even hundreds of thousands or millions of users should be a breeze.

As others have said, if the memory consumption DOES become an issue, you can always restrict the size of the hash table and use it as simple cache with eviction, which is what your psuedocode shows anyway (since you consult the database if the user id isn't found in the hash table on the first try).

Whether or not your code will benefit from a cache is entirely dependant on the access pattern seen in practice (is a user id that has been accessed likely to be accessed again in the near future?). Not all systems show this kind of access pattern for all data. Consider the access pattern for employess names in a payroll package: having seen one employee name is probably a guarantee that the name will NOT be seen again during this run (because you are running through the names sequentially and generating only one paycheck per name).
Jeffrey Dutky Send private email
Wednesday, March 05, 2008
I can't give you any hard numbers, but I can offer my thoughts and hope they are helpful.

First of all, do you know for real that this user lookup is your bottleneck? It's a waste to fix the wrong problem, right?

Next, I think it's helpful to know how many lookups you have today, and what you can tolerate. Let's say you have 100 lookups pr second, and can only tolerate 70.

Based on these numbers, you can design a caching strategy. Maybe caching each lookup for 2 seconds would do? That's a simple solution which is probably supported by your framework, or easily available as free code somewhere out there.

Perhaps you need the purge-old-items strategy proposed in this thread. In that case, you may have to try several configurations until your lookups have declined to your acceptable level.
Thomas Eyde Send private email
Wednesday, March 05, 2008
ehcache works well for this in Java.
cbmc64 Send private email
Wednesday, March 05, 2008
You want your cache to be able to hold all simultaneously logged in (active) users without collision.  If you have 100 users logged in at once you want a cache of 125 to 200 depending on how good the hash is and how much memory you have.
Cash Man
Wednesday, March 05, 2008
Cash Man said exactly what I was going to say.

How long is your session? Are your users automatically logged out after 30 minutes? Maybe 60 minutes?

During your busiest day of the week, how many users are logged in, during the busiest 30 minutes (or 60 minutes) of the day?

Your cache should be slightly bigger than that high-water-mark of users.

(I should also mention that this *isn't* how I'd implement the username cache. If I was developing this application in Java, I'd probably have a User object somewhere. During the login authentication, I'd create a User object for the user and store it in a session object. But the username HashMap thing will work too, and if that's how you want to do it, I'm happy to help.)
BenjiSmith Send private email
Wednesday, March 05, 2008
+1 Cash Man

Actually a LinkedHashMap in LRU mode using ReaderWriterLocks beats ehcache. On a few critical code flows we've replaced a few areas that way (after profiling). Of course, you loose statistics, JMX, etc, but surprising ehcache really isn't doing anything special behind the scenes. So on areas where it actually is a bottleneck, the reader/writer approach works nicely.

You haven't mentioned how you plan on invalidating your cache when customer profiles change. If you have multiple servers, you'll need to broadcast the event. You can avoid that complexity if you introduce memcached and rely only on remote caching, which is still better than a database hit. The best is a multi-layered cache, but there is quite a bit of complexity if you decide to roll your own (I did).
Benjamin Manes Send private email
Wednesday, March 05, 2008

Might want to check it out.  Easily the best distributed cache out there, as it has fine-grained caching on the field level, opposed to object level, in addition to being transactional.

It also works decently well as a non-distributed cache,
SmartYoungAnonPoster Send private email
Thursday, March 06, 2008
In a project from many moons ago (real-time ticker plant), we used what we called a "non-deterministic hash".  In your context, this could be done as follows:

1.  create a hash with a relatively small # of buckets

2.  when you need to look up something, try the hash first

3.  if not found in hash, look up in db and then update the hash bucket with the results from db.

This tends to be a self-managing algorithm, where the most recent lookups stick in the hash, and older ones get replaced by newer ones over time.
Friday, March 07, 2008
Make sure this is really a problem you need to solve.  Out of the performance profile how much time is this db look up taking?  If the whole process takes 1 second and the db part of it is taking 1/100 th of a second then the best the hash map can do is reduce the total time to 99/100 of a second. (not much)  Make sure you are solving the correct problem and not adding more complexity unnecessarily.
JimK Send private email
Thursday, March 13, 2008

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz