The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Implementing a caching layer

I've inherited a web application which reads and stores information in an XML database via a third party API.  The system is severely under-performing and we're looking at options to improve it.  Due to architectural and organisational constraints, we are mandated to use the existing data API.   

Since 95 % of the data operations are reads, one option I'm considering is to introduce a caching layer.  Is it worth rolling our own or is this addressed by an existing open source product (app is java based)?
Siohban Edwards
Friday, May 09, 2008
You should not roll your own unless you have a very good reason to. I did, but only so much as to provide an integration and coherency protocal - not the actual caching stores. That by itself is quite a bit of effort if you plan on scaling to thousands upon thousands of servers.

For local caching, Ehcache is quite good and popular. I've heard that Whirlycache is excellent at critical sections, but does suffer by needing a thread per cache for eviction handling. OSCache and JCS are common, but probably not as performant.

For a remote caching layer, memcached is very fast. It has the nice property of being an independant layer, making it more scalable and faster than deeply integrated solutions, but harder to manage if you prefer local/remote caches. Most adopters forgo the local cache and thus the coherency protocol needed, and find it fast enough. It is then very easy to use.

For integrated local/remote solutions, see JBoss Cache, Coherence (commercial), Ehcache/JCS, or Teracotta. The first three are primarily modelled in a mesh approach, where changes ripple through the nodes as all caches are updated. Teracotta takes a single management server approach, which allows quite a bit of magic at the expense of (at least preceived) reliability. Coherence supports a number of topologies, so you can pick the best one.

I personally like the cache-aside model with independant layers. You can have a remote cache farm, not share it's memory with the JVM's, and simply broadcast invalidation messages versus push partial changes. If you are moving a large legacy code base with a current invalidation process, its the easiest to bootstrap to as well. However, it takes the most work as there are no solid open source implementations that I am aware of.
Benjamin Manes Send private email
Friday, May 09, 2008
I agree with Benjamin. Don't write a cache yourself, because it is very tricky to get it right. I wrote about this here:

I recommend using ehcache. It's a good all-round Java caching solution, open source, actively developed, created by people at Thoughtworks, in wide use, and easy to get started with.

Caching is rarely the best solution to a problem, but if you can't fix the performance problems directly, it's a pretty good solution.
Steve McLeod Send private email
Friday, May 09, 2008
First, you need to determine what the bottleneck is. 

In my experience, there are orders of magnitude differences between XML parsers -- for instance, Xerces/Xalan is 10-100 times slower than libxml2 (in C++, using DOM).  So, the XML parser could be something to look at.

However, you really need to measure first.
BillAtHRST Send private email
Friday, May 09, 2008
+1 to measuring first.
Spend money & development time later.
xampl Send private email
Friday, May 09, 2008
You might also want to take a look at GigaSpaces Data Grid

Geva Perry
Geva Perry Send private email
Friday, May 09, 2008
I figure I could add something but Ben basically covered it all really well.

Ben, wouldn't you say that Memcached is an open source 'cache aside' caching solution?

Perhaps I don't see the distinction...
SmartYoungAnonPoster Send private email
Monday, May 12, 2008
We've had some success at work with memcached in a php application. I think that a couple important areas to start off with once you've settled on caching are what your key generation strategy will be and deciding which types of requests to cache.

The trick with keys is that if you're using something like md5, sha1 or maybe Java hashcodes, is that you can end up with multiple cached copies of the same data because of some trivial difference (as defined by you) that causes the key to change. This is pretty similar to choosing primary keys and unique constraints in SQL.

As for caching requests, how granular do you want to make the cached items. If your requests are fairly homogeneous and rarely overlap, then you may want to store the whole thing under one key, otherwise, you may want to consider breaking requests up over multiple keys so you can reuse the same items in multiple requests.

The last thing to investigate is whether you want to support preloading the cache, and pushing updates when the backend data changes.
Dana Send private email
Monday, May 12, 2008
Yep, it is! What I meant was that I prefer a cache-aside approach, whether its local and/or remote. Generally a mesh approach is used for local+remote, as in the examples cited above, which I dislike. They tend not to scale as well, be too deeply integrated to become inflexible, and tie you into a particular vender.

I tend to look towards hardware for insight to scaling and performance issues. A cache-aside solution is prevelant and the form we use is very similar to a NUMA architecture (where a service == a cpu group). A mesh approach is more like a shared memory architecture - a UMA with optional private memory. If you look at hardware these days, UMA is used for small/medium sized systems and NUMA for the large. In software its becoming the same way, just that most of us aren't used to the "large" yet (e.g. Google, Amazon).
Benjamin Manes Send private email
Tuesday, May 13, 2008
Ive tackled this exact problem in my own web applications. Mine are fairly small web apps that store all their data in really fast XML native databases. I use a combination of XPATH, MSXML and JScript sorting to do fast saves and returns. This type of XML local caching is pretty darn fast if done correctly, and can be shown to beat SQL Server and MySQL in some nano speed tests. But like you may be finding, there are bottlenecks with large volumes of data. I developed my own indexing system, but the issues I had were with sorting and ordering. As the data grew, this became very slow, so I resorted to a caching scheme. So, yes, it can be done.

My solution to this type of cache is very clean, mutli-dimensional sorted arrays stored in web application scoped variables. The speed gains were stills slow till I found that a type or JScript array algorythm that beat out most bubble sort scripts. Once I implemented this, my apps were able to fully cache multiple forms of data and serialize the data out as needed very quickly at any point in the app. The long and short of it is, YES, you can cache but make sure you understand serialization demands by your users, sorting speeds, and indexing and try several performance tests to find the right combo. If you use a healthy balance of cached array/hashtable datasets along with local XML storage, and a well-written regeneration intelligence that checks cache and rebuilds behind the scenes as needed, you can have a very fast system.

This type of cache and XML system is very powerful, but even it has upper limits, and at that point, you need to get creative. Im exploring b-trees and other crazy systems right now to see whats possible as I absolutely love Native XML databases and believe its the foundation for a new more portable and flexible Web.
XML Database Web Technology Send private email
Thursday, May 15, 2008

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz