The Joel on Software Discussion Group (CLOSED)

A place to discuss Joel on Software. Now closed.

This community works best when people use their real names. Please register for a free account.

Other Groups:
Joel on Software
Business of Software
Design of Software (CLOSED)
.NET Questions (CLOSED)
TechInterview.org
CityDesk
FogBugz
Fog Creek Copilot


The Old Forum


Your hosts:
Albert D. Kallal
Li-Fan Chen
Stephen Jones

Web dev - How to deal with lots of insert/update ?

When designing a web application for heavy load, there is always a moment when you need to separate SELECT statements (made on replicated servers) from the INSERT/UPDATE/DELETE ones (made on the master server(s)) to distribute the load. The classic ratio is something like one I/U/D for ten or hundred selects ... (YMMV).

But sometimes you also need lots of insert/delete. For example amazon.com needs to store every product page a customer sees in order to profile him.
So for each page the customer consults there is at least a select to check if he had already seen the page. If yes I guess they update a "last_seen_date" field and if not they make an insert in their database. Or something like that (try to insert, if it fails because of a key duplicate try to update)

That's a lot of database trafic. So what kind of caching mechanism do you think they use ? What do you use in such situations (keep tracks of what a customer does for example).

Thanks in advance and excuse my bad english as it's not my native language.
learner Send private email
Tuesday, May 12, 2009
 
 
Use separate databases. Don't put the customer tracking info in with the book information.

Several purposes.

1. Your customers represent a convenient vertical slicing of the data. You can have different servers for each group of customer IDs for storing this data on. Customer 679878 never needs to see the tracking info for cusomer 553453. So their data can live on separate servers.

2. Following from 1, your customers will almost certainly be geographically distributed. And hence so can your customer tracking DBs... you can put the machines in different hosting centres.

3. Your customer tracking info isn't as important. You'd rather it didn't get lost, but it's not mission critical if it does. So you can have a different backup strategy for them.

4. You can even have a different DB hosting them then... say one which is good at inserts but not so nifty at (say) searching for books by title.

5. When you need to aggregate the data, do it behind the scenes with asynchronous snooping systems which produce core aggregate data which is then replicated into the main datastore. You don't actually need the page viewing counts to tick up in hard real-time, for example, as long as they eventually increment. You probably won't get overnight batch processing time in the modern 24x7 world, but you can do it with processes which use idle time to move the information around or at least know how to stop what they're doing and free up DB time for the more critical queries.
Katie Lucas Send private email
Tuesday, May 12, 2009
 
 
I would first do profiling and see how much the database caching mechanisms buy you. You might find that you won't get much of a win by doing your own caching ... or that your money/efforts are better spent elsewhere.

For example, it might be possible to redesign your tables/queries so that the caching mechanisms are better used (in other words, so that the data that you are going to fetch can be better predicted by the caching mechanism).

You might find that if you do your own caching, that you now require more main memory on your application servers. This might be money better spent upgrading the memory on the database servers.
Larry Watanabe Send private email
Tuesday, May 12, 2009
 
 
Keep in mind that the web server doesn't need to do the update itself. It can call a stored procedure that does the select/insert/update. Doing something like this in a sp can really cut down on network traffic.
Mike Swaim Send private email
Tuesday, May 12, 2009
 
 
Make sure that you are doing work on the web server asynchronously wherever possible. If the web server thread does not need to know the result of an insert, update or delete, then don't issue that work on the thread, queue it somewhere else and let something else do the work - blocking is the enemy here.

For example, in one place I worked, we needed to update statistics every time a user viewed a page. We designed a new schema to store this data.

From the web application, we started with the standard, synchronous approach - as you can imagine, as load increased, page response times began to slow measurably, and lock contention increased.

Updating statistics is a fire-and-forget operation and is ideal for asynchronous processing, so we implemented a queued update system - the web server thread would invoke a business component which would write a message to a queue, which would be picked up by a separate service and processed against the database.

Over time we were able to make the update service pretty smart, and tailored to our particular needs - lock contention at the database level was much less of an issue, mainly because a single service was doing all the work, and the data was coming in fast enough to keep the business analysts happy. I won't say that MSMQ didn't give us some headaches along the way, but at least our customers didn't notice.

Incidentally, in .NET Enterprise Services (COM+) there is a feature called Queued Components which implements this pattern - it's worth looking into, but rolling your own solution is perfectly feasible as well.
Sam Piper Send private email
Tuesday, May 12, 2009
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz