The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

How much do I need to worry about web app scalability?

I'm in the process of thinking through a web application venture. Unfortunately I've never really had to give much thought to the processing power/ability to handle users for any of my sites. For this one I want to make sure it can be scaled relatively easily. I'm prepared to rework things if they aren't practical but I'd rather not box myself into a corner with my early design decisions.

So far I've only ever used PHP for writing simple web apps. I was thinking of trying a new setup like Python and Turbogears, but I don't know the impact on scalability of these decisions.

A user would probably be using the site multiple times a day and using maybe 10-20 page accesses per time. At what level would this become a problem for a normal web server running PHP or Python/Turbogears for instance? i.e. How much do I have to worry about this problem? Does anyone have any experience with this?

Do you have to split the load between two web servers in order to continue to scale?

Are there any good references on setting up web application architectures?

Sorry for all the questions, I'm happy to read online references to find the answers but haven't managed to find anything useful yet.
Matt
Wednesday, April 25, 2007
 
 
Worry about it when it becomes a problem. There are more important things to think about first, like creating a great app that your users are going to love.

http://gettingreal.37signals.com/ch04_Scale_Later.php
John Topley Send private email
Wednesday, April 25, 2007
 
 
Interesting points. Are there any technologies that make this easier when it happens? For instance if I write my app using Python would I have to turn around and write it in C++ in order to go from 100,000 to 500,000 customers? Is that how it works?
Matt
Wednesday, April 25, 2007
 
 
Not really. *If* you're fortunate enough to have that problem, then profiling your application's performance may show that there are hot spots that could benefit from being rewritten using a high-performance language such as C. There's a good discussion around that here: http://www.loudthinking.com/arc/000598.html

Like all these things there's a trade-off involved. It's almost certainly not worth your while to write your app in C++ if Python is what you're productive in. You'll probably do a lousy job and never get to the stage where scaling is a problem.
John Topley Send private email
Wednesday, April 25, 2007
 
 
> Worry about it when it becomes a problem

Bleh. Know your profession and you can do both. Scalability is about architecture and architecture is something you can learn, know, and implement now with no impact on your wonderful app.

This entire drive while blindfolded cult is simply a way for people free themselves from knowing what they are doing and making decisions.
son of parnas
Wednesday, April 25, 2007
 
 
son of parnas: I understand where you are coming from, but unfortunately your comment didn't actually contain any useful recommendations.

John: I'm actually a C++ programmer but I've never written a webapp in it and in my mind it's probably easier to learn Python than to battle with the technicalities of webapps in C++.

It's definitely a case of ease of implementation versus performance. In order to get customers you need to have an app, favouring ease of implementation, but once you have customers you may need to scale it.
Matt
Wednesday, April 25, 2007
 
 
You will hit several scaling "milestones."  Some of them have high associated costs.  Design your code so you can drop in new parts without changing the rest.

This way, you can incrementally upgrade components on demand when the current solution is too slow.  Its not cost effective to focus on performance before you have content.

Wednesday, April 25, 2007
 
 
That seems like a good idea "blank". Any thoughts on how to do it? Are there particular techniques that are used to separate different components? Do you mean a presentation/data layer split?

Thanks for all the comments so far.
Matt
Wednesday, April 25, 2007
 
 
John Toley gave a link above about how you can incorporate pre-compiled routines in your code when needed.

A good design will allow you to take most any routine in your code and replace it with a pre-compiled version.

Make sure that you can switch out databases easily.  Or change the data model (multiple DBs, load balancing in the access code, caching, etc). 

Good design can help.  Beyond that suggestion, I'm running into my lack of experience dealing with large web sites.

Wednesday, April 25, 2007
 
 
In theory it's simple. For instance:
* watch out for your database use, as it's one of the first bottlenecks;
* try to optimize the files like images, javascript, css, as each connection can be expensive because the browsers might retrieve the files sequentially from the same server and multiple connections can become expensive; there are techniques like using image sprites, joining multiple javascript and css files, compacting (gzip) files... study it when necessary;
* keep the session data as light as possible; when you need session data, use a database for that which can be shared by multiple servers in your load balanced web servers;
* cache database data if you really need it, but with a careful constructed database access you may be able to avoid it or make it easy to use it when necessary;
* cache View data in things like "memcache" which is used in SlashDot and other popular sites, and is even recommended for use with Rails for instance;
* beware of unoptimized use of ORMs;
* keep the database transactions as short as possible;
* use database connection pooling (or not...)
* beware of overuse of "select count(*) from a_table";
* use "limit" in your "select" to retrieve as few rows as necessary;
* beware of premature optimization, though, because ugly code sucks big time; :-)
* frameworks can hide inefficiencies; so what? :-)
* Nginx, Lighttpd, are slightly faster than Apache;
* stability trumps performance; sometimes high profile sites are pretty much broken stability-wise, but they end up being useful by working 99% of the time; :-)
* at the very least, keep the raw database stuff in a dedicated module, with the functionality as hidden as possible from the outside; I know it's hard, though; :-)
* perhaps beware of inefficiencies in your Ajax connections (prefer JSON rather than XML, for instance);

And so on...
Joao Pedrosa
Wednesday, April 25, 2007
 
 
> but unfortunately your comment didn't actually contain
> any useful recommendations.

The first recommendation is that software is a profession and should be treated as such. Scaling is a skill that can be learned. There are numerous papers on the internet of how current sites succeed. You can learn from those.

A few:
1. http://www.danga.com/words/2005_mysqlcon/mysql-slides-2005.pdf
2. http://mysqluc.com/presentations/mysql05/benzinger_michael.pdf
3. http://labs.google.com/papers.html
4. http://www.mysqluc.com/presentations/mysql05/pattishall_dathan.pdf
5. http://glinden.blogspot.com/2006/05/early-amazon-end.html

Study. Learn. Start small and evolve. But that doesn't mean you have to toss rationality out the window.
son of parnas
Wednesday, April 25, 2007
 
 
Thanks guys that's some really great information. I've got a bit of reading to do now!

Thanks again.

Wednesday, April 25, 2007
 
 
Most of the advice above is about optimization, not about scalability. They're two different machines.

Optimization is "How do I make this app/call/library faster?"

Scalability is "How do I build this app so that I can improve total throughput by throwing more hardware at it?"

In the web world, this basically means if you're slow, throw in another web server box and you shouldn't need to change anything and the overall site will be faster.

The important things to think about here are where your contentions are, and where your state is. There is often a tradeoff between the two.

For example, suppose you're storing per-user session state; the contents of a shopping cart, for example. The absolute fastest thing you can do is store this state in memory on the web server.

However, this prevents you from scaling out by adding more web servers. Since the cart state is in memory on an existing box, that user must always go back to that exact machine for every subsequent request, or the cart contents won't be available.

So, you need to store session state somewhere that can be shared across servers (a database is the typical answer). Ironically, this will slow you down for the individual user: a database round trip will be required where none was before. However, your overall site performance will go  up significantly, since you can now handle many more simultaneous users by adding more server boxes.

Contentions are resources that multiple machines need to access or update simultaneously. If your web app updates a disk file, for example, every server will need to access that same file, and that requires locking. Locking will slow down all your other boxes. Similar things occur when you get locks on rows or tables in your database.

The best way to deal with contentions is to avoid them if possible with a "shared nothing" architecture. Of course, you always end up with a database behind you that all the web servers end up hitting. Luckily, databases have had many years to get good at handling concurrent access, and scaling databases is an art all to its own.
Chris Tavares Send private email
Thursday, April 26, 2007
 
 
Thanks Chris.

So basically the model is to split out session information into a separate database so you can have that as the central machine and then scale the number of webserver boxes dealing with the requests. This would be the point at which database scalability comes in - I presume there are techniques for splitting databases across machines as well.

Are there any decent articles about this anywhere on the web?
Matt
Thursday, April 26, 2007
 
 
This article which was linked here a while back talks about several options for database scaling.

http://www.baselinemag.com/print_article2/0,1217,a=198614,00.asp
Adam
Thursday, April 26, 2007
 
 
>> Optimization is "How do I make this app/call/library faster?"

Scalability is "How do I build this app so that I can improve total throughput by throwing more hardware at it?" <<

I prefer to think of this as:

Optimization: How do I make one user's experience faster?

Scalability: How do I add (a lot) more users without degrading the overall user experience?

Like SoP says, this is a skill that can be learned, but you have to change how you think about stuff (warning: this can be painful).  Rather than thinking "How do I make this one routine as fast as possible", think instead about "How do I make the entire checkout operation faster"

As in most things, this involves a series of tradeoffs.  Generally, you're trading memory usage against disk access.  Generally, the fewer times you hit the disk, the better.
xampl Send private email
Thursday, April 26, 2007
 
 
You could look at how existing sites handle scaling. LiveJournal has an open approach to a lot of things.

http://danga.com/words/2005_oscon/oscon-2005.pdf

I expect there is something more up to date then that but I haven't found anything newer in my brief check.
Cymen
Thursday, April 26, 2007
 
 
I think the biggest roadblocks to scalability that you need to watch out for are:

1. crappy code - poor OOD, spaghetti code, etc.
2. lack of good documentation - specs, code comments, etc.
3. lack of good source control

A lot of projects start with somebody hacking something together, then building hacks upon the hacks, upon the hacks, upon... During that process, nobody comments the code, or removes the "test code" from the source, nor are any specification docs created. Then one day, you'll hot a point where you need all of that straightened out. That's when you have a scalability problem.
*myName
Thursday, April 26, 2007
 
 
I 100% agree with Chris Tavares.  Scalability is all about building a system that can be scaled up using only money.  Optimization is about scaling up by using talent.

The biggest think to remember is that your only job is to "get out of the way of scalability".  Almost all web platforms today are reasonably scalable.  The only way to build a poorly scalable web app is to make a descision that hinders scalabiltiy.  For example: choosing an authentication architecture that precludes database connection pooling.  Or even worse: choosing an authentication architecture that precludes physically seperating the presentation tier from the data tier (using Windows authentication to SQL Server in the pre-Active Directory days was an example of this).

You want to design to allow these techniques to work:
-Splitting the logical tiers into physical tiers.
-Building a web farm for the presentation layer
-Clustering/farming the middle tier
-Partitioning the data tier
-Multithreading

A lot of optimization techniques actually prevent scalability.  In general, anything that positively effects both performance and scalability is already a best practice.  So, if you are already following best practices for your environment, any change that increase performance on the current hardware probably hinders scalability.
JSmith Send private email
Thursday, April 26, 2007
 
 
Find a way to stress test your app, so that you can see what happens if you have a large number of simultaneous users.

Likewise, a way to load up your database to see what happens when there is lots of data.
dot for this one
Friday, April 27, 2007
 
 
But that's testing for performance, not scalability.  The beauty of perfect (and unattainable) scalability is that no matter how many customers you get, you can simply use a portion of their money to buy the hardware necessary to support them.

With the worst type of poor scalability, there will be a brick wall where no matter what you do, it will be slow under that load.  Then you have to re-architect and do it under pressure.

So, you can have awful scalability in a high performance app.  Making it faster or measuring the workload capacity has little to do with scalability.  In order to actually measure scalability you have to try the same system on faster and more pieces of hardware and see where you eventually hit a point of diminishing or no returns.  This is really hard and expensive to do.
JSmith Send private email
Friday, April 27, 2007
 
 
Hey! I just found an interesting article:

* The top 10 presentations on scaling websites: twitter, Flickr, Bloglines, Vox and more. - http://poorbuthappy.com/ease/archives/2007/04/29/3616/the-top-10-presentation-on-scaling-websites-twitter-flickr-bloglines-vox-and-more
Joao Pedrosa
Sunday, April 29, 2007
 
 
Most sites out now days are so well defined (they do BBS, they do support, they do bug tracking, they do ecommerce, etc) that refactoring a quick and dirty prototype for performance would be something you can hand off to an experienced website architect pretty easily. Especially one who can refactor it to scale.

If your site IS the application and it's solving a fairly unique problem. And it's not just a simple CRUD going on, you should probably hire an experienced developer early on or mandate that your team learn a little about scalability before designing the software. You don't want this situation where the entire prototype has to be scraped due to some major over-sight. Most highly experienced web developers will do the web application prototype right first time around, making refactoring for scalability relatively non-issue.

Agree with previous thinkings, as long as you put some thought into it now, it should be easy to keep improving the performance as your business grows, making it a continuous process of improvement.
Li-fan Chen Send private email
Monday, April 30, 2007
 
 
1. Source control, docs, clear code = maintainability.
2. Optimisation = efficiency and performance.

Both come at the cost, so if your resources are limited you may want initially pay less attention to these two and more attention to minimal useful functionality.


Scalability = ability to increase system throughtput and capacity solely by performing computer administration tasks (sic, no coding).

Web apps are inheritly more scalable than client-server, so read up on web development first.

Secondly, search google for "eBay architecture", "Google architecture", "Flickr architecture" (very useful) and such.

From my personal experience as a web developer for last 7 years: don't worry about the code, worry about data structure and storage being scalable.

The data source and structure is always the bottleneck in web apps. But don't worry to much initially, just keep the data source layer separate and data structure tidy and portable.

Chances are that you will have to re-write some code and that's no problem. You can deploy new code easially, while your app is live.

However the more successful your web app is, the more data you will hold. In case you got inflexible data structure or haven't separated your data layer a future migration to a more scalable solution might become  seriously difficult (specially whilst staying "live"), but still not impossible.

Again, don't worry about scalability now. Worry about making your site useful to your users.
John Cribbon
Thursday, May 10, 2007
 
 
Additionally, imploying technologies that none of the development team is proficient at is a very serious project risk. You really ought to consider it along with the risk of the web app becoming unscalable and not being useful.

Having these three risks:
1) Risk of defining a product not enough people care about.

2) Risk of failing to deliver the defined product due to lack of expirience with the particular technology.

3) Risk of the product not being scalable enough.

You can, hopefully, clearly see why I have ranked them in this particular order: the impact of not having product that people want is much greater than the impact of having the product, albeit with some limitation that could be eliminated (remember "soft" in software means it can always be changed).

Concentrate on high impact, high probability risks first early in the lifecycle, leave the lower impact risks for later. Probability of number three becoming an issue totally depends on having delivering the right product first.


P.S. I am currently looking for a position of Jr. Software Project Manager anywhere in UK. I will leave an e-mail to anyone interested in seeing my CV. Cheers!
John Cribbon
Thursday, May 10, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz