The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Improving the speed of CPU bound application

I have a web application in mod_perl. The search functionality runs really slow and I want to speed it up. There are three ways that I know would make it runs faster:

1. Rewrite the most called subroutine in C
2. Write the search functionality in a separate application and parallelize the search (yes, the problem is parallelizable)
3. Get a faster hardware

But my boss said no more hardware changes. So, i'm left with the first two options.

Does anyone know which way is better and why?
Or is there another way?

Thank you
badaiaqrandista Send private email
Thursday, September 01, 2005
 
 
Mod Perl is pretty darn fast, it could well be that your algorithm isn't any good, why not show us it so we can see where you could improve it?

Also read these for some tips on how to profile your application:

http://www.perl.com/pub/a/2004/06/25/profiling.html
http://perl.apache.org/docs/1.0/guide/performance.html
Sheldon Lavine
Thursday, September 01, 2005
 
 
Well, I've asked around in mod_perl mailing list and the result of the discussion is that my application is CPU bound.

It is basically an online hotel booking system. The user enters the dates she wants to stay and how many guests would stay and probably a coupon number. Then the web app looks for available room and package combinations and calculate all the possible price structures with various constraints, like close outs, discounts, extra adult prices, etc.

I did a stress test with production data and found that a request takes an average of 85 seconds to finish when the server hit by 50 concurrent connections per second.

Unfortunatelly I can't provide the code because it is too big, or you want only the most called subroutine?
badaiaqrandista Send private email
Thursday, September 01, 2005
 
 
Just the psuedo code will do.
Sheldon Lavine
Thursday, September 01, 2005
 
 
Perhaps you could precalculate most of the combinations, or leave out options that you know are not going to be optimal?
Frederik Slijkerman Send private email
Thursday, September 01, 2005
 
 
Here's the pseudocode:

foreach (hotel in hotels) {
  foreach (room in hotel.rooms) {
      foreach (package in hotel.packages) {
        if ( is_available(
                room,
                package,
                from_date,
                to_date,
                number_of_guests,
                coupon_number) )
        {
            push room-package to available_list
        }
        else
        {
            push room-package to not-available-list
        }
      }
  }
}

function is_available(
  room,
  package,
  from_date,
  to_date,
  number_of_guests,
  coupon_number)
{
  if (room.max_guests < number_of_guests)
  {
      return false
  }

  if (package.min_guests > number_of_guests)
  {
      return false
  }

  ... other simple constraints ...

  if (!is_price_valid(
      room,
      package,
      from_date,
      to_date,
      number_of_guests,
      coupon_number) )
  {
      return false
  }

  if (!room.has_allotment_between(
      from_date,
      to_date) )
  {
      return false
  }

  return true
}

function is_price_valid(
  room,
  package,
  from_date,
  to_date,
  number_of_guests,
  coupon_number)
{
  declare array price_list
  declare array discount_list
  declare array extra_price_list

  for (i = from_date; i < to_date; i++)
  {
      price_list[i] = calculate_basic_price(
                        room,
                        package,
                        i)

      discount_list[i] = calculate_discount(
                            room,
                            package,
                            i,
                            coupon_number)
      price_list[i] -= discount_list[i]

      extra_price_list[i] = calculate_extra_price(
                              room,
                              package,
                              i,
                              number_of_guests)
      price_list[i] += extra_price_list[i]
  }
}

Hope, that helps...
badaiaqrandista Send private email
Thursday, September 01, 2005
 
 
I suppose that you have a data base.
How long takes the queries ?
Usually there is the speed problem.
Have to review your  index etc...

Reagrds,
Marcel
Marcel Send private email
Thursday, September 01, 2005
 
 
I have just posted the pseudocode but somehow, it wasn't posted. I'll try that again tomorrow.

Anyway, I have done extensive profiling and optimization. I have cached most of the results of database queries (the latest database access profiling showed 1-2 seconds out of 85 seconds is used by database). I have put indexes on as many hotspot as possible in the database. I preload all the modules to memory. And I have set MaxClients to 3, to allow more CPU time allocated for each apache child.

When hit by requests, top shows >90% CPU usage. So my conclusion is that my application is CPU bound.

To attack that, I know only three choices, which are listed in the original post. What I'd like to know is: which one is the best way to pursue? Has anyone have any experience with any of these?

Thank you...
badaiaqrandista Send private email
Thursday, September 01, 2005
 
 
badaiaqrandista,

You need to find out where the algorithm is spending its time before you try to optimize. If the native profilers don't work, or are too complicated (I have no idea...) then maybe you could instument the code with timers. Something like:

timer {
  timer() {
  start_time = now();
  }
  ~timer() {
  stop_time = now();
  time_timer += stop_time - start_time;
  }
}

put a static timer in every function, and report on the elapsed time when you're done.

We do something similar to this in our (commercial) product, and its great for getting performance data from the field...
Steve S
Thursday, September 01, 2005
 
 
Maybe I am reading this wrong, but your pseudo code seems to indicate that you are loading all data into your script and then loop 3 deep through it. That would bring even the strongest hardware on its knees no matter what language you use. An optimized database with proper queries will be orders of magnitude faster.
Jan Derk
Thursday, September 01, 2005
 
 
>  to indicate that you are loading all data into your script and then loop 3 deep through it

Might he want a stored procedure with well keyed tables?
son of parnas
Thursday, September 01, 2005
 
 
when you say:

foreach (hotel in hotels) {
  foreach (room in hotel.rooms) {
      foreach (package in hotel.packages) {

does this equate, in your app, to something like:

select * from hotels
  foreach hotel
    get id
    select * from rooms where hotel = hotelid
      foreach room
        get id
        select * from packages where room = roomid


Because if it does, therin lies your problem.
redeye Send private email
Thursday, September 01, 2005
 
 
+1 for profiling and moving the logic closer to the data (e.g. stored procedures or better WHERE clauses)
Jonas Grumby Send private email
Thursday, September 01, 2005
 
 
Suggesting a stored procedure distracts from the real problem, which is the highly inefficient code.

Option 1:
Gain 25% by moving the inefficient code into a stored procedure.

Option 2:
Gain 30% by rewriting the thing in C.

Option 3:
Gain 1000% by replacing the brute force loops with optimized queries.

Plus, we all know stored procedures are evil and should only be used as a last resort.
Jan Derk
Thursday, September 01, 2005
 
 
>  Suggesting a stored procedure distracts f

Don't think so. They are paging data in over the network for no purpose.
son of parnas
Thursday, September 01, 2005
 
 
> loading all data into your script and then loop 3 deep
> through it.

> An optimized database with proper queries will be orders of
> magnitude faster.

The problem is how to replace the conditions provided by the  is_available function with SQL queries?
badaiaqrandista Send private email
Friday, September 02, 2005
 
 
You could start with something like


select *
from rooms ro
where num_guests between min_guests and max_guests
  and not exists
  (select 1 from reservations re
  where re.room_number = ro.room_number
    and
      (proposed_first_day
        between re.first_day and re.last_day - 1
    or proposed_last_day
  )    between re.first_day + 1 and re.last_day)
George Jansen Send private email
Friday, September 02, 2005
 
 
I think you need to rework you algorithm from the ground up.

Right now, you're performing a complete brute force search through your entire database just to find a single room package. Then, when the next person searches, you're starting from the beginning again, searching through every single hotel, room, and package in the whole world.

That's ridiculously inefficient.

If you wanted to find the definition of a particular word, would you open up the dictionary to page 1 and continue reading until you found your word? Of course not. You'd flip to approximately the right place in the dictionary, maybe within a letter or two of your goal, and then you'd flip around from there, honing in on the word you're looking for.

If you were searching for a particular topic in a technical book, you'd flip to the index first and look up some sort of search term there. The index is might only be 20 pages long, while the book might be over a thousand pages in length. So the index would give you some clues for where to look though the text, and then you could go directly to those pages to try to find the information you're looking for.

You need to do the same kind of thing in your application. Use some sort of offline processing to produce an index apropos to your application. Then use the index to do your first level of searching. After you've performed that search, then use the results of the index lookup to narrow your search.

It might be the case that you can have a database do all the tricky indexing and searching for you, just using standard database indices and SQL queries. Or, depending upon the complexity of your business rules, you may have to build some application-specific index tables using your own logic.

Either way, the results will be far better than iterating sequentially through tens of thousands of combinatorial possibilities and checking the validity of every single one.
BenjiSmith Send private email
Friday, September 02, 2005
 
 
George Jansen, your query will be more efficient if you use a JOIN clause rather than using the WHERE clause to link the 're' and 'ro ' tables.
BenjiSmith Send private email
Friday, September 02, 2005
 
 
BenjiSmith, so the idea is to redesign the data to make easier and faster algorithm.

Is there any link where I can learn and get some ideas from?

Thanks
badaiaqrandista Send private email
Friday, September 02, 2005
 
 
Think about different ways of modelling your data.

Rather than looking at an array of hotels, each of which contains an array of rooms, each of which contains an array of packages, model your data as...say...a calendar containing booking percentages for each hotel.

When your user searches for a package on September 19th, jump to the appropriate page in the calendar and start searching from there. Each page in the calendar should contain a list like this...

Holiday Inn on Green Street ... 26% booked
Motel 6 on Huntington Blvd ... 62% booked
Travelodge on 9th Street ... 70% booked
Hilton on Park Avenue ... 92% booked

Notice how the index is sorted according to the current bookings percentage. The most available hotel is first on the list. You're most likely to find empty rooms in this hotel. If you started looking at the Hilton, you'd have a harder time finding available rooms.

The nice thing about an index like this is that you can regenerate it offline in a batch process in the middle of the night (or whatever) and then you can use it all day the next day and it will still be reasonably timely.

Of course, your business requirements might make it necessary to use a different indexing strategy. If you always offer people the least-booked hotel first, then you might also be offering the lowest quality hotel as the first choice, and that might be a bad decision.

And maybe a calendar isn't a good top-level model for your data either. Think carefully about how users tend to interact with the data, and devise a data model which lends itself to good, compact summaries conforming with the user's expectations.

But the point is this: find some way to summarize and catalogue the information in your database so that you do your first level of searching within that index. Once you've found a pretty good match in the index, then start searching in the full dataset from that point, until you've found what you need.

Kapish?
BenjiSmith Send private email
Friday, September 02, 2005
 
 
Your application isnt CPU bound. The particular loop you showed seems to be.

You can reduce the cost of the body of the loop.
You can reduce the number of times the loop iterates.
The basic principal is do the least work possible in your loop ( I know you know this ).

The body seems pretty thin already, though I cant tell what is_price_valid really does. That inner loop the date and price and coupon around the actual price seems expensive. The "lowest fruit" here seems be to change the order of the body. Most expensive discrimination last. It seems to me that, has_allotment_between would cost less than is_price_valid. So run that first.

The "lowest fruit" here is to sort the data. sort hotel.rooms by maxguests, sort hotel.packages by min guests, access the lists by a index vector.
candidateRooms = hotel.rooms[ hotels.room.roomSizeIndex[ nGuests ] : ]. Depending on the data this could save you a bit.

Now the "lowest fruit" approach will only get you so far. You really want to attack the "biggest" fruit.
B
Sunday, September 04, 2005
 
 
BenjiSmith,

Kapish!!! Actually, what the system should do is to list all available room/packages that match the guest's query.

> Think carefully about how users tend to interact with
> the data, and devise a data model which lends itself to
> good, compact summaries conforming with the user's
> expectations.

Could you elaborate more on this? By example probably? It sounds have a deep meaning that I can't really understand.

> But the point is this: find some way to summarize and
> catalogue the information in your database so that you
> do your first level of searching within that index.

What do you mean by 'first level of searching'?

I'll try to work on that index/catalogue/offline data processing technique that you describe. I had the basic idea a while ago but I did not know how to proceed with it.

Do you or anyone know any open source project that has to do massive database based calculation? I can probably learn from it.

Otherwise, I'll just charge on with this idea and hack away...

Thanks...
badaiaqrandista Send private email
Sunday, September 04, 2005
 
 
> > Think carefully about how users tend to interact with
> > the data, and devise a data model which lends itself to
> > good, compact summaries conforming with the user's
> > expectations.
>
> Could you elaborate more on this? By example probably?
> It sounds have a deep meaning that I can't really
> understand.

When you have tens of thousands of potential searches to conduct for every query (every package, for every room, at every hotel in the entire universe), it's better if you can eliminate some of those potential searches as quickly as possible. So create some sort of data model that allows you quickly discard irrelvant searches.

For example, when people search for hotels, they expect to narrow their search FIRST by location. So you should probably index all of your hotels by state and city. Then, when a user selects "Pittsburg, PA", you don't have to search through all of the rooms in "New York, NY", because they can be eliminated from the get-go.

After first narrowing by city, I would probably narrow by some sort of calendar range.

Because that's what the user *expects*, you can build your data model around those expectations. There's no reason to search through a bunch of extra records when you know in advance that the user doesn't care about most of those records.
BenjiSmith Send private email
Tuesday, September 06, 2005
 
 
> > But the point is this: find some way to summarize and
> > catalogue the information in your database so that you
> > do your first level of searching within that index.
>
> What do you mean by 'first level of searching'?

The first search is a search in the index. When you find a relevant entry in the index, it points you to a location in the full dataset (or another, slightly larger index) at which you can continue searching.
BenjiSmith Send private email
Tuesday, September 06, 2005
 
 
By the way, I should mention that nearly every RDBMS provides significant indexing functionality. Chances are, if you have all of your records in a database, you can have the DB index your records for you, and then it will consult your indices transparently whenever you run a query.

There are, of course, many scenarios where you might want to build your own custom types of indices within your application. But your FIRST STEP is definitely to use the index features of your database engine. Only after you've profiled a DB index (and found it lacking) should you build your own.
BenjiSmith Send private email
Tuesday, September 06, 2005
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz