The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Mutltithreaded design in web application...

I have a PHP script that uses CURL to fetch a bunch of pages for comparing prices.The list of pages has grown to about 10 now and predictably the performance has degraded, it takes longer for the results to appear.

I understand it's because all the pages are fetched one after the other and that this situation is a prime example of how mutithreading can improve performance.

Is there any threading library that I can use from PHP ?

If not possible with PHP, would it be a good idea to look at JSP/servlets, are they fit for my purposes.

Any hint/advice will be greatly appreciated.

Thanks in advance.
tss
Wednesday, May 03, 2006
 
 
If you are using PHP5, then see http://uk.php.net/curl_multi_exec

Otherwise, this is probably a good time to start learning Java. Rather than using JSP/servlets alone (which is *so* 1999), I recommend going straight in with JSF which gives you a much nicer event-driven GUI style model.

There is a very good set of articles which will guide you through from (almost) first principles at http://www-128.ibm.com/developerworks/views/java/libraryview.jsp?topic_by=All+topics+and+related+products&sort_order=asc&lcl_sort_order=desc&search_by=nonbelievers%3A&search_flag=true&type_by=All+Types&show_abstract=true&start_no=1&sort_by=Date&end_no=100&show_all=false
Darren Hague Send private email
Wednesday, May 03, 2006
 
 
"I understand it's because all the pages are fetched one after the other and that this situation is a prime example of how mutithreading can improve performance."

You probably don't need multithreading to fetch multiple pages at once (although you might have to use fopen() rather than CURL to fetch the URLs).  Open a connection to each remote page you want and then cycle through them grabbing the data for each in a round-robin.

Is there some reason you need to fetch a bunch of pages on each user request?  If this is something you can do periodically, in the background, then you could run a PHP script from a cron job to fetch the pages into a database. 
When a user visits, you give them the cached data rather than fetching it on each request.

"Is there any threading library that I can use from PHP ?"

No.
Almost H. Anonymous Send private email
Wednesday, May 03, 2006
 
 
Following up on Anonymous' suggestion...

If you do go the cache route, you can create a Perl script (or something similar) for each page you want to fetch, and schedule cron jobs to update the pages as appropriate. For example, one page may only need to be updated every Tuesday, another every morning, and a third may need to be updated every five minutes.

If the web site makes a minor change to one page, you'll only need to update the script for that page.

It's a poor man's way of doing it, but this is the easiest way you can parallelize the fetch routines (basically let the operating system do it). The next best option is to switch to something like Java or C#.
TheDavid
Wednesday, May 03, 2006
 
 
Do you really need those pages fetched real-time? Is there no way to make a cron script or a daemon which would periodically scrape those pages and store results locally?

If not, I suggest you not to drop PHP completely; instead use another language to make a service that can retrieve pages concurrently (Java is a good option). Then you simply call that service by PHP and display the result.

When it comes to serving dynamic Web pages, *nothing* really beats PHP's simplicity, integration and ubiquity. It's a lot weaker in other areas, however, which is why we'll never see an SQL database in PHP. (Or will we? http://www.c-worker.ch/txtdbapi/index_eng.php )
Berislav Lopac Send private email
Wednesday, May 03, 2006
 
 
Thanks a lot for your responses.

Berislav and A.H.A., I cannot run a cron job and pre cache the results, as the prices I compare are of Air Tickets and there are hundreds of routes to search along with variable number of passengers plus the fact that a visitor can search on any date(s).

Darren, thanks a lot for the links. I wanted to use JSP/Servelet as I only need them to fetch the results, I plan to handle the presentation part with PHP. JSF would be an overkill in this situation...
tss
Wednesday, May 03, 2006
 
 
It seems like overkill to add another platform (in this case Java) just to execute a simple background task.

The first problem, fetching the URLs in a way that isn't sequential, is pretty easy to solve without multithreading (and, in fact, multithreading wouldn't be any more effecient).

The second problem is that you might want to spawn this as a background task entirely separate from the users browsing experience if the duration is too long for any visitor to wait for.  This is easily accomplished in any language.  The problem you'll eventually have is notifying the visitor when the processing is complete.
Almost H. Anonymous Send private email
Wednesday, May 03, 2006
 
 
Can you have processes running on the box?

I would:

Take each web request and create one queued request for each datasource. Queue up the requests in a DB, have 5 or 10 background processes running that can pull requests from the queue, handle the page fetch and parsing, and drop the results back into a return queue... you're done when you have the same # of returns as you had requests (or you just want to ignore the results).

You can then run the queue/dequeue on multiple machines and spread out the load.

The technique is based on another technique called SEDA, which I have used in production successfully in the past.
Michael Johnson Send private email
Wednesday, May 03, 2006
 
 
When the upper level functions in PHP get too slow for something like this consider moving lower. I faced similar issues with a page that summarized a number of other pages. I went to talking straight HTTP to the sites instead of using the higher level functions. It sounds bad but it gave more control over timeouts and was faster. I'll post a snipet if I can find it (away from work).
Cymen
Wednesday, May 03, 2006
 
 
Learning some Perl and using LWP::Parallel is also an option.
Egor
Saturday, May 06, 2006
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz