The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Speeding up performance of Python

Once upon a time we built a search server in C.  It was fast, but we needed to make radical changes to it and prototyped the new version in Python.

The Python version took only a few days to write compared to the C version which took nearly a month.

The Python version pegs the CPU.  The C version never did.  We've been throwing hardware at it, but we'd also like to explore off-the-shelf optimization tools.  I could easily spend too much time optimizing this by hand, but I'd rather throw money at it since I've got too much on my plate as it is.

Psyco is supposed to offer good performance improvement in CPU bound applications, but it only works for 32-bit platforms (we need to run it on 64-bit since we've got a +4GB working set).

Can anyone recommend any commercials tools that do either instruction-stream peephole optimization or actual Python runtime optimization on 64-bit?
Michael B
Tuesday, December 26, 2006
 
 
I guess the standard answer would be "identify the bottlenecks" and "you can write a library in C if you need to". You've got the C code, after all.
Pakter
Tuesday, December 26, 2006
 
 
shedskin will compile your Python into C++. It's limited in scope, but perhaps you could fudge your code until it fits.

Psrex is a language almost like python. A lot of Python will compile as Pyrex, and run at more or less the same speed; However, Pyrex accepts type annotations, which means that if you declare a var's type to be "int", it will be just as fast as C.

NumPy is a good option if your computation fits its model. Most programs aren't, but can be rewritten that way.

Tuesday, December 26, 2006
 
 
> I guess the standard answer would be "identify the bottlenecks" and "you can write a library in C if you need to". You've got the C code, after all.

Spending a few hundred bucks on a tool that does runtime "hot spot" profiling has better ROI than hand-rewriting stuff in C.  Assuming such a tool exists.
Michael B
Tuesday, December 26, 2006
 
 
But rewrite in C you must, eventually.  Python is interpreted (not run-time compiled like C# and Java), that's why it's so dog-slow.  Psyco is a run-time compiler but not a very good one; it was still far slower than C# in my brief tests.

When Python fans talk about "Python programs" that run super-fast they really mean short Python scripts that call lots of hand-written C code.  That's the only way to get good performance out of Python.
Chris Nahr Send private email
Wednesday, December 27, 2006
 
 
If you have a > than 4GB data set, your might get better performance and scalability by putting work into distributing the search over multiple boxes instead of increasing the performance of a single box. This way would wouldn't even need 64 bit procs.
Andrew Murray
Wednesday, December 27, 2006
 
 
Have you tried hotshot yet? That's the profiler in the standard library.  Of course profiling can be slooooooooow in any langugage.  Also, some details on how you're storing this >4Gb dataset would be apprciated.

As someone else said, the standard response is "write the slow stuff in c".

You're not going to find any pro tools to do this, but at least a few of the python core developers do contract work, so it may be worth hiring one of them for a couple of weeks or a month to take a look.
Grant
Wednesday, December 27, 2006
 
 
I'm trying to resist the urge to hand-optimize this.

Every other time I've dealt with something slow I've succumbed to the allure of hand-optimizing.  And why not?  It's a lot of fun and makes me feel like a total badass.  I'm concerned I'm just letting my ego get the better of me, and that there's a cheaper solution.

I always have this nagging feeling that time spent optimizing code is wasteful because of how quickly it depreciates, thanks to Moore's law.  Yes, I know there are exceptions and O(N^N) algorithms that no amount of hardware will help and maybe YOU are a REAL macho programmer that can rewrite faster than the boss can key in their credit card and blah blah blah.  Save it.

I really want a shrink wrapped solution.
Michael B
Wednesday, December 27, 2006
 
 
> shedskin will compile your Python into C++. It's limited in scope, but perhaps you could fudge your code until it fits.

If shedskin were more complete (ie, supported the standard library) it'd probably be a perfect fit.

The other solutions seems more invasive than using shedskin, which makes them even less attractive.

Thanks for the suggestions.
Michael B
Wednesday, December 27, 2006
 
 
You might try Iron Python http://www.codeplex.com/Wiki/View.aspx?ProjectName=IronPython , which I believe compiles python down to CLR byte code. Also, there might be a similar product/project for compiling python to Java byte code. In which case you could leverage these mature runtime environments and hopefully see a pretty noticeable speed improvement (not sure if either runtime can deal with 4GB+ memory though).
r Send private email
Wednesday, December 27, 2006
 
 
There is one for java too, it's called Jython: http://www.jython.org
r Send private email
Wednesday, December 27, 2006
 
 
My random guess is that whatever you are doing in python isn't relinquishing control to the OS enough. I encountered something similar writing a server program in C++, in the absence of a good way of handling large numbers of connections you're limited to whizzing through the list of connections doing a non-blocking read/write/select for each one. This never gives up time gracefully (you asked for non-blocking, and you got it!), so you end up with the CPU pegged.

Writing quality scalable server code that conforms to best practices is 0% of my day job so after careful consideration I put a Sleep(1) in exactly the right place and the CPU usage went from 99% to 1-2% (which I felt was still over the odds but good enough). No noticeable performance degradation, but of course there probably was some if you looked really really carefully.

In the absence of any specific information about the code and how it works this is the best random suggestion I can come up with. I hope that if it doesn't help it at least doesn't hinder.
Tom_
Wednesday, December 27, 2006
 
 
> My random guess is that whatever you are doing in python isn't relinquishing control to the OS enough.

Pretty good shot in the dark with the sleep suggestion, but it really is CPU bound.

It has to search/sort gigabytes of data (all memory resident) to serve each request.  Theoretically, the CPU can, if you're working with assembly, scan hundreds of gigabytes of memory a second.  As soon as you add all of the overhead of super duper flexible Python it falls off by orders of magnitude.  It's fast for individual requests but scales pretty badly.

It's easy to add identical servers and distribute the load, but it strikes me as silly that there's no tool that will automatically figure out that all of the extra Python flexibility that's sapping cycles can be tossed out at runtime.
Michael B
Wednesday, December 27, 2006
 
 
"It has to search/sort gigabytes of data (all memory resident) to serve each request."

This sounds like you should spend some time implementing better search algorithms and indexing methods. Or does it already use the best methods appropriate for this specific task?
Secure
Thursday, December 28, 2006
 
 
No.

There is no shrinkwrap tool that will get you an order of magnitude increase in performance of python code.
Grant
Thursday, December 28, 2006
 
 
"It has to search/sort gigabytes of data (all memory resident) to serve each request."
If it is really doing this then Python will be as fast as anything else. It is pushing around pointers and looking up hash tables as fast as you are going to do it in C.

Quick profiling step, try it with 10%, 25% and 50% of the data size and see if you are hitting a bottleneck ( due to Python, OS or hware ) at some size - then look at splitting over multiple instances / machines.
Martin Send private email
Thursday, December 28, 2006
 
 
"If it is really doing this then Python will be as fast as anything else. It is pushing around pointers and looking up hash tables as fast as you are going to do it in C."

Yeah, ignoring that totally unimportant constant time factor of, oh, let's say 1,000 or so...
Chris Nahr Send private email
Friday, December 29, 2006
 
 
If you implememting the sort in python - yes.
If you call the python library sort on a list it's done in C.
Martin Send private email
Friday, December 29, 2006
 
 
Sure, but that's what people have been saying in this thread.  I don't know Micheael's present implementation but if it's too slow chances are it needs to be rewritten in C, or at least recoded to offload as much work as possible to library functions written in C.
Chris Nahr Send private email
Saturday, December 30, 2006
 
 
You can't scan hundreds of gigabytes of memory in a second
using a normal cpu. A cpu goes at, generously, 3 gigahertz.
A 64 bit load goes at 8 bytes. You are looking at about
24 GB/second, assuming you do nothing but load data and
that all the gigabytes of data were cached. The main memory
bus is orders of magnitude slower. In fact you can't even
have hundreds of gigabytes of RAM without specialist
hardware, so you probably hit the disk, which is orders of
magnitutde slower than the RAM bus. I think, realistically,
you are looking at 1GB/second tops, and for disk bound
datasets about 300MB/second, though you might do this
faster with striped disk arrays.
C won't remove hardware limits
Saturday, December 30, 2006
 
 
The last application I attempted to write in Python ran on a 4 CPU machine and pegged one of them constantly.  It turns out there's nothing you can do to fix the problem with the GIL mechanism ... I switched to Java.

Check out: http://www.oreillynet.com/onlamp/blog/2005/10/does_python_have_a_concurrency.html
Steve Moyer Send private email
Wednesday, January 03, 2007
 
 
If yours is the same problem as http://discuss.joelonsoftware.com/default.asp?joel.3.422998.33 then I think you should run a profiler, to find why it's slow, *before* you decide how to optimize it: e.g. before you decide that instruction peephole optimizations are worth investigating as a solution to the problem, as opposed to say algorithmic and/or I/O optimizations.
Christopher Wells Send private email
Thursday, January 04, 2007
 
 
Have you checked paging on your OS? It's likely the memory allocation for such a large junk is suboptimal using the default memory allocator and you are paying a heavy price for memory access.
son of parnas
Thursday, January 04, 2007
 
 
> The last application I attempted to write in Python ran on a 4 CPU machine and pegged one of them constantly.  It turns out there's nothing you can do to fix the problem with the GIL mechanism ... I switched to Java.

I can more or less do a 1:1 line-by-line conversion of the Python code to Java code in probably 5-6 hours.  Might not be a bad way to spend a lonely Friday night.

My Java experience from a data processing perspective is pretty light. Do Java apps span multiple CPUs effectively on Linux 64-bit?  Did they ever get around to making that "HotSpot" technology work? ;)
Michael B
Friday, January 05, 2007
 
 
Yes, Java will use all the CPUs!  Remember that Java WON"T use all the memory unless you increase the maximum heap size allowed within the JVM (it's the -Xmx parameter).
Steve Moyer Send private email
Monday, January 08, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz