The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

MapReduce

I'm in-between projects at the moment, so I thought that I'd have some fun and implement my own Map-Reduce framework. I previously built a workflow engine and distributed master-slave processing framework, so I'm building on top of those projects. I'm not a big fan of the Hadoop code base, but I am using it and the Google papers as my guides. I'm hoping to build a better, cleaner framework than Hadoop.

Tonight I implemented the basics and ported the word count example for testing. I'm wondering if anyone knows of some good (and large) unit tests I could work with and, if you have experience with Hadoop or Google's, advice on what you like/don't like about them.

Thanks!
Benjamin Manes Send private email
Tuesday, October 09, 2007
 
 
Oh well, no help.  If anyone cares, in two days I extended my frameworks to support map-reduce and covers 90% of Hadoop (minus distributed file system + work placement, since we lack that infrustructure).  I haven't had a chance to work on it tonight, but it took 15seconds to perform a word count on a 3.5-million line document on a local server (can toggle between threads, jms, etc workers).

It was pretty fun and educational. I'd highly recomend it.
Benjamin Manes Send private email
Thursday, October 11, 2007
 
 
What language is it implemented in?  It's unfortunate that Hadoop is in Java, because Java is so heavy.

It would be nice to have one where you could write the computations in Python.  The framework could be in pure python, or it could be written in C but the callers use Python.
Andy Send private email
Thursday, October 11, 2007
 
 
Its in Java. :)

But Hadoop is an utter mess and reimplements everything possible (even serialization). Its a non-layered, one-big-package design that shows its root - try to clone whatever Google's published in their papers. Mine is a heck of a lot saner (workflow engine->distributed processing task, which Map-Reduce is an layered abstraction of).

I've heard Python has pretty poor concurrency support, which is necessary so that the master can manage workers efficently. I'd expect that you would have to use a pure C library with a Python API, which is how Google did it. Since Java-5 is pretty strong in this area, my code is lock-free and performs quite well at 10,000+ concurrent workers.

But oh well, its not like I could ever open-source it so what it can do is pretty much moot.
Benjamin Manes Send private email
Thursday, October 11, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz