The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Working with large sets of data (~800MB)

We have a floating-point number cruncher that needs to work on very large data sets.  Some of these can easily grow over time up to 800+MB.

So far the system uses a memory file until a threshold is met, and then it moves to disk I/O.  It works well, but still leaves me wondering if there is anything more I can do to speed things up.  The biggest worry is scalability, as more customers of the software are giving it larger datasets.

Any ideas on other ways to manage this?
Walter Williams Send private email
Monday, November 27, 2006
The cliche, knee jerk reaction is to provide more RAM and there's some truth to the statement. The collorary is that it really depends on what you're doing to the data itself and what kind of processor is available in the sense of whether it's a SIMD or an MISD processor. Or some other variant.

Basically once your data set reaches a certain size, dealing with it becomes for the most part, a hardware problem. You can delay crossing that threshold by optimizing the algorithm, reducing the operating system footprint, or in some cases, parallelizing the problem and utilizing a cluster.

As a frame of reference though, when I worked with satellite data (approximately 1GB raw, 600MB processed) ten years ago, simply upgrading from a workstation to a true rackmounted quad processor server saw processing time drop from about 2 hours 45 minutes to a mere 20 minutes. And that was without any optimization to take advantage of the other processors in the box.
Monday, November 27, 2006
Can this be split up into independant work units? If so, you can begin looking at horizontal scaling. Its fairly straghtforward to develop a controller/worker framework, especially if you already have a network messaging layer in place.
Monday, November 27, 2006
The first and largely obvious question is: do you need to manipulate the entire 800MB data set at once?

If we're talking about signal processing or linear algebra types of applications (an area that I deal with where large data sets are common), there are tons of algorithms that only deal with subsets of the data. A typical algorithm would be

- partition data
- load chunk 1 into memory
- process chunk 1
- process chunk 2
- combine chunks

There is usually some parameter you can tweak that gives you control over how big the chunks are (i.e. so you can optimize memory useage).

Secondly, make sure that wherever possible, you're avoiding making unnecessary copies of your data.
i.e. if data re-use isn't an issue don't do this (shifting an array, stupid example, but whatever):


instead write


How often do you need to retain the original data?

This may seem elementary, but there's no shortage of people I've met that miss this stuff.
Albert A. Borkowski
Monday, November 27, 2006
Well, it's going to depend completely on the algorithm, and from now on you're probably embarking on a small research project. I wish you the best of luck.

I'm not an expert, but I know someone who is, so here are a couple of pointers with food for thought.

Again best of luck
Peter Monsson Send private email
Monday, November 27, 2006
And the knee jerk reaction, in this case, is correct: they guy only has 800MB of data to work with (yes, he said it could get bigger, but he didn't say how much bigger, so I'm betting it's not more than a factor of 2). Any reasonable computer today can have AT LEAST 2GB of RAM, so we never need to worry about manually paging the data to disk (just rely on the virtual memory system to handle any paging). If it REALLY matters, get a 64-bit system and max out the memory (8 or 16 GB), again, no worries.

If, for some strange reason, your target system ABSOLUTELY can't be upgraded to a modern system with adequate memory, then the first order of business would be to increase the size of the virtual memory paging file so that the OS handles paging and your application can be written as if you had enough RAM for the task. If that can't be done (or if it is horribly slow) you can look at managing the paging process yourself, but only as a last resort.

Another option might be to compress the data in memory. At a minimum, all ranges of zero data should be eliminated (see "sparse matrices, implementation"). Next level of compression would be from identifying common values and referencing a common repository, but I'm not sure that will really help much. In any case, if your data isn't very sparse and doesn't show much repetition, you'll, eventually, be stuck paging to disk.

To summarize, first throw hardware at the problem. The speed of access to main memory is several orders of magnitude faster than disk access, so you get your biggest bang for you buck from buying lots of RAM. Having everything in RAM also makes your coding simpler, so you are less likely to have odd bugs to squash. Once your have exhausted the hardware route you should see if your OS is up to the task (most unix implementations should be, I don't know about Windows VM so I won't speculate). Only as a last resort should you roll your own solution.
Jeff Dutky Send private email
Monday, November 27, 2006
Look at memory mapped files.
It basically does what you currently do manually with loading data into memory then using disk but it uses the OS virtual memory system to do it for you - it is generally better than doing it yourself.
A similair API exists on windows / Unix - just the names are slightly different.
Martin Send private email
Monday, November 27, 2006
> Some of these can easily grow over time up to 800+MB.

These things are relative - one system I managed dealt with about 30MB a second ;-)

On a more constructive note, if performance does become a problem then it is time to become really familiar with a good profiling tool and the IO API of your OS. It is very difficult to generalize about these things - only to remember that optmization is something that has to be largely driven by experiment and recording of actual performance. Not by theorizing.
Arethuza Send private email
Tuesday, November 28, 2006
> then it moves to disk I/O

Working with a real-time server for example, it's common to say that you won't support its performance unless the customer installs enough RAM on the machine to support however much data/load the customer wants to put on it: becase there's a huge difference between RAM and disk performance.

By the way that would imply a 64-bit O/S if you have 8GB of data rather than 800 MB.

> Any ideas on other ways to manage this?

To improve the algorithm,

* Use a profiler
* Use a more efficient algorithm
* Multi-threaded on a multi-CPU machine
* Optimize the algorithm's data access/memory layout to optimize use of the CPUs' cache

If you're asking about how to implement it when you need disk access, I imagine that the key is to separate the I/O from the number crunching: because for example if you have a single number-crunching thread, then you don't want that thread to spend any part of its time blocked on a disk I/O.

I haven't used memory-mapped files, and I don't know the difference between ...

* A memory-mapped file
* RAM that's backed by the O/S page file

... but my worry would be that it blocks your number-crunching thread when the data doesn't fit in RAM: because when your thread touches a portion of the data that doesn't exist in RAM, then your thread is blocked (by the O/S) while that data is paged-in to memory from disk.

A concetual alternative is to have two threads which run concurrently ...

* A number crunching thread which operates on in-RAM data
* A disk I/O thread which pages disk to memory and vice-versa

... you might also do this with a single thread, if the thread is initiating *asynchronous* I/O calls (e.g using NT completion ports), which don't block it (i.e which leave it free to continue number-crunching on other in-RAM data while other I/O requests haven't completed yet).
Christopher Wells Send private email
Wednesday, November 29, 2006
If you're writing in C++ then take a look at boost::iostreams::mapped_file.  If you use STL then write a custom allocator that creates memory-mapped files instead of new/delete when necessary.

Memory-mapped files have two important advantages over "RAM backed by page file":

1. They are limited by the size of largest available continuous block of address space rather than virtual memory, and
2. You don't have to read data from files into memory before processing and write it back to files after.
Jeff Zanooda Send private email
Wednesday, November 29, 2006
Where memory mapped files work, they're fantastic.

When the size grows beyond the addressing limits of your compiler/hardware/OS limits, you can't do anything about it.  (i.e. if your maximum file size is 2GB determined by the OS, adding terrabytes of RAM won't help.)

And if you're doing sequential scanning, you want to load a chunk, and start the next chunk loading in the background while processing the first chunk, and so on and so forth. You don't want the OS's paging system keeping old chunks around for use while not pre-loading chunks you're about to access.  (OS-level paging is implementation defined - depending on your system that specific issue may or may not occur.)

It's one of those things you need to read up on in detail, and try several approaches to measure what actually works. Still, it is simple so worth trying at first to see if it is good enough.

Sunday, December 03, 2006

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz