The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

c++ file caching algorithm

I have a program that accesses (read-only) a set of files on a regular basis.  It's sort of a batch type process that runs and opens the files and then exits.  There are about ~600 files that it accesses and it opens and reads them in a particular order.  For each file, the program open and closes it for each read operation.  The files are not on the local file system, they are on a SAN that we have to access remotely.  This behavior is slowing down the program tremendously.  I did not write any of this code, but it's my job to fix it.

My idea is to write some sort of 'file cache' that opens a file, reads it into memory and allows for read access.  This way we only perform a file open operation about 600 times as opposed to almost 100,000 times. 

I would be able to reuse a lot of code if I can treat a memory buffer like a ifstream.  I would be able to use all of the code that does the seek()'s and read()'s.  Is there anything in C++ that allows me to do this?  I was looking at the filebuf class, but I'm not sure if this is what I want.

I'm having some difficulty designing this algorithm.  Anyone have any comments or suggestions.  Perhaps something like this already exists?

new c++ guy
Thursday, August 23, 2007
Memory mapped files, but don't know about acessing them over a network.
Are they changed by any other remote process? Do you have to check that the 'cached' copy is still valid.
There are ram disk utilities for windows, thy aren't used very much since the windows disk cache makes them redundant for local files.
You might be best just taking a local snapshot and then letting the disk cache do this for you.
Martin Send private email
Thursday, August 23, 2007
Edit sorry - memory mapped is the exact opposite of what you said - it lets you treat a file as a memory object!
Martin Send private email
Thursday, August 23, 2007
If you want to tie a buffer into iostreams, the thing to do is to create a custom stream buffer class that inherits from std::streambuf and use that to initialize an iostream. Documentation on how to do that is a little hard to come by, but you might try <>. Start reading at the heading "the C++ stream buffer".

In essence you'll want to create a stream buffer class that can be initialized with an array of data that you've already got handy, and will always fail to read more data.
Thursday, August 23, 2007
There's actually a standard set of classes which does what you want without having to implement a stream buffer(!): std::{,o,i}stringstream in <sstream>.

Some random docs on the web:
Mike Owens
Thursday, August 23, 2007
The files are essentially never updated.  We are running on linux accessing the files over an NFS mount.  The fileset is too large to store locally, so they reside on a SAN.  We cannot run our process on the machine that houses the files.  The files are binary, not text.
new c++ guy
Thursday, August 23, 2007
Memory mapped IO really sounds like the way to go, if the files will fit into memory.  MMIO does work in Linux across NFS.  It's been a while since I've done it, but at the time we figured that if it took X amount of time to access a file in RAM, it would take about 10X to access the same file on a local disk and 20X to access it via NFS.  So you're likely to save a lot of time.

Out of curiosity, it sounds like you should be able to read the files in once (into some kind of data structures) and then use the data structures directly, rather than having to reconvert to input streams.

And the istringstream stuff mentioned above works best with text files, not as well with binary.
Michael G Send private email
Thursday, August 23, 2007
if the files are too big to fit on local disk, will they really fit on local memory?
Chris Brooks Send private email
Thursday, August 23, 2007
Why you just don't read the files once into a collection of something light (a standard vector of blobs, for instance) and let's the operating system take care of fitting it into available RAM?
Am I missing something in you situation that you really need to have the streams?
Deem Send private email
Friday, August 24, 2007

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz