The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Multithreading on multiple hard drive, make it faster ?

Hi everyone,
I have program which collect files info,currently it's single threading.

Does multithreading make it faster to collect files info on multiple hard drive (each hard drive one thread) ? How about hard drives that sharing data line (eg. IDE primary and secondary) ? Anyone have tried this before ?

Thanks
Sz
Saturday, December 06, 2008
 
 
You might get a speed up, or you might not. It's very difficult to tell a priori and may be hardware-dependent.

The questions you should ask are (1) is this task worth a factor of 2 (or # of drives) speed up? and (2) how costly would it be from a software engineering standpoint to parallelize?
d Send private email
Saturday, December 06, 2008
 
 
If your target platform supports it, I would look at asynchronous I/O before I went after multi-threading.
Jeff Dutky
Saturday, December 06, 2008
 
 
The question is unanswerable without more data. You need to look at your current program and answer the question:"Where does it spend most of its time."  Once you know that then you can reduce or eliminate that bottleneck.  IF the disk is the major bottleneck then most people use some sort of RAID or striping to get higher disk throughput.

For example:
If you find step X in your program is executed 1,000,000 times during a run.
Those 1,000,000 take a total run time of 10 minutes.
The total run time of the program is 11 minutes.

Then your objective would be to work on step X.  Working on anything else would be a waste of time.  Working on anything else would at best eliminate 1 minute of run time in 11 minutes.  Therefore if you could:
A. Reduce the number of times step X is called.
and/or
B. Make step X more efficient. (say 2X by improving the algorithium or logic)
Jim Send private email
Sunday, December 07, 2008
 
 
If you can queue up enough read requests, your OS has a chance to order the disk accesses in a more optimal way; pushing them down disk channels correctly, even ordering the disk operations by track number to minimise head seek time. Etc.

Blocking disk IO in threads is one way of doing this.

Posix async IO is (on UNIXes) usually implemented using threads anyway.

And yes, gather the data before optimising.
Katie Lucas
Sunday, December 07, 2008
 
 
@d
>> How costly would it be from a software engineering standpoint to parallelize?
My program actually is double threading, one thread to collect files info,main thread to update GUI. So I think it won't be problem to parallelize

@Jeff Dutky
>> If your target platform supports it
My target platform is Windows

>> I would look at asynchronous I/O before I went after multi-threading.
I think I still need to use multi-threading for each drive to wait I/O operation to be completed

@Jim
>> Where does it spend most of its time.
It spend most at I/O operation, that is read file info and save them to memory. It's repeated for X number of files.
Sz
Sunday, December 07, 2008
 
 
Do drives read file on parallel or sequence ?
Sz
Sunday, December 07, 2008
 
 
@Katie
>> even ordering the disk operations by track number to minimise head seek time

Do you know how to get file's track number on Windows ?, I've been search for it and could not find any
Sz
Sunday, December 07, 2008
 
 
If you want to use Windows asynchronous I/O, take a look at http://msdn.microsoft.com/en-us/library/aa365683(VS.85).aspx

To work out where a file is on disc, the defrag API is one option - http://msdn.microsoft.com/en-us/library/aa363911(VS.85).aspx
Adam
Monday, December 08, 2008
 
 
If it is spending most of its time on IO then how much of the whole is that?  If you eliminated it entirely how much faster would the application be?  For example if IO is taking most of the time in your benchmark (eg 10 seconds) and the total time spent on everything is 100 seconds then the best you can hope for is a total time of 90 seconds. (total time - IO time reduced to 0).
jim Send private email
Monday, December 08, 2008
 
 
Hard drives read a file in serial bits read off the hard disk through the read head, reassembled into bytes, usually a sector (or a few sectors) at a time.

First, it has to read the directory entry for the file, which has key information like the date and file size and the location of the first sector of the file.  If that's all the information you need, you don't actually HAVE to read the file data itself.

It's not read in parallel.  Unix (and Linux for all I know) tends to 'save' data to be written back to the hard drive in memory buffers, and only actually 'write' it to the physical drive "when it has time to do so" -- which is why you need to "sync" your buffers before shutting down Unix.  Reading is slightly faster if all the file data is written in contiguous sectors, but even that is not mandatory.

If you really have multiple hard drives, in theory you could issue a 'read' on each one while multi-tasking, and each task could 'pend' waiting for its data to be ready.  Typically this is a form of "premature optimization", because it may not buy you much compared to the difficulty of implementing such a scheme.  I mean, how will your program KNOW which subdirectories are mounted on which physical hard drives?  And spawning multiple tasks (or even multiple threads) one for each physical drive has some overhead associated with that -- not to mention the Inter-Process Communication you'll need to implement so the multiple tasks/threads can collate their data somewhere.
AllanL5
Monday, December 08, 2008
 
 
@Adam
Thanks for the links

@Jim
>> If it is spending most of its time on IO then how much of the whole is that?
About 75% of total time

@Allan5
Thanks for the infos
>> how will your program KNOW which subdirectories are mounted on which physical hard drives?
The user select folder(s) before the program search for file. I know which physical hard drive using drive letter extracted from those folder, I have found a solution for this.

Thanks everyone for feedback and suggestion. I guess I have just to try it.
Sz
Monday, December 08, 2008
 
 
Okay, but it's quite possible for one physical drive to have several partitions.  Each partition will have its own drive character, yet will still be on the same physical drive.

That's for Windows.  Unix is even more complex, since each physical drive is "mounted" to a "mount point", which looks to your application like simply another subdirectory name.

Still, these aren't show-stoppers, you might as well give it a try.
AllanL5
Tuesday, December 09, 2008
 
 
"And spawning multiple tasks (or even multiple threads) one for each physical drive has some overhead associated with that -- not to mention the Inter-Process Communication you'll need to implement so the multiple tasks/threads can collate their data somewhere."

Yes, but if you want high speed data streaming, this is the way to go.

We have a column-orientated data store product that stores each column in a separate file, and the option to store each file on a separate drive, if needed. We used overlapped I/O with IOCP to read the raw data from the disk, and a series of queues and threads to stage the data  and a final thread to assemble it into "records" before it's dumped to a socket. It's extremely fast. We can saturate a gigE channel with ease.

There are applications that needs to refresh very large in-memory data sets. Our data stores improved performance by orders of magnitude. That is, a task that took almost a day using an RDMS takes 30 minutes with our data store.
anony
Tuesday, December 09, 2008
 
 
Very cool, "anony".  Sounds like you have a very special purpose, high-volume data slinging application.  Very nice.

I might add, that Gig-E runs at 125 MBytes/second (probably a little less, given the overhead of ethernet, but still darn fast) while the parallel bus to your hard drives runs quite a bit faster.  It's still an impressive achievement, given the overheads associated with Unix/Linux or whatever other operating system you're using to access the disks.

My point above was that this is doable, but only worth the effort for a few applications -- clearly yours needed it, but I'm not sure about the OP.
AllanL5
Tuesday, December 09, 2008
 
 
AllanL5:

We load several hundred million 512-byte records into a memory cache spread across multiple blades (10 is typical - each blade runs Windows Server 2003 64-bit. 8 gigs ram/blade; 8 cores/blade is typical) that perform trillions of comparisons over a span of several days.

If the primary server goes down (usually for maintenance), we have to reload those several hundred million 512-byte records into the cache.

With the new data server, it's almost painless. When customers used a RDMS, they would do almost anything to avoid the reload, as expected.
anony
Tuesday, December 09, 2008
 
 
@anony

it sounds like a system I learn about fingerprint matching ?

Data was stored in ram or special cards and a full reload would take a full day.... the card data could survive a reboot...
Francesco Send private email
Wednesday, December 10, 2008
 
 
Sweet.  See, there ARE applications where an RDBMS is too slow.
AllanL5
Wednesday, December 10, 2008
 
 
Nothing like an exception to prove a rule.
So tired
Wednesday, December 10, 2008
 
 
Unless your files are really really big OS caching will defeat the purpose of this optimization.  Here are a few pointers to speed up IO:

1) Try compression and decompression to reduce IO size and time.  Typically it is faster to read compressed data and expand it in memory rather than reading a huge chunk of uncompressed data.
2) When writing to files don't commit on every single write, do the commit after all the writes of done then you will make full use of OS caching.
3) Check if your FS is NTFS if so you can turn off logging because NTFS is itself journalized so you will be writing to disk four times for every single write. (log + main file) done by you and again by the OS.
4) Write and read sequentially use FILE_FLAG_SEQUENTIAL.  Random reads and writes are slower than reading a big continuous chunk.
5) Don't have more than a 1000 files in a folder.  This is about the upper limit after which file open slows down drastically.  Use sub folders to speed up CreateFile.

Other than these having many threads competing for HDD access will just make the seek head go here and there slowing down reads.  Route all requests of a HDD through a single thread.  Keep your data drives separate from the windows drives and page file drives otherwise your thread will compete with windows.

Above all experiment and test and time to come to your own conclusions, what I have found out may not apply to your situation.
dd
Sunday, December 14, 2008
 
 
Francesco, not fingerprint data, but you're warm. ;-) They tried Oracle and SQL Server and it was too slow, so they hired me to write the high-speed data store and now they are the ONLY company on the market that can reload data in terms of minutes as opposed to hours (which adds up to a day or more in some cases.)
anony
Monday, December 15, 2008
 
 
"...having many threads competing for HDD access will just make the seek head go here and there slowing down reads."

This is not really true. NCQ and other technologies perform optimization and arbitration to move the head in the most efficient manner.

"Route all requests of a HDD through a single thread."

No need to. See above.

"Keep your data drives separate from the windows drives and page file drives otherwise your thread will compete with windows."

Agreed. It can help a lot.
anony
Monday, December 15, 2008
 
 
Dr Known Send private email
Friday, December 26, 2008
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz