The Joel on Software Discussion Group (CLOSED)

A place to discuss Joel on Software. Now closed.

This community works best when people use their real names. Please register for a free account.

Other Groups:
Joel on Software
Business of Software
Design of Software (CLOSED)
.NET Questions (CLOSED)
TechInterview.org
CityDesk
FogBugz
Fog Creek Copilot


The Old Forum


Your hosts:
Albert D. Kallal
Li-Fan Chen
Stephen Jones

Reading large files in Perl (not loading into memory)

I'm trying to write a program in Perl to analyze a flat file database. This flat file database could be quite large (over a couple gigs), and our systems don't have the memory to read that into memory completely.

Luckily, I can linearly traverse this file, one line at a time. So, really, I wouldn't need to load anything into memory except for one line. Buuuuttt, I can't seem to figure out how to do this.

Does anyone with any Perl experience have a link or information on a module or something? I've tried searching, but most Perl file stuff is either reading it all into an array (not possible), or using DB_File.... Which I'm trying to use on a test database of size 100mb right now, and it has yet to start processing-- 25 minutes into running the program. :-/

Any ideas/suggestions? Note: I know very little about Perl's file capabilities. Even a link with a little explanation would be great.

Thanks.
Anon
Monday, July 11, 2005
 
 
OK, simple pseudocode:

open database

while (read line is successful) {
  process line
}

close database

The only problem I can see is that you don't have a "read line" interface to the database.

Care to explain more about the problem?
Ian Boys Send private email
Monday, July 11, 2005
 
 
So I guess I didn't catch this until just now. Am I right about the following:


If you just iterate over the file handle, it only loads one line? That is, if you set the file handle to a scalar, it only returns a row, like so:

$line = <FILE>;


Whereas, if you set the file handle to an array, it loads the entire file, like so:

@lines = <FILE>;


Is that correct?

Thanks.
Anon
Monday, July 11, 2005
 
 
Well, considering I just got it working, reallllly fast, I'd say, yeah, that was about right.


Aah, I love having the "aha" moments. :) Thanks for your help.
Anon
Monday, July 11, 2005
 
 
Yes, you are correct. The following code snippet is one of the most fundamental idioms in Perl:

open FH, "database.dat" or die;

while (my $line = <FH>) {
  /* do something with $line */
}

close FH;

It iterates over the file line by line and processes each line read.
Ian Boys Send private email
Monday, July 11, 2005
 
 
Better Yet:

while(<>) {
  #Do something with $_
  print $_;
}

  This whill shift off the file name you pass in from the command line into a file and read from it, a line at a time, populating $_ (the default variable) with each pass.

  If you don't want the carraige return, don't forget to chomp()!

  Hey, on a related note, check out my 1 hour introduction to perl for programmers:

http://www.xndev.com/Speaking/PerlIntro01.ppt

Regards,
Matthew Heusser Send private email
Monday, July 11, 2005
 
 
Ugh. That's not "better yet". That's "worse yet".

As much as I like banging out quick scripts in perl, I *hate* perl's implicit variables.
BenjiSmith Send private email
Tuesday, July 12, 2005
 
 
I have to agree with "worse than". After using perl for 8 years, I still think that the implied variables are a bad idea.
nobody
Tuesday, July 12, 2005
 
 
Here's one that's slightly different than (and, in my opinion, superior to) the example above:

foreach my $line (<FH>) {
  # Do something with $line
}

Same basic idea, but I just like this idiom better.

Of course the "while" example above is slightly more efficient (not that you'll notice).
BenjiSmith Send private email
Tuesday, July 12, 2005
 
 
Aaargh! No!

The "foreach" example is bad. Don't do that unless you know what you are doing. (Remember that half the people posting here don't know what they are doing--are you one of them, dear reader?)

In the OP, the file was described as about 2 GB in size.

The foreach construct will try to read the whole file into memory before processing it. Do you have a spare 2 GB of RAM available for that?

Use the while construct. It's the Right Way To Do It.
Ian Boys Send private email
Tuesday, July 12, 2005
 
 
Actually the more correct while loop is

while ( defined( $input = <FH> ) ) {
  # do something
}

without the "defined" function a line containing a single zero or a blank line will evaluate to 'false' and cause the other while loop to terminate.

See page 18 of "Effective Perl Programming" by Joseph N. Hall and Randal L. Schwartz for the topic "Item 5: Remember that 0 and "" are false.". I recommend that book.
empty
Wednesday, July 13, 2005
 
 
I'm sorry, but you are incorrect about the while loop, though you are correct that 0 and "" are false.

The loop:

while ( defined( $input = <FH>) ) {
  # do something
}

is identical to the loop:

while ( $input = <FH> ) {
  # do something
}

in every respect.

You can refer to the perlop manpage for confirmation, and then try it for yourself if you don't believe it. (It's not magic; there is a simple and obvious reason why.)

It seems unlikely the book would be in error with Randal Schwartz being one of the authors; perhaps you have misunderstood what it said?
Ian Boys Send private email
Wednesday, July 13, 2005
 
 
Look into Tie::File.

From the description:

"The file is not loaded into memory, so this will work even for gigantic files."

-Elliot
Elliot Send private email
Thursday, July 14, 2005
 
 
Anon,

Can I recommend Ruby, since it can be easier to learn the Perl.

In Ruby, processing a large file one line at a time looks like this:

IO.foreach("testfile") {|x| print "GOT ", x }
Ged Byrne Send private email
Friday, July 15, 2005
 
 
No need for Ruby -- use Tie::File in Perl and it's a snap.

-Yehudah
Yehudah Send private email
Monday, July 18, 2005
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz