The Joel on Software Discussion Group (CLOSED)

A place to discuss Joel on Software. Now closed.

This community works best when people use their real names. Please register for a free account.

Other Groups:
Joel on Software
Business of Software
Design of Software (CLOSED)
.NET Questions (CLOSED)
Fog Creek Copilot

The Old Forum

Your hosts:
Albert D. Kallal
Li-Fan Chen
Stephen Jones

Reading large files in Perl (not loading into memory)

I'm trying to write a program in Perl to analyze a flat file database. This flat file database could be quite large (over a couple gigs), and our systems don't have the memory to read that into memory completely.

Luckily, I can linearly traverse this file, one line at a time. So, really, I wouldn't need to load anything into memory except for one line. Buuuuttt, I can't seem to figure out how to do this.

Does anyone with any Perl experience have a link or information on a module or something? I've tried searching, but most Perl file stuff is either reading it all into an array (not possible), or using DB_File.... Which I'm trying to use on a test database of size 100mb right now, and it has yet to start processing-- 25 minutes into running the program. :-/

Any ideas/suggestions? Note: I know very little about Perl's file capabilities. Even a link with a little explanation would be great.

Monday, July 11, 2005
OK, simple pseudocode:

open database

while (read line is successful) {
  process line

close database

The only problem I can see is that you don't have a "read line" interface to the database.

Care to explain more about the problem?
Ian Boys Send private email
Monday, July 11, 2005
So I guess I didn't catch this until just now. Am I right about the following:

If you just iterate over the file handle, it only loads one line? That is, if you set the file handle to a scalar, it only returns a row, like so:

$line = <FILE>;

Whereas, if you set the file handle to an array, it loads the entire file, like so:

@lines = <FILE>;

Is that correct?

Monday, July 11, 2005
Well, considering I just got it working, reallllly fast, I'd say, yeah, that was about right.

Aah, I love having the "aha" moments. :) Thanks for your help.
Monday, July 11, 2005
Yes, you are correct. The following code snippet is one of the most fundamental idioms in Perl:

open FH, "database.dat" or die;

while (my $line = <FH>) {
  /* do something with $line */

close FH;

It iterates over the file line by line and processes each line read.
Ian Boys Send private email
Monday, July 11, 2005
Better Yet:

while(<>) {
  #Do something with $_
  print $_;

  This whill shift off the file name you pass in from the command line into a file and read from it, a line at a time, populating $_ (the default variable) with each pass.

  If you don't want the carraige return, don't forget to chomp()!

  Hey, on a related note, check out my 1 hour introduction to perl for programmers:

Matthew Heusser Send private email
Monday, July 11, 2005
Ugh. That's not "better yet". That's "worse yet".

As much as I like banging out quick scripts in perl, I *hate* perl's implicit variables.
BenjiSmith Send private email
Tuesday, July 12, 2005
I have to agree with "worse than". After using perl for 8 years, I still think that the implied variables are a bad idea.
Tuesday, July 12, 2005
Here's one that's slightly different than (and, in my opinion, superior to) the example above:

foreach my $line (<FH>) {
  # Do something with $line

Same basic idea, but I just like this idiom better.

Of course the "while" example above is slightly more efficient (not that you'll notice).
BenjiSmith Send private email
Tuesday, July 12, 2005
Aaargh! No!

The "foreach" example is bad. Don't do that unless you know what you are doing. (Remember that half the people posting here don't know what they are doing--are you one of them, dear reader?)

In the OP, the file was described as about 2 GB in size.

The foreach construct will try to read the whole file into memory before processing it. Do you have a spare 2 GB of RAM available for that?

Use the while construct. It's the Right Way To Do It.
Ian Boys Send private email
Tuesday, July 12, 2005
Actually the more correct while loop is

while ( defined( $input = <FH> ) ) {
  # do something

without the "defined" function a line containing a single zero or a blank line will evaluate to 'false' and cause the other while loop to terminate.

See page 18 of "Effective Perl Programming" by Joseph N. Hall and Randal L. Schwartz for the topic "Item 5: Remember that 0 and "" are false.". I recommend that book.
Wednesday, July 13, 2005
I'm sorry, but you are incorrect about the while loop, though you are correct that 0 and "" are false.

The loop:

while ( defined( $input = <FH>) ) {
  # do something

is identical to the loop:

while ( $input = <FH> ) {
  # do something

in every respect.

You can refer to the perlop manpage for confirmation, and then try it for yourself if you don't believe it. (It's not magic; there is a simple and obvious reason why.)

It seems unlikely the book would be in error with Randal Schwartz being one of the authors; perhaps you have misunderstood what it said?
Ian Boys Send private email
Wednesday, July 13, 2005
Look into Tie::File.

From the description:

"The file is not loaded into memory, so this will work even for gigantic files."

Elliot Send private email
Thursday, July 14, 2005

Can I recommend Ruby, since it can be easier to learn the Perl.

In Ruby, processing a large file one line at a time looks like this:

IO.foreach("testfile") {|x| print "GOT ", x }
Ged Byrne Send private email
Friday, July 15, 2005
No need for Ruby -- use Tie::File in Perl and it's a snap.

Yehudah Send private email
Monday, July 18, 2005

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz