The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Challenge--500K Search and Replaces on 120 MB text file

The subject says it all.  I've got a list of about 572,000 literal replacements I need to make in a 120 MB unicode text file (can do in UTF8 or any encoding that supports 2byte characters).  I can break that list down into smaller bits, but obviously it's not practical if I end up with, say, 572 steps to run.  I'm now trying it with ten lists, each of about 60K replace S&R pairs.

I'm trying out a Python script (running for four days straight on an S&R list of 60,000 replaces with no end in site) and Powergrep (with a pga containing 60K literal S&R pairs--does about 12% per 24 hours of processing, meaning 8 days for each 10% piece or 80 days total processing time).  There's got to be a better way.  Any suggestions?

I've spend weeks trawling the internet, trying different software options, and hunting for fast scripts.  Posting here and bothering my betters (which, in the only sense that matters to me right now, you clearly are) is a last resort, believe me.

I'm running Win XP SP2 on a Core2 Q6600 quad core CPU with 4GB RAM.

Thanks, in advance.
Peter Rivard Send private email
Tuesday, December 04, 2007
 
 
I'm really surprised it takes that long on a pretty good PC.

I guess it makes sense that a Python script would be slow, as this is pretty much outside of its comfort zone.

I would suggest (almost) getting down to the bare metal, and using C, but grep should be pretty much that.

How do the replacements work? I mean how many occurrences of each replacement do you have? If it is one the algorithm should be very different to if there are many.
Entries of Confusion Send private email
Tuesday, December 04, 2007
 
 
Have you come accross Tim Bray's "wide finder" challenge?
It's at http://www.tbray.org/ongoing/When/200x/2007/10/30/WF-Results

It does partially what you want though in several languages. Perhaps it could be expanded from there to a meaningful solution for you.
Joao Pedrosa Send private email
Tuesday, December 04, 2007
 
 
This sounds like the sort of problem that's most interesting if you don't really have to do it.

My initial thought would be to load the 120MB into RAM and process in place. This depends on the replace being the same size as the search, which it most probably won't be, which leads me onto thinking about constraints.

From your website I guess you're working in Japanese, which makes me think that avoiding UTF-8 might be a plan, as I suspect (totally without evidence) that the fancy string searching algorithms might have trouble with variable-length encoding. In any case, N characters will be faster than 2.5N, or whatever the japanese char/UTF8 ratio is.

If you can reorder the search list then you might be able to match a set of prefixes at once, or reorder the search list to make sure that runs of searches & replaces don't have overlapping characters, and build substring indexes into the target text that won't be invalidated by the replaces. However, I suspect that either you won't be able to reorder the list, or you will need to grow or shrink the text as you replace.

So assuming you can't reorder the searches, and the target will grow and shrink as replace size doesn't match the search size, I'd first investigate setting up a pair of data areas, and start with something like Boyer-Moore or KMP string searching, using 16-bit fixed-len chars. Load the text into the first data area, then search into that, and memcpy into the second as you either find something or hit the wall. This will allow you to grow the text, but because it's going to copy 120MB 500K times it's not going to be fast.

If this won't be fast enough (either because it's really slow, or you have to run this every week) then you would have to consider something with more moving parts. I'd think about chunking the text into sections, leaving enough room for the text to grow from replacements (hopefully the shrinkage will balance the growth), with a master list keeping track of the segments and their order.
Now you can run your fancy search on each segment (hopefully in parallel, not forgetting that matches could cross segment boundaries), and not have to copy 120MB each time. If the searches hit relatively few segments each time, it might be worth generating substring indexes for each segment, and only rebuilding them when a segment changes.

In any case, reading should start here: http://en.wikipedia.org/wiki/Category:Algorithms_on_strings
good luck
Jamie Anstice Send private email
Tuesday, December 04, 2007
 
 
Can you say anything about the 572,000 literal replacements: for example, are they non-overlapping?
Christopher Wells Send private email
Tuesday, December 04, 2007
 
 
Partition the problem, but not the 500 K searches, but the 120 MB text.

One reason that your job is so expensive is the size of the text. If needle and replacement have different sizes, then on average 60MB of text must be moved through memory for each replacement. The cost of that is huge (unless you use some fancy data structure).

Therefore, split the text in 12 parts of 10 MB. Then the average move is only 5 MB, which is 12 times better. PErhaps you should use even smaller fractions.

Also make sure that you search from the position after the last position where the needle was found. Otherwise searching gets more and more expensive near the end of the text.

Given the size of the problem, you might preprocess the text so that searching is simpler. Various things come to mind: conversion to upper case, removing all parts that are not touched anyway, etc.
Karel Thönissen Send private email
Tuesday, December 04, 2007
 
 
> If needle and replacement have different sizes, then on average 60MB of text must be moved through memory for each replacement.

That's be a naive implementation; better maybe to have two phases, one tokenizing and another writing: in one phase, tokenize the input into a sequence of tokens, where each token either is or isn't equal to one of the needles; and in the next phase, write the tokens to the output file, replacing tokens-which-are-needles as they're written.

> 572,000 literal replacements

Imagine the implementation is like this: for each literal, remember whether the most recently-searched string fragment is or isn't a partial match.

class Literal
{
  string m_needle;
  string m_replace;
  int m_index;
  public bool found(char inputChar)
  {
    if (m_needle[m_index] == inputChar)
    {
      if (++m_index == m_needle.size())
        return true;
    }
    else
    {
      notFound();
      return false;
    }
  }
  public void notFound()
  {
    m_index = 0;
  }
}

Imagine we have 500K instances of the Literal class. Imagine the main loop looks like:

void main()
{
  while (!inputFile.eof())
  {
    proocess(inputFile.readChar());
  }
}

void process(char inputChar)
{
  foreach (Literal literal in allLiterals)
  {
    if (literal.found(inputChar))
    {
      //replace most recently-read chars
      Output.replaceWith(literal.m_replace);
      resetAllLiterals();
      return;
    }
  }
}

void resetAllLiterals()
{
  foreach (Literal literal in allLiterals)
  {
    literal.notFound();
  }
}

Order of magnitude estimate for this to run will be:

  1E8 (number of input characters)
x 5E7 (literals per input character)
x 1E2 clock cycles (cost of invoking the Literal.found method for each literal for each input character)
= 5E17 clock cycles total.

Which in real-time means:

  5E17 clock cycles
/ 1E9 clock cycles per second
/ 1E5 seconds per day
= 1E3 days (1000 days).

I think this shows that searching for each distinct literal is too expensive ... doesn't it?

In which case, the only possible alternative that I see to searching for each distinct literal is to pre-combine the literals, for example, have a single parser instance which has millions of possible states; and for any <input state> plus <input character>, the result is an <output state> plus either the <original input character> or a <replacement string>.

The challenge then is to pre-combine the literals to create a tokenizer.
Christopher Wells Send private email
Tuesday, December 04, 2007
 
 
> 5E7 (literals per input character)

That's a mistake: it should be only 5E5, therefore the total is more like 10 days instead of 1000 days. But 10 days is over-optimistic, because it assumes that everything operates at the speed of the CPU, whereas in fact the data won't all fit in the CPU cache and instead the system will be operating at (bound by) the speed of the RAM, whch is slower.

FWIW I wish someone could show me an example of how to make a rule-of-thumb guestimate based on the speed of RAM and the amound of RAM to be read from.
Christopher Wells Send private email
Tuesday, December 04, 2007
 
 
Christopher,

Tokenisation is a good idea anyway, if only to reduce the cache misses.
Karel Thönissen Send private email
Tuesday, December 04, 2007
 
 
More ways to avoid all the copying:

1. Find first substring string that matches any of the ones to be replaced. Copy the unmodified text up to that point, and then the replacement to a new buffer. Repeat until you get through the whole thing.

This would work fine with any encoding, and you could probably persuade [f]lex to do the rapid simultaneous finding of strings to replace for you.

2. Use more than 16-bits per character, I'd pick 32 to make it simple. Use the top two bits as flag bits. Top bit = to be replaced. Next bit = skip this.

You can then go through and find all the matches, and instead of inserting replacement text then and there just store in the first character the index of the replacement string (with the top bit set as a flag), and any subsequent characters are marked as to be skipped.

Then as a final pass over the data copy it to a new buffer and put the replacements in as you go.

This will work if you don't search for all the strings simultaneously, or the search isn't linear.
Adam
Tuesday, December 04, 2007
 
 
> I wish someone could show me an example of how to make a rule-of-thumb guestimate based on the speed of RAM

Figures 3.12 and 3.32 of http://lwn.net/Articles/252125/ suggest 200 to 400 cycles per cache miss.

My estimate of "1E2 clock cycles (cost of invoking the Literal.found method for each literal for each input character)" is an over-estimate if the code is optimized and if each Literal instance can be accessed without a cache miss.

So, if a total expected time-to-run of some small number of days is acceptable, you might get that with a C-like implementation of the above pseudo-code where the memory structures used by the implementation are:

* few MB to contain however-many Literal instance will fit in cache.
* 120 MB of pre-loaded input data plus 120 MB of output data, in memory; you can use less memory (e.g. half the memory and twice the total number of disk reads and disk writes per run), but perhaps you should assume that any system I/O will flush the CPU's cache?, so have only some very small total number of I/Os.

Run the whole operation several times: one each run, process however many Literal instances will all fit in cache simultaneously.
Christopher Wells Send private email
Tuesday, December 04, 2007
 
 
> if each Literal instance can be accessed without a cache miss

This requirement too can be relaxed: if it's true that a cache miss costs only a few 100 cycles, then it would be enough if *most* (e.g. more than 90% or 95%) can be accessed without a cache miss. So I'd suggest trying an implementation which:

* Implements the algorithm pseudo-coded above
* Is written in optimized C
* Runs once
* Locates all the Literal instances (and their inline data) one after the other in contiguous memory
* Ensures that the Input.readChar, Output.putChar, and Output.replaceString methods are mostly running in user-space RAM and minimize the number of system I/O calls.
Christopher Wells Send private email
Tuesday, December 04, 2007
 
 
Try this.

I haven't verified, but the idea is:

Get all searches of length 10 (or n, if we need to be general):

Cram the search keys into a dictionary (so they point at the replace value).

Create n iterators, each of which will search and replace chunks of length n. We use a different offset each time, so that we cover the whole string.

Link those iterators together, and fire away.


It looks like it would run in about an hour on my machine, but mine is just a little old macbook.

Code (please let the indents work):

import random
N=120000000
big_unicode_iter=(unicode(random.choice((0,1,2,3,4,5,6,7,8,9))) for i in xrange(N))

def s_r_iter(string_inter,replacements_dict,offset,n):
    for key in replacements_dict:
        assert len(key)==n
    # yield the top
    for i in range(offset):
        yield string_inter.next()
    # yield the replaced parts
    chunk_list=[]
    i=0
    for char in string_inter:
        chunk_list.append(char)
        i+=1
        if i==n:
            i=0
            chunk=''.join(chunk_list)
            chunk_list=[]
            if chunk in replacements_dict:
                print 'hit',chunk
                chunk = replacements_dict[chunk]
            for char in chunk:
                yield char
    # yield the tail
    for char in chunk_list:
        yield char


n=10
replacements_dict={}
for i in range(10000):
    key = u''.join([unicode(random.choice((0,1,2,3,4,5,6,7,8,9))) for i in range(n)])
    val = u''.join([unicode(random.choice((0,1,2,3,4,5,6,7,8,9))) for i in range(n)])
    replacements_dict[key]=val

result=big_unicode_iter
for offset in range(n):
    result=s_r_iter(result,replacements_dict,offset,n)

i=0
j=0
for c in result:
    i+=1
    j+=1
    if i==100000:
        print j
        i=0
PeterR
Tuesday, December 04, 2007
 
 
I've come to this party late, it seems.  Personally, I'd be very tempted to use Perl.

And in Perl I would load your 500K Search and Replace rules into a hash, keyed on the 'search' term.

Then pass through your 120 MByte text file once, for each 'token' checking the token against the 'hash' dictionary, and writing out the replacement string.

I've done similar things when I built a text file recording all the files on a hard disk -- typically this file was over 100 MBytes, and would have upwards of 20,000 directory entries and 130,000 file entries.  Doing an individual search for a file using 'grep' would take a while, and if I wanted to locate multiple file paths this became prohibitive without 'hashing' the keys.

I know the above approach does require a few assumptions.  One is that the text file is 'tokenizable' -- completely general regular expressions wouldn't work well.
AllanL5
Tuesday, December 04, 2007
 
 
Yep. It takes about an hour. Of course, you are taking a 1000 times hit to use a 'slow' language like python or perl, but who cares when your have a hash table. Big problems need algorithms, not metal.
PeterR
Tuesday, December 04, 2007
 
 
+MAX_INT to AllanL5

This is the only way this is going to take less than close to forever to handle.

Only drawback would be if parsing out the tokens for replacement would be too burdensome.

If it's something like changing account numbers in a well defined file, then this is a perfect application for a dictionary.

And you have a framework to do this again.
frustrated
Tuesday, December 04, 2007
 
 
@PeterR,

Hardly 1000 times if you use Perl. Maybe x2 at worse, unless you can't do the token parsing with regular expressions.
frustrated
Tuesday, December 04, 2007
 
 
>>  start with something like Boyer-Moore or KMP string searching <<

Forgive me if I'm wrong, but my understanding is that Boyer-Moore makes the assumption that byte == char.

Is there a variant that works on Unicode strings?
(not sure if this is possible, due to the algorithm's "backwards" approach)
xampl Send private email
Tuesday, December 04, 2007
 
 
While Boyer Moore is terrific at finding a string, a Ternary Search Tree is a wicked fast way to match text against a dictionary of 500k choices.

http://www.javaworld.com/javaworld/jw-02-2001/jw-0216-ternary.html

Would you have to start down the tree at every character of the input - 120 million times - minus skipping over hits? Or can you tell where candidates begin and end, like whitespace or punctuation around them?
Stan James Send private email
Tuesday, December 04, 2007
 
 
Use flex.  It doesn't care about unicode or not, you are just replacing byte-patterns.  The flex program looks like this

%%
"old1" {cout << "new1";}
"old2" {cout << "new2";}
...

.|\n {ECHO;}
%%

where old*/new* are your replacement strings (which might have escapes in them for the non-ascii characters.

If there are overlaps (e.g. old1="abc" and old2="abcd") then you have to make sure the longer ones are first.

The flex build and compile with 500K rules could take an hour on your machine, but the actual run, after the code is compiled, will take at most a few minutes since it a single  linear pass over 120MB.  You need a recent version of flex; older ones choked on large rule sets like this.

Sometimes I cringe when I think of people using a scripting language for a task like this, while flex, which is both high-level (full regular expressions allowed) and compiled, is available.
Will Dowling
Tuesday, December 04, 2007
 
 
OK. Roughly, without any special theory, algorithms:

1) Load 120 MB into memory
2) Load 500 K into memory
3) Search 120 MB against each "literal" to be replaced and generate/initialize
    record ("literal" position, "literal"/replacement index) for each find
4) Sort generated records (in list) by "literal" position
5) Write output file by writing original file until first pos. in the list,
    then write replacement string (by "literal"/replacement index),
    then skip "literal" length from the original file,
    then write original file until next pos. and so on...

I did some tests with 130 MB file using Delphi 2007... On 3GHz Pentium it should take less
than 24 hours, and that's without using any additional tricks...
shouldn't be that difficult
Tuesday, December 04, 2007
 
 
Boyer-Moore doesn't assume characters are bytes; it assumes characters are all the same length.  It'll work with UTF-16; it won't work with UTF-8.

A search dictionary is definitely the right solution to this problem. 

A hash table's an OK choice, but it becomes problematic if the number of distinct key lengths is large, because for each input position you need to do N lookups (where N is the number of distinct key lengths). 

A trie might work, though the fact that you're using Unicode makes me worry.  For a trie to be efficient, each node needs an array of pointers that you can look up the next character in, which is OK if you're working with letters only and the array can contain 26 elements, and quite a bit less OK if it needs to contain 2^16 elements.  (I'd bet that your implementation needs a much smaller space than this.)

A TST won't be as fast as a trie, but it'll be damn fast, and it's a lot less space-inefficient.  And unlike a trie you don't have to come up with a fast method for mapping characters onto array positions.
Robert Rossney Send private email
Tuesday, December 04, 2007
 
 
Oh, I neglected to mention that whatever type of dictionary you use (hash table, trie, TST, btree, whatever) you need to deal with disambiguating multiple hits.  If you're replacing A with BC and AB with DE, what do you do if the current input character is A and the next one is B?  That's not a problem if all your keys are the same length, of course.
Robert Rossney Send private email
Tuesday, December 04, 2007
 
 
... forgot to mention that it doesn't matter what format (Unicode, 1-byte, 4-byte, etc.)- searches and replacements are performed at binary level (byte by byte).
shouldn't be that difficult
Tuesday, December 04, 2007
 
 
> Forgive me if I'm wrong, but my understanding is that
> Boyer-Moore makes the assumption that byte == char.

I can't see it caring about the width of the char, as long as the width is fixed. Things might be a little trickier if a character is composed of more than one character - assuming the needle and haystack are decomposed in the same manner then you'd have to check that the character following the match isn't a combining character.
Jamie Send private email
Tuesday, December 04, 2007
 
 
It seems to me like the first obvious step would be to pre-process the literal replacements somehow. For example, if a subset of the replacements is:
    steve -> Steve
    joe -> Joe
    john -> John
or whatever, you could group together the ones where the literal starts with "j", or the ones that start with "jo", and so on. And then only test the individuals if the input matches the group.

Or am I not getting the problem here? It doesn't seem like 572K replacements could be so unique that a single comparison can only eliminate a single replacement...
andy
Wednesday, December 05, 2007
 
 
I would isolate a not usual small string part of the 500K and search for it. 

Lets say we search the not common Substring "xawyr"  in the 500K text :

When Substring "xawyr" is found in text 
is larger part of substring(text,start,length)  =  500K String ? Yes
Change it and continue

SergeD
Serge Send private email
Wednesday, December 05, 2007
 
 
Wow--A lot of helpful pointers and things to think about.  Thank you all.  I'm going to take time to digest all your suggestions, probably starting with Stan James' TST solution and Will Dowling's flex suggestions.

Since some people asked about the problem, I've got a very large Japanese-English dictionary database (Eijiro).  Japanese is written using three different character sets--at the same time.  Two of them are alphabetical, and the third is the pictographs you can probably all picture. The dictionary was made for Japanese people, and I want to make it useful for non-Japanese.  Wherever there's a word containing pictographs, I want to add its pronunciation in brackets next to it.  The number one problem for Japanese learners is that there isn't a set pronunciation for each pictograph--each one can have anywhere from 2 to more than 20 different pronunciations, and there's no rule to help you figure out which pronunciation to use in any particular situation.  Even if you memorize all 6355 pictographic characters used in the database, a given two-character word might have 30 plausible pronunciations.  Another difficulty for this project is that Japanese doesn't put spaces between the words--there's no 100% reliable way to see where one word ends and the next begins. 

I've also got a file, painstakingly produced over several months, of 572,000 words containing pictographs and their pronunciations.  Right now, each entry looks like "word,word{pronunciation}" in a csv file to be run with the Python script by Carol Albright at http://www.visi.com/~chaugan/carol/Search_and_Replace_Python_Script.html .  I'm not asking for fixes to the script, as previous posters have shown me that's not the way to go--I'm just putting it in as a reference for anyone who wants to see it.  I did make a mistake in my first post.  Running this with 60K word,word{pronunciation} entries (instead of the whole 572K) took under a day of processing time.  But I'll have to do two different iterations on two different 120MB files, so the total time is still approaching 40 days of processing time.  Ideally, I'd like to do this every time a new edition of Eijiro is released, about every 4-6 weeks.

I can set these files up any which way, reorder them, change the encoding, break them into pieces, etc. (Thank you, PowerGREP). Currently I've got them in unicode and the word{pronunication} file in reverse order by word length (longest words first).    Each word in that file can occur in the main database anywhere from 1 to 5000 times.  Average is probably about 15-20.

Again, thanks everyone.
Peter Rivard Send private email
Thursday, December 06, 2007
 
 
Peter,

One possibility might be to use the method you have, which is slow, but run off a big ramdisk. Cutting down the disk I/O would speed up the processing quite a bit.

This supposes you have a computer with several GB of RAM.

Since this is a one time thing, it might be easier than casting around for a faster method.

Except, of course, that almost nothing is a one time thing. :(
frustrated
Thursday, December 06, 2007
 
 
... and Peter,

Could you share how much total time (additional research, trials, configuration, runs, etc.) did it take to accomplish that conversion, and which method you chose.
shouldn't be that difficult
Thursday, December 06, 2007
 
 
On a lighter note, if after these suggestions you still can't find a good solution, call this guy:

http://xkcd.com/208/
rubinelli
Thursday, December 06, 2007
 
 
Hi,

You might like to look at some of the Unix tools which were designed for this kind of stuff: sed and awk

You can get these by installing cygwin for Windows.

I have personally had good results with Perl in this kind of situation. Text processing is really Perl's forte. You need to steer clear of simply re-implementing a giant for loop though.

Not saying that you have to, just a few more tools for your arsenal.

-Andrew
Andrew Send private email
Monday, December 10, 2007
 
 
My vote is to put the search strings into a packed trie indexing the replacement strings and do it in one pass. Traverse the trie and the main text simultaneously. On a trie hit, emit the replacement. On a trie miss, emit unchanged text traversed since the last time the trie was restarted. In both cases start at the beginning of the trie again.

The problem of a Unicode trie having too many branches at any node can be solved in various ways - hashing, for example. Maybe it's not a trie any more if you do that, but the principle is the same.
Graham Asher Send private email
Tuesday, December 11, 2007
 
 
I really do appreciate all the help, but I've a confession to make--I'm not a pro, by any means, and I find that I'm a few years of intense study and practice away from being in a position to begin to understand some of these answers.  I should have made that clear.  I'm diving into the flex stuff, which itself looks like a monthlong learning curve (starting with learning how to compile a program).  Learning is good, so I'm not complaining. 

shouldn't, I'm not sure what you were asking about.  How long to convert the format of the 572K item list once I had it?  That took a couple of minutes total with PowerGREP--simple text editing and reformatting is no problem, and I know my way around Regex, though probably not as well as most of you.  Folding the list into the macro I linked to also took only a few minutes.
Peter Rivard Send private email
Thursday, December 13, 2007
 
 
I just cracked open the book Foundations of F#: http://www.apress.com/book/view/1590597575.

In the first couple of pages, the author describes solving a very similar problem and claims it was WAY easier using functional programming (FP). His language of choice is F#. I think lisp is another popular choice.

Maybe you could go to the bookstore and crack that book open ... see if it speaks to you ;-)

Good luck!
Patrick Foley Send private email
Friday, December 14, 2007
 
 
What about spliting the file and doing parallel processing in several machines? If you could borrow 5 or 10 pc's you would end in 20% or 10% the time.

You could ask for volunteers and do this over the net.
I would do it Send private email
Monday, December 17, 2007
 
 
As far as I see, the python script works roughly like this:
For each entry in the change table it goes over the whole text and performs replace.
This allows it to work if the result of a first replace should affect the later, but in your case this is not needed.
Hence good news: your task may be completed about 100 000 times faster on the same computer. All you need is another script :)
IMil
Thursday, December 20, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz