The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Search and Replace For Arbitrary Needles

I've got an interesting Computer Sciencey-sounding problem---a bunch of individual pieces of data for which I believe there might be a non-trivial substring that is embedded in every one of them. I'd like to search-and-replace on them and insert a token instead, but I don't know what I'm searching for up front! Example:

1:  sdfkl90xzc##3x#fsdfsdfsd
2:  fsd##3x#cmv1312
3:  fsdkfsdklfjsd##3x#

All three of these contain the substring ##3x#, which is the largest common substring. So I'd want to get as a result

1:  sdfkl90xzc<token>fsdfsdfsd
2:  fsd<token>xcmv1312
3:  fsdkfsdklfjsd<token>

Again, I don't *know* that ##3x# is the common substring in this case, just that there either definetely is or is not a common substring at least X letters long.
How do I do this without just dumb brute force?
Robby Slaughter Send private email
Thursday, March 01, 2007
Hey... and then you could take those tokens and represent them using a Huffman tree!!!

Oh wait, that's gzip.

Look up the source code of gzip to find a good algorithm.  Or better yet, just use a library and zip the data.
JSmith Send private email
Thursday, March 01, 2007
I know how to build a Huffman tree, but I don't see how you intend to use it to solve the original poster's problem. Could you expand you answer a little bit?
Mark Jeffcoat Send private email
Thursday, March 01, 2007
Use dynamic programming?

Google for 'longest common subsequence';

Thursday, March 01, 2007
Is it possible to construct regular expressions dynamically and use them?

Say, you have 100 different strings in which you need to look for a substring. You can parse 20 of them and construct a regular expression which will look for a pattern nearer to the one you want. We can use the regex for the subsequent strings and have a variable which will keep track of the number of matches. The loop variable which will have the index of the whole lot of strings will also be available. When certain numbers of the index like 40 or 50 or 60 are reached, the other count can be checked and if it doesn't increase, the loop can be stopped and the process continued again.

I may be totally off here as I've no idea if I've understood your problem correctly and if this makes sense.
Senthilnathan N.S. Send private email
Thursday, March 01, 2007

The deflate compression algorithm used in gzip works by finding repeating patterns and storing them in a Huffman tree.  The difference between a good compression utility and a bad one is how efficiently they can find tokens.  So, well respected compression tools like gzip are a great source of code to find repeating blocks in an arbitrarily long block of data.

Further, I'm suggesting that the original poster is essentailly trying to compress data.  Unless this is an acedemic exercise to learn how it is done, then it will be much easier to simply run the data through an existing compression algorithm instead of writing a new one.
JSmith Send private email
Thursday, March 01, 2007
> How do I do this without just dumb brute force?

+1 to look it up in a reference and/or use an existing implementation.

But, to treat to as an interesting interview question to be answered from scratch without a reference: what assumptions can you make about the data?

* Do you need to find the 'best' solution, or is 'a good' solution good enough?

* Is it true that you're trying to optimize for processing time (i.e. that the solution should be as compute-efficient as possible)?

* Do you have more RAM than you have strings? How much more, exactly (what's the max number of strings, the max length of each string, and the minimum amount of available working memory)?

* Do you know anything about the distribution of letters within each string? Are the letters printable? Are they random? Is there anything that's semantically equivalent to a whitespace or word-separation that can help to tokenize each string?
Christopher Wells Send private email
Friday, March 02, 2007
Without having researched further into the topic, my out-of-thin-air-approach would be to build a list of hash values over any X-letter-substring for the strings, sort the hash/index list for any string by ascending hash value, and then compare the substrings for which the same hash value occurs in all strings, with additional care for collisions.

If you know X it will be fine, if you don't know X you can do this multiple times for all X, with the largest possible X first.

At least this should be better than brute force over the strings themselves. It requires additional memory, of course.
Friday, March 02, 2007
I'm thinking along the lines of ...

//location of letter in string
//is pointer-to-string plus offset within string
typedef pair<string*,string::size_type> Location;

//list of locations
typedef list<Location> Locations;

//locations within strings of each character
map<char,Locations> Map;
Map firstMap;

//pre-process strings to build a map of character locations
for each string
    for each character
        add character location to firstMap

//look for runs
foreach (Locations locations in firstMap)
    calculate_longest_run(locations, 1);

int calculate_longest_run(Locations locations, int depth)
    //recursion ends when no map element contains more than one Location (i.e when all locations are unique)
    if (locations.size() == 1)
        //there's only one instance of this so it can't match any other
        return depth;
    //first letter of all Locations is identical
    //sort them according to their second letter
    Map secondMap;
    foreach (Location location in Locations)
        char secondChar = *(++location);
        secondMap[secondChar] = location;
    //recurse to find the 3rd letter.
    int rc = 0;
    foreach (Locations secondLocations in secondMap)
        rc = max(rc,calculate_longest_run(secondLocations, depth+1));
    return rc;

I suspect it would improve performance (speed and memory) if the above calculate_longest_run were modified to be greedier: if it started by trying to build a map of double-character or quad-character look-aheads, and retry with a reduced look-ahead iff the greedier attempt found no matches (and I'm not sure what the heuristic might be to guess at how much look-ahead to attempt).
Christopher Wells Send private email
Friday, March 02, 2007
If it were greedy then it wouldn't have to be recursive.

The amount of look-ahead could be a binary search, e.g.:

* Try 1 (match)
* Try 2 (match)
* Try 4 (match)
* Try 8 (match)
* Try 16 (no match)
* Try 12 (etc)

Another optimization would be to pass-in the minium amount of look-ahead (e.g. if a previous calculate for a different initial character return "10", then we could pass "10" to all subsequent calculate_longest_run to say that if they can't find at least 10 then we're not interested).
Christopher Wells Send private email
Friday, March 02, 2007
Lots of great ideas here!

The anonymous poster who came up with the phrase "longest common substring" seems to be exactly the term I needed to Google. I just couldn't think of  way to say it.

But, I also think JSmith is right in that this is essentially a compression problem. (Though really it isn't. It's actually data mining.) I guess if I concatenated all of the pieces of data into a single "file", I might be able to find a piece of widely-available compression code to look for the largest repeated token. But then I would lose the original boundary points between data pieces.

Christopher Wells has some excellent questions. It's a real-world problem. A mediocre solution is fine, but a brute-force one will take until the heat death of the universe. I only need to do this process once.

The data pieces are between 1kb-50kb in length. I think the common substrings are no larger than 10kb. I probably have 50 pieces of data in a set. (Though I have thousands of sets to run this against). The data is HTML but the substring could be anything.

The code sample you provided doesn't feel like it would scale well to these numbers, even if I could be greedy with the longest run hunt. Am I missing something?

I like Secure's suggestion though. Say I built a table that had a list of all of the data pieces, every subset of a decent size, and the hash of each one. That would be a BIG table but databases are good at that sort of thing. Then I could sort by common hashes and largest size. I'd want to do some checking in case the hash was inaccurate.
As a bonus, I can select my hash method to correspond best to my data, since probably an extra space or two in the HTML shouldn't be significant.

The data might be large enough that using Christopher Wells' suggestion for a "binary search" could be helpful. I'd have to think about the queries some more to be sure.

Thanks all!
Robby Slaughter Send private email
Friday, March 02, 2007
Top of my head thoughts:
1. Longer common substrings must contain many shorter common substrings. Look for all common substrings of some minimal length. Adjacent ones can then be combined into longer strings.

2. To hash substrings, use a hash function that allows you to get the hash of the substring starting at n+1 from that starting at n with a few ops.e.g if hash of N items is
for(i = 0, hash = 0; i < n; i++) hash = hash*5 + string[i];
hash of next is 5*oldhash + newchar - (5^N)*char_going_out

3. scan first string storing all hashes with offset and string num. scan next string, only increment counts of hashes that match etc. When you have done all strings, go thru all hash matches to check for true matches.
Friday, March 02, 2007
> The data pieces are between 1kb-50kb in length. I probably have 50 pieces of data in a set.

> Am I missing something?

I don't know; try this: you have 2500KB per set.

Let's assume that the time it takes to add an element to some type of map is about 100 cycles, i.e. about 30 nsec (assume that the time to allocate heap for each new entry is negligeable because you're using custom allocators).

Creating the firstMap would therefore take (30 nsec * 2500K =) 0.1 seconds.

Creating all of the secondMaps instances also involves processing a total of 2500K characters, so it too might take 0.1 seconds.

If the search terminates after 10000 generations (your estimate) then the total search time (at 0.1 seconds per generation) will have been 1000 seconds (worst case). I'd expect the better-than-worst case to be *much* better than this: because in practice the time per generation should fall rapidly with successive generations, as you begin to exclude failures (i.e. unique, non-common strings) from subsequent generations.

Anyway it seems to me to be potentially a lot faster than "heat death of the universe".

A worry is RAM: with 2500KB per set and 2.5 GB RAM you can afford a processing overhead of 1000-to-1: even with a most memory-efficient style of map, 1000 generations of map would completely blow the RAM budget ... so my first version might try to count on having a better-than-worst case (in which the number of candidates falls with successive generations).

> That would be a BIG table but databases are good at that sort of thing.

I wouldn't! Here's my list of types of computer instruction, arranged from fastest to slowest:

1) CPU cache access
2) RAM access
3) Heap allocation
4) Kernel I/O
5) Disk I/O
6) Video I/O
7) Network I/O
8) Serial port I/O
9) Carrier pigeon I/O
10) Database I/O

[Just kidding about 8 and 9.]
Christopher Wells Send private email
Friday, March 02, 2007
>  like Secure's suggestion though. Say I built a table that had a list of all of the data pieces, every subset of a decent size, and the hash of each one.

Let's say for example that you have 2 strings, where each is 10000 bytes, and X is 100.

=> Each string contains 9900 substrings of length 100.

=> Calculating the hash of each substring then requires you to process (2 * 9900 * 100 =) 1,980,000 characters.

I'm not sure but maybe my idea is better, because it requires you process each character in the set no more than approximately once.
Christopher Wells Send private email
Saturday, March 03, 2007

you can optimize the hell out of my approach, as "expr" suggested, say:

- Use a rolling hash, thus any character is touched only once while building the hash map.

- Sort the strings by length.

- Build the initial hash map with the shortest string.

- Compare the second string against the map (hash value AND string). Mark anything found as "Found".

- Delete anything from the map that was not found.

- Repeat with all strings until either the map is empty or contains the final result(s).
Saturday, March 03, 2007
From the wikipedia page, it looks like you should be googling for a library implementation of a Generalized Suffix Tree.

Apparently this approach would be linear in the total length of all the strings you're comparing. I doubt you'll do any better than that.
Sunday, March 04, 2007
Looks like making a diff. Look for the Leveshtein-algorithm.
Karel Thönissen Send private email
Sunday, March 04, 2007
Without knowing more about what you're looking to accomplish, I'm going to recommend BLAST and Smith Waterman algorithms.
Monday, March 05, 2007

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz