The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

fast compares

I am looking for an algorithm that will do fast compares.

Imagine I have a very large list of strings.. say 5000. I want to see if I have any duplicates in the list. Assume you start with an unsorted list. You don't have to assume the type of the container holding the strings. It can be anything you want although it is preferable if you stick with common container types.

What is the fastests way to accomplish this?
newbie
Tuesday, August 16, 2005
 
 
KMP string search may be helpful.
comp.lang.c refugee
Tuesday, August 16, 2005
 
 
I don't have to assume the type of container?  How about a C++ std::set?  Bang!  No duplicates.  :)

I'm not convinced KMP will help here (though it might).  I'd be tempted to hash the strings, your comparisons are going to be faster.
Chris in Edmonton Send private email
Tuesday, August 16, 2005
 
 
I also suspect (but cannot mathematically prove) that this problem is essentially the same as converting the list to sorted order.  You can, of course, save a bit of time by stopping at the first duplicate but perhaps you actually want to find all duplicates?
Chris in Edmonton Send private email
Tuesday, August 16, 2005
 
 
sort the strings and then look for duplicates. you'll get n log n complexity, which is best you can get without parallelism.
szeryf
Tuesday, August 16, 2005
 
 
Agreed -- sort the strings then look for duplicates.

An alternative approach would be to scan the list once, generating a SECOND list with the count for each entry in the first list.  However, now you have to search the SECOND list for each entry in the first list.

If you sort the first list first, then you only need to make an entry in the SECOND list for each duplicate in the first list -- no searching of the second list is now needed, because you know the first list is sorted.

And you don't need to store every string in the first list, on the possibility that the first string seen might match the last string seen.

And 5000 records is not a lot of records to sort.  I would still take this approach if you had a million records, though.
AllanL5
Tuesday, August 16, 2005
 
 
A little more detail about what you're trying to accomplish would be helpful. The fast way to do this depends largely on how much preprocessing you can do on the data, and on what you know about the strings ahead of time.

For example, if all you get is a big list of basically random strings all at once, then sort and compare is probably the easiest option. Adding the strings to a hash table will be faster to find the first duplicate than sorting the list would be, assuming a good hash function.

But the real potential speed improvement comes from knowing more about the data - are all the strings the same length? Then you can use comparisons of machine-word sized chunks at a time to speed things up. If all the strings are all the same for the first or last N characters, then you can ignore those characters.

Having said all that, 5000 strings isn't a very big list, depending on what computing resources you have available.
Using the command line:
  cat file.txt|sort|uniq >noduplicates.txt
on a file with 5000 duplicate strings in it (out of 12000) takes less than 50 ms on a 2.5GHz PowerPC machine. And almost all of that is overhead of the shell, I bet.

-Mark
Mark Bessey Send private email
Tuesday, August 16, 2005
 
 
The application is a bookmark manager type of thing. When you are about to insert a URL to a list of URLs, I want to be able to find out if:

1) The exact URL already exists
2) A similar(!) URL already exists

Plus, I want to be able to find out if there are duplicates within the list.

If I do this by comparing each URL to the one I am about to add, it will increasingly take longer to do. I know PCs are fast, but that's no excuse to be inefficient if there are efficient algorithms in existance. 

I know there are (free) applications to do what I am doing here, but it is a learning experience kinda thing.

Thanks for your replies.
newbie
Tuesday, August 16, 2005
 
 
Two ways I can think of...

Using a Binary Tree (or red black, b-tree whatever if you are worried about performance and all that rot.) insert the elements one at a time.  If you hit a duplicate, drop it on the floor.  If you need a sorted output then do a tree traversal.

If sorted output isn't really necessary I'd go with something that uses a HASH table.  A little more work to get the data back out but it will probably work a little better.

Or as others have pointed out use a sort routine.  Maybe one that will automatically remove duplicates.  Merge sort might be handy there.
zekaric Send private email
Tuesday, August 16, 2005
 
 
5000 is pretty small if you are only going to do it once.
But wait, how big are these strings ?
captain damage
Tuesday, August 16, 2005
 
 
That "Similar" requirement bugs me.  You need a better definition of "Similar".

In any event -- create a dictionary of existing URL's.  Before you insert, query the dictionary if the URL already exists.  I think this is what Windows does, using a 'File Table' approach.  Check IfExists(URL_Name) and if it returns TRUE, you know you already have it.

Now, I've been using this 'dictionary' term.  By that I mean a stored, probably hashed, list of URL's.  Hashed, because you want very quickly to know if a particular URL is in there already, and a Hash function can do that for you.

One alternative is to keep the dictionary sorted, then do a binary search against it.  I think that might scale better.  log2(5000) == 12.29, so worst case search for something would be 12 lookups.  Should still be fast enough, though.
AllanL5
Tuesday, August 16, 2005
 
 
For *similarity* of strings, you're going to probably look at the Smith-Waterman algorithm. Most of the research on similar strings currently being done is in the support of genetic research (how close are these strands of DNA? does this make the same, or a similar protein?). Years ago, it would have been done in areas of "data cleansing" (is "department of chemical engineering" the same as "dept chem eng"? or patient tracking (is this patient J. Doe, the same John Doe who was treated over there?).

Grab the PDFs (upper right hand corner, second row) from:
http://citeseer.ist.psu.edu/monge97efficient.html
http://citeseer.ist.psu.edu/monge00adaptive.html
Peter
Tuesday, August 16, 2005
 
 
A hash table is O(n) on the average (assuming a good hash implementation), O(n^2) worst case -- that _can_ happen if you have an adversary that can make your input.. The result is not ordered.

A good sort (heap sort, merge sort) is guaranteed O(n log n)  which is probably the best guarantee you can get. Quicksort will have that on the average, but an adversary can make your run-of-the-mill quicksort operate at O(n^2). Many quicksorts do that for an all-equal entry.

5000 is a small number. Anything you do, even the simplest O(n^2) bubble sort will work for you.

You have to define similar better. A common string similarity metric is based on "edit distance", sometimes called "levenstein distance". It may or may not suit your specifc implementation. [ http://en.wikipedia.org/wiki/Edit_distance ]

There is no ordering that properly takes "similarity" into account - you have to better define what you want to do.
Ori Berger
Tuesday, August 16, 2005
 
 
+1 zekaric, the only reason you would sort is if what you mean by "similar" is that they start the same. Like someone said, you need to think about what you mean by similar. For your original question of seeing if it is already in the list, you need a hash table. You would only do a binary search if you go with sorting, but the hash table with a big enough table is faster anyway. And you may need to consider:
1) do some normalizing of the URLs, depending on how you get them, to make sure they all prefixed with protocol e.g. http://
2) www.microsoft.com might equal microsoft.com without www
3) do hash code on the lowercase so A9.com == a9.com
Ben Bryant
Tuesday, August 16, 2005
 
 
If you only actually care about duplicates during an insert, you can do your search in O(n).  This is a different problem to what you initially specified.

Start off by assuming no duplicates.
Now, you are about to insert a new item.  Scan the entire collection (O(n)) to see if the proposed addition already exists.

Although you have to do this for each new element, the work done in each case is TRIVIAL.  I mean, you'd be fine if you had literally a million elements.  I mean, the time it has taken me to WRITE this is longer than the time you'll spend IN TOTAL checking EVERY addition in a bookmark manager for duplicates.
Chris in Edmonton Send private email
Tuesday, August 16, 2005
 
 
Build a tree index (binary, PATRICIA, whatever you like) of your list, stuffing a reference to each element into its appropriate "bucket". This is going to take one list traversal. After which you can simply go over the index and see which buckets contain more than one element - these are your duplicates. As a side effect, your index will have the elements in sorted order.

But indeed, it is more efficient to have your strings in a data structure that puts them in at sorted positions, possibly with an index. This would guarantee sorted and duplicate-free list, and at a relatively low overhead that is spread over the list population phase, but never happens at retrieval.
ping?
Wednesday, August 17, 2005
 
 
Ori Berger: hash table access is still O(n log n), because you have to compute the hash and that isn't constant time operation. So if you have a very large list it theoretically doesn't matter if you sort it or hash it.
szeryf
Wednesday, August 17, 2005
 
 
Chris in Edmonton: your algorithm is O(n^2). The worst case is when there are no duplicates.
szeryf
Wednesday, August 17, 2005
 
 
Ori Berger: hash table access is still O(n log n)

How so? Accessing a hash table takes two steps:
1) Calculate hash O(1), hashing doesn't get more difficult for larger tables
2) Look up elements with the same hash value. If you use a linked hash table this will take O(1) + 1/2 * O(#collisions), you can get slightly better results with double hashing, and other tricks.

Result: O(1) * O(1) = O(1). Not O(n * log n), that would be even slower than a LINEAR SEARCH!

 Chris in Edmonton: your algorithm is O(n^2).

No it isn't. Because Chris searches linearly through the list to check for duplicates we can assume the list isn't sorted. Therefore, new elements can be added to the end of the list. This takes exactly n steps. Hence O(n)
General Protection Fault
Wednesday, August 17, 2005
 
 
PS: I ignored the fact that every once in a while you'll have to reallocate memory when you insert a new element. But if you double the size of your array every time you run out of space this will not make an impressive impact on performance.

If you create a linked list of large chunks of memory you don't have to move elements when you reallocate, so even this (minor) performance problem can be resolved.

I'm not convinced this problem (5000 elements) warrants such drastic performance measures.
General Protection Fault
Wednesday, August 17, 2005
 
 
Szeryf,

You may not be able to, but I can search a list of entries in O(n) time and tell you if a new item would be a duplicate of any of the existing items.  Yes, the worst case is still when there are no duplicates, but it only takes n string comparisons in that case.

I am assuming n means the number of string comparisons here, mind you, rather than the number of CHARACTER comparisons.  It makes little sense to count the number of character comparisons, for reasons I won't bother going into right now.
Chris in Edmonton Send private email
Wednesday, August 17, 2005
 
 
szeryf: You are wrong on all accounts. Please review your computer science.
Ori Berger
Wednesday, August 17, 2005
 
 
GPF: Yes, I know that most literature says that computing hash value is O(1), but this is assuming that maximum hash value fits into machine word. It's really O(log N), where N is hash table size (because you need log N bits to address array of N elements), and because in practice log N < 32 (or even 64), this can be almost always safely ignored.

As I read my previous posts I see they need some clarification: I was talking about the complexity of the whole algorithm of finding duplicates. I should have said: hash table lookup is O(log N), but you do this n times, so on the whole algorithm is O(n log N) (or O(n) if you assume log N is ignorable).

Now, about Chris's algorithm: of course, linear search is O(n) and inserting element at the end of list is O(1), but you have to do it n times! n*O(n) is O(n^2), simple.

The problem is to "check if list of n elements contains dupliactes", not "check if list of n elements contains element x". You cannot just "Start off by assuming no duplicates." unless you start with empty list. So you start with empty list and insert n elements into it. For each element you do O(n) comparisons and O(1) inserts.
szeryf
Thursday, August 18, 2005
 
 
Ori Berger: do you have anymore valuable comments you would like to add to the discussion?
szeryf
Thursday, August 18, 2005
 
 
szeryf,

Because there are only a few elements the word size is not a limiting factor. I do agree that taking CS literature on algorithms at face value can be a bad idea because of oversimplification. Page faults, multi-cpu/threading and other complexities are never taken into account either. But I digress.

Your algorithms try to solve the problem the OP described in his first post. However, a few posts later the OP gave additional info which made other solutions viable.

Your algorithm and Chris' solve two entirely different problems. Hence the difference in runtime complexity.

You -do- start with an empty list, and every time you add an URL to the list you check for duplicates first. Therefore you only have to compare the new URL you're adding to the list to the other elements.
General Protection Fault
Thursday, August 18, 2005
 
 
szeryf: Please reread what you wrote.

Parphrase: You guys are all wrong, I can sell a dollar for 1.25 euros. Oh, I just reread what I wrote and see I need to clarify. You guys are generally right, and it's the other way around.
Ori Berger
Thursday, August 18, 2005
 
 
Ori Berger: Where did I say that anybody is wrong? Aren't you a little overreacting? Not every comment to your opinion must mean "you're wrong".

I admit that in my second post (the one referring to your post) I should have mentioned I'm talking about the complexity of the whole algorithm, not single access, but it seemed obvious to me because _you_ were also talking about the whole algorithm. But this is a clarification, not "the other way around".
szeryf
Thursday, August 18, 2005
 
 
Read 'em all into a perl hash (or hash in the language of your choice) and then foreach and dump 'em into something else?

5000, <1K each, <5Meg, shouldn't be a problem ...
Matthew Heusser Send private email
Friday, August 19, 2005
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz