The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

database vs stringlist

Hey,

I have come across this problem, I have some data that I need to remove duplicates for. I use string list for that (loop through stringlist with each addition to check if the same record exists). However, this gets very slow after 10000 records. Would database be faster?
johng
Tuesday, May 13, 2008
 
 
Your solution is probably of order N2 (N times an N-order list lookup).

Use a container. 10,000 strings is not much to process in memory. It would help if you mentioned your language, but in C++ I'd do it with an std::set<std::string>, in .Net with some of their cool collections or dictionary or whatever.
Daniel_DL Send private email
Tuesday, May 13, 2008
 
 
In .NET, I'd use the string as a key in a Hashtable. Then it's a matter of only adding nonexisting strings, and finally retrieve all the keys.
Thomas Eyde Send private email
Tuesday, May 13, 2008
 
 
if you keep the string-list sorted, you can use a binary search of the list/array. that would be a lot faster. if you happen to be using sqlite or another database, then its easy enough to create a table to hold the rows or a hash of the row. personally i have pre-written sqlite code that i can use for this sort of thing.

-don
Don Dickinson Send private email
Tuesday, May 13, 2008
 
 
+1 hash table or binary search to stay at O(n) or O(log n)
Cade Roux Send private email
Tuesday, May 13, 2008
 
 
it is in delphi
johng
Tuesday, May 13, 2008
 
 
There is another very simple way to eliminate duplicates in Delphi using a TStringList which might work for you.  With your strings in a TStringList, create a second TStringList.  For the second list, set Sorted = true and Duplicates = dupIgnore. Assign your original string list to the newly created list and you are done.

I tested it in C++ Builder with a list of 2 million strings and it took about 3 minutes.
DaveW
Tuesday, May 13, 2008
 
 
Make sure you set the capacity of the list before adding items. A TStringList is kinda slow if it needs to grow while adding.

It would be faster to check wether a string is already in the collection upon adding, compared to filling the list and removing duplicates.

If you don't want to check for uniqueness upon adding, then set the string to empty (="") instead of deleting them from the collection. Adding and deleting from the collection is more time-consuming than changing a value.

HTH
Eddy Vluggen Send private email
Tuesday, May 13, 2008
 
 
In Java:

String[] allStrings = Your Array;
Set uniqueStrings = new HashSet();
for(int i = 0; i < allStrings.length; i++) {
  uniqueStrings.add(allStrings[i]);
}

If order is important, you can use LinkedHashSet.
Austinian
Wednesday, May 14, 2008
 
 
To me using a database is the most user
friendly way but is probably the wrong
way in this case. It is a deployment issue
and probably overkill if this is the only
thing you would use it for.

You could conceivably use a sorting
algorithm such as merge sort, that's how
a database deduplicates internally.

Or you might want to get yourself a
sorted container like a hashtable, .net
has one but I can't verify that Delphi
classic does. You might want to find a
3rd party component, according to google
there may even be some free ones around.

I don't know how easy it is to use COM
objects from Delphi, but maybe you can
get hold of the MFC or MS-scripting
hashtables that way, or maybe a C++ map,
but this seems to be overkill also.
Object Hater
Wednesday, May 14, 2008
 
 
"f you happen to be using sqlite"

Then don't, because it's a piece of crap.

Its code reimplements the printf function... sheer madness.
quant dev Send private email
Thursday, May 15, 2008
 
 
Sorry, but I once spent two days debugging a crashing svntrac which was backed by sqlite and didn't manage to find the source of the segmentation fault. But it was sqlite's fault after it processed the SQL commands in its own reimplementation of printf. Just horrible.
quant dev Send private email
Thursday, May 15, 2008
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz