The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

database synchronization

I would like to synchronize client databases with each other by talking via a server.

Each client has a database. When they talk to the server, they upload what's new and the server sends back what's new with the other clients.

The database engine is SQLite.

I did some searching on google, but didn't find anything too useful. Do you guys have experience with this sort of thing?
coder
Saturday, September 15, 2007
 
 
This appears to be a complicated issue, so let me give you some more details.

Each client has a database that holds, let's say, names of cities. The users get a default set of cities when they install the client app. As they use the app, they add more city names of their choice.

At some point, each client syncs up with the server to send their current list of city names. The server collects all names in a master list and distributes this master list to all clients so that each client has all the available city names.

The only problem is that there could be duplicates or near duplicates  and I don't really know how to deal with this case. For example, someone enters Boston, and someone else enters Bostons. Aside from the typo, these two names are the same. How could I resolve this sorta thing?
coder
Saturday, September 15, 2007
 
 
I don't mean to sound facetious, but why don't you just ship a complete list of cities for your territories of interest in the first place ?

Then, you won't need to do any synchronisation. Actually, the synchronisation bit it easy, it is cleaning the multiple lists that is hard, as you have realised with your Boston example. That is why you should just ship a full list to everyone.

Apart from China, new cities seldom spring up, so data ageing shouldn't be a problem.
Entries of Confusion Send private email
Sunday, September 16, 2007
 
 
The "answer" to your question will involve a lot of reading, I'm afraid. The word you're looking for is "replication" and here are some older posts at JoS that come to mind:

http://discuss.joelonsoftware.com/default.asp?design.4.530464.3
http://discuss.joelonsoftware.com/default.asp?design.4.301193.14 which leads to:
http://www.lc.leidenuniv.nl/awcourse/oracle/server.920/a96567/repconflicts.htm#22336
http://discuss.joelonsoftware.com/default.asp?design.4.471271.15
Peter Send private email
Sunday, September 16, 2007
 
 
Entries of Confusion,

Thanks for your reply. "Names of cities" was just an analogy. In reality, there is no way to know in advance the list of all possibilities although, as you suggested, I could do my best to release a list that is as comprehensive as possible.
coder
Sunday, September 16, 2007
 
 
Peter,

I did do some research on replication . It looks like replication is just a direct copy of a dataset. My problem is beyond merely copying data.

For example, someone could enter "Washington" and someone else could enter "Washington DC". These are both the same city and replication will merely make sure every client has both of these names, but it does not handle the fact that they are the same city.

The only thing I can think of is do some kinda regex on the server and figure out which clients have which names and keep track of that. If the client says it has "Washington" and the server has previously seen "Washington DC", it will have to know that these two things are the same thing and act accordingly. That just sounds extremely messy though.

Perhaps I am chasing something that either can't be resolved or can't be resolved without getting very messy, but I wanted to ask you guys anyways. If you have any more ideas, I am all ears.
coder
Sunday, September 16, 2007
 
 
Pretty much any time you let the user enter ad-hoc data you are going to get a mess back. Just how many ways do you think the user can mispell the same words? In many cases it is possible to provide lists of choices but in many others it is not. There pretty much is no real cure for the case of duplicated entries beyond having a real person sit down on a regular basis and weed out the duplicates. Regex's will help you find some of them but they simply aren't smart enough to recognize that "Wshingtn" and "Washington" are supposed to be the same thing (if they even are).
dood mcdoogle
Sunday, September 16, 2007
 
 
On one of those links I pointed to NORA. This sounds somewhat closer to what you're now asking for.
Peter Send private email
Sunday, September 16, 2007
 
 
Oops. Hit enter too soon. I think you've picked a messy subject that isn't going to be easy to solve.

http://en.wikipedia.org/wiki/Washington_%28disambiguation%29

Humans resolve words via context. If you're going to put some sort of context resolver on the server, I think you're going to have a larger problem than you think you have. How do voice recognition software products handle this? You might want to start looking there.
Peter Send private email
Sunday, September 16, 2007
 
 
coder, I thought that might be the case ;-)
Entries of Confusion Send private email
Monday, September 17, 2007
 
 
A similar problem is possibly handled by sites which deal with tagging (like delicious). I could label something "fruit" and you can label it "fruits". We are talking about the same thing, but now we've created two tags which could easily be merged. The computer has no idea these tags cover the same content.

Do they deal with it or just let it go?
coder
Monday, September 17, 2007
 
 
That would be stemming.
http://en.wikipedia.org/wiki/Stemming
Peter Send private email
Monday, September 17, 2007
 
 
Coder, for the two examples you give (tagging & city names) a human's intervention will be required.  You pick what is "similar" as a threshold (first 5 chars the same, Soundex returns identity, etc.) and kick all the "same-sounding" items into an exception list.  A human looks at the list, decides which wins, makes a final change, and the synchronizer sends the winner back to everyone.

Something on the client end then has to reset all keys in the client's app to match the winner. This is the problem with synchronization: there is almost never just one table involved, typically there are many.  So there are GUID keys required, or you have to create something that changes the tempkeys created when "Washington" and "Washington DC" were created with the winning entry's key.

This is an extremely difficult task.  Why do you need to have different clients synch, anyway?
Karl Perry Send private email
Tuesday, September 18, 2007
 
 
==>The only thing I can think of is do some kinda regex on the server and figure out which clients have which names and keep track of that. If the client says it has "Washington" and the server has previously seen "Washington DC", it will have to know that these two things are the same thing and act accordingly. That just sounds extremely messy though.

You're dreaming the impossible dream if you want this handled in an automated fashion. It's gonna require an actual human being, and one with some intelligence, to figure out the intent of the folks doing the data entry.

I'd argue that it's beyond "messy". To handle it without human intervention will be darned near impossible.

Your Washington example is a pretty good example for my area too. I'm in southern Ohio and we've got a town relatively close ('bout 45 miles) called "Washington Courthouse". I've seen this listed, on various maps, as:

Washington Courthouse
Washington Court House
Washington CourtHouse
Washington CH
Washington C.H.
Washington C
Washington C.
Wash. CH
Wash. C.H.

I imagine there's quite a few places out there with even more variation.

I really don't see a way to automate intelligence. Call back in a few decades, we might be able to do it then but in the mean time, I think you'll need a RealHumanBeing(tm) to maintain this monstrosity.
Sgt.Sausage
Tuesday, September 18, 2007
 
 
I agree with those who say we aren't ready to automate intellegence but you can put in place a process or structure in which that intellegence is applied.

You have two questions deal with.  What is the cost benefit associated with cleaning up all these duplicate names.  If the cost benefit is high enough to justify the effort then process would we want to use to do it.

Who would decide that two names are the same?  Would this be performed by some central group or would each client make these decisions.  Would a decision by one client be allowed to automatically propagated to some other client? 

From a system design point of view your main focus would probably be in two areas.

1.  How would a user specify that two names are the same and what happens after they do that?

2.  Can you present a convenient list of probable name duplications that the use can use.
Bill Hamaker Send private email
Tuesday, September 18, 2007
 
 
SRD put at least a hundred man-years into solving this sort of problem before IBM purchased their company. NORA is the product and while originally aimed at CRM, it evolved into something aimed more at loss prevention and law enforcement. In a casino, it is very important to know if this new gambler lives with an existing casino employee. Or, for a retailer, if this new employee lives with someone you prosecuted for theft. Or in some cases, you just hired someone you had had convicted of stealing from you. That is a serious form of institutional amnesia.
http://jeffjonas.typepad.com/IEEE.Identity.Resolution.pdf
http://www.theregister.co.uk/2005/01/07/ibm_buys_srd/
http://www.pcworld.com/article/id,103692-page,1/article.html

Casino fraud can cost a casino millions of dollars. As a result, they'll spend a very large amount of money to solve/prevent this problem.

At a previous employer, we handled electronic prescriptions, insurance claims and lab reports; the CEO was interested in creating an anonymized longitudinal database that could let researchers study the efficacy of long term drug use, or whether certain treatments actually worked. So I became interested in NORA back around 1999 because that would have been needed to match the records in the various mostly incompatable systems involved. It became very clear that my department of 2 people, who were responsible for data replication throughout the company, couldn't take on the task of scrubbing the data without at least doubling in size. I'm proud that I did get automated city/state/zip data scrubbing set up, as well as something to automatically handle the area code splits that affected the several hundred thousand phone/fax numbers in the system.

http://jeffjonas.typepad.com/jeff_jonas/2007/04/to_know_semanti.html

What you're describing isn't something that a single person can implement over a weekend. I think it is on the scale of research that would be needed for a PhD. But it is something I find fascinating and want to be able to implement myself. I wish he wrote some books or papers on what he's done.
Peter Send private email
Tuesday, September 18, 2007
 
 
Thank you for all your great replies. I'll have to find a different way to resolve this.
coder
Thursday, September 20, 2007
 
 
Coder, what we're all saying is that problem entries, such as your similar city name example, will always require human intervention.

This kind of thing is done a lot.  You just have to have someone - one person - who's job it is to resolve the conflicts.  That may be two minutes a day or it might be a full-time job - the volume of data involved and the number of users will determine how much work is involved on a continuous basis.

It's not a simple task even with human conflict resolution, but it is doable.
Karl Perry Send private email
Thursday, September 20, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz