The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Algorithm for Synchronizing Offline Data

We're using sqlite as an offline data store and it doesn't have any built in means of synchronizing with a server data store. So, we're going to have to roll our own mechanism to do it.

I want to avoid using GUIDs for each record (like SQL Server does for replication), so in any syncable table in the offline data store we will simply have an integer column (rowstat) that can contain the following codes:

-1 = deleted (on server)
0 = normal
1 = new record offline
2 = updated offline
3 = deleted offline

(The corresponding tables on the server also have the rowstat field, but it will only ever contain the codes -1 or 0.)

The only problem left is how to get the changes made by other users while the client is offline. So, everytime the client syncs with the server it takes the server-provided UTC date + time and saves it. Next time it wants to sync up, it can just grab any non-conflicting records that were created/changed after that date + time...

For conflict resolution each syncable table has a "lastchange" column that holds the server's UTC date + time in it.... Also, records created offline would have a negative int ID associated with them until they sync...

I'm pretty sure that this is all the info that I need to track changes while in offline mode.

Is this enough to get the job done? I have had trouble finding reading on the topic, most of what I've found deals with commercial solutions that don't apply...

Any comments or suggestions would be appreciated. Thanks.
J. Timberlake Send private email
Monday, March 26, 2007
 
 
You could do the Exchange Server/Active Directory model of assigning an Update Sequence Number every time a change is made.  It is more reliable than a date stamp because people sometimes mess with their clocks.  In a star topology, each time a client connects to the central server it simply asks for all changes with a USN greater than the last one it got from the server.  The client can also keep a local USN number for sending changes back to the server.  This model can also be extended to a mesh model by holding a list of USNs for each row.  One USN for each database that it syncs with.
JSmith Send private email
Monday, March 26, 2007
 
 
I was thinking about using a USN instead of date + time. I figured that if I always used the server's time there'd be no problem, but I didn't think about someone messing with the server time, so maybe that would be best.
J. Timberlake Send private email
Monday, March 26, 2007
 
 
> I want to avoid using GUIDs for each record

Why?
Christopher Wells Send private email
Monday, March 26, 2007
 
 
Well I don't want to use GUIDs because I don't really see the need for it and they take up a lot more storage/index space and lookups are slower.

I can track inserts, updates and deletes in offline mode with a single integer column. Auto-increment of entity table ID columns only happens in "connected" mode, on the server. In offline mode, entity ID columns are decremented and they are only kept in the negative zone until the syncs up...

Maybe if someone can poke a hole in that theory I would change my mind.
J. Timberlake Send private email
Tuesday, March 27, 2007
 
 
Question about USN implementation: It gets incremented each time any client syncs up, right? That's the only way I can see it not bottle-necking.

This is the way that I see it working:

Server: USN = 1
Client A: Connect
Client A: UPDATE Customer SET ..., USN = Server.USN ...
Client B: Connect
Client B: Request Sync with Server (USN 0)
Server: USN = 2
Server: Send Diffed Data (rows where USN > 0)
Client B: Save USN 1 (for next sync request)
Client A: UPDATE Customer SET ..., USN = Server.USN ...
Client B: UPDATE Customer SET ..., USN = Server.USN ...
Client A: Request Sync with Server (USN 1)
Server: USN = 3
Server: Send Diffed Data (rows where USN > 1)
Client A: Save USN 2 (for next sync request)
J. Timberlake Send private email
Tuesday, March 27, 2007
 
 
One of the pattern books has a pattern for replication.

Two older books that have some ideas:
http://www.amazon.com/Data-Replication-Techniques-Distributed-Information/dp/0471157546/
http://www.amazon.com/SyncML-Synchronizing-Managing-Your-Mobile/dp/0130093696/

I think you don't have enough information to determine changes.

At an older company, the group I ended up in charge of did pretty much data replication all over the place all the time. Most of the algorithms used there were good for push-type replication, and I was introducing 2-way replication there. The algorithms used there were similar to what JSmith mentions.
Peter Send private email
Tuesday, March 27, 2007
 
 
Thanks Peter, those books look interesting.

"I think you don't have enough information to determine changes"

Care to comment on what's missing? The only thing I can see is the inability to track individual field changes.
J. Timberlake Send private email
Tuesday, March 27, 2007
 
 
"Maybe if someone can poke a hole in that theory I would change my mind."

Well, for one thing, simply allowing deletions makes your work a hell of a lot more complicated. For example, what happens if client A submits a #3 request (deleted offline) and five minutes later, client B submits a #2 request (updated offline)? I'm not saying it can't be done, I'm saying that's a hell of a lot of work just to avoid using GUIDs.

It's a difficult decision, you have to decide whether the time and effort spent coding and TESTING a robust solution is cheaper than the hardware penalty. I will point out that five years down the road, it's easier and cheaper to upgrade the hardware (add more RAM, switch to a new DB), than it is to go back and reoptimize code.

If you're really that worried about space and performance issues, then I would still use the concept behind GUIDs, but not the actual GUIDs provided by SQLite. That is, each document has three primary keys - an id unique to the document, a time stamp, and a current state. Furthermore, I would not permit actual deletions per se.

As a result, if I wanted to get the latest document, I just pass in the id, and fetch the most recent time stamp where the state is not "deleted". Every time a client synches, the client always inserts a new record.

Whoops, I take that back. I just realized if GUIDs take up too much space, this always-insert philosophy will take up even more.
TheDavid
Tuesday, March 27, 2007
 
 
Whatever you use to track changes, be aware that synchronization has a basic problem when the data can be changed in more than one place. E.g. you have your repository R and two clients A and B. Now A and B both change the same piece of data and try to synchronize it back to R.

If you take the newest version and throw the other away, you have lost data for one of the clients. If you try any scheme to automatically handle these conflicts, e.g. merge them, you have to handle the file formats and you have the risk of corrupting the documents, not to think about how it could be done for database rows. If you do a silent fork, you end up with multiple documents.

The only safe solution is to take both versions and to require a manual merge or a manual decision which is the definitive version. And things get worse when more clients are involved. Manual control may be possible for 2 or 5 documents, but no one will do it when 20 conflicts occur. Especially when you think about the time when the syncs are usually done: When you are already in a hurry to go. ;)
Secure
Tuesday, March 27, 2007
 
 
I am not allowing actual deletions regardless of whether I use GUIDs or not. I take that back: Actual deletions are enabled only through an Administrative interface and should only be done by a technical person who understands backups.

Also, I have plans for conlict resolution. Basically everytime there is a conflict the synchronizer has to decide:

1) Server Wins
2) Client Wins
3) User Decides

When a client syncs up they can pick how they want to handle all conflicts... That's the first part.

The second part is that each important table has a corresponding log table that mirrors all of the changes that have taken place. So, things can be rolled back if need be.
J. Timberlake Send private email
Tuesday, March 27, 2007
 
 
Here are 2 scenarios that you might not have considered, that were "gotchas" in the past.

Two users: Alice and Bob.

Scenario 1:
Monday:
Bob changes record_X.

Tuesday:
Alice synchs with master db.

Wednesday:
Bob sychs with master db.

Thursday:
Alice synchs with master db.
Does she see the change to record_X? After all, it changed *before* her last sync.

Scenario 2:
Alice has a slow connection
Bob has a fast connection

Bob changes record_X.

Alice starts synch at noon
Bob starts synch at 12:01 pm
Bob finishes synch at 12:02 pm
Alice finishes synch at 12:30 pm.

Did the process pick up the change to record_X?
If users make changes all the time, will slow connections ever complete synchronization?

Changes made in the gap between start/finish had caused a couple hard-to-find bugs in the replication.

The specs and stuff for SyncML appear to have moved here:
http://www.openmobilealliance.org/tech/affiliates/syncml/syncmlindex.html

I'd put some columns in the tables that indicate the source of the data. AddUser, ChangeUser, AddDate, ChangeDate, SyncDate, SyncSource.

IBM purchased this guy's company, and pretty much all the info about it disappeared from the web:
http://jeffjonas.typepad.com/jeff_jonas/2006/10/source_attribut.html
It was called NORA, non-obvious relationship analysis, and one of the things he came up with is a tracability feature where you could track data back to its source. IBM now sells it for about the price of a nice car. Source Attribution and Data Tethering. I haven't found much written about it, but in a replication scheme of the complexity that it appears you're developing, it is a bit of overkill.

Source attribution will be very important when working backwards, tracing data. There will most likely be some very tense, nasty, meetings where the questions are of the nature of "who changed this record" or "how did this record get in here"
Peter Send private email
Tuesday, March 27, 2007
 
 
Peter, I think that with the USN method that scenario will be handled. If the USN is incremented before the sync starts, then any updates made during/after the sync will have the higher USN.

If a client doesn't get all of the updates that were made during a sync it's not a big deal. They'll just have to do it again or I could cycle it until there are no new updates...

Well, anyway it looks like I have a lot of testing to do, to say the least.

Thanks.
J. Timberlake Send private email
Tuesday, March 27, 2007
 
 
You might want to take a look at the Simple Sharing Extensions spec at http://msdn.microsoft.com/xml/rss/sse/ .

While couched in the terms of rss feeds, the spec is really about synchronizing multiple changes to a single set of items, and gives explicit semantics for it. It's gotten some pretty good press, and it does work.

Might be worth checking out to see if you've at least covered the same ground.
Chris Tavares Send private email
Tuesday, March 27, 2007
 
 
Looks like Microsoft removed anything mentioning Simple Sharing Extensions. (Content Not Found, 404s, etc are all I get.)
J. Timberlake Send private email
Tuesday, March 27, 2007
 
 
The links are redirecting correctly now.

Friday, March 30, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz