The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Ther same file on two computers - which is newer?

I can't see any way of determing whether one file is newer than another, when there is less than 24 hours time difference between them and they are located on different computers.

Time zones and DST effect any decision you make. Also you don't know whether the OS maintains file times as UTC or Local time. Windows PCs can have file times differ by 1 hour or 2 seconds depending on the File System being used (NTFS, FAT32) and DST. You can't easilly go back in time and find out when DST started and ended for a particular year.

Your thoughts.
Trying times! Send private email
Thursday, August 03, 2006
 
 
> Also you don't know whether the OS maintains file times as UTC or Local time.

It's UTC isn't it? The FILETIME structure is UTC.
Christopher Wells Send private email
Thursday, August 03, 2006
 
 
"OS maintains file times as UTC or Local time."

It doesn't matter.  The OS knows which and it knows it's own timezone, so it'll give you the file time as UTC.
Almost H. Anonymous Send private email
Thursday, August 03, 2006
 
 
From MSDEV:
"The FILETIME structure is a 64-bit value representing the number of 100-nanosecond intervals since January 1, 1601."

No mention of UTC. Windows GetFileTime() doesn't mention UTC.

My understanding is Windows stores file times in local time, not UTC.

I'm trying to be OS agnostic here as well.
Trying times! Send private email
Thursday, August 03, 2006
 
 
FILETIME

"Contains a 64-bit value representing the number of 100-nanosecond intervals since January 1, 1601 (UTC)."

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sysinfo/base/filetime_str.asp

I'm not sure why your definition doesn't have that "(UTC)" on the end.
Almost H. Anonymous Send private email
Thursday, August 03, 2006
 
 
>It doesn't matter.  The OS knows which and it knows it's own timezone, so it'll give you the file time as UTC.

If a file has time stamp of 1 Mar 2003 01:00:00 and the OS stores this as local time how do we know what the UTC time is for that file. Was DST in affect then or not.
Trying times! Send private email
Thursday, August 03, 2006
 
 
"If a file has time stamp of 1 Mar 2003 01:00:00 and the OS stores this as local time how do we know what the UTC time is for that file. Was DST in affect then or not."

If I set the timestamp of a file to set it to 1 Mar 2003 01:00:00 PST what is the UTC timestamp of the file?

If you know the timezone you're in, you can calculate whether or not DST was in effect for any date.  There's a huge table of values in the operating system to handle converting *any* datetime to and from UTC.

"The FAT file system records times on disk in local time. GetFileTime retrieves cached UTC times from the FAT file system."
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sysinfo/base/file_times.asp

You do have to be careful but it's not as bad as you make it out to be.
Almost H. Anonymous Send private email
Thursday, August 03, 2006
 
 
Christopher Wells Send private email
Thursday, August 03, 2006
 
 
> There's a huge table of values in the operating system to handle converting *any* datetime to and from UTC.

The registry has TIME_ZONE_INFORMATION data for each time zone, but perhaps not for past years (for places where the DST algorithm might change from one year to another). For old files from some places you might want to find a (external) database of historical DST dates.
Christopher Wells Send private email
Thursday, August 03, 2006
 
 
Thanks folks.

Re. FILETIME I was looking at my VC6 MSDEV Docs, not the latest info on-line.

I'm using Boost::filesystem so I can write cross platform code. It uses stat() which I think is part of my problem. From what I can see stat() - struct _stat, returns local time, not UTC.

I'll give GetFileTime() a shot.

>If you know the timezone you're in, you can calculate whether or not DST was in effect for any date.  There's a huge table of values in the operating system to handle converting *any* datetime to and from UTC.

How do you access this? I assume it gets updated by the Windows code that corrects DST start/end dates as required.
Trying times! Send private email
Thursday, August 03, 2006
 
 
"How do you access this?"

You don't really have to.  It's all handled behind the scenes in any code that gives you a UTC or allows you to convert a UTC to/from a local time.

Christopher seems to right about historical data but since you're specifically talking about files less than 24 hours old, that should be a non-issue.
Almost H. Anonymous Send private email
Thursday, August 03, 2006
 
 
>I'm using Boost::filesystem so I can write cross platform code. It uses stat() which I think is part of my problem. From what I can see stat() - struct _stat, returns local time, not UTC.

I rescind that. stat() is UTC as well. I assumed it was local time, which has caused my confusion. If I forget about FAT, I think I'll be good.
Trying times! Send private email
Thursday, August 03, 2006
 
 
Hmm, I think the real problem here goes deeper. If I have two timestamps from two different computers I can not really suppose that their clocks are _exactly_ sync'ed!

If the difference between the two timestamps is smaller than the (unknown!) difference of the two involved clocks, I have a problem. I can not tell which file is newer than the other.

The only solution to this would be a single time source for the computers involved. Via NTP regulary updated clocks may be sufficient (if sync'ed often enough), though.
Jürgen
Friday, August 04, 2006
 
 
I'm shocked to find it took that long for someone to see the basic flaw, Juergen. Maybe living here in DE has turned me somewhat German and it takes something peculiar to the German mentality to see it.

Friday, August 04, 2006
 
 
> The only solution to this woud be NNTP

Or, you could create a file on the remote computer and then read back its file-created timestamp: that would tell you how well the two clocks are synchronized.
Christopher Wells Send private email
Friday, August 04, 2006
 
 
If you can work it out, let us know...
Ewan McNab Send private email
Friday, August 04, 2006
 
 
would vector clocks be appropriate here?
Vince Send private email
Friday, August 04, 2006
 
 
Just use version numbers - if the files have a single source, the version number tells you who's got older releases regardless of local installation time.

Have a distributed transactional system - one computer holds the file's modification date on each PC, and the update process is a distributed transaction - not trivial, but it would work.

There's also the "don't care" option - really, why does it matter?  If two people are editing copies of a file and you want to merge, a good version control system will use a proper diff and not timestamp. Locally, each system can know if the file was modified since it was last downloaded, and if so then the changes need to be merged with the definitive copy.


Given that timestamps are problematic even in ideal circumstances, what is the *actual* problem being solved here, and is there an alternative option that doesn't require solving problems where the laws of physics make it essentially impossible to have both clocks and physical hard drive write operations synchronized perfectly?


> would vector clocks be appropriate here?

Actually, magic pixies are your only realistic option here. Without specialised hardware and software that removes the application from the home and typical office market, you simply can't make any reasonable guarantees about time. And if you can afford the effort of accounting for the speed of light then you probably won't be posting here asking how to deal with disk file timestamp issues - you'll be writing specialized textbooks that never get read by anyone without an advanced degree, most of whom will probably not follow all the math in the book anyway.

Sunday, August 06, 2006
 
 
'There's also the "don't care" option - really, why does it matter?  If two people are editing copies of a file and you want to merge, a good version control system will use a proper diff and not timestamp.'

And it will work perfectly with Word documents or photoshopped jpegs, of course...

'Locally, each system can know if the file was modified since it was last downloaded, and if so then the changes need to be merged with the definitive copy.'

Even doing it by hand won't work anymore when multiplied by the typical number of files in a working directory. "Don't care" is not really an option -- it is a sure way into confusion.
Secure
Monday, August 07, 2006
 
 
Version numbers are out of the question. I have no control over the files.

Merge and Diffs are also out of the question as is any transactional system.

I'm assuming that the PC's clocks are set correctly (NNTP/whatever).

This is a typical two way sync scenario where newer files will replace older files on either PC. I'm on top of it now.

Thanks all.
Trying times! Send private email
Monday, August 07, 2006
 
 
Well, provided the assumption is correct, you'll be fine. Of course, that assumption can only be correct within a certain degree of accuracy but if it's good enough for you, then it's ok.

Monday, August 07, 2006
 
 
> And it will work perfectly with Word documents or photoshopped jpegs, of course...

Only if you're not completely retarded. There's plenty of 'diff' type programs that work on binary files - there's nothing magical and sacred about choosing to use only 7 bits per byte (otherwise known as 'ascii text') that makes it especially suitable for detecing whether or not a file has changed.

Monday, August 07, 2006
 
 
On a different level, this is a fundamental problem in offering peer to peer services.  Consider this scenario.  I'm on the West Coast, my girlfriend is on the East Coast.  We're both listening to the radio, a really cool song comes on and we both decide to download it and save it to our respective hard drives.

However, I get my copy from one source, and she gets her copy from a different source, so the files are very slightly different; enough so that the checksums don't match.

Should my copy overwrite hers? Or should her copy overwrite mine?

If I were to design a "new solution", I wouldn't use the file creation date as a criteria for deciding which copy to keep because as this thread demonstrates, it's a lot of work for very little benefit.

I think probably the best, economically reasonable solution is that you keep the file with the longer length (duration or size). The collorary is that if the longer file is twice as long as the shorter file (or whatever your comfort level is), assume that it's garbage.

So in this scenario, if my song was 3:15 seconds long and her song was 3:20 seconds long, we would probably want to keep hers. (Dead air can be skipped over or trimmed, but truncated music can't be bought back.)

If the files are exactly the same length, and their checksums match, assume they're the same.  If they are the same length and their checksums don't match, flag them as special cases that need to be resolved by a human. (It's possible that one is a trojan.)
TheDavid
Monday, August 07, 2006
 
 
File size can't be used in any meaningfull way. Database files get compacted and free space removed, Word Docs get paragraphs deleted, Source code gets functions removed etc. etc.

I can't see any non-human way to resolve the issue you raised.
Trying times! Send private email
Tuesday, August 08, 2006
 
 
True, I should have added the warning that file size only works if logically speaking, you will never delete from that file. For example, it works with log files (where a program writes a status to a file but does nothing else with that file) and music files, but it does not work with word processing files or spreadsheet files.

I think there is a solution but it really depends on what kinds of files we're dealing with. I wish the original poster  wasn't so vague.
TheDavid
Tuesday, August 08, 2006
 
 
>I think there is a solution but it really depends on what kinds of files we're dealing with.

The types of files are completely unknown.

>I wish the original poster  wasn't so vague.

Think of a Backup app backing up files across two machines. If you want any further clarification let me know.
Trying times! Send private email
Tuesday, August 08, 2006
 
 
The notion of "which file is newer", is imho often the wrong question.

Normally, people want to know if two files are similar, or different - and even then, they may want to be able to merge the changes.

So, the question should be:
Given that two files are somewhat similar, how can I merge the two? Or, how should I allow the user to make an educated choice as to which file to use?

This, is a much more difficult problem.
Send private email
Tuesday, August 08, 2006
 
 
Ok, the obvious question - which machine is the authoriative machine? If node A is the primary machine and node B is the backup or secondary machine, then node B is always overwritten with the contents of A.

Now, either A or B can decide which files need to be overwritten, but once you decide on a rule, that rule is always followed. If A decides, A always pushes to B. B never decides to pull stuff from A or push stuff back to A. Node B just sits there and silently accepts whatever A provides.

(Both machines cannot be authoriative at the same time, but they can trade roles, for example, node A is authoriative on Mondays, node B is authoriative on Tuesdays.)

Changing the subject slightly...

There are environments where Henry will write to node A and Steve will write to node B, and you want to REPLICATE the changes such that A has both Henry's and Steve's data but such systems never "overwrite" data, they just maintain both copies and rely upon the user to determine which copy to keep and which, if any, to discard.

If this is really what you're after, then determining which file is newer is moot. Just save both copies.
TheDavid
Tuesday, August 08, 2006
 
 
>Ok, the obvious question - which machine is the authoriative machine?

I agree with all of this. But there are still issues. Lets say PC-A (authoritative) and PC-B are remote and some files you need to sync/backup are large (+100's MB). You need to know if the file on PC-A is different to the one on PC-B, as you don't want to go pushing 100's of MB data down the wire for no "very good" reason. This same holds true if the PC's are on a LAN.

If you can determine that the file on PC-A is newer than on PC-B that's one answer. Another is to determine if the files are different. Someone might suggest that if the file sizes differ then the files must be different, which is true. But the file size can be the same and the files can be different, so that is of no use. You can do MD5 comparisons, but to accomplish that you need be able to run software on PC-B, which may not be possible or overly complicates what you are doing.

So we are back to square one. IMO comparing file last-write stamps remains the best way to handle this.
Trying times! Send private email
Tuesday, August 08, 2006
 
 
"So we are back to square one. IMO comparing file last-write stamps remains the best way to handle this."

For the detection if a file has changed you can build a database where you store the stamps of the last write accesses. If it is the same you are at least quite sure that the file was not changed and was already handled one of the last access times.

For the problem of the two files it would give you a partial solution. If one file changed and the other changed not, you definitely know which is newer.
Secure
Wednesday, August 09, 2006
 
 
I'm sort of punting on the practical aspects of this, but I just re-read Leslie Lamport's "Time, Clocks and the Ordering of Events in a Distributed System", available here: http://research.microsoft.com/users/lamport/pubs/pubs.html#time-clocks, It deals with the problem of event ordering (like file timestamps) across distributed machines, one of the keys to which is clock synchronization.  If someone really needs to keep two files in sync, dwelling on this paper wouldn't be a bad place to start.
Pat Morrison
Wednesday, August 09, 2006
 
 
"You need to know if the file on PC-A is different to the one on PC-B, as you don't want to go pushing 100's of MB data down the wire for no "very good" reason."

True.

If we're talking backup as opposed to replication, then PC-A keeps track of whether it has sent a file to PC-B or not.

If we're talking replication as opposed to backup, then you want PC-B to have its own copy as well as PC-A's copy. (In this scenario, yes, you want to treat them as different files even if they are really identical.)

I suspect you're trying to handle a scenario where A asks B if it needs to push a new copy down to B.  In a well designed backup program, A already knows the answer.

Consider this example - I back up files to a DVD-RW disc. Ideally, my PC would know if I've backed up a file or if it has been changed since it was last backed up without having to prompt me to insert all of my DVDs so that it can verify that the file exists somewhere on a DVD.

Now granted, if you have PC-A in New York City and PC-B in Shanghai, yes, you'd be concerned with sending extra files down the wire. In practice though, it is too expensive to do two-way backup; we always either push (or ship on DVDs) the entire package to the destination and let them figure it out as opposed to PC-A and PC-B talking back and forth trying to figure out which files need to be copied.

Real world example - when I worked at the Jet Propulsion Laboratory (in California, USA), we would just assume the worst case scenario and FTP the entire daily satellite data dump (terabytes worth) from the ground stations in Australia as one package and then synch it locally. If we'd stopped to confirm each record with Australia, it would take even longer to get the job done.
TheDavid
Wednesday, August 09, 2006
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz