Reliable and Canonical Filenames Versus Java

Sometimes my programs might like to access a user-specified file, or even make a new one on demand, and then be able to get back to it the next day. In Java, unlike any other modern computer language I can think of offhand, this is not possible in the general case.

The desire to do such a thing will seem eccentric to applet writers. My concern, though, is with using Java as a general-purpose language.

Definitions

Reliable Filename

A filename is reliable if it can be used immediately to gain access to a unique file; it can be stored for later use; when retrieved, it will give access to the very same file, provided only that nothing relevant has changed in the file system.

The names foo and foo/bar, for instance are not reliable filenames, because their meaning depends on the current working directory. By contrast, the Unix names /foo and /foo/bar, as well as the DOS names c:\foo and e:\foo\bar, are reliable.

Canonical Filename

A canonical filename is a reliable filename that is the output of a consistent procedure canonize that has the following characteristic: if name1 names some unique file F1 , and name2 ditto F2 , then canonize(name1) is the same as canonize(name2) if and only if F1 and F2 are the same thing.

We use plain English words, rather than = signs and == signs and .equals() methods, to distance the definition from any implementation details. Likewise, we avoid the word "identical" which used to mean "the very same entity", but now has no particular meaning that's generally accepted.

Other definitions

A "file" is some kind of object in the filesystem, such as a data file or a directory. Whether it can be a device file or other some special thing is not relevant. A "filename" may be a character string, or it may be anything else with which you can unambiguously specify (in some particular context) a file to be accessed; for instance, a File object in Java. It may seem (to understandably sensitive users of other operating systems) that "operating system" is here defined implicitly as "DOS or Unix", since no other system is discussed. This is quite wrong! In fact, if we had only those systems to worry about, we could grit our teeth and write cases to work around some of the problems. A lousy way to program, but workable. However, the point of a machine-independent language is to handle these problems without requiring the application programmer to know the details of all the possible systems, which of course include the Macintosh and VMS and who knows what else.

Doing it in Java

Obviously a reliable filename is the same as an absolute filename: take away all context but the filesystem itself, and the thing has a consistent meaning. Anyone except Javasoft can see that.

And in some cases, they are the same: the isAbsolute() method will return false for foo/bar, and the getAbsolute() will return a reliable version of that name, perhaps /user/baz/foo/bar.

Similarly, in Unix, the name /foo/bar is absolute, and Java will tell you so, and getAbsolute() will apply the identity transformation to the name: it won't change it.

Regrettably, the ultimate machine-independent language was designed by Unix weenies, and therefore doesn't consider DOS worthy of the effort to make it work right. Plainly, the DOS filename \foo\bar is not reliable, or in any possible meaningful sense absolute. But Java happily classes it as absolute, and lets getAbsolute() do nothing to it. This behavior is not a stupid bug in some implementation: it's documented in a book published by Javasoft, with Gosling as co-author.

By the way, this is an unmistakable violation of the language spec (section 22.24.15. etc.), also co-authored by James Gosling. Nobody's perfect.

Likewise, the file c:foo is absolute in no way except Java's treatment of it.

To get these things right when working in the machine-dependent code for a particular Java implementation would be an hour's implementation work -- if you worked cautiously and had forgotten the details of the proper system calls, and had to take time looking them up. Otherwise it would be quicker. No, that's wrong. What system calls? If the first character is backslash or the second is colon and the third is not backslash, then grab some bytes from user.dir. You could take an hour to do that if you had a nice coffee break in the middle, provided the nearest Starbucks is several blocks away.

Fixing this would not require violating the almost-explicit provision of the spec that the operation of File is purely syntactic and may not dynamically call for information from the operating environment beyond that which was read at startup. You must understand that provision, lest you think that the getParent() method is anything other than purely syntactic, or maybe I should say lexical. What is the parent of /foo/bar/baz/..? According to Java, it's /foo/bar/baz. This is a useful answer, but one that calls for a warning to the user who might reasonably hope to get the actual parent directory when calling getParent(). Determining the actual parent is an interesting task. On DOS, you can get the answer syntactically: /foo. On Unix, symbolic links make matters more challenging. Which leads us more or less naturally to canonical names.

Canonical names, of course, are harder than reliable ones. Under DOS you have a reasonable shot at making a canonical name from a merely reliable one, by converting everything to a consistent case and editing out . and .. names. Without links, either hard or symbolic, this ought to work right.

For Unix, the job may not even be possible in the general case. On any system at all, you have a reasonable shot at a canonical filename with the simple procedure
cd name1; pwd
or equivalent. On Unix that will solve the problem for a lot of practical cases (and, yes, there are practical cases that need to make such a determination), but I can't say whether it will always work.

Now here's where Java really comes into its own: you can't do that with Java code. You've got user.dir to find the working directory from which Java was invoked, and that name really is a reliable one and probably a useful form of canonical name. But you can't change directories; and if you could, you still couldn't ask for the system's version of the name of the new directory.

We see now that there is one case in which we can get a reliable name that presumably is also canonical: the directory from which the program was executed. And there are also all its parent directories, all the way to the root. If you force the user to execute the program from within a particular directory, you're all set, so long as you aren't interested in any other directory.

If for some reason you want to let the user name some other directory, you would never let anyone type in a filename; that's old-fashioned, and implies that you could possibly write a command-line program and completely bypass the elegant design of AWT. And if you do allow a name to be typed in, you could insist that the user type an entire, reliable filename. Except, of course, that Java gives you no accurate and machine-independent way of distinguishing reliable from evanescent filenames to enforce that requirement.



Date last modified: February 4, 1999
Dan Drake's Home Page
Mail to dd@dandrake.com

Copyright (C) 1998 Daniel Drake. A royalty-free license to reproduce this document in whole or in part is hereby granted provided (i) all additions, omissions, and other changes are clearly marked; (ii) the work is not reproduced as, or as part of, a work for which payment is charged; (iii) this notice is reproduced without change. Quotations for critical or polemical purposes, with proper attribution, are permitted in any case, being obviously fair use.

Document: http://www.dandrake.com/java/reliable.html