COM.dandrake.wlv.PressMyButtons
and
COM.dandrake.wlv.WebLinkValid

This is a website maintenance tool in two forms: WebLinkValid, a class that does the work and has a command-line interface; and PressMyButtons, a GUI class that drives it and implements a wider set of options than the other does. Each of them will press all the buttons, or anyway click on all the links, that it finds on your Web page. They will also mirror a Web site.

Sorry about the repulsive names, but that's the Java standard (or was, when these were written and one was supposed to use COM in all caps).

Quick Start

Well, there really is no quick start in rapidly evolving multi-platform Java; but at least it is multi-platform. If you've got Swing classes installed, this should work. If not, you'll have to figure out how to install it first. And if you don't want to jump in with both feet, you can skip to the real document.

First, then, you download the program and put it on your classpath, however that's done in your system.

When you've got it installed, go to some directory in which you won't mind getting new stuff stored by the program. Execute the program by, for instance,
java COM.dandrake.wlv.PressMyButtons

You will get a dialogue in the middle of your screen, something like

In the text entry area at the top, enter the full URL for some simple, convenient page, like your home page. Click on the Mirror button to enable the mirroring feature. Leave the rest alone, and click Go.

Two windows will appear, hogging much of your screen. The one on the left, with the label "Error listing" will display any problems found with the page, like dead pointers, #reference items with no anchor defined, and so on. Since your site naturally is error-free, this window will show nothing until it says "Finished" when the job is done.

The other new window, "Progress log" will show what's currently happening. These messages will change at my whim, but they're likely to show what's being loaded from the Internet at any moment when nothing seems to be happening.

When the job is done, or before then, look at your working directory. It will have a new subdirectory named after your Website host, and if you follow the directory structure, you'll find a copy of the page you called for. Any pages that it incorporates (pages and images, but not hyperlinked pages) will allso be copied into their proper place in the directory structure; anything that's on a different host will be in a directory named after the host.

Now try mirroring that page plus everything that it directly links to on the same site. In the Depth field, enter 1. While you're up, click on the "Check all local pages" box, if you want it to check your whole site for bad pointers and the like. Then click Go. (A new pair of result windows will replace the old ones. By the way, it does no harm to close those windows at any time, but of course you'll lose the data.)

It won't load your home page a second time, because you already have a copy. But it will scan it for links, and load the things it points to, provided they're on the same site. Set Depth to 2, and it will go 2 links away; set Breadth to 1, and it will load from other sites that you link to, to the same total Depth.

Want to save the list of problems? Pull down the File menu on a result window, and browse for a file to save to. If you select an existing file, you'll get to decide whether to overwrite the file, append to it, or forget it.

What the Programs Are About

PressMyButtons has two functions, which it can execute independently or at the same time. It's pretty obvious why you might want a program that checks your site for links that don't link to anything. This type of function is avavilable in some commercial web builders, but here's a standalone version for anyone who wants one. And the program has two other features that are not found in other programs I know of.

Validate a site

Mirror a site

PressMyButtons

This Java application runs the validation under a minimal GUI. Just execute it as
java COM.dandrake.wlv.PressMyButtons
and may we suggest a command file, or shell script, for that piece of JavaSoft-approved gibberish?

In the text area you can enter either a full URL or a simple filename with or without a directory path. Or you can press the "Or choose a file" button to bring up a file dialog.

Now, having aimed the program at something simple for a first test, hit the GO button. Two scrollable windows will pop up: one shows a string of progress messages so that you know what's going on, while the other shows all the errors that are found. When the job is finished, the Progress window displays "DONE", and the main window gets the GO button re-enabled. You can stop the scan by pushing the STOP button; the result windows will remain, with the progress window showing "STOPPED". You can terminate the whole program at any time with the Cancel button.

The operation of the program, as distinct from its user interface, is described a little more precisely in the section on WebLinkValid.

Using the result windows

Each of the two streams of results is in its own scrollable, resizable window. You can kill either or both, and the program will go merrily on. When you start a new operation with the GO button, the windows will be removed and replaced with new ones. They also go away, of course, when you terminate PressMyButtons. Each result window has a File menu with two options: "Write..." and "Quit". The Write operation will write the current contents of the window into a text file, which you choose from a file dialog. The dialog operates in SAVE mode, which means that it expects a new filename, and you can't click to select an existing one from the list. I don't know whether this is a good choice or not. Anyway, if you do specify an existing file, you'll get an alert box with the choices "Append", "Overwrite", and "Cancel". Take your pick.

Scan all local HTML pages

When this box is checked, as it is by default, PressMyButtons will not be content with checking the specified file; it will follow the pointer to any local file and check that; and it will keep checking everything it finds until it runs out of local files.

A local file is one that has a simple directory/file reference rather than a full URL. For instance, foo or foo/bar or ../foo/bar#baz. If the option "Check absolute refs" is checked, it will also check pages pointed to by full URLs, so long as they appear to be on the same host. See Check absolute refs. The limitations are intended to keep the program from checking the validity of everything on the World-Wide Web; we leave that sort of thing to Alta Vista and Google.

If the box is not checked, then only the named file will be fully checked. Other files may still be scanned for the existence of names in ...#name references; see the next section.

Check all names

When this box is checked, as it is by default, PressMyButtons will read the full text of any file that is the object of a reference that gives a name; that is, any reference that ends in #someNameOrOther. It will log an error if someNameOrOther is not in the file.

If the box is not checked, then files will be scanned only if required by "Check all local pages", above.

Check for offsite files

If this box is checked, as it is by default, PressMyButtons will scan files that are not on the same site, when it may be necessary for validating #name references.

Check absolute refs

When scanning all the HTML files found as references in other HTML files, it's fairly important not to end up trying to scan everything on the whole Web. One way is to follow only relative references, which have no host specified. When this option is off, as it is by default, that's the action.

When this option is on, pages are also checked if the reference is absolute, provided that the host is the same as the one on which the checking started. This can be problematic with large ISPs that host numerous Web sites, but it's unlikely to cause serious trouble, maybe.

Mirror files

When this option is on, which by default it is not, PressMyButtons will load a copy of every local file to which it finds a reference. They will be loaded into a subdirectory of the current working directory, so watch out.

If the reference to a file is via http: , the file will be loaded into a directory tree named after the host. If it's from file: , and it's on a DOS-like file system, the base of the tree will named after the unit letter, as in CCoLoN . There is no option to condense the directory tree structure and load all files into one directory.

Mirroring overwrites files

By default, the mirror operation will not download a file with a name that's already present. This option overrides that choice. (The program in any case will not load the same file twice in the same scan.)

Mirror offsite files

By default, files will be downloaded only if they are local to the host on which the search started. This option allows loading of things from other sites.

The things to be loaded are found in the checking process. Since it won't look at an off-site page to find references, there's no danger of trying to mirror the entire Web.

Mirror absolute refs

The idea is explained under Check absolute refs. Many sites have images, for instance, in a directory separate from HTML pages, and use absolute references for them. Hence, this option is on by default.

Redundant Scans and Infinite Loops

It doesn't do them. Each file is fully scanned at most once, is validity checked at most once, and is not scanned until an actual need arises. Just thought I'd mention it, and solicit bug reports for any apparent failures.

Why Not an Applet?

PressMyButtons has been cleverly coded to run as either an application or an applet. The applet works perfectly under the JDK applet viewer. It kills NetScape Navigator® in the version I have, which of course means it also kills the desktop, because that's the way NN likes to do things. I don't have time to run lots of experiments to see where the key problem is (though I can pretty well guess) with a reboot after each one. That's why I don't offer it here as an applet.

WebLinkValid

This is the really crude command-line form of the program. You give it one or more names, and it checks them out.

As of November, 2000, it also has a mirroring capability. See under the -m option. See, more relevantly, the PressMyButtons documentation.

The names can be proper URLs (of type http: or file:) or simple file names. It generates two outputs:

stdout: A list of error messages, showing any bad links. You probably want to redirect this to a file.
stderr: Junky progress messages. The reason for this is that it takes a long time to read all the stuff over the net, if your site has any complexity at all. And the ones that don't feel like responding really slow things down. This program should be multi-threaded some day, but it isn't now. If you have a shell that's not a toy, you can redirect this output to /dev/null or somewhere, but be prepared for a lot of dead time on the screen.

So what does it do? It scans the file to find all href and name attributes in <A> tags (ordinary hyperlinks) and <AREA> tags (maps). Wherever an href is of a type that the JDK URL class understands (a local reference or file: or http:), it tries to validate the reference as follows:

Try to access the thing to be sure it exists.
If the reference included a #name spec, and the file is *.html or *.htm, scan it to check that the name is defined.
If the reference is local (i.e., there's no protocol: prefix), find and test all its references, recursively.

Yes, it is reasonably optimized to prevent multiple scans of the same file.

It doesn't know how to check the validity of a reference to a directory (http://www.nobody.glump/foo/bar/) because JDK 1.0 doesn't know how, so it ignores them. If you leave off the / it will complain that the file doesn't exist. It's a worse bummer that the program can't verify ftp: links, but that's the way it is.

Suggestions will be appreciated unless they're obvious known problems, like "You should give it a decent interface and maybe offer it as an applet."

Mirroring: the `-m` option

If you give the -m option on the command line before any URL spec, the program will download every referenced file on the same host. Specifically,

It reads through the file you name on the command line. If you named more than one, it completes all the steps below before going to the next one.
It checks for the existence of each file to which it finds a reference.
When it finds a local reference (without explicit host name) to a file of type .html or .htm , it examines that file in the same way. The order in which it looks at things is unspecified.
Each file checked for, including the original one, is downloaded if the -m option is on and the file is on the same host as the original file. Note that the reference need not be local in the sense just given, and the file may or not be scanned for contents. The file may be of any type.
Downloaded files go into a subtree of the working directory.

If the download is from an Internet host, the base of the subtree is named after the host.
If the load is from the client machine (file: protocol), the local directory structure is mirrored. On DOS-ish filesystems, names like C: are rendered as CCoLoN .

No file is downloaded twice in the same search
No file will overload an existing file unless the -c (clobber) option is on.
The program is too stupid to do anything at all about ftp: files.
It refuses to run if it finds itself on a filesystem that doesn't accept long names. If you find a problem with this, please report it as a SaneFile bug.
Hrefs that contain a % sign are considered to be queries of some kind, and are stripped to the part preceding the % . This is a naïve approach, but works for the moment. Its crudeness may cause problems on Macs. If so, let us reason together.

Credits and Copyright

This program incorporates adc/parser, an HTML tokenizing package written by Arthur Do, available at http://www-cs-students.stanford.edu/~do/. The classes are © 1997 and are used under the terms of the license.

The WebLinkValid program itself is freeware, but distribution is restricted under the copyright in the preceding paragraph. Enjoy.

Date last modified: March 26, 2002.
Built April 2, 2002

Dan Drake's Home Page
Mail to dd@dandrake.com

Copyright (C) 1998-2001 Daniel Drake. A royalty-free license to reproduce this document in whole or in part is hereby granted provided (i) all additions, omissions, and other changes are clearly marked; (ii) the work is not reproduced as, or as part of, a work for which payment is charged; (iii) this notice is reproduced without change. Quotations for critical or polemical purposes, with proper attribution, are permitted in any case, being obviously fair use.

Document: http://www.dandrake.com/wlv/wlv.html Begin JavaGeekery.

If you have Java 2 (that is, JDK 1.2) in a recent release, this program should run. If you have JDK 1.1.7 or higher (or maybe 1.1.5; I happen to use 1.1.8), then all you need is the Swing classes. If you don't have them, get them at http://java.sun.com/products/jfc/#download-swing. They really are pure machine-independent Java.

If you don't have the right JDK level, you can get it for free somewhere.

End JavaGeekery.

Anyway, you can download these Java programs and unzip the package; or you can put the zip file in your class path without unzipping. If you do unzip it, you'll get a nice collection of subdirectories--and you'll need a decent non-FAT file system and an unzip program that isn't brain-damaged, namely InfoZip.

Since everything is written in Java version 1.0, it ought to run on anything that supports Java. We'll see. (Make that version 1.1: I haven't tested with 1.0 for a long time.)

COM.dandrake.wlv.PressMyButtons and COM.dandrake.wlv.WebLinkValid