Gary McGath

Professional Software Developer

JHOVE usage notes

By Gary McGath, Professional Software Developer

Now available on Smashwords: JHOVE Tips for Developers! It's free if you don't want to pay for it, but if you'd like to support JHOVE development with money, this is the way. cover image for JHOVE Tips for Developers

JHOVE is a widely used application among libraries and archives for file validation, but I've noticed some misconceptions about its use that keep popping up and some areas that need better explanation. These notes may be helpful in putting it to the best use.

Please use the latest version!

Some people are still using old versions of JHOVE, either just because they've never seen a reason to upgrade or because they're afraid of compatibility problems. Old versions have some serious bugs, and the compatibility issues are very few. If you're still using Java 1.4, you can't upgrade to the latest version, but why on earth are you using Java 1.4?? All releases of JHOVE will run under Java 1.5 or higher.

This wouldn't be important, except that older versions of JHOVE hammer on Harvard's servers every time they need to get a copy of the JHOVE schema, which happens once per file processed. Newer versions use a local copy of the schema. The servers as hul.harvard.edu are getting a big load of requests from a small number of sites, and they might eventually decide to block or limit access. JHOVE 1.9 is the latest as of May 2013, and there's no reason not to use it.

What JHOVE does and doesn't do

JHOVE is not a full validation application. Due to limitations in resources, Stephen Abrams and I made some decisions about its functionality, and I haven't changed these. Specifically, it doesn't look at code streams, at the actual encoding of pixels or audio samples. It examines the structure of a file and reports any violations of the specification that it knows how to find.

It reports whether a file is "well-formed" and "valid." These terms come from XML nomenclature and don't always have clear meaning for other formats. I had to make decisions as I went on what defects made a document ill-formed and what made it invalid. In general, a document that isn't well-formed is one that's unusable; it will break an application that makes the assumptions in the spec. An invalid document is one with errors that reduce its functionality.

With JHOVE, the published spec rules. Good applications are forgiving of errors when they can make reasonable assumptions, but JHOVE is based on a desert island hypothesis: You've been washed ashore with some files, a computer, and a copy of the spec. You need to write an application that will read the files, maybe because they'll tell you how to build a radio from coconuts and signal for help. This may or may not be the best choice, but it's what it does.

JHOVE identifies profiles, which are restrictions of the specification for particular purposes. It does this in the course of validating a file; if particular requirements are present and the file doesn't do anything forbidden by the profile, it reports that the file satisfies the profile. Failure to satisfy the profile isn't considered an error; if the file doesn't pass every test, the output just doesn't list the profile and doesn't tell you why. The PDF/A profile test is particularly shaky; the requirements are very complicated, and checking them as an afterthought to a module checking PDF conformity doesn't work very well. JHOVE 1.7 has some serious bugs that make it fail to report PDF/A compliance in a lot of cases; some of these will be fixed in 1.8.

Interpreting JHOVE output

JHOVE output can be either in XML or in plaintext list format. The XML is formatted to be human-readable. The information is equivalent either way.

The first thing to look at is the module. If you don't pick a specific module, JHOVE will always report some module, since in the last resort it will report a Bytestream. The release and date don't matter unless you're using a third-party module. If you get Bytestream, that means either the file is of a format that JHOVE doesn't know about, or JHOVE thinks it's ill-formed in its intended format.

If it finds a file is "well-formed" but not "valid" in a format, JHOVE will report that in its status. As mentioned above, the line between "well-formed" and "valid" isn't clear for all formats. Don't pay too much attention to SignatureMatches; it may sometimes give an indication of what the format was really supposed to be, but signature checking is inconsistent across the modules.

There may be a list of Profiles. These tell you what profiles JHOVE thinks the file satisfies.

The metadata section will vary greatly from one module to another. The more you know about the format, the more you'll be able to get out of the metadata.

Making the best use of JHOVE

In most cases, you don't need all the format modules. If a format isn't one you ever expect to see, you might want to disable it. To do this, edit the file jhove/conf/jhove.conf in your home directory and remove unwanted module declarations. They look like this:

 <module>
   <class>edu.harvard.hul.ois.jhove.module.HtmlModule</class>
 </module>

I recommend always disabling the HTML module. It's extremely slow and not very good. If you need HTML validation, there are better validators out there. The strict-conformity philosophy of JHOVE and the laxness of HTML in practice just don't play well together, especially since almost everything in HTML is optional.

If you're checking for a particular format, select only that format. The advantage of this is that if a file doesn't validate, JHOVE will tell you why. If you let it use all modules, it will just tell you that it's a bytestream. In the command line, use a command like this:

jhove -m AIFF-hul filebelievedtobe.aiff

In the GUI version, select the module from the Edit menu.

The names of the modules included with the SourceForge distribution are:

Using the JHOVE API

JHOVE can be used as a library by other Java applications. Among its users are Harvard's FITS, the Planets Testbed, and the Plato preservation planning tool. This isn't very well documented on the JHOVE website, I'm afraid; here's a quick introduction.

The main entry point for the API is edu.harvard.hul.ois.jhove.JhoveBase. The initialization code in Jhove.main() provides an example of how to use it. Here's a minimal example of how to invoke it:

    App app = new App (NAME, RELEASE, DATE, USAGE, RIGHTS);
    OutputHandler handler = je.getHandler ("XML");
    module = null;       // check all modules
    JhoveBase je = new JhoveBase ();
    je.setLogLevel (logLevel);
    je.init (configFile, saxClass);
    
    je.setEncoding ("utf-8");
    je.setTempDirectory ("~/temp");
    je.setBufferSize (4096);
    je.setChecksumFlag (false);
    je.setShowRawFlag (false);
    je.setSignatureFlag (false);
    try {
        je.dispatch (app, module, null, handler, outputFile, fileToTest);
    }
    catch (Exception e) {
        e.printStackTrace();
    }
 

Conclusions

By now you've probably guessed that I don't agree with all the design decisions behind JHOVE, or even all the programming decisions I'd made. The high-level decisions weren't mine to make, and the preservation community has learned a lot since 2005, as have I. I think it's best to keep JHOVE good at what it was designed to do, rather than change its purpose.