JEP draft: Better hash codes

OwnerJohn Rose
TypeFeature
ScopeJDK
StatusDraft
Componenthotspot
Created2018/04/12 01:23
Updated2018/04/19 19:02
Issue8201462

Summary

Introduce a modernized general-purpose hash function into the JVM, usable by arbitrary classes and arrays.

Enable new classes to use it as a back-end for the standard Object.hashCode API.

Supply APIs for legacy types (including arrays and collections) to begin to make use of this hash code.

Goals

Non-goals

Success Metrics

Motivation

Existing standard implementations of the Object.hashCode API have well-known flaws, leading to excessive hash collisions and poor use of CPU cycles and excessive memory footprint in hashed structures.

In many cases every bit of the input affects at most one bit of the output (Integer.hashCode), and/or can be trivially canceled by another bit (Long.hashCode). Even for variably-sized types, such as strings and lists, this observation often applies to the last elements of the type.

Even if the input bits are mixed a little, in most cases they are not mixed thoroughly, because the usual mixing function is exclusive or between parallel inputs, or the simple recurrence h=h*31+x for sequences of inputs. In these functions there are many funnels between inputs and dead bits in the output.

The existing hash codes are insufficient for industrial strength hashing that must avoid of pathological collision behaviors, even in the presence of overloaded or subverted inputs.

Meanwhile, the JVM has a special role to play here. For any given object layout (or value type layout), it knows how to load the hashed input with as few memory operations as possible, and it knows whether special-purpose instructions (such as AES steps) are available to accelerate the hash algorithm while retaining good mixing properties. The JVM can even sense when a vectorized hashing algorithm might be appropriate. All of these considerations are inconsistent with an externally-specified hash algorithm such as XOR or h*31+x, and all of them are important for obtaining the best possible trade-off between hashing speed and quality. It follows that we don't need just another library function for a few use cases, but rather a deeply embedded facility that the JVM can optimize for (potentially) any Java object, value, or array.

Description

The low-level API point System.primitiveHashCode will be defined alongside the API point System.identityHashCode, as a way to obtain a fast, high quality hash code for the publicly readable contents of an object, value, or array. A variant entry point (see below) will direct the JVM to privately readable contents on an opt-in basis.

The goals given above will be met by all instances of this algorithm. (Exception instances of salt selected non-randomly by an adversary will not serve as proof of failure.) The JVM is encouraged to use appropriate technology, possibly including built-in cryptographic instructions or specialized vector hardware, to implement these goals.

Any non-public fields, and any fields or elements of reference type in the object value, or array will be ignored by primitiveHashCode. This may seem to be a deep usability problem, but in fact allows primitiveHashCode to be used as building block for a range of useful hashcode behaviors, including those which treat references as object identities or as pointers to more hashable content, or some mix of both.

The API point System.primitiveHashCode(Object,Lookup) will provide a version of the hash algorithm which inspects all fields in the given object, under the requirement that the Lookup argument validates access to these fields. This API point is intended for use as a building block with classes whose internal structure has primitives.

Any fields of value type declared in the object or value class are recursively scanned for non-reference inputs to the hash code algorithm. Likewise, arrays of value types are recursively scanned. Arrays of Object or interface types will not be scanned, even if some of the elements might refer to value types.

The class of the operand passed to System.primitiveHashCode is significant to the hash code. That is, two objects with equivalent fields but different classes will produce independent hash codes. Two arrays of the same length and numeric contents, but with different types (short vs. long) will produce independent hash codes. Likewise, the length of an array is significant to the hash code, as well as all of its non-reference (primitive or value) elements.

As a special case, reference arrays could distribute the hash algorithm to each of their reference elements, instead of ignoring them. But this special case seems to work better as a special method; see below. For consistency, passing a reference array to System.primitiveHashCode will simply hash the array's class and length.

Since public fields are relatively rare in standard Java APIs, there is also an API to adapt primitiveHashCode to values or objects with private fields. This API includes a permission check based on a second argument, a Lookup which must exactly match the class of the operand. It is System.primitiveHashCode(Object,Lookup). It is not caller sensitive.

Internally, the JVM will make two key choices to implement System.primitiveHashCode. First, it will statically sense any configuration parameters that add "salt" to the hashing algorithm, for this JVM instance. Second, for each operand, it will sense the class and vector (if necessary for efficiency) to an appropriately specialized implementation of the hash function. This second step will often be omitted if JIT-time type information is available.

If the lookup parameter is present, a third step is required, which is comparing the lookup class against the object class, and ensuring that the lookup has the required "private" bit set which allows full access.

The API point MethodHandles.primitiveHashCode(String) will provide an instance of the long hash code algorithm, using the given string as "salt". If the string is empty, the algorithm will behave the same as that of System.primitiveHashCode.

The combined API point that takes both the string and the Lookup object may also be provided, as well as (perhaps) API points which allow selection of a set of relevant fields and other inputs to the hash function, such as hashes recursively derived from reference fields.

There will be a way to specify a salt string at JVM start-up for use only the in that instance of the JVM. There will be way to ask the JVM to generate such a salt string randomly at start-up, and in this case the salt will be as unpredictable as possible.

For object and value classes, the set of non-reference fields which are public, or the set of all fields, is often only a first approximation to the set of inputs required for a correct hash function. In those cases, user code (or a method handle generated by a bootstrap method) can use primitiveHashCode to extract all of the primitive fields at once, and then mix them with hashes derived from other sources, such as affiliated objects or arrays. This mixing can be performed on an intermediate array of type long[].

Additional API points may be built on top as needed:

Publicly available block data hash algorithms, such as xxHash or entries in Google's CityHash are likely to be useful for building the above functionality.

An important advantage of defining hashes with JVM assistance is accelerating the acquisition of initial inputs to the hash, the primitive bits from the original object, value, or array. On modern hardware, this can often be performed most efficiently by loading an entire vector worth of data from the input instance. A vector masking operation (either with the load, or after it) can suppress irrelevant portions of the object layout, such object header, fragmentation overhead, nearby objects, private fields (if suppressed) and fields of reference type. If the object is dense enough with primitive fields, it is likely that a single masked vector load is competitive with a series of scalar loads, and the vector processing unit is often capable of performing significant hash steps, immediately after the load.

As a specific algorithm for hash steps, note that user-mode instructions for a single step of the 128-bit AES encode algorithm is available on many significant platforms supported today by Java, and are well integrated with vector units. (In some cases, the vector unit can run several steps in parallel.) A 128-bit AES step is likely to be about as cheap as many traditiional hashing steps, such as 64-bit multiply followed by bit swizzling. The AES step operates on 128 bits at a time, which gives it an edge on some platforms. Early experiments with SMHasher suggest that two AES steps (but not one) are usually enough to produce robust mixing of bits.

Thus, an AES-based algorithm something like this would seem to work well for the above APIs:

primitiveHashCode(x) {
  cls = x.getClass()
  mask1 = vectorMaskForInstance(cls)
  bits = unsafe_vector_load(x, mask1)
  bits = aes_step(bits, SALT1)
  if (cls.isArray()) {
    mask2 = vectorMaskForArrayBody(cls)
    for (off in vectorOffsetsFor(cls, x.length)) {
      bits2 = unsafe_vector_load(x + off, mask2)
      bits = aes_step(bits, bits2)
      bits2 = aes_step(bits2, SALT2)
      bits = aes_step(bits, bits2)
    }
  }
  bits = aes_step(bits, SALT3)
  return bits
}

Indeed, the CityHash project appears to have experimented briefly with the approach of using AES as a mixing step.

The SALT values above can be any pseudo-random 128-bit value. They determine which instance of the hash algorithm is in use. They could be derived from a random number generator, or from the hashing (cryptographic or otherwise) of the salt string mentioned earlier.

For machines which can issue several AES steps at once, for larger arrays, it would be a routine adjustment to unroll the above algorithm and rearrange it to accumulate larger state vectors, in parallel. The choice of whether to do this depends sensitively on the the array length and the processor, which means it must be made on-line by the JVM and not by a prior decision, such as a mandate from a library API.

A JVM instance which runs on hardware that doesn't accelerate AES could fall back to a standard ALU-intensive hash algorithm, such as MurmurHash or xxHash. The freedom to back off like this is integral to getting the best-quality hash available on any given platform.

Collision resistance can be increased by adding more instructions, including XOR steps to carry forward duplicate values of previous steps (Davies-Mayer and similar constructions) and/or additional AES round instructions. This construction is not designed to be a cryptographically strong hash, and collisions may occur. The use of randomly selected salt values will make it more difficult, though not impossible, for adversaries to engineer collisions.

Java currently taps out at 64 bits on primitive size, but future value types will easily grow larger. At that point, the APIs sketched above could be expanded to provided "extra long" hashes of arbitrary size, if this seems desirable. Such jumbo hashes are relatively simple to derive from the above hashing algorithms, by using distinct salt values for each channel of the jumbo hash.

Alternatives

The special case for reference arrays could be removed, and instead special API points reserved for those uses: Arrays.deepLongHashCode

The various API points suggested above could be deferred until proven useful.

We could continue to live with hash collisions from h*31+x.

Library writers could continue to create their own hash functions, which will not be optimized well by the JVM and/or will have poorer hashing quality than what the JVM could do.

It is an unusual API design to mix array-element hashing with instance-field hashing, all in one place. Perhaps the concerns should be separated. Perhaps polymorphism should be expressed statically, as method handle factories, rather than via dynamic type tests performed by a single magic entry point. Note that the magic entry point has precedents such as System.arraycopy and Class.newInstance. The common thread handled by the proposed System.primitiveHashCode is that only the JVM can know the fastest and best thing to do for each distinct object, value, or array layout (and for each selection of public fields vs. all fields).