JEP draft: Vector API

AuthorsVladimir Ivanov, Razvan Lupusoru, Paul Sandoz, Sandhya Viswanathan
TypeFeature
ScopeSE
StatusSubmitted
Componenthotspot / compiler
EffortM
DurationM
Reviewed byJohn Rose
Created2018/04/06 22:58
Updated2018/06/08 23:54
Issue8201271

Summary

Provide an initial iteration of an incubating module, named jdk.incubator.vector, to express vector computations that reliably compile at runtime to optimal vector hardware instructions on supported CPU architectures and thus achieve superior performance than equivalent scalar computations.

Goals

Non-Goals

Motivation

Vector computations consist of a sequence of operations on vectors. A vector comprises a (usually) fixed sequence of scalar values, where the scalar values correspond to the number of hardware-defined vector lanes. A binary operation applied on two vectors with the same number of lanes would, for each lane, apply the equivalent scalar operation on the corresponding two scalar values from each vector. This is commonly referred to as Single Instruction Multiple Data (SIMD).

Vector operations express a degree of parallelism that enables more work to be performed in a single CPU cycle and thus can result in significant performance gains. For example, given two vectors each covering a sequence of eight integers (eight lanes), then the two vectors can be added together using a single hardware instruction. The vector addition hardware instruction operates on sixteen integers, performing eight integer additions, in the time it would ordinarily take to operate on two integers, performing one integer addition.

HotSpot supports auto-vectorization where scalar operations are transformed into superword operations, which are then mapped to vector hardware instructions. The set of transformable scalar operations are limited and fragile to changes in the code shape. Furthermore, only a subset of available vector hardware instructions might be utilized limiting the performance of generated code.

A developer wishing to write scalar operations that are reliably transformed into superword operations needs to understand HotSpot's auto-vectorization support and its limitations to achieve reliable and sustainable performance.

In some cases it may not be possible for the developer to write scalar operations that are transformable. For example, HotSpot does not transform the simple scalar operations for calculating the hash code of an array (see the Arrays.hashCode method implementations in OpenJDK source code), nor can it auto-vectorize code to lexicographically compare two arrays (which why an intrinsic was added to perform lexicographical comparison, see JDK-8033148).

The Vector API aims to address these issues by providing a mechanism to write complex vector algorithms in Java, using pre-existing support in HotSpot for vectorization, but with a user model which makes vectorization far more predictable and robust. Hand-coded vector loops can express high-performance algorithms (such as vectorized hashCode or specialized array comparison) which an auto-vectorizer may never optimize. There are numerous domains where this explicitly vectorizing API may be applicable such as machine learning, linear algebra, cryptography, finance, and usages within the JDK itself.

Description

A vector is represented by the Vector API by the abstract class Vector<T, S>. The type variable T corresponds to the boxed type of scalar primitive integral or floating point element types covered by the vector. The type variable S corresponds to the shape of the vector. The shape defines the size (in bits, commonly fixed) of a vector, and will govern how an instance of Vector<T, S> is mapped to a vector hardware register when vector computations are compiled by the HotSpot C2 compiler (see later for a mapping from instances to x86 vector registers). The length of a vector (number of lanes or elements) will be the vector size divided by the element size.

Here is a simple scalar computation over elements of arrays:

void scalarComputation(float[] a, float[] b, float[] c) {
   for (int i = 0; i < a.length; i++) {
        c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
   }
}

(It is assumed that the array arguments will be of the same size.)

An explicit way to implement the equivalent vector computation using the Vector API is as follows:

static final FloatVector.FloatSpecies<Shapes.S256Bit> SPECIES =
        FloatVector.species(Shapes.S_256_BIT);

void vectorComputation(float[] a, float[] b, float[] c) {
    int i = 0;
    for (; i < (a.length & ~(SPECIES.length() - 1));
           i += SPECIES.length()) {
        // FloatVector<Shapes.S256Bit> va, vb, vc;
        var va = SPECIES.fromArray(a, i);
        var vb = SPECIES.fromArray(b, i);
        var vc = va.mul(va).
                    add(vb.mul(vb)).
                    neg();
        vc.intoArray(c, i);
    }

    for (; i < a.length; i++) {
        c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
    }
}

In this example the shape is hardcoded to vector size of 256 bits. A species (factory) is obtained from a static method and stored in a static final field, so the runtime compiler will treat the field's value as a constant and therefore able to better optimize the vector computation.

The vector computation features a main loop kernel iterating over the arrays in strides of vector lanes (the species length). The species is used to load float vectors from arrays a and b at the corresponding index, then the operations are performed (fluently), and finally the result is stored into array c.

The scalar computation after the vector computation is required to process the tail of elements, the length of which is smaller than the species length. We shall see later how this example compiles to vector hardware instructions and how the Java implementation can be improved upon.

The set of element types (T) supported will be Byte, Short, Int, Long, Float and Double corresponding to the scalar primitive types byte, short, int, long, float and double, respectively.

The set of shapes (S) supported will be the concrete shape classes corresponding to vector sizes of 64, 128, 256, and 512 bits. A shape corresponding to a size of 512 bits can pack bytes into 64 lanes or pack ints into 16 lanes, and a vector of such a shape can operate on 64 bytes at a time, or 16 ints at a time.

The following table presents a mapping from Vector to x86 registers:

| Vector                     | x86 register |
|----------------------------|--------------|
| Vector<?, Shapes.S64Bit>   | xmm?         |
| Vector<?, Shapes.S128Bit>  | xmm?         |
| Vector<?, Shapes.S256Bit>  | ymm?         |
| Vector<?, Shapes.S512Bit>  | zmm?         |

(Note: We believe that these simple shapes are generic enough to be useful on all platforms supporting the Vector API. However, as we experiment beyond this JEP with future platforms, we may further modify the design of the shape parameter. In principle it could express additional properties besides bit-size, such as whether a vector is dense or not, whether it possesses an intrinsic mask, whether and how it may be dynamically sized, etc. Such work is not in the scope of this JEP, but these possibilities partly inform the present role of shapes in the Vector API.)

An instance of Vector<T, S> is immutable and is a value-based type that retains, by default, object identity invariants (see later for relaxation of these invariants).

Vector<T, S> declares a set of methods for common vector operations supported by all element types. Such operations can be classified into groups of unary (e.g. negation), binary (e.g. addition), and comparison (e.g. lessThan), with corresponding familiar scalar operational equivalents. Further operations are more specific to vectors, such as intra-lane operations (e.g. permuting elements), converting to a vector of a different type and/or shape (e.g. casting), or storing the vector elements into a data container (e.g. a byte[] array).

To support operations specific to an element type there are six public abstract sub-classes of Vector<T, S>, one for each supported element type, ByteVector<S>, ShortVector<S>, IntVector<S>, LongVector<S>, FloatVector<S>, and DoubleVector<S>. There are operations that are common to the integral sub-types, such as bitwise operations (e.g. logical xor), and operations common to the floating point types, such as mathematical operations (e.g. transcendental functions like cosine). Other operations are bound to the element type since the method signature refers to the element type (or the equivalent array type), such as reduction operations (e.g. sum all elements to a scalar value), or storing the vector elements to an array.

Concrete sub-types for concrete type parameters of both S and T are not public since there is no need to provide operations specific to the type and shape. This reduces the API surface to a sum of concerns rather than a product.

As a result instances of Vector<T, S> cannot be constructed directly. Instead instances are obtained via factories referred to as species, specifically via the abstract class Vector.Species<T, S> and the element type-specific abstract sub-classes IntVector.IntSpecies<S> etc, that mirror the type variables and class hierarchy of Vector<T, S>.

A species provides operations to obtain vector instances, such as the vector instance whose elements are initiated to default values (the zero vector), or a vector from an array, in addition to providing the canonical support for converting between vectors of different types and/or shapes (e.g. casting).

To support control flow relevant vector operations will optionally accept masks, represented by the public abstract class Mask<T, S>. An instance of Mask<T, S> will have the same length as an instance of Vector<T, S> for the same S and T, and therefore the same number of lanes. Each element, a boolean value or bit, in a mask corresponds to a vector lane. When a mask is an input to an operation it governs whether the operation is applied to each lane, if the mask bit for the lane is set (is true). Alternative behaviour occurs if the mask bit is not set (is false). Comparison operations produce masks, which can then be input to other operations to selectively disable the operation on certain lanes and thereby emulate flow control. Masks can also be obtained from species (just like vectors).

It is anticipated that masks will likely play an important role in the development of vector computations that are generic to shape. (This is based on the central importance of predicate registers, the equivalent of masks, in the ARM Scalable Vector Extensions as well as in Intel's AVX-512.)

Example

Continuing with the example presented at the beginning of the description section, the HotSpot compiler should generate machine code similar to the following:

0.43%  / │  0x0000000113d43890: vmovdqu 0x10(%r8,%rbx,4),%ymm0
  7.38%  │ │  0x0000000113d43897: vmovdqu 0x10(%r10,%rbx,4),%ymm1
  8.70%  │ │  0x0000000113d4389e: vmulps %ymm0,%ymm0,%ymm0
  5.60%  │ │  0x0000000113d438a2: vmulps %ymm1,%ymm1,%ymm1
 13.16%  │ │  0x0000000113d438a6: vaddps %ymm0,%ymm1,%ymm0
 21.86%  │ │  0x0000000113d438aa: vxorps -0x7ad76b2(%rip),%ymm0,%ymm0
  7.66%  │ │  0x0000000113d438b2: vmovdqu %ymm0,0x10(%r9,%rbx,4)
 26.20%  │ │  0x0000000113d438b9: add    $0x8,%ebx
  6.44%  │ │  0x0000000113d438bc: cmp    %r11d,%ebx
         \ │  0x0000000113d438bf: jl     0x0000000113d43890

This is actual output from a JMH micro-benchmark for the example code under test using a prototype of the Vector API and implementation (the vectorIntrinsics branch of Project Panama's development repository).

The hot areas of C2 generated machine code are presented. There is a clear translation to vector registers and vector hardware instructions. (Note loop unrolling was disabled to make the translation clearer, otherwise HotSpot should be able to unroll using existing C2 loop optimization techniques.). All Java object allocations are elided.

It is an important goal to support more complex non-trivial vector computations that translate clearly into generated machine code.

There are, however, a few issues with this particular vector computation:

  1. The loop is hardcoded to a concrete vector shape, so the computation cannot adapt dynamically to a maximal shape supported by the architecture (which may be smaller or larger than 256 bits). Therefore the code is less portable and maybe less performant.

  2. Calculation of the loop upper bounds, although simple here, can be a common source of programming error.

  3. A scalar loop is required at the end, duplicating code.

The first two issues will be addressed by this JEP. A preferred species can be obtained whose shape is optimal for the current architecture, the vector computation can then be written with a generic shape, and a method on the species can round down the array length, for example:

static final FloatVector.FloatSpecies<?> SPECIES =
    FloatVector.preferredSpecies();

<S extends Shape>
void vectorComputation(float[] a, float[] b, float[] c,
        FloatVector.FloatSpecies<S> species) {
    int i = 0;
    int upperBound = species.roundDown(a.length);
    for (; i < upperBound; i += species.length()) {
        //FloatVector<S> va, vb, vc;
        var va = species.fromArray(a, i);
        var vb = species.fromArray(b, i);
        var vc = va.mul(va).
                    add(vb.mul(vb)).
                    neg();
        vc.intoArray(c, i);
    }

    for (; i < a.length; i++) {
        c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
    }
}

vectorComputation(a, b, c, SPECIES);

The last issue will not be fully addressed by this JEP and will be the subject of follow on work. One solution could be easily expressed in the API by using a mask as an additional loop variable, for example:

<S extends Shape>
void vectorComputation(float[] a, float[] b, float[] c,
        FloatVector.FloatSpecies<S> species) {
    int i = 0;
    Vector.Mask<Float, S> m = species.maskFromBounds(0, length);
    for (int i = 0;
         i < a.length;
         i += species.length(), m = species.maskFromBounds(i, length)) {
        //FloatVector<S> va, vb, vc;
        var va = species.fromArray(a, i, m);
        var vb = species.fromArray(b, i, m);
        var vc = va.mul(va).
                    add(vb.mul(vb)).
                    neg();
        vc.intoArray(c, i, m);
    }
}

The mask will ensure that loads and stores will not result in out of bounds exceptions (in this case the calculation is such that masks do not need to be provided to the other vector operations). It is anticipated that such masked loops will work well for a range of architectures, including x86 and ARM, but will require additional runtime compiler support to generate maximally efficient code. Such work on masked loops, though important, is beyond the scope of this JEP.

HotSpot C2 compiler implementation details

The Vector API has two implementations in order to adhere to the project goals. The first implements operations in Java, thus it is functional but not optimal. The second makes intrinsic, for the HotSpot C2 compiler, those operations with special treatment for Vector API types. This allows for proper translation to x86 registers and instructions for the case where architecture support and implementation for translation exists.

The intrinsification process for the Vector API will work by translating Vector API method calls to C2 IR Nodes that represent appropriate intended semantics. For example, for FloatVector<Shapes.S256Bit>.add, the C2 compiler will replace the call with a AddVF node plus a VectorBox node. The AddVF represents addition of two float vectors while the VectorBox represents the boxing portion to create a valid object. (Thus add on two Vector objects will produce a resulting Vector object.) This way, object creation (if any) is submerged under the vector operation, so in cases where the object does not need to exist, it can be eliminated.

The IR nodes generated by intrinsification will overlap with the IR nodes used by vectorizer. However, because the Vector API will support a much larger set of operations, additional IR nodes will be added as needed. In order to keep the newly added nodes to a minimum, new nodes will no longer encode the type in the operand name. For example, the VectorBlend node supports blending and masking operations. There is no VectorBlendI node for int vectors. Instead, the extra type information is simply encoded using existing type system (TypeVect) which encodes element type along with shape.

It is intended that for all of the vector operations defined by the API, there will be a translation implemented that will allow use of x86 instructions on some x86 architectures. For example, ByteVector<Shapes.S256Bit>.blend will translate to vpblendvb (AVX2) where as ByteVector<Shapes.S512Bit>.blend will translate to vpblendmb (AVX-512). The translation may be non-optimal. If ByteVector<Shapes.S512Bit>.blend is used on a system that only supports AVX2, no translation will occur and instead the default Java implementation will be used. That said, the Vector API provides the Vector.preferredSpeciesInstance method in order to query for the appropriate vector size to use. Behind the scenes, this method is implemented by calling into Matcher::vector_width_in_bytes so that this value is dynamically computed depending on the system. The returned species of this method can be used for generically sized vector computations so no concrete shape needs be declared.

The set of operations on Vector, Species and Mask will be selected for their applicability for C2 intrinsification on x86 architectures. Additional non-intrinsified operations may be placed off to the side in helper classes. In future work, these divisions may be adjusted in order to provide more fully platform agnostic API.

To avoid an explosion of intrinsics added to C2, a set of intrinsics will be defined that correspond to operation kinds, such as binary, unary, comparison, and so on, where constant arguments are passed describing operation specifics. Approximately ten new intrinsics will be needed to support intrinsification of all parts of the API.

The C2 compiler will have special knowledge of the Vector, Species and Mask types and all the sub-types. This will enable C2 to map instances of Vector, to vector registers, and aggressively elide allocations when such instances do not escape. C2 will also have knowledge for treatment of vector registers and vector objects at safepoints so that it can safely save them and also safely reconstruct Vector objects. Special attention will taken to ensure, by default, object semantics (such as identity) are preserved when an instance escapes or needs to be materialized as reference to a Vector object.

Vector instances are value-based, morally values where identity-sensitive operations should be avoided. This potentially limits the set of applicable optimizations, specifically due to the limitations of escape analysis. A flag will be provided to enable Vector instances to have no guaranteed identity and thereby support more aggressive optimizations such as lazy materialization at a safepoint. When value types are fully supported by the Java language and runtime (see Project Valhalla) then concrete Vector classes can be made value types and it is anticipated such a flag and many optimizations will no longer be required.

Mask support will require careful attention on x86 architectures since there are two kinds of representation, a vector register representation or an opmask register representation (for AVX-512), and different instructions will take one or the other. In the initial implementation, it is expected that all masks will be represented as vector registers even for AVX-512. This means that native masking via opmask (or k) registers will not be supported in the first implementation. Platforms like AVX-512 and ARM SVE motivate our treatment of Mask as a special type rather than as an ordinary combination of Vector and boolean types.

Future Work

The Vector API will benefit significantly from value types when ready (see Project Valhalla). Instances of a Vector<T, S> can be values, whose concrete classes are value types. This will make it easier to optimize and express vector computations. Sub-types of Vector<T, S> for specific types, such as IntVector<S>, will no longer be required with generic specialization over values and type-specific method declaration. A shift to value types is thought to be backward compatible, perhaps after recompilation of Vector API code. Some abstract classes may need conversion to interfaces, if they are supers of value types.

A future version of the Vector API may make use of enhanced generics, as noted above.

It is expected that the API will incubate over multiple releases of the JDK and will adapt as dependent features such as value types become available in a future JDK release and newer CPU architectures become more established in the industry.

Alternatives

HotSpot's auto-vectorization is an alternative approach but it would require significant enhancement and would likely still be fragile and limited compared to using the Vector API, since auto-vectorization with complex control flow is very hard to perform.

In general, and even after decades of research (especially for FORTRAN and C array loops), it seems that auto-vectorization of scalar code is not a reliable tactic for optimizing ad hoc user-written loops, unless the user pays unusually careful attention to unwritten contracts about exactly which loops a compiler is prepared to auto-vectorized. It's too easy to write a loop that fails to auto-vectorize, for a reason that only the optimizer can detect, and not the human reader. Years of work on auto-vectorization (even in HotSpot) have left us with lots of optimization machinery that works only on special occasions. We want to enjoy the use of this machinery more often!

Testing

Combinatorial unit tests will be developed to ensure coverage for all operations, for all supported types and shapes, over various data sets. The tests will be implemented with TestNG and will be exercisable via jtreg.

Performance tests will be developed to ensure performance gaols are met and vector computations map efficiently to vector hardware instructions. This will likely consistent of JMH micro-benchmarks but more realistic examples of useful algorithms will also be required.

As a backup to performance tests, we will create white-box tests to force the JIT to report to us that vector API source code did, in fact, trigger vectorization.

Risks and Assumptions

There is a risk that the API will be biased to the SIMD functionality supported on x86 architectures. This applies mainly to the explicitly fixed set of supported shapes, which bias against coding algorithms in a shape-generic fashion. We consider the majority of other operations of the Vector API to bias toward portable algorithms. To mitigate that risk other architectures will be taken into account, specifically the ARM Scalar Vector Extension architecture whose programming model adjusts dynamically to the singular fixed shape supported by the hardware. We welcome and encourage OpenJDK engineers working on the ARM specific areas of HotSpot and ARM engineers to participate in this effort.

The Vector API uses box types (like Integer) as proxies for primitive types (like int). This decision is forced by the current limitations of Java generics (which are hostile to primitive types). When Project Vahalla eventually introduces more capable generics, the current decision will seem awkward, and may need changing. We assume that such changes will be possible without excessive backwards incompatibility.

Dependencies

The 'Future Work' sub-section describes the enhancement of this JEP via use of value types. However, that is not a blocking dependency but merely an enhancement. There are no other dependencies for this work.