JEP draft: Vector API

AuthorsVladimir Ivanov, Razvan Lupusoru, Paul Sandoz, Sandhya Viswanathan
Created2018/04/06 22:58
Updated2018/04/19 18:35
TypeFeature
StatusDraft
Componenthotspot
ScopeSE
EffortM
DurationM
Priority4
Issue8201271

Summary

Provide an incubating module, named jdk.incubator.vector, to express vector computations that reliably compile at runtime to optimal vector hardware instructions on supported CPU architectures and thus achieve superior performance than equivalent scalar computations.

Vector computations consist of a sequence of operations on vectors. A vector covers a (usually) fixed sequence of scalar values, where the number of scalars covered corresponds to the number of vector lanes. A binary operation applied on two vectors with the same number of lanes would, for each lane, apply the equivalent scalar operation on the corresponding two scalar values from each vector. This is commonly referred to as Single Instruction Multiple Data (SIMD).

Goals

Non-Goals

Motivation

Vector operations express a degree of parallelism that enables more work to be performed in a single instruction and thus can result in significant performance gains. For example, given two vectors each covering a sequence of eight integers (eight lanes), then the two vectors can be added together using a single hardware instruction. The vector addition hardware instruction operates on sixteen integers, performing eight integer additions, in the time it would ordinarily take to operate on two integers, performing one integer addition.

HotSpot supports auto-vectorization where scalar operations are transformed into superword operations, which are then mapped to vector hardware instructions. The set of transformable scalar operations are limited and fragile to changes in the code shape. Furthermore, only a subset of available vector hardware instructions might be utilized limiting the performance of generated code.

A developer wishing to write scalar operations that are reliably transformed into superword operations needs to understand HotSpot's auto-vectorization support and its limitations to achieve reliable and sustainable performance.

In some cases it may not be possible for the developer to write scalar operations that are transformable, for example HotSpot does not transform the simple scalar operations for calculating the hash code of an array (see the Arrays.hashCode method implementations in OpenJDK source code.), nor can it auto-vectorize code to lexicographically compare two arrays (which why an intrinsic was added to perform lexicographical comparison, see JDK-8033148).

The Vector API aims to address these issues by providing a mechanism to write complex vector algorithms in Java. There are also numerous domains where this API may be applicable such as machine learning, linear algebra, cryptography, finance, and usages within the JDK itself (two such examples were previously given).

Description

A vector is represented by the Vector API by the abstract class Vector<T, S>. The type variable T corresponds to the boxed type of scalar primitive integral or floating point element types covered by the vector. The type variable S corresponds to the shape of the vector. The shape defines the size (in bits, commonly fixed) of a vector, and will govern how an instance of Vector<T, S> is mapped to a vector hardware register when vector computations are compiled by the HotSpot C2 compiler (see later for a mapping from instances to x86 vector registers). The length of a vector (number of lanes or elements) will be the vector size divided by the element size.

The set of T supported will be Byte, Short, Int, Long, Float and Double corresponding to the scalar primitive types byte, short, int, long, float and double, respectively.

The set of S supported will be the concrete shape classes corresponding to vector sizes of 64, 128, 256, and 512 bits. A shape corresponding to a size of 512 bits can pack bytes into 64 lanes or pack ints into 16 lanes, and a vector of such a shape can operate on 64 bytes at a time, or 16 ints at a time.

The following table presents a mapping from Vector to x86 registers:

| Vector | x86 register | |------------------------------|--------------| | Vector<?, Shapes.S64Bit> | xmm? | | Vector<?, Shapes.S128Bit> | xmm? | | Vector<?, Shapes.S256Bit> | ymm? | | Vector<?, Shapes.S512Bit> | zmm? |

An instance of Vector<T, S> is immutable and is a value-based type that retains, by default, object identity invariants (see later for relaxation of these invariants).

Vector<T, S> declares a set of methods for common vector operations supported by all element types. Such operations can be classified into groups of unary (e.g. negation), binary (e.g. addition), and comparison (e.g. lessThan), with corresponding familiar scalar operational equivalents. Further operations are more specific to vectors such as intra-lane operations (e.g. permuting elements) , converting to a vector of a different type and/or shape (e.g. casting), or storing the vector elements into a byte[] array.

To support operations specific to an element type there are six public abstract sub-classes of Vector<T, S>, one for each supported element type, ByteVector<S>, ShortVector<S>, IntVector<S>, LongVector<S>, FloatVector<S>, and DoubleVector<S>. There are operations that are common to the integral sub-types, such as bitwise operations (e.g. logical xor), and operations common to the floating point types, such as mathematical operations (e.g. transcendental functions like cosine). Other operations are bound to the element type since the method signature refers to the element type (or the equivalent array type), such as reduction operations (e.g. sum all elements to a scalar value), or storing the vector elements to an array.

Concrete sub-types for concrete type parameters of both S and T are not public since there is no need to provide operations specific to the type and shape, further it reduces the API surface.

As a result instances of Vector<T, S> cannot be constructed directly. Instead instances are obtained via factories referred to as species, specifically via the abstract class Vector.Species<T, S> and the element type-specific abstract sub-classes IntVector.IntSpecies<S> etc, that mirror the type variables and class hierarchy of Vector<T, S>.

Species provides operations to obtain vector instances, such as the a vector instance whose elements are initiated to default values (the zero vector), or a vector from an array, in addition to providing the canonical support for converting to a vector of a different type and/or shape (e.g. casting) that a vector defers to.

To support control flow relevant vector operations will optionally accept masks, represented by the public abstract class Mask<T, S>. An instance of Mask<T, S> will have the same length as an instance of Vector<T, S> for the same S and T, and therefore the same number of lanes. Each element, of boolean value or bit, in a mask corresponds to a vector lane. When a mask is input to an operation it governs whether the operation is applied to lane, if the mask bit for the lane is set (is true), or alternatively other behaviour if the mask bit is not set (is false). Comparison operations produce masks, which can then be input to other operations to selectively disable the operation on certain lanes and thereby emulate flow control. Masks can also be obtained from species (just like vectors).

It is anticipated that masks will likely play an important role in the development of vector computations that are generic to shape (this is based on central importance of predicate registers, the equivalent of masks, in ARM Scalable Vector Extension)

Example

Here is a simple scalar computation over elements of arrays:

void scalarComputation(float[] a, float[] b, float[] c) {
   for (int i = 0; i < a.length; i++) { 
        c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
   }
}

(It is assumed that the array arguments will be of the same size.)

An explicit way to implement the equivalent vector computation using the Vector API is as follows:

static final FloatVector.FloatSpecies<Shapes.S256Bit> SPECIES =
        FloatVector.speciesInstance(Shapes.S_256_BIT);

void vectorComputation(float[] a, float[] b, float[] c) {
    int i = 0;
    for (; i < (a.length & ~(SPECIES.length() - 1)); i += SPECIES.length()) {
        FloatVector<Shapes.S256Bit> va = SPECIES.fromArray(a, i);
        FloatVector<Shapes.S256Bit> vb = SPECIES.fromArray(b, i);
        FloatVector<Shapes.S256Bit> vc = va.
                mul(va).
                add(vb.mul(vb)).
                neg();
        vc.intoArray(c, i);
    }

    for (; i < a.length; i++) {
        c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
    }
}

In this example the shape is hardcoded to vector size of 256 bits. The species (factory) is obtained from a static method and stored in a static final field, so the runtime compiler will treat the field's value as a constant and therefore able to better optimize the vector computation.

The vector computation comprises of a loop kernel iterating over the arrays in strides of vector lanes (the species length). The species is used to load float vectors from arrays a and b at the corresponding index, then the operations are performed (fluently), and finally the result is stored into array c.

The scalar computation after the vector computation is required to process the tail of elements, the length of which is smaller than the species length.

The HotSpot compiler should generate machine code similar to the following:

0.43%  ↗ │  0x0000000113d43890: vmovdqu 0x10(%r8,%rbx,4),%ymm0
  7.38%  │ │  0x0000000113d43897: vmovdqu 0x10(%r10,%rbx,4),%ymm1
  8.70%  │ │  0x0000000113d4389e: vmulps %ymm0,%ymm0,%ymm0
  5.60%  │ │  0x0000000113d438a2: vmulps %ymm1,%ymm1,%ymm1
 13.16%  │ │  0x0000000113d438a6: vaddps %ymm0,%ymm1,%ymm0
 21.86%  │ │  0x0000000113d438aa: vxorps -0x7ad76b2(%rip),%ymm0,%ymm0
  7.66%  │ │  0x0000000113d438b2: vmovdqu %ymm0,0x10(%r9,%rbx,4)
 26.20%  │ │  0x0000000113d438b9: add    $0x8,%ebx
  6.44%  │ │  0x0000000113d438bc: cmp    %r11d,%ebx
         ╰ │  0x0000000113d438bf: jl     0x0000000113d43890

This is actual output from a JMH micro-benchmark for the example code under test using a prototype of the Vector API and implementation (the vectorIntrinsics branch of Project Panama's development repository).

The hot areas of C2 generated machine code are presented. There is a clear translation to vector registers and vector hardware instructions. (Note loop unrolling was disabled make the translation clearer, otherwise HotSpot should be able to unroll using existing C2 loop optimization techniques.). All Java object allocations are elided.

It is an important goal to support more complex non-trivial vector computations that translate clearly into generated machine code.

There are, however, a few issues with this particular vector computation:

  1. hardcoded to a concrete vector shape, so the computation cannot adapt dynamically to maximal shape supported by the architecture (which may be smaller or larger than 256 bits), therefore the code is less portable and maybe less performant.

  2. calculation of the loop upper bounds, which although simple can be a common source of programming error.

  3. a scalar loop is required at the end, duplicating code.

The first two issues will be addressed by this JEP. A preferred species can be obtained whose shape is optimal for the current architecture, the vector computation can then be written with a generic shape, and a method on the species can round down the array length, for example:

static final FloatVector.FloatSpecies<?> SPECIES =
    FloatVector.speciesPreferredInstance();

<S extends Shape>
void vectorComputation(float[] a, float[] b, float[] c,
        FloatVector.FloatSpecies<S> species) {
    int i = 0;
    int upperBound = species.roundDown(a.length);
    for (; i < upperBound; i += species.length()) {
        FloatVector<S> va = species.fromArray(a, i);
        FloatVector<S> vb = species.fromArray(b, i);
        FloatVector<S> vc = va.
                mul(va).
                add(vb.mul(vb)).
                neg();
        vc.intoArray(c, i);
    }

    for (; i < a.length; i++) {
        c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
    }
}

vectorComputation(a, b, c, SPECIES);

The last issue will not be fully addressed by this JEP and will be the subject of follow on work. The solution is could be easily expressed in the API leveraging a mask as an additional loop variable, for example:

<S extends Shape>
void vectorComputation(float[] a, float[] b, float[] c,
        FloatVector.FloatSpecies<S> species) {
    int i = 0;
    Vector.Mask<Float, S> m = species.maskFromBounds(0, length);
    for (int i = 0;
         i < a.length;
         i += species.length(), m = species.maskFromBounds(i, length)) {
        FloatVector<S> va = species.fromArray(a, i, m);
        FloatVector<S> vb = species.fromArray(b, i, m);
        FloatVector<S> vc = va.
                mul(va).
                add(vb.mul(vb)).
                neg();
        vc.intoArray(c, i, m);
    }
}

The mask will ensure that loads and stores will not result in out of bounds exceptions (in this case the calculation is such that masks do not need to be provided to the other vector operations). However, it anticipated this will require additional runtime compiler support to generate efficient code across a variety of x86 architectures.

HotSpot C2 compiler implementation details

The Vector API has two implementations in order to adhere to the project goals. The first implements operations in Java, thus is functional but not optimal. The second makes intrinsic, for the HotSpot C2 compiler, those operations with special treatment for Vector API types. This allows for proper translation to x86 registers and instructions for the case where architecture support and implementation for translation exists.

The intrinsification process for the Vector API will work by translating Vector API method calls to C2 IR Nodes that represent appropriate intended semantics. For example, for FloatVector<Shapes.S256Bit>.add, the C2 compiler will replace the call with nodes AddVFNode + VectorBox. The AddVFNode represents addition of two float vectors while the VectorBox represents the boxing portion to create a valid object (namely add on two Vector objects will produce a resulting Vector object). This way, the vector operation can be decoupled from the object creation, so in cases where the object does not need to exist, it can be eliminated.

The IR nodes generated by intrinsification will overlap with the IR nodes used by vectorizer. However, because the Vector API will support a much larger set of operations, additional IR nodes will be added as needed. In order to keep the newly added nodes to a minimum, new nodes will no longer encode the type in the operand name. For example, the VectorBlend node supports blending and masking operations. There is no VectorBlendI node for int vectors, all type information is simply encoded using existing type system (TypeVect) which encodes element type along with shape.

It is intended that for all of the vector operations defined by the API, there will be translation implemented that will allow use of x86 instructions on some x86 architectures. For example, ByteVector<Shapes.S256Bit>.blend will translate to vpblendvb (AVX2) where as ByteVector<Shapes.S512Bit>.blend will translate to vpblendmb (AVX512). The translation may be non-optimal. If ByteVector<Shapes.S512Bit>.blend is used on a system that only supports AVX2, no translation will occur and instead the default Java implementation will be used. That said, the Vector API provides the Vector.preferredSpeciesInstance method in order to query for the appropriate vector size to use. Behind the scenes, this method is implemented by calling into Matcher::vector_width_in_bytes so that this value is dynamically computed depending on the system. The returned species of this method can be used for generically sized vector computations so no concrete shape needs be declared.

The set of operations on Vector, Species and Mask will be selected for their applicability for C2 intrinsification on x86 architectures. Additional non-intrinsified operations may be placed off to the side in helper classes.

To avoid an explosion of intrinsics added to C2, a set of intrinsics will be defined that correspond to operation kinds, such as binary, unary, comparison, and so on, where constant arguments are passed describing operation specifics. Approximately ten new intrinsics will be needed to support intrinsification of all parts of the API.

The C2 compiler will have special knowledge of the Vector, Species and Mask types and all the sub-types. This will enable C2 to map instances of Vector, to vector registers, and aggressively elide allocations when such instances do not escape. C2 will also have knowledge for treatment of vector registers and vector objects at safepoints so that it can safely save them and also safely reconstruct Vector objects. Special attention will taken to ensure, by default, object semantics (such as identity) are preserved when an instance escapes or needs to be materialized as Vector object.

Vector instances are value-based, morally values where identity-sensitive operations should be avoided. This potentially limits the set of applicable optimizations, specifically due to the limitations of escape analysis. A flag will be provided to enable Vector instances to have no identity and thereby support more aggressive optimizations such as lazy materialization at a safepoint. When value types are fully supported by the Java language and runtime (see Project Valhalla) then concrete Vector classes can be made value types and it is anticipated such a flag and many optimizations will no longer be required.

Mask support will require careful attention on x86 architectures since there are two kinds of representation, a vector register representation or an opmask register representation (for AVX512), and different instructions will take one or the other. In the initial implementation, it is expected that all masks will be represented as vector registers even for AVX512. This means that native masking via opmask (or k) registers will not be supported in the first implementation.

Alternatives

HotSpot's auto-vectorization is an alternative approach but it would require significant enhancement and would likely still be fragile and limited compared to using the Vector API, since auto-vectorizations with complex control flow is very hard to perform.

Testing

Combinatorial unit tests will be developed to ensure coverage for all operations, for all supported types and shapes, over various data sets. The tests will be implemented with TestNG and will be exercisable via jtreg.

Performance tests will be developed to ensure performance gaols are met and vector computations map efficiently to vector hardware instructions. This will likely consistent of JMH micro-benchmarks but more realistic examples of useful algorithms will also be required.

Risks and Assumptions

There is a risk that the API will be biased to the SIMD functionality supported on x86 architectures, more so to the explicitly fixed set of supported shapes, as opposed to the set of operations, since that will bias against coding algorithms generically to shape. To mitigate that risk other architectures will be taken into account, specifically the ARM Scalar Vector Extension architecture whose programming model adjusts dynamically to the singular fixed shape supported by the hardware.

Dependences

The Vector API will benefit significantly from value types when ready (see Project Valhalla). Instances of a Vector<T, S> can be values, whose concrete classes are value types. This will make it easier to optimize and express vector computations. Sub-types of Vector<T, S> for specific types, such as IntVector<S>, will no longer be required with generic specialization over values and type-specific method declaration.