JEP 338: Vector API (Incubator)

AuthorsVladimir Ivanov, Razvan Lupusoru, Paul Sandoz, Sandhya Viswanathan
OwnerVivek Deshpande
Componenthotspot / compiler
Discussionpanama dash dev at openjdk dot java dot net
Reviewed byJohn Rose
Created2018/04/06 22:58
Updated2019/07/16 22:27


Provide an initial iteration of an [incubator module], jdk.incubator.vector, to express vector computations that reliably compile at runtime to optimal vector hardware instructions on supported CPU architectures and thus achieve superior performance to equivalent scalar computations.




Vector computations consist of a sequence of operations on vectors. A vector comprises a (usually) fixed sequence of scalar values, where the scalar values correspond to the number of hardware-defined vector lanes. A binary operation applied to two vectors with the same number of lanes would, for each lane, apply the equivalent scalar operation on the corresponding two scalar values from each vector. This is commonly referred to as Single Instruction Multiple Data (SIMD).

Vector operations express a degree of parallelism that enables more work to be performed in a single CPU cycle and thus can result in significant performance gains. For example, given two vectors each covering a sequence of eight integers (eight lanes), then the two vectors can be added together using a single hardware instruction. The vector addition hardware instruction operates on sixteen integers, performing eight integer additions, in the time it would ordinarily take to operate on two integers, performing one integer addition.

HotSpot supports auto-vectorization where scalar operations are transformed into superword operations, which are then mapped to vector hardware instructions. The set of transformable scalar operations are limited and fragile to changes in the code shape. Furthermore, only a subset of available vector hardware instructions might be utilized limiting the performance of generated code.

A developer wishing to write scalar operations that are reliably transformed into superword operations needs to understand HotSpot's auto-vectorization support and its limitations to achieve reliable and sustainable performance.

In some cases it may not be possible for the developer to write scalar operations that are transformable. For example, HotSpot does not transform the simple scalar operations for calculating the hash code of an array (see the Arrays.hashCode method implementations in the JDK source code), nor can it auto-vectorize code to lexicographically compare two arrays (which why an intrinsic was added to perform lexicographical comparison, see JDK-8033148).

The Vector API aims to address these issues by providing a mechanism to write complex vector algorithms in Java, using pre-existing support in HotSpot for vectorization, but with a user model which makes vectorization far more predictable and robust. Hand-coded vector loops can express high-performance algorithms (such as vectorized hashCode or specialized array comparison) which an auto-vectorizer may never optimize. There are numerous domains where this explicitly vectorizing API may be applicable such as machine learning, linear algebra, cryptography, finance, and usages within the JDK itself.


A vector will be represented by the abstract class Vector<E>. The type variable E corresponds to the boxed type of scalar primitive integral or floating point element types covered by the vector. A vector also has a shape, which defines the size, in bits, of the vector. The shape of the vector will govern how an instance of Vector<E> is mapped to a vector hardware register when vector computations are compiled by the HotSpot C2 compiler (see later for a mapping from instances to x64 vector registers). The length of a vector (number of lanes or elements) will be the vector size divided by the element size.

The set of element types (E) supported will be Byte, Short, Int, Long, Float and Double corresponding to the scalar primitive types byte, short, int, long, float and double, respectively.

The set of shapes supported will be corresponding to vector sizes of 64, 128, 256, and 512 bits. A shape corresponding to a size of 512 bits can pack bytes into 64 lanes or pack ints into 16 lanes, and a vector of such a shape can operate on 64 bytes at a time, or 16 ints at a time.

(Note: We believe that these simple shapes are generic enough to be useful on all platforms supporting the Vector API. However, as we experiment during the incubation of this JEP with future platforms, we may further modify the design of the shape parameter. Such work is not in the early scope of this JEP, but these possibilities partly inform the present role of shapes in the Vector API. See the "Future Work" section.)

The combination of element type and shape determines the vector's species, represented by VectorSpecies<E>

An instance of Vector<E> is immutable and is a value-based type that retains, by default, object identity invariants (see later for relaxation of these invariants).

Operations on vectors can be classified as lane-wise and cross-lane. Lane-wise operations can be further classified as unary, binary, ternary and comparison. Cross-lane operations can be classified as permutation, conversion and reduction. To reduce the surface of api, we will define collective methods for each class of operation which then take an operator as input. The supported operators are instances of Operator class and are defined as static final fields in VectorOperators class. Some common operations(e.g. add, mul), called full-service operations, will have dedicated methods which can be used in place of the generic methods.

Certain operations on vectors, such lane-wise cast and reinterpret, can be said to be inherently "shape-changing". Having shape-changing operations in a vector computation could have unintended effects on portability and performance. For this reason, wherever applicable, the api will define an additional shape-invariant flavor of such an operation. Users are encouraged to write shape-invariant code using the shape-invariant flavor of operations. Additionally, shape-changing operations will be clearly called out in the javadoc.

Vector<E> declares a set of methods for common vector operations supported by all element types.
To support operations specific to an element type there are six abstract sub-classes of Vector<E>, one for each supported element type, ByteVector, ShortVector, IntVector, LongVector, FloatVector, and DoubleVector. These sub-classes define additional operations which are bound to the element type since the method signature refers to the element type (or the equivalent array type), such as reduction operations (e.g. sum all elements to a scalar value), or storing the vector elements to an array. They also define additional full-service operations that are specific to the integral sub-types, such as bitwise operations (e.g. logical or), and operations specific to the floating point types, such as mathematical operations (e.g. transcendental functions like pow()).

These classes are further extended by concrete sub-classes defined for different shapes (size) of Vectors.

The following table presents the concrete vector classes and their mapping to x64 registers:

| Vector                                                                                       | x64 register |
| Byte64Vector,  Short64Vector,  Int64Vector,  Long64Vector,  Float64Vector,  Double64Vector   | xmm?         |
| Byte128Vector, Short128Vector, Int128Vector, Long128Vector, Float128Vector, Double128Vector  | xmm?         |
| Byte256Vector, Short256Vector, Int256Vector, Long256Vector, Float256Vector, Double256Vector  | ymm?         |
| Byte512Vector, Short512Vector, Int512Vector, Long512Vector, Float512Vector, Double512Vector  | zmm?         |

These classes are non-public since there is no need to provide operations specific to the type and shape. This reduces the API surface to a sum of concerns rather than a product. As a result instances of concrete Vector classes cannot be constructed directly. Instead instances are obtained via factories methods defined in the base Vector<E> and its type-specific sub-classes. These methods take as input the species of the desired vector instance. The factory methods provide different ways to obtain vector instances, such as the vector instance whose elements are initiated to default values (the zero vector), or a vector from an array, in addition to providing the canonical support for converting between vectors of different types and/or shapes (e.g. casting).

To support control flow relevant vector operations will optionally accept masks, represented by the public abstract class VectorMask<E>. Each element in a mask, a boolean value or bit, corresponds to a vector lane. When a mask is an input to an operation it governs whether the operation is applied to each lane; the operation is applied if the mask bit for the lane is set (is true). Alternative behavior occurs if the mask bit is not set (is false).
Similar to vectors, instances of VectorMask<E> are instances of (private) concrete sub-class defined for each element type and length combination. The instance of VectorMask<E> used in an operation should have the same type and length as the instance(s) of Vector<E> involved in the operation. Comparison operations produce masks, which can then be input to other operations to selectively disable the operation on certain lanes and thereby emulate flow control. Another way for creating masks is using static factory methods in VectorMask<E>.

It is anticipated that masks will likely play an important role in the development of vector computations that are generic to shape. (This is based on the central importance of predicate registers, the equivalent of masks, in the ARM Scalable Vector Extensions as well as in Intel's AVX-512.)


Here is a simple scalar computation over elements of arrays:

void scalarComputation(float[] a, float[] b, float[] c) {
   for (int i = 0; i < a.length; i++) {
        c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;

(It is assumed that the array arguments will be of the same size.)

An explicit way to implement the equivalent vector computation using the Vector API is as follows:

Example 1:

static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_256;

void vectorComputation(float[] a, float[] b, float[] c) {

    for (int i = 0; i < a.length; i += SPECIES.length()) {
        var m = SPECIES.indexInRange(i, a.length);
		// FloatVector va, vb, vc;
        var va = FloatVector.fromArray(SPECIES, a, i, m);
        var vb = FloatVector.fromArray(SPECIES, b, i, m);
        var vc = va.mul(va).
        vc.intoArray(c, i, m);

In this example, a species for 256-bit wide vector of floats is obtained from FloatVector. The species is stored in a static final field so the runtime compiler will treat the field's value as a constant and therefore be able to better optimize the vector computation.

The vector computation features a main loop kernel iterating over the arrays in strides of vector length (the species length). Static method fromArray() loads float vectors of the given species from arrays a and b at the corresponding index. Then the operations are performed (fluently), and finally the result is stored into array c.

We use masks, generated by indexInRange(), to prevent reading/writing past the array length. The first floor(a.length / SPECIES.length()) iterations will have a mask with all lanes set. Only the final iteration (if a.length is not a multiple of SPECIES.length()) will have a mask with first a.length % SPECIES.length() lanes set.

Since a mask is used in all iterations, the above implementation may not acheive optimal performance(for large array lengths). The same computation can be implemented without masks as follows:

Example 2:

static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_256;

void vectorComputation(float[] a, float[] b, float[] c) {
    int i = 0;
    for (; i < (a.length & ~(SPECIES.length() - 1));
           i += SPECIES.length()) {
        // FloatVector va, vb, vc;
        var va = FloatVector.fromArray(SPECIES, a, i);
        var vb = FloatVector.fromArray(SPECIES, b, i);
        var vc = va.mul(va).
        vc.intoArray(c, i);

    for (; i < a.length; i++) {
        c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;

The tail elements, the length of which is smaller than the species length, are processed using the scalar computation after the vector computation. Another way to process the tail elements is using a single masked vector computation.

When operating on large arrays, the implementation above achieves optimal performance.

Continuing with example 2 presented above, the HotSpot compiler should generate machine code similar to the following:

0.43%  / │  0x0000000113d43890: vmovdqu 0x10(%r8,%rbx,4),%ymm0
  7.38%  │ │  0x0000000113d43897: vmovdqu 0x10(%r10,%rbx,4),%ymm1
  8.70%  │ │  0x0000000113d4389e: vmulps %ymm0,%ymm0,%ymm0
  5.60%  │ │  0x0000000113d438a2: vmulps %ymm1,%ymm1,%ymm1
 13.16%  │ │  0x0000000113d438a6: vaddps %ymm0,%ymm1,%ymm0
 21.86%  │ │  0x0000000113d438aa: vxorps -0x7ad76b2(%rip),%ymm0,%ymm0
  7.66%  │ │  0x0000000113d438b2: vmovdqu %ymm0,0x10(%r9,%rbx,4)
 26.20%  │ │  0x0000000113d438b9: add    $0x8,%ebx
  6.44%  │ │  0x0000000113d438bc: cmp    %r11d,%ebx
         \ │  0x0000000113d438bf: jl     0x0000000113d43890

This is actual output from a JMH micro-benchmark for the example code under test using a prototype of the Vector API and implementation (the vectorIntrinsics branch of Project Panama's development repository).

The hot areas of C2 generated machine code are presented. There is a clear translation to vector registers and vector hardware instructions. (Note loop unrolling was disabled to make the translation clearer, otherwise HotSpot should be able to unroll using existing C2 loop optimization techniques.). All Java object allocations are elided.

It is an important goal to support more complex non-trivial vector computations that translate clearly into generated machine code.

There are, however, a few issues with this particular vector computation:

  1. The loop is hardcoded to a concrete vector shape, so the computation cannot adapt dynamically to a maximal shape supported by the architecture (which may be smaller or larger than 256 bits). Therefore the code is less portable and maybe less performant.

  2. Calculation of the loop upper bounds, although simple here, can be a common source of programming error.

  3. A scalar loop is required at the end, duplicating code.

The first two issues will be addressed by this JEP. A preferred species can be obtained whose shape is optimal for the current architecture, the vector computation can then be written with a generic shape, and a method on the species can round down the array length, for example:

static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;

void vectorComputation(float[] a, float[] b, float[] c,
        VectorSpecies<Float> species) {
    int i = 0;
    int upperBound = species.loopBound(a.length);
    for (; i < upperBound; i += species.length()) {
        //FloatVector va, vb, vc;
        var va = FloatVector.fromArray(species, a, i);
        var vb = FloatVector.fromArray(species, b, i);
        var vc = va.mul(va).
        vc.intoArray(c, i);

    for (; i < a.length; i++) {
        c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;

vectorComputation(a, b, c, SPECIES);

The last issue will not be fully addressed by this JEP and will be the subject of follow on work. As shown in example 1 earlier, you can use masks to implement vector computation without tail processing. It is anticipated that such masked loops will work well for a range of architectures, including x64 and ARM, but will require additional runtime compiler support to generate maximally efficient code. Such work on masked loops, though important, is beyond the scope of this JEP.

HotSpot C2 compiler implementation details

The Vector API has two implementations in order to adhere to the project goals. The first implements operations in Java, thus it is functional but not optimal. The second makes intrinsic, for the HotSpot C2 compiler, those operations with special treatment for Vector API types. This allows for proper translation to x64 registers and instructions for the case where architecture support and implementation for translation exists.

The intrinsification process for the Vector API will work by translating Vector API method calls to C2 IR Nodes that represent appropriate intended semantics. For example, for Float256Vector.add, the C2 compiler will replace the call with a AddVF node plus a VectorBox node. The AddVF represents addition of two float vectors while the VectorBox represents the boxing portion to create a valid object. (Thus add on two Vector objects will produce a resulting Vector object.) This way, object creation (if any) is submerged under the vector operation, so in cases where the object does not need to exist, it can be eliminated.

The IR nodes generated by intrinsification will overlap with the IR nodes used by vectorizer. However, because the Vector API will support a much larger set of operations, additional IR nodes will be added as needed. In order to keep the newly added nodes to a minimum, new nodes will no longer encode the type in the operand name. For example, the VectorBlend node supports blending and masking operations. There is no VectorBlendI node for int vectors. Instead, the extra type information is simply encoded using existing type system (TypeVect) which encodes element type along with shape.

It is intended that for all of the vector operations defined by the API, there will be a translation implemented that will allow use of x64 instructions on some x64 architectures. For example, Byte256Vector.blend will translate to vpblendvb (AVX2) where as Byte512Vector.blend will translate to vpblendmb (AVX-512). The translation may be non-optimal. If Byte512Vector.blend is used on a system that only supports AVX2, no translation will occur and instead the default Java implementation will be used. That said, the type-specific vector classes provide the PREFERRED_SPECIES field corresponding to the appropriate vector size to use. Behind the scenes, this field is set by calling into Matcher::vector_width_in_bytes so that this value is dynamically computed depending on the system. This species can be used for generically sized vector computations so no concrete species needs be declared.

The set of operations on Vector, VectorSpecies and VectorMask will be selected for their applicability for C2 intrinsification on x64 architectures. Additional non-intrinsified operations may be placed off to the side in helper classes. In future work, these divisions may be adjusted in order to provide more fully platform agnostic API.

To avoid an explosion of intrinsics added to C2, a set of intrinsics will be defined that correspond to operation kinds, such as binary, unary, comparison, and so on, where constant arguments are passed describing operation specifics. Approximately ten new intrinsics will be needed to support intrinsification of all parts of the API.

The C2 compiler will have special knowledge of the Vector, VectorSpecies and VectorMask types and all the sub-types. This will enable C2 to map instances of Vector, to vector registers, and aggressively elide allocations when such instances do not escape. C2 will also have knowledge for treatment of vector registers and vector objects at safepoints so that it can safely save them and also safely reconstruct Vector objects. Special attention will taken to ensure, by default, object semantics (such as identity) are preserved when an instance escapes or needs to be materialized as reference to a Vector object.

Vector instances are value-based, morally values where identity-sensitive operations should be avoided. This potentially limits the set of applicable optimizations, specifically due to the limitations of escape analysis. A flag will be provided to enable Vector instances to have no guaranteed identity and thereby support more aggressive optimizations such as lazy materialization at a safepoint. When value types are fully supported by the Java language and runtime (see Project Valhalla) then concrete Vector classes can be made value types and it is anticipated such a flag and many optimizations will no longer be required.

Mask support will require careful attention on x64 architectures since there are two kinds of representations, a vector register representation or an opmask register representation (for AVX-512), and different instructions will take one or the other. In the initial implementation, it is expected that all masks will be represented as vector registers even for AVX-512. This means that native masking via opmask (or k) registers will not be supported in the first implementation. Platforms like AVX-512 and ARM SVE motivate our treatment of Mask as a special type rather than as an ordinary combination of Vector and boolean types.

Future Work

The Vector API will benefit significantly from value types when ready (see Project Valhalla). Instances of a Vector<E> can be values, whose concrete classes are value types. This will make it easier to optimize and express vector computations. Sub-types of Vector<E> for specific types, such as IntVector, will no longer be required with generic specialization over values and type-specific method declaration. A shift to value types is thought to be backward compatible, perhaps after recompilation of Vector API code. Some abstract classes may need conversion to interfaces, if they are supers of value types.

A future version of the Vector API may make use of enhanced generics, as noted above.

It is expected that the API will incubate over multiple releases of the JDK and will adapt as dependent features such as value types become available in a future JDK release and newer CPU architectures become more established in the industry.

API points for loop control, loop boundary processing, and active set maintenance are likely to be added or refined in a future version of this API. Additional vector shapes with intrinsic masks or lengths, or synthetic tandem vector types (vector pairs) may be introduced if they are found to help with loop management. Methods for alignment control may also be introduced, if they show benefits in portability or performance.

Scatter and gather operations which can traverse managed heap pointers may be introduced in the future, if a portable and type-safe way can be found to express them (such as VarHandles). This would allow workloads to be accessed directly in Java objects, instead of being buffered through Java arrays or byte buffers.

Additional vector sizes and shapes may be supported in a future version of this API, in a follow-on JEP or perhaps during incubation. In principle the API could express additional vector shape properties besides bit-size, such as whether a vector is dense or not, whether it possesses an intrinsic mask, whether and how it may be dynamically sized, whether the size is a power of two, etc.

A future version of this API may introduce additional, non-primitive lane types such as short floats (useful for machine learning) or very long integers (useful for cryptography), along with relevant specialized operations. Such types tend to be hardware-specific, and so a challenge of specifying such API points is either making them portable, or else properly scoping them to machine-specific instances of the JDK.


HotSpot's auto-vectorization is an alternative approach but it would require significant enhancement and would likely still be fragile and limited compared to using the Vector API, since auto-vectorization with complex control flow is very hard to perform.

In general, and even after decades of research (especially for FORTRAN and C array loops), it seems that auto-vectorization of scalar code is not a reliable tactic for optimizing ad hoc user-written loops, unless the user pays unusually careful attention to unwritten contracts about exactly which loops a compiler is prepared to auto-vectorized. It's too easy to write a loop that fails to auto-vectorize, for a reason that only the optimizer can detect, and not the human reader. Years of work on auto-vectorization (even in HotSpot) have left us with lots of optimization machinery that works only on special occasions. We want to enjoy the use of this machinery more often!


Combinatorial unit tests will be developed to ensure coverage for all operations, for all supported types and shapes, over various data sets. The tests will be implemented with TestNG and will be exercisable via jtreg.

Performance tests will be developed to ensure performance goals are met and vector computations map efficiently to vector hardware instructions. This will likely consistent of JMH micro-benchmarks but more realistic examples of useful algorithms will also be required.

As a backup to performance tests, we will create white-box tests to force the JIT to report to us that vector API source code did, in fact, trigger vectorization.

Risks and Assumptions

There is a risk that the API will be biased to the SIMD functionality supported on x64 architectures. This applies mainly to the explicitly fixed set of supported shapes, which bias against coding algorithms in a shape-generic fashion. We consider the majority of other operations of the Vector API to bias toward portable algorithms. To mitigate that risk other architectures will be taken into account, specifically the ARM Scalar Vector Extension architecture whose programming model adjusts dynamically to the singular fixed shape supported by the hardware. We welcome and encourage OpenJDK contributors working on the ARM specific areas of HotSpot to participate in this effort.

The Vector API uses box types (like Integer) as proxies for primitive types (like int). This decision is forced by the current limitations of Java generics (which are hostile to primitive types). When Project Vahalla eventually introduces more capable generics, the current decision will seem awkward, and may need changing. We assume that such changes will be possible without excessive backwards incompatibility.