JEP draft: JEP xxx: Vector API (Second Incubator)

OwnerPaul Sandoz
TypeFeature
ScopeJDK
StatusDraft
Componenthotspot / compiler
Discussionpanama dash dev at openjdk dot java dot net
EffortM
DurationM
Reviewed byJohn Rose, Maurizio Cimadamore
Endorsed byJohn Rose, Maurizio Cimadamore
Created2021/02/12 17:06
Updated2021/04/09 06:49
Issue8261663

Summary

Provide a second iteration of an incubator module, jdk.incubator.vector, to express vector computations that reliably compile at runtime to optimal vector hardware instructions on supported CPU architectures and thus achieve superior performance to equivalent scalar computations.

History

The Vector API was first proposed by JEP 338 and was integrated into Java 16 as an incubating API. This JEP proposes to incorporate Vector API enhancements based on feedback, performance improvements, and significant implementation enhancements, such as optimizing masked vector operations on supporting hardware.

Goals

Motivation

The primary motivation of the Vector API remains unchanged, as described in JEP 338.

This JEP has three specific motivations. The first is to improve the Vector API by incorporating feedback, which involves some minor enhancements and adjustments. The second is to improve the performance of the Vector API with enhancements to HotSpot, specifically, enhancing vector support in the C2 runtime compiler, and the existing supported architectures of Intel x64 and ARM Neon. Where possible this may also enhance, or enable future enhancements, to Hotspot's auto-vectorizer. The third is to broaden the support of the Vector API on new CPU architectures, specifically support for ARM SVE.

Description

API enhancements

The following API enhancements are proposed:

Implementation enhancements

Implementation enhancements are detailed in the follow sub-sections.

Intel SVML intrinsics

The Vector API supports transcendental and trigonometric lanewise operations. Currently, such operations are not optimized, since there are no associated vector hardware instructions available, nor intrinsic implementations consisting of vector hardware instructions.

For x86, the Intel Short Vector Math Library (SVML) can be leveraged to provide optimized intrinsic implementations for such operations.

The assembly source files of SVML operations are placed in the jdk.incubator.vector module under OS-specific directories. The JDK build process compiles the assembly source files for the target OS platform into an SVML-specific shared object library. Note that, if a JDK image is built, using jlink, that omits the jdk.incubator.vector module, then the SVML library will not be present in the JDK image.

The supported OS platforms are Linux and Windows. Mac OSX support will be considered later, since it is a non-trivial amount of work to provide assembler source files with the required OS-specific directives.

The HotSpot runtime will attempt to load the SVML library, and if present binds the operations in the SVML library to named stub routines. The C2 compiler generates code that calls the appropriate stub routine based on the operation and vector species (element type and shape).

ARM SVE

The C2 compiler is enhanced to support the Vector API on ARM SVE. Such support will leverage general ARM SVE support in C2, which is proposed and integrated separately from this JEP.

Masking

Vector operations that accept masks are not optimally supported on architectures that support masking in hardware. Currently, such operations are implemented by composing the non-masked operation with a blend operation, for example the masked lanewise operation on DoubleVector is implemented as follows:

@ForceInline
public final
DoubleVector lanewise(VectorOperators.Binary op,
                      Vector<Double> v,
                      VectorMask<Double> m) {
     return blend(lanewise(op, v), m);
}

On hardware that supports masked registers, such as AVX-512 and SVE, the blend operation is not required. Instead, the mask m can be compiled to a mask register, and the vector operation compiled to a vector hardware instruction that operates with the mask register.

For example, consider the following code that loads a vector and mask, then performs a masked lanewise operation:

var vec    = IntVector.fromArray(SPECIES_512, int_arr, 0);
var mask   = VectorMask.fromArray(SPECIES_512, mask_arr, 0);
var res    = vec1.lanewise(VectorOperations.ABS, mask);

On AVX-512 hardware the sequence of instructions generated by C2 is:

// LoadVector (IntVector.fromArray)
vmovdqu32 0x10(%r9),%zmm0          
// LoadVector (VectorMask.fromArray)
vmovdqu 0x10(%r12,%r8,8),%xmm1  
// AbsV   (IntVector.lanewise)
vpabsd %zmm0,%zmm2                    
// VectorLoadMask (VectorMask.fromArray)
vpxord %zmm3,%zmm3,%zmm3        
vpsubb %zmm1,%zmm3,%zmm3       
vpmovsxbd %xmm3,%zmm3               
// VectorBlend (IntVector.blend)  
vpcmpeqd -0xeb539(%rip),%zmm3,%k7 
vpblendmd %zmm2,%zmm0,%zmm0{%k7}

With hardware masking support the ideal sequence of instructions generated is:

// LoadVector (IntVector.fromArray)
vmovdqu32 0x10(%r9),%zmm1 
// LoadVector (VectorMask.fromArray) 
vmovdqu 0x10(%r12,%r8,8),%xmm0
// VectorLoadMask (VectorMask.fromArray)
vpcmpb $0x0,-0xee9e1(%rip),%xmm0,%k7 
// VectorMaskedOper(IntVector.lanewise)
vpabsd %zmm1,%zmm1{%k7}

A predicated vector hardware instruction is generated using a masked hardware register. Fewer instructions are generated and performance is improved.

The Vector API implementation and generic components of C2 are enhanced to support efficient masked operations, rather than composing explicitly using blend. In addition, special attention will be required for loads and stores of vectors to ensure no out-of-bounds access occurs. Such support will leverage general enhancements to HotSpot for masked registers and their allocation, which is proposed and integrated separately from this JEP (see JDK-8262355).

Care is taken to ensure C2's masking support allows for efficient generation of code on AVX-512 and SVE, requiring a common intermediate representation that is expressive enough to abstract over the underlying architectural differences.

Further, care is taken to ensure masking support does not unduly increase the following: number of instruction selection patterns; the size of ad files; and the size of the resulting libjvm shared library.

Testing

Existing tests will be updated to test enhancements to the Vector API.

Existing tests are considered sufficient to cover enhancements to HotSpot. Testing on ARM SVE and AVX-512 hardware will be aided by the contributors, since such hardware may not be widely available.

Risks and Assumptions

Two features may be deferred to a future JEP if they are not ready in a timely manner and risk delaying the progress of this JEP and its other features. Specifically if masking and/or ARM SVE are not considered ready, then this JEP will be updated to remove related details.