JEP draft: Enable execution of Java methods on GPU

Created2014/06/17 15:05
Updated2016/09/08 15:57
Componenthotspot / runtime
Discussionsumatra dash dev at openjdk dot java dot net
Reviewed byMikael Vidstedt

This is the JEP draft for OpenJDK project Sumatra:


Enable Java applications to take advantage of GPUs, using JDK 8 Stream API parallel streams and lambdas as the programming model.


Enable seamless offload of Java 8 parallel stream APIs to GPGPU when possible.

By seamless we mean:

Non Goals


An initial success metric would be to offload a parallel workload using Stream API and observe better performance in that part of the application.


Many Java workloads are becoming larger and larger. GPUs offer computing power that are more efficient in both power and performance for some workloads, but earlier Java/GPU offload solutions such as Aparapi or JOCL are not integrated into the JDK and require their own programming model.

With Sumatra, we plan to offer seamless offload of some Stream API parallel lambda functions. The Stream API is designed to simplify parallel programming and Sumatra is a natural extension of the parallel capability already in the Stream API. Since Sumatra will be integrated into the JDK, it will simplify both development and deployment of offloadable applications compared to existing Java/GPU solutions.


Our implementation uses Heterogeneous System Architecture supported in certain AMD APUs with a related software stack, and uses the Graal JVM that includes an HSAIL back end. The JDK is modified such that for certain Stream API operations, the application's lambda function is extracted from the stream and compiled into an HSA kernel. The stream data structures are examined to extract the lambda arguments, and passed to the HSA kernel.

Current GPUs have hundreds to thousands of stream cores. Ideally, for parallelizable workloads all the stream cores can operate on the input data at the same time. We use the Stream API parallel() method as the indicator that it is safe to offload the following part of the stream since the programmer explicitly wrote it. For example, we have implemented offloadable versions of parallel().forEach() and some parallel().reduce() operations in the Stream API.

Work sent to a GPU is generally in the form of an array. The length of the input array is sometimes called the "range" in GPU terms. The length of the range indicates how many "work items" are in the task. In the GPU programming model it is common for each stream core to use the work item id as an index into an array to get the data that stream core will process. In Sumatra, we find the source Java array in the stream and pass the array to the kernel and use the work item id to retrieve the array element for that stream core. Each stream core processes one array element which corresponds to one iteration variable execution of the lambda in the Stream API.

Note with HSA the GPU is operating on the main memory and has direct access to the Java heap, so there is no copying of data. Thus we can operate on Java objects and are not limited to basic type arrays.

Garbage collection cannot occur while a kernel is executing. Our prototype is executing the kernels from inside the JVM and is not using JNI, so no extra object pinning is required.

We support deoptimization of HSA kernels back to CPU execution, and handle safepoints by deoptimizing back to the CPU. In this way the CPU execution of the application is not blocked or delayed by execution of a kernel.

Here is a simple use of parallel stream API showing examples of what can be offloaded:

package simple;


public class Simple {

public static void main(String[] args) {
    final int length = 8;
    int[] ina = new int[length];
    int[] inb = new int[length];
    int[] out = new int[length];

    // Initialize the input arrays - this is offloadable.
    // Each iteration of this lambda is independent and
    // always produces the same answer whether executed single-threaded, 
    // by CPU thread pool or GPU kernel.
    IntStream.range(0, length).parallel().forEach(p -> {
        ina[p] = 1;
        inb[p] = 2;

    // Sum each pair of elements into out[] - this is offloadable
    // Meets the same criteria as the above example
    IntStream.range(0, length).parallel().forEach(p -> {
        out[p] = ina[p] + inb[p];

    // Print results - this is not offloadable since it is calling
    // native code etc. Also it is not really parallelizable even
    // on the CPU since it is printing messages that might become garbled.
    IntStream.range(0, length).forEach(p -> {
        System.out.println(out[p] + ", " + ina[p] + ", " + inb[p]);



There are several open source packages available to offload some Java methods to GPUs with OpenCL or CUDA. They generally require their own programming model, their own jars in the classpath and native libraries.


Risks and Assumptions