JEP 331: Low-Overhead Heap Profiling

OwnerJean Christophe Beyler
TypeFeature
ScopeJDK
StatusTargeted
Release11
Componenthotspot / jvmti
Discussionhotspot dash dev at openjdk dot java dot net
EffortL
Reviewed byMikael Vidstedt, Robbin Ehn, Serguei Spitsyn
Endorsed byMikael Vidstedt, Vladimir Kozlov
Created2016/12/12 21:31
Updated2018/06/20 06:26
Issue8171119

Summary

Provide a low-overhead way of sampling Java heap allocations, accessible via JVMTI.

Goals

Provide a way to get information about Java object heap allocations from the JVM that:

Motivation

There is a deep need for users to understand the contents of their heaps. Poor heap management can lead to problems such as heap exhaustion and GC thrashing. As a result, a number of tools have been developed to allow users to introspect into their heaps, such as the Java Flight Recorder, jmap, YourKit, and VisualVM tools.

One piece of information that is lacking from most of the existing tooling is the call site for particular allocations. Heap dumps and heap histograms do not contain this information. This information can be critical to debugging memory issues, because it tells developers the exact location in their code particular (and particularly bad) allocations occurred.

There are currently two ways of getting this information out of HotSpot:

This proposal mitigates these problems by providing an extensible JVMTI interface that allows the user to define the sampling rate and returns a set of live stack traces.

Description

New JVMTI event and method

The user facing API for the heap sampling feature proposed here consists of an extension to JVMTI that allows for heap profiling. The following systems rely on an event notification system that would provide a callback such as:

void JNICALL
SampledObjectAlloc(jvmtiEnv *jvmti_env,
            JNIEnv* jni_env,
            jthread thread,
            jobject object,
            jclass object_klass,
            jlong size)

where:

The new API also includes a single new JVMTI method:

jvmtiError  SetHeapSamplingRate(jvmtiEnv* env, jint sampling_rate)

where sampling_rate is the average allocated bytes between a sampling. The specification of the method is:

Note that the sampling rate is not precise. Each time a sample occurs, the number of bytes before the next sample will be chosen as a geometric variable with the given average. This is to avoid sampling bias; for example, if the same allocations happen every 512KB, a 512KB sampling interval will always sample the same allocations.

Use-case example

To enable this, a user would use the usual event notification call to:

jvmti->SetEventNotificationMode(jvmti, JVMTI_ENABLE, JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)

The event will be sent when the allocation is initialized and set up correctly, so slightly after the actual code performs the allocation. By default, the average sampling rate is 512KB.

The minimum required to enable the sampling event system is to call SetEventNotificationMode with JVMTI_ENABLE and the event type JVMTI_EVENT_SAMPLED_OBJECT_ALLOC. To modify the sampling rate, the user calls the SetHeapSamplingRate method.

To disable the system,

jvmti->SetEventNotificationMode(jvmti, JVMTI_DISABLE, JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)

disables the event notifications and disables the sampler automatically.

Calling the sampler again via SetEventNotificationMode will re-enable the sampler with whatever sampling rate was currently set (either 512KB by default or the last value passed by a user via SetHeapSamplingRate).

New capability

To protect the new feature and make it optional for VM implementations, a new capability named can_generate_sampled_object_alloc_events is introduced into the jvmtiCapabilities.

Global / thread level sampling

Using the notification system provides a direct means to send events only for specific threads. This is done via SetEventNotificationMode and providing a third parameter with the threads to be modified.

A full example

The following section provides code snippets to illustrate the sampler's API. First, the capability and the event notification is enabled:

jvmtiEventCallbacks callbacks;
memset(&callbacks, 0, sizeof(callbacks));
callbacks.SampledObjectAlloc = &SampledObjectAlloc;

jvmtiCapabilities caps;
memset(&caps, 0, sizeof(caps));
caps.can_generate_sampled_object_alloc_events = 1;
if (JVMTI_ERROR_NONE != (*jvmti)->AddCapabilities(jvmti, &caps)) {
  return JNI_ERR;
}

if (JVMTI_ERROR_NONE != (*jvmti)->SetEventNotificationMode(jvmti, JVMTI_ENABLE,
                                       JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)) {
  return JNI_ERR;
}

if (JVMTI_ERROR_NONE !=  (*jvmti)->SetEventCallbacks(jvmti, &callbacks, sizeof(jvmtiEventCallbacks)) {
  return JNI_ERR;
}

// Set the sampler to 1MB.
if (JVMTI_ERROR_NONE !=  (*jvmti)->SetHeapSamplingRate(jvmti, 1024 * 1024)) {
  return JNI_ERR;
}

To disable the sampler (disables events and the sampler):

if (JVMTI_ERROR_NONE != (*jvmti)->SetEventNotificationMode(jvmti, JVMTI_DISABLE,
                                       JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)) {
  return JNI_ERR;
}

To re-enable the sampler with the 1024 * 1024 byte sampling rate, a simple call to enabling the event is required:

if (JVMTI_ERROR_NONE != (*jvmti)->SetEventNotificationMode(jvmti, JVMTI_ENABLE,
                                       JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)) {
  return JNI_ERR;
}

User storage of sampled allocations

When an event is generated, the callback can capture a stack trace using the JVMTI GetStackTrace method. The jobject reference obtained by the callback can be also wrapped into a JNI weak reference to help determine when the object has been garbage collected. This approach allows the user to gather data on what objects were sampled, as well as which are still considered live, which can be a good means to understand the job's behavior.

For example, something like this could be done:

extern "C" JNIEXPORT void JNICALL SampledObjectAlloc(jvmtiEnv *env,
                                                     JNIEnv* jni,
                                                     jthread thread,
                                                     jobject object,
                                                     jclass klass,
                                                     jlong size) {
  jvmtiFrameInfo frames[32];
  jint frame_count;
  jvmtiError err;

  err = global_jvmti->GetStackTrace(NULL, 0, 32, frames, &frame_count);
  if (err == JVMTI_ERROR_NONE && frame_count >= 1) {
    jweak ref = jni->NewWeakGlobalRef(object);
    internal_storage.add(jni, ref, size, thread, frames, frame_count);
  }
}

where internal_storage is a data structure that can handle the sampled objects, consider if there is a need to clean up any garbage collected sample, etc. The internals of that implementation are usage-specific, and out of scope of this JEP.

The sampling rate can be used as a means to mitigate profiling overhead. With a sampling rate of 512KB, the overhead should be low enough that a user could reasonably leave the system on by default.

Implementation details

The current prototype and implementation proves the feasibility of the approach. It contains five parts:

  1. Architecture dependent changes due to a change of a field name in the ThreadLocalAllocationBuffer (TLAB) structure. These changes are minimal as they are just name changes.
  2. The TLAB structure is augmented with a new allocation_end pointer, to complement the existing end pointer. If the sampling is disabled, the two pointers are always equal and the code performs as before. If the sampling is enabled, end is modified to be where the next sample point is requested. Then, any fast path will "think" the TLAB is full at that point and go down the slow path, which is explained in (3).
  3. The gc/shared/collectedHeap code is changed due to its usage as an entry point to the allocation slow path. When a TLAB is considered full (because allocation has passed the end pointer), the code enters collectedHeap and tries to allocate a new TLAB. At this point, the TLAB is set back to its original size and an allocation is attempted. If the allocation succeeds, the code samples the allocation, and then returns. If it does not, it means allocation has reached the end of the TLAB, and a new TLAB is needed. The code path continues its normal allocation of a new TLAB and determines if that allocation requires a sample. If the allocation is considered too big for the TLAB, the system samples the allocation as well, thus covering in TLAB and out of TLAB allocations for sampling.
  4. When a sample is requested, there is a collector object set on the stack in a place safe for sending the information to the native agent. The collector keeps track of sampled allocations and, at destruction of its own frame, sends a callback to the agent. This mechanism ensures the object is initialized correctly.
  5. If a JVMTI agent has registered a callback for the SampledObjectAlloc event, the event will be triggered and it will obtain sampled allocations. An example implementation can be found in the libHeapMonitorTest.c file, which is used for JTreg testing.

Alternatives

There are multiple alternatives to the system presented in this JEP. The introduction presented two already: Flight Recorder provides an interesting alternative. This implementation provides several advantages. First, JFR does not allow the sampling size to be set or provide a callback. Next, JFR's use of a buffer system can lead to lost allocations when the buffer is exhausted. Finally, the JFR event system does not provide a means to track objects that have been garbage collected, which means it is not possible to use it to provide information about live and garbage collected objects.

Another alternative is bytecode instrumentation using ASM. Its overhead makes it prohibitive and not a workable solution.

This JEP adds a new feature into JVMTI, which is an important API/framework for various development and monitoring tools. With it, a JVMTI agent can use a low overhead heap profiling API along with the rest of the JVMTI functionality, which provides great flexibility to the tools. For instance, it is up to the agent to decide if a stack trace needs to be collected at each event point.

Testing

There are 16 tests in the JTreg framework for this feature that test: turning on/off with multiple threads, multiple threads allocating at the same time, testing if the data is being sampled at the right rate, and if the gathered stacks reflect the correct program information.

Risks and Assumptions

There are no performance penalties or risks with the feature disabled. A user who does not enable the system will not perceive a performance difference.

However, there is a potential performance/memory penalty with the feature enabled. In the initial prototype implementation, the overhead was minimal (<2%). This used a more heavyweight mechanism that modified JIT’d code. In the final version presented here, the system piggy-backs on the TLAB code, and should not experience that regression.

Current evaluation of the Dacapo benchmark puts the overhead at: