JEP 331: Low-Overhead Heap Profiling

AuthorJC Beyler
OwnerJean Christophe Beyler
Created2016/12/12 21:31
Updated2018/04/05 22:04
TypeFeature
StatusCandidate
Componenthotspot / jvmti
ScopeJDK
Discussionhotspot dash dev at openjdk dot java dot net
Priority4
Reviewed byRobbin Ehn, Serguei Spitsyn
Issue8171119

Summary

Provide a low-overhead way of sampling Java heap allocations, accessible via JVMTI.

Goals

Provide a way to get information about Java object heap allocations from the JVM that:

Motivation

There is a deep need for users to understand the contents of their heaps. Poor heap management can lead to problems such as heap exhaustion and GC thrashing. As a result, a number of tools have been developed to allow users to introspect into their heaps, such as the Java Flight Recorder, jmap, YourKit, and VisualVM tools.

One piece of information that is lacking from most of the existing tooling is the call site for particular allocations. Heap dumps and heap histograms do not contain this information. This information can be critical to debugging memory issues, because it tells developers the exact location in their code particular (and particularly bad) allocations occurred.

There are currently two ways of getting this information out of HotSpot:

This proposal mitigates those problems by providing an extensible JVMTI interface that allows the user to define the sampling rate, and returning a set of live stack traces.

Description

A) New Event and new method to JVMTI

The user facing API for the heap sampling feature proposed by this JEP consists of an extension to JVMTI that allows for heap profiling. The following systems rely on an event notification system that would provide a callback such as:

void JNICALL
SampledObjectAlloc(jvmtiEnv *jvmti_env,
            JNIEnv* jni_env,
            jthread thread,
            jobject object,
            jclass object_klass,
            jlong size)

where:

The new API also includes a single new JVMTI method:

jvmtiError  SetHeapSamplingRate(jvmtiEnv* env, jint sampling_rate)

where sampling_rate is the average allocated bytes between a sampling. The specification of the method is:

B) Use-case example

To enable this, a user would use the usual event notification call to:

jvmti->SetEventNotificationMode(jvmti, JVMTI_ENABLE, JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)

The event would be sent when the allocation is initialized and set up correctly, so slightly after the actual code performs the allocation. By default, the sampling rate is 512kb. In essence, the minimum required to enable the sampling event system is to call SetEventNotificationMode with JVMTI_ENABLE and the event type JVMTI_EVENT_SAMPLED_OBJECT_ALLOC. To modify the sampling rate, the user calls the SetHeapSamplingRate method.

To disable the system, there is a two part disabling:

jvmti->SetEventNotificationMode(jvmti, JVMTI_DISABLE, JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)

which disables the event notifications and disables the sampler automatically.

Calling the sampler again via SetEventNotificationMode will re-enable the sampler with whatever sampling rate was currently set (either the 512kb by default or the last value passed by a user via SetHeapSamplingRate).

C) New Capability

To protect the new feature and make it optional for VM implementations, a new capability called can_generate_sampled_alloc_events is introduced into the jvmtiCapabilities.

D) Global/Thread level sampling

Using the notification system provides a direct means to send events only for specific threads. This is done via SetEventNotificationMode and providing a third parameter with the threads to be modified.

E) What the JVMTI agent can do

The user of the callback can then pick up a stacktrace at the moment of the callback using the JVMTI GetStackTrace method for example. The oop obtained by the callback can be also wrapped into a JNI weak reference to help determine when the object has been garbage collected. The idea behind that is to provide data on what objects were sampled and are still considered live or garbage collected, which can be a good means to understand the job's behavior.

The sampling rate will provide a different sampling precision but also can be a means to mitigate overhead due to the profiling. Using a sampling rate of 512k and the sampling solution, the overhead should be low enough that a user could reasonably leave the system on by default.

F) A Full Example

The following section provides code snippets to illustrate the sampler's API. First, the capability and the event notification is enabled:

jvmtiEventCallbacks callbacks;
memset(&callbacks, 0, sizeof(callbacks));
callbacks.SampledObjectAlloc = &SampledObjectAlloc;

jvmtiCapabilities caps;
memset(&caps, 0, sizeof(caps));
caps.can_generate_sampled_alloc_events = 1;
if (JVMTI_ERROR_NONE != (*jvmti)->AddCapabilities(jvmti, &caps)) {
  return JNI_ERR;
}

if (JVMTI_ERROR_NONE != (*jvmti)->SetEventNotificationMode(jvmti, JVMTI_ENABLE,
                                       JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)) {
  return JNI_ERR;
}

if (JVMTI_ERROR_NONE !=  (*jvmti)->SetEventCallbacks(jvmti, &callbacks, sizeof(jvmtiEventCallbacks)) {
  return JNI_ERR;
}

// Set the sampler to 1MB.
if (JVMTI_ERROR_NONE !=  (*jvmti)-> SetHeapSamplingRate(jvmti, 1024 * 1024)) {
  return JNI_ERR;
}

To disable the sampler (disables events and the sampler):

if (JVMTI_ERROR_NONE != (*jvmti)->SetEventNotificationMode(jvmti, JVMTI_DISABLE,
                                       JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)) {
  return JNI_ERR;
}

To re-enable the sampler with the 1024 * 1024 byte sampling rate, a simple call to enabling the event is required:

if (JVMTI_ERROR_NONE != (*jvmti)->SetEventNotificationMode(jvmti, JVMTI_ENABLE,
                                       JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)) {
  return JNI_ERR;
}

User Storage of Sampled Allocations

For a user, once a callback is set up, the system could set up a weak reference and track the reference to determine if the object has been garbage collected or not. A stacktrace can be added to the data to help users profile the code using the JVMTI GetStackTrace method.

For example, something like this could be done:

extern "C" JNIEXPORT void JNICALL SampledObjectAlloc(jvmtiEnv *env,
                                                     JNIEnv* jni,
                                                     jthread thread,
                                                     jobject object,
                                                     jclass klass,
                                                     jlong size) {
  jvmtiFrameInfo frames[32];
  jint frame_count;
  jvmtiError err;

  err = global_jvmti->GetStackTrace(NULL, 0, 32, frames, &frame_count);
  if (err == JVMTI_ERROR_NONE && frame_count >= 1) {
    jweak ref = jni->NewWeakGlobalRef(object);
    internal_storage.add(jni, ref, size, thread, frames, frame_count);
  }
}

where internal_storage is a data structure that can handle the sampled objects, consider if there is a need to clean up any garbage collected sample, etc. The internals of that implementation are out of scope of this JEP since it belongs to the user to define/implement the system using the data from the callback.

Alternatives

There are multiple alternatives to the system presented in this JEP. The introduction presented two already: The Java Flight Recorder system provides an interesting alternative but is not perfect due to it not allowing the sampling size to be set and not providing a callback.

The JFR system does use the TLAB creation as a means to track memory allocation but, instead of a callback, JFR events use a buffer system that can lead to missing some sampled allocations. Finally, the JFR event system does not provide a means to track objects that have been garbage collected, which means it is not possible currently to have a system provide information about live and garbage collected objects using the JFR event system.

Another alternative is the bytecode instrumentation using ASM is another alternative but its overhead makes it prohibitive and not a workable solution.

This JEP adds a new feature into the JVMTI which is an important API/framework for various development and monitoring tools. With it, a JVMTI agent can use a low overhead heap profiling API along with the rest of JVMTI functionality, which provides great flexibility to the tools. For instance, it is up to the agent to decide if a stack trace needs to be collected at each event point.

Testing

There are 16 tests in the JTreg framework for this feature that test: turning on/off with multiple threads, multiple threads allocating at the same time, testing if the data is being sampled at the right rate, and if the stacks are coherent to what is expected.

Risks and Assumptions

There are no performance hits or risks with the feature disabled. A general user not enabling the system would not perceive a difference with or without the feature.

However, there is a potential performance/memory hit with the feature enable. In the prototype implementation, the overhead is minimal (<2%), but this was using a mechanism that modified JIT’d code. In the version presented here, the system piggy-backs on the TLAB code and should not have that regression.

Current evaluation of the Dacapo benchmark puts the overhead at:

Prototype Implementation Details

[The current prototype and implementation] proves the feasibility of the approach. It contains in essence five parts:

  1. Architecture dependent changes due to a change of a field name in the ThreadLocalAllocationBuffer (TLAB) structure. These changes are minimal as they are just name changes.

  2. The TLAB structure is augmented with a new allocation_end pointer and a current_end pointer. If the sampling is disabled, the two pointers are always equal and the code performs as before. If the sampling is enabled, the current_end is modified to be where the next sample point is requested. Then, any fast path will "think" the TLAB is full at that point and go down the slow path, which is explained in (3)

  3. The gc/shared/collectedHeap code is changed due to its usage as an entry point to the allocation slow path. If a TLAB is considered full, the code enters the collectedHeap and tries to allocate a new TLAB. At this point, the TLAB is set back to its original size and an allocation is attempted. If the allocation succeeds, the code returns after sampling the allocation. If it does not, it means it is the end of the TLAB and a new TLAB was actually needed. The code path continues its normal allocation of a new TLAB and determines if that allocation requires a sample. If ever the allocation is considered too big for the TLAB, the system samples the allocation as well, thus covering in TLAB and out of TLAB allocations for sampling.

  4. When a sample is requested, there is a collector object set on the stack in a place safe for sending the information to the native agent. The collector keeps track of sampled allocations and, at destruction of its own frame, sends a callback to the agent. This mechanism ensures the object is initialized correctly.

  5. Though not in the implementation due to its out-of-JDK nature, the native agent can then register a callback and obtain sampled allocations. The allocations can be associated with a stacktrace using a JVMTI method and then wrapped into a WeakReference, which will provide live-ness information. An example implementation of this can be found in the libHeapMonitorTest.c file of the webrev, which is used for the JTREG testing.