JEP draft: Provide a low-overhead way of sampling Java heap allocations, accessible via JVMTI

AuthorJC Beyler
OwnerChuck Rasbold
Created2016/12/12 21:31
Updated2017/01/10 09:57
TypeFeature
StatusDraft
Componenthotspot / jvmti
ScopeJDK
Priority4
Reviewed byStaffan Larsen
Issue8171119

Summary

Provide a low-overhead way of sampling Java heap allocations, accessible via JVMTI.

Goals

The overall goal of this proposal is to provide a way of getting information about Java object heap allocations from the JVM that:

Is low-overhead enough to be enabled by default continuously, Is accessible via a well-defined, programmatic interface, Can sample all allocations (i.e., is not limited to allocations that are in one particular heap region or that were allocated in one particular way), Can be defined in an implementation-independent way (i.e., without relying on any particular GC algorithm or VM implementation), and Can give information about both live and dead Java objects.

Motivation

There is a deep need for users to understand the contents of their heaps. Poor heap management can lead to problems like heap exhaustion and GC thrashing. As a result, a number of tools have been developed to allow users to introspect into their heaps, such as the Java Flight Recorder, jmap, YourKit, or VisualVM tools.

One piece of information that is lacking from most of the existing tooling is the call site for particular allocations. Heap dumps and heap histograms do not contain this information. This information can be critical to debugging memory issues, because it tells developers the exact location in their code particular (and particularly bad) allocations occurred.

There are currently two ways of getting this information out of HotSpot:

First, you can instrument all of the allocations in your application using a bytecode rewriter (like the one at https://github.com/google/allocation-instrumenter). You can then have the instrumentation take a stack trace (when you want one).

Second, you can use Java Flight Recorder, which takes a stack trace on TLAB refills and when allocating directly into the old generation. The downsides of this are that a) it is tied to a particular allocation implementation (TLABs), and misses allocations that don’t meet that pattern; b) it doesn’t allow the user to customize the sampling rate; c) it only logs allocations, so you cannot distinguish between live and dead objects; and c) it is proprietary, so it cannot be user-extended.

This proposal mitigates those problems by providing an extensible JVMTI interface that allows the user to define the sampling rate, and returning a set of live stack traces.

Description

The user facing API for the heap sampling feature proposed by this JEP consists of an extension to JVMTI that allows for heap profiling. The following structure represents a single heap sample:

struct StackTraceData {
  jvmtiStackInfo *trace;
  jint byte_size;
  jlong thread_id;
  const jbyte *name;
  jint name_length;
  jlong uid;
  void *context;
};

where trace is the stack trace where the allocation event happened; byte_size is the size of the allocation (in bytes); thread_id is the Java thread id; name is the name of the class being allocated; uid is a unique identifier for this allocation; and context is a user-supplied piece of context information.

The new API also includes several new JVMTI methods. The first method added by the API enables tracing:

jvmtiError StartHeapSampling(
    jvmtiEnv *env, jlong (*sampling_interval)(const StackTraceData *));

The function sampling_interval is user-provided; when called, it will return the number of bytes to be allocated before the next sample is taken. It is passed the sample currently being taken (or NULL). Note that JNI cannot be called during the sampling_interval call.

In order to keep the interface simple, a second call to StartHeapSampling can replace the existing callback. This also provides a means to disable the sampling callback: the code can pass a null pointer as the callback mechanism.

In the initial proposal form, the JVM handles bookkeeping, with two more functions providing a means to inspect the current allocation behavior:

jvmtiError GetLiveTraces(
    jvmtiEnv *env, StackTraceData **stack_traces_ptr, int *num_traces_ptr);
jvmtiError GetGarbageTraces(
    jvmtiEnv *env, StackTraceData **stack_traces_ptr, int *num_traces_ptr);

These functions get the information about sampled objects. GetLiveTraces gets sampled information associated with objects that have not been garbage collected yet. GetGarbageTraces returns some of the objects that have recently been garbage collected.

Internally, the system remembers the last X number of garbage collected sampled objects. In our local implementation, we have set x at 200, and have gotten good results. We have used two replacement policies: statistically sampled garbage collected traces, and recently garbage collected, where: Recently garbage collected keeps the traces in a ring buffer; it simply discards the oldest sampled trace when there is a new one. Statistically sampled garbage collected makes a decision to evacuate an old trace when a new one discards a trace with diminishing probability over time. It replaces a random entry with probability 1/samples_seen. This strategy will tend towards preserving the most frequently occurring traces over time.

Internal Object Allocation Tracking

The following section provides insight as to how to track the information in the JVM.

A. Object Allocation

Object allocation is sampled by using the threshold provided by the StartHeapSampling API explained above. The threshold is stored in a thread-local variable, allowing each thread to handle their own per-thread counter. The interpreter and JIT compilers are modified to emit a subtraction from the threshold and a comparison at every allocation. If the threshold is <= 0, the slow path is taken to gather a stack trace, perform the callback, and reset the threshold.

The system adds to an internal list the stacktrace, obtained by AsyncGetCallTrace, and object reference for future garbage collection statistics/information.

B. Garbage Collection

During reference processing, the system walks the list of internal sampled objects. The list walk checks if the objects are still live. If the object is no longer live, the system removes it from the list of currently live objects and pushes it to a garbage collected list.

In our implementation, there are two different lists maintained for user perusal: Recently garbage collected Frequently garbage collected

Alternatives

There are multiple alternatives to the system presented in this JEP. The introduction presented two already: The JFR system provides an alternative but has a licensing issue/limitation making this not usable by all interested parties The bytecode instrumentation using ASM is an alternative but its overhead makes it prohibitive and not a workable solution

The JFR system also uses the TLAB creation as a means to track memory allocation. A third alternative could leverage/expose the creation of new TLABs into a callback system explained by Tony Printezis in the mail thread: http://mail.openjdk.java.net/pipermail/serviceability-dev/2015-June/017543.html. However, this means the sampling is now dependent on the changing TLAB size and might have statistical bias at that point as well.

Finally, a fourth alternative would, instead of implementing the bookkeeping internally, have the JVM simply expose a set of callbacks when allocations/GC happen and let the user handle the whole bookkeeping. The advantages are that the user controls the extent of what is to be maintained or not. The disadvantages are the potential overhead of extra calls to the outside world. Another disadvantage to an exposed callback is the risk of user error. There are many things that are not possible during an allocation, such as the creation of weak references, for example. To enable such a callback system, the documentation will have to be crystal clear and provide sufficient warning to reduce the risk of complex JVM crashes. Due to the error prone nature of providing a callback, the flexibility advantage might be outweighed by the risks of user mistakes. A study will be conducted to assess what real risks exist, how the Java toolchain could mitigate the risks, and what would the extra overhead be.

Testing

We have an implementation that we’ve validated on x86 Linux, and are using in production at Google. Further testing needs to be done in four parts: Overhead testing (performance non-regression) needs to be done with benchmarks such as: SPECjvm98, SPECjvm2008, SPECjbb2005 Testing needs to be done on the stacktrace retrieval process and needs to be available for non x86 architectures. We should support as many architectures and OSs supported by the JDK as possible. Testing needs to be done on non-Linux machines. Tests on the default size of the object lists should be done to see memory/cpu usage

Risks and Assumptions

There is a performance/memory hit with the feature. In the prototype implementation at Google, the overhead is minimal (<2%), but this should be measured with a set of agreed upon benchmarks.

Trusting the data is a risk: stack trace gathering has proven to be tricky, especially when trying to handle every corner case. This becomes more complex when trying to handle various architectures.

Dependences

None known.