JEP draft: Region Pinning in G1

OwnerHamlin Li
TypeFeature
ScopeImplementation
StatusDraft
Componenthotspot / gc
EffortL
DurationM
Reviewed byThomas Schatzl
Created2021/10/28 08:05
Updated2021/11/26 14:29
Issue8276094

Summary

Support region pinning in G1 to avoid the need to disable garbage collection during JNI critical regions eliminating additional latency.

Goals

Motivation

For interopability with unmanaged programming languages JNI provides functions to obtain raw pointers to Java objects (e.g. GetXXXCritical and ReleaseXXXCritical ). Code running within pairs of these calls is treated as running in a "critical region". While any Java thread is in such a critical region, the JVM must take care to not move that "critical object" during garbage collection.

The current default garbage collector, G1, currently implements critical region support by disabling garbage collection while any Java thread is in such a critical region.

This choice of handling JNI critical regions has a significant undesirable latency impact on Java threads: The severity of these problems depends on the number of Java threads that use these JNI functions and the frequency and duration of these critical regions, but users report critical sections blocking garbage collection and the whole application for minutes, and fake out of memory conditions due to starvation problems. This can lead to premature VM shutdown.

Description

The current mechanism to disable garbage collections in G1 works as follows: G1 records Java threads in a critical region. If a Java thread requests a garbage collection, it suspends these threads until all Java threads currently in a JNI critical region exited their JNI critical region. In this case, G1 also records and suspends all subsequent Java threads trying to enter such a JNI critical region, performing selected virtual machine mode transitions or requests for further garbage collections. G1 uses a global mutex called GCLocker to achieve the above suspend/resume mechanism. Only after all JNI critical regions were exited with a pending garbage collection request, G1 executes the pending garbage collection and the VM subsequently resumes execution of all previously suspended threads.

The main idea presented in this JEP is to, instead of disabling garbage collection completely, keep collecting garbage in heap regions not containing a critical object.

G1 is a region based incremental collector: it can already collect parts of the heap with the granularity of a heap region. Further, some of these regions may already be treated as locked in place (marked as "pinned") during any garbage collection. This JEP aims to extend this capability for any type of region during any kind of garbage collection.

There is existing generic support to notify the JVM of Java threads obtaining and releasing critical objects.

Existing Support for Region Pinning in G1

There already exist a few mechanisms that we intend to exploit for support of pinning of arbitrary regions in the G1 collector.

Modifications to G1 Garbage Collection Algorithms

The existing region pinning support described above suggests to implement the following modifications to the G1 garbage collection algorithms to achieve the desired effect:

Reusing Evacuation Failure Handling

When G1 is unable to find space to evacuate an object during minor collection, an evacuation failure occurs for that object. That object is kept in place, recorded, and the object and its containing region marked as "failed" (i.e. the region containing the object that failed evacuation). After evacuation there is a separate fixup phase to clear the recorded marks, format the space around these objects that failed evacuation as empty and relabel these regions as if they were Old regions.

This current implementation assumes that evacuation failure is very rare: typically G1 avoids evacuation failure occurrences completely by proper generation sizing or preventive garbage collections. Even if a garbage collection incurs an evacuation failure, the number of affected objects is typically extremely small.

By repurposing this mechanism for handling pinned Young regions, neither assumption is valid: still a low, but expectedly larger amount of regions will incur evacuation failure at higher frequency. Further, the number of affected objects is only bounded by the size of the regions as G1 needs not only keep the objects that actually failed evacuation in place, but all live objects.

The original assumptions led to the following design decisions that require significant improvement:

There is a blog post summarizing the necessary work in detail here and the linked JIRA issues tagged with the gc-g1-pinned-regions label.

Alternatives

Implementation alternatives for support of critical regions correspond to the ones mentioned in the JNI specification:

The first option is to always copy JNI critical objects to a place (e.g. the C heap) where the object does not move and copy it back afterwards: this has been discarded in the past for being very inefficient in time and space. Nothing substantially changed about the effort needed for this mechanism. A small optimization could be to only copy objects in regions G1 does not support pinning for, limiting copying to critical objects in Young regions. We do not expect that this improves the situation significantly: many heuristics in the garbage collection area assume that a large fraction of object modification and use occurs in the young generation. This is generally true given the efficiency of existing collection algorithms. We expect that the same applies to JNI critical functions.

Another option is to pin objects individually: G1 can only evacuate whole regions, and can only allocate into completely free regions. Since a pinned object keeps a region from being freed (as it is trivially in use), there is no advantage doing that except additional code complexity to keep track of pinned objects on a per object basis.

Of course we could keep and refine the existing mechanism to disable garbage collection during critical regions using the GCLocker: however disabling garbage collection fundamentally causes latency problems and can not improve the existing status quo as far we are aware of.

Apart from those we have not found other reasonable ideas that provide extra benefit (performance, simplicity, ...) to implement region pinning differently than suggested in this JEP.

Testing

Besides of functionality tests, we especially need to do benchmarking and performance measurements to collect performance data.

Risks and Assumptions

We assume that there are no changes to the expected usage of JNI critical regions: they are still to be used "sparingly" and these JNI critical regions are "short".

The existing evacuation failure handling mechanisms G1 uses are well understood, the risk in reusing them seems manageable. As stated before, there are some performance problems with using them as they are, but initial prototypes of changes show very good promise.

There is a risk when the application pins lots of regions at the same time, in the extreme case pinning the entire heap, which will lead to an out-of-memory situation. There is no solution for this case currently, but it seems that in practice (the Shenandoah collector already uses region pinning for JNI critical regions) this will not occur.

One good mitigation for this problem could be allowing allocation in regions that were pinned and sparsely occupied using a first-fit algorithm with a linked list of free space around critical objects. This technique may be further improved by tracking pinning on a per object basis. However we do not see any of these changes as necessary for this JEP for the above mentioned reason.

Dependencies

The work for this JEP is based on several existing and completed features in G1: