JEP 345: NUMA-Aware Memory Allocation for G1

OwnerSangheon Kim
TypeFeature
ScopeJDK
StatusCandidate
Componenthotspot / gc
Discussionhotspot dash gc dash dev at openjdk dot java dot net
EffortM
DurationM
DuplicatesJEP 157: G1 GC: NUMA-Aware Allocation
Reviewed byMikael Vidstedt, Stefan Johansson, Thomas Schatzl
Endorsed byMikael Vidstedt
Created2018/09/06 22:46
Updated2018/11/10 14:26
Issue8210473

Summary

Improve G1 performance on large machines by implementing NUMA-aware memory allocation.

Non-Goals

Motivation

Modern multi-socket machines increasingly have non-uniform memory access (NUMA), since not all memory is equidistant from each socket or core. Memory accesses between sockets have different performance characteristics, with access to more-distant sockets typically having more latency.

The parallel collector (enabled by by -XX:+UseParallelGC) has been NUMA-aware for many years, and this has helped improve the performance of configurations that run a single JVM across multiple sockets. Other HotSpot collectors have not had the benefit of this feature, which means they have not been able to take advantage of such vertical multi-socket NUMA scaling. Large enterprise applications in particular tend run with large heap configurations on multiple sockets, yet they want the manageability advantage of running within a single JVM. We are seeing customers using the G1 collector increasingly running up against this scaling bottleneck.

Description

G1's heap is organized as a collection of fixed-size regions. A region is typically a set of physical pages, although when using large pages (via -XX:+UseLargePages) several regions may make up a single physical page.

Most modern OSes provide interfaces through which the memory topology of the platform can be queried and physical memory preferentially mapped from a specific locality group (henceforth, "lgrp") that the collector can use. When the JVM is initialized, all regions are evenly split between the total number of available lgrps and touched by threads bound to the same lgrps so that they are preferentially allocated on that lgrp. Fixing the lgrp of a region at the beginning for all regions is a bit inflexible, but this drawback can be mitigated by the following enhancements.

Regions are used to allocate memory by mutators or to copy survivor objects during GC. When those requests happen, G1 preferentially selects free regions from the same lgrp the thread is allocated to, i.e., the object will be kept on the same lgrp in the young generation. If there is no free region from same lgrp during region allocation from a mutator, G1 will trigger a garbage collection. An alternative idea to be evaluated would be searching other lgrps in order of distance for free regions, with closest lgrps first.

There is no particular attempt to keep objects on the same lgrp in the old generation.

Humongous regions are excluded from in this allocation policy. Nothing special will be done for these regions.

Testing

Existing tests with the option -XX:+UseNUMA should flush out any correctness issues. This JEP assumes the use of NUMA hardware for testing.

There should be no performance difference to the original code when NUMA aware allocation is turned off.

Risks and Assumptions

We assume that most short-lived objects are often accessed by the thread that allocated them. This is certainly true for majority of short-lived objects in most object-oriented programs. However, there are some programs where this assumption does not quite hold, so there may be performance regressions in some cases. In addition, the benefits also depend on the interplay of the extent of NUMA-ness of the underlying system and the frequency of threads being migrated between lgrps on such systems, especially when load is high.