JEP 157: G1 GC: NUMA-Aware Allocation
|Author||Y. Srinivas Ramakrishna|
|Component||hotspot / gc|
|Discussion||hotspot dash gc dash dev at openjdk dot java dot net|
|Reviewed by||Igor Veresov, Jesper Wilhelmsson, Jon Masamitsu, Paul Hohensee, Tony Printezis|
|Endorsed by||Mikael Vidstedt|
Enhance G1 to improve allocation performance on NUMA memory systems.
Extend NUMA-awareness to work on any OS other than Linux and Solaris, which provide appropriate NUMA interfaces.
Modern multi-socket machines are increasingly NUMA, with not all memory equidistant from each socket or core. The more traditional SMPs using conventional dance-hall architectures are increasingly rare, except perhaps at the very high end, perhaps because of the cost and difficulty of scaling up such architectures and the resulting latency and bandwidth limitations of their interconnects. Most modern OSes, starting with Solaris about a decade ago, now offer interfaces through which the memory topology of the platform can be queried and physical memory preferentially mapped from a specific locality group. HotSpot's ParallelScavengeHeap has been NUMA-aware for many years now, and this has helped scale the performance of configurations that run a single JVM over multiple sockets, presenting a NUMA platform to the JVM. Certain other of the Hotspot collectors, most notably the concurrent ones, have not had the benefit of this feature and have not been able to take advantage of such vertical multi-socket NUMA scaling. Especially as large enterprise applications run in large heap configurations and need the power of multiple sockets, yet want the manageability advantage of running within a single JVM, we'll see customers using our concurrent collectors increasingly run up against this scaling bottleneck.
This JEP aims to extend NUMA-awareness to the heap managed by the G1 garbage collector.
G1's heap is organized as a collection of fixed-size regions from what currently happens to be a convex interval of the virtual address space. Generations, or individual logical spaces (such as Eden, Survivor, and Old), are then formed as dynamic disjoint subsets of this collection of regions. A region is typically a set of physical pages, although when using very large pages (say 256M superpages on SPARC), several regions may make up a single physical page.
To make G1's allocation NUMA-aware we shall initially focus on the so-called Eden regions. Survivor regions may be considered in a second enhancement phase, but are not within the scope of this JEP. At a very high level, we want to fix the Eden regions to come from a set of physical pages that are allocated at specific locality groups (henceforth, "lgrps"). The idea is analogous to the NUMA spaces used by ParallelScavengeHeap. Let's call these "per-lgrp region pools", for lack of a better phrase.
We envisage the lifetime of an Eden region to be roughly as follows:
Each region starts off as an untouched region with no allocated physical pages.
Eden regions have backing pages allocated in specific locality groups.
Initially a region is untouched and is not associated with any specific locality group.
Each thread, when it starts out, queries and records its home lgrp, (henceforth the "thread's lgrp", for short).
When a TLAB request is made by a thread whose lgrp is L, we look in the the per-lgrp region pool for L. If there is a current allocation region in L, it is used to satisfy the TLAB allocation request. If the current allocation region is NULL, or the free space in it is too small to satisfy the TLAB request, then a new region is allocated out of the region pool for L, and becomes the current allocation region which will supply that and subsequent TLAB requests. This region has been previously touched and already has pages allocated to it from the lgrp L. If the region pool for L is empty, we check the global pool to see if a free Eden region is available, and this region is then assigned to pool L. At this point the region is untouched and has no pages allocated to it (or was most recently madvised to free). An appropriate lgrp API (either prescriptive or descriptive) is used to ensure that physical pages for this region are allocated in the local lgrp L.
If there are no available regions in the global (untouched) Eden pool, and Eden cannot be grown (for policy or other reasons), a scavenge will be done. An alternative is to steal already biased but unallocated regions from another lgrp, and migrate it to this lgrp, but the suggested policy above follows the policy implemented in PS, where such migration-on-demand was found to be less efficient than adaptive migration following a scavenge (see below).
At each scavenge, the occupancy of the per-lgrp pools is assessed and an appropriately weighted medium-term or moving-window average is used to determine if there are unused or partially-used regions that must be madvised to free so as to adaptively resize the per-lgrp pools.
Humongous regions are naturally eliminated from this allocation policy since such regions are not considered part of Eden anyway, so nothing special will need to be done for such regions. (A reasonable policy for such regions may be to interleave or randomly allocate pages uniformly across all lgrps to optimize the worst-case performance assuming uniform random access from each lgrp.)
ParallelScavengeHeap allocates pages from a survivor space in round-robin fashion. As mentioned above, NUMA-biasing of survivor regions is not a goal of this JEP.
When using large pages, where multiple regions map to the same physical page, things get a bit complicated. For now, we will finesse this by disabling NUMA optimizations as soon as the page size exceeds some small multiple of region size (say 4), and deal with the more general case in a separate later phase. When the page size is below this threshold, we shall allocate and bias contiguous sets of regions into the per-lgrp Eden pools. This author is not sufficiently familiar with current region allocation policy, but believes that this will likely require some small changes to existing region allocation policy in G1 to allow allocating a set of regions at a time.
-XX:+UseNUMA command line switch should enable the feature for G1
-XX:+UseG1GC is also used. If the option is found to perform well
for a large class of programs, we may enable it by default on NUMA
platforms (as I think is the case for ParallelScavenge today). Other
options related to NUMA adaptation and features should be supported in
the same manner as for ParallelScavenge heap. We should avoid any
collector-specific options for NUMA to the extent possible.
Normal testing (with
-XX:+UseNUMA as appropriate) should flush out any
correctness issues. This JEP assumes the use of NUMA hardware for
testing. Targeted performance testing will be done, using a variety of
benchmarks and applications on a variety of NUMA and non-NUMA platforms.
Risks and Assumptions
As in the case of the ParallelScavenge collector, an assumption of the implementation here is that most short-lived objects are such that they are accessed most often by the thread that allocated them. This is certainly true of the majority of short-lived objects in most object-oriented programs, as experience with ParallelScavenge has already shown us. There is, however, some small class of programs where this assumption does not quite hold. The benefits also depend on the interplay of the extent of NUMA-ness of the underlying system and the overheads associated with migrating pages on such systems, especially in the face of frequent thread migrations when load is high. Finally, there may be platforms platforms for which the appropriate lgrp interfaces are either not publicly accessible or available, or have not been implemented for other reasons.
There is some risk that the assignment of regions to specific lgrp pools will reduce some flexibility in terms of moving regions between various logical spaces, but we do not consider this a serious impediment.
Somewhat more seriously, the assignment of regions to lgrp pools will cause some internal fragmentation within these pools, which is not dissimilar to the case of ParallelScavengeHeap. This is a known issue and, to the extent that the unit of lgrp-allocation in ParallelScavengeHeap is a page and that of G1 is a region which may be several (smaller) pages, we will typically not expect the G1 implementation to perform any better than the ParallelScavengeHeap one.