JEP draft: Thread-local handshakes

OwnerRobbin Ehn
Created2017/08/01 13:30
Updated2017/10/17 20:28
Componenthotspot / runtime
Discussionhotspot dash dev at openjdk dot java dot net
Reviewed byMikael Vidstedt


Introduce a way to execute a callback on threads without performing a global VM safepoint. Make it both possible and cheap to stop individual threads and not just all threads or none.


Success metrics


Being able to stop individual threads has a multitude of applications:

  1. Improving biased lock revocation to only stop individual threads for revoking biases, rather than all of them.

  2. Reducing the overall VM latency impact of different types of serviceability queries such as acquiring stack traces for all threads which on a VM with a large number of Java threads can be a slow operation.

  3. Performing safer stack trace sampling by reducing reliance on signals.

  4. Eliding some memory barriers using so called Asymmetric Dekker Synchronization techniques, by performing handshakes with Java threads. For example, the conditional card mark code inherently required by G1 and used by CMS, will not need memory barriers. As a result, the G1 post write barrier can be optimized, and branches that try to avoid the memory barrier can be removed.


In the initial implementation there will be a limitation on at most one handshake operation in flight at a given time. The operation can however involve any subset of all JavaThreads. The VM thread will coordinate the handshake operation through a VM operation which will in effect prevent global safepoints from occurring during the handshake operation.

Description compiled code fast path

The current safepointing scheme is modified to perform an indirection through a per-thread pointer which will allow a single thread's execution to be forced to trap on the guard page. Essentially, at all times there will be two polling pages:

In order to force a thread to yield the VM updates the per-thread pointer for the corresponding thread to point to the guarded page.


There are multiple other alternatives. Here are a few:

  1. Emit conditional branches instead. This consumes branch predictor state and is not as tight as just a load. Experiments in this area have shown that the performance of conditional branches can be highly dependent on the specific microarchitecture of the target CPU. Another drawback with the conditional branches approach is that each conditional branch safepoint would need a corresponding stub to be output to take care of returning to the location of the poll.

  2. There is an idea which implies sacrificing another register, and then performing a load of the address the register holds to the register itself, assuming the contents of the register is the address of its own thread-local field. One would start the thread-local handshake by changing the field to NULL. The next poll the register would be set to NULL, and for the second poll, the load would trap. This requires sacrificing a register globally, the traps are more expensive, and on average it will take twice as many polls to reach the safepoint once the request is made for a thread to stop. The benefit is that it theoretically has a lower impact on application execution.

  3. Previously a prototype was constructed where the global polling page was left in as-is but only the actual target thread(s) were caught in the VM code. Threads which were not targets of the handshake would simply return from the signal handler and continue executing. A drawback with this approach is that if a target thread is slow to respond then this can cause a signal storm for other Java threads since the polling page cannot be disarmed until the target thread has responded.