JEP 279: Improve Test-Failure Troubleshooting
|Discussion||hotspot dash dev at openjdk dot java dot net, core dash libs dash dev at openjdk dot java dot net|
|Reviewed by||Aleksandre Iline, Brian Goetz|
|Endorsed by||Mikael Vidstedt|
|Relates to||JEP 228: Add More Diagnostic Commands|
|JEP 102: Process API Updates|
Automatically collect diagnostic information which can be used for further troubleshooting in case of test failures and timeouts.
Gather the following information to help diagnose test failures and timeouts:
- For Java processes which are still running on a host after test failure or timeout:
- C and Java stacks
- Core dumps (minidumps on Windows)
- Heap statistics
- Environment information:
- Running processes
- CPU and I/O loads
- Open files and sockets
- Free disk space and memory
- Most recent system messages and events
We will develop a library that provides this functionality and co-locate the library sources with the product code.
It is difficult to troubleshoot intermittent test failures when there is no information about the testing environment. Such test failures often depend on test execution order and concurrence, which makes it extremely difficult to reproduce them.
Currently, there are two extension points in the
jtreg test harness.
The first one is the timeout handler, which
jtreg runs when a test times out.
The second one is the observer, which implements the observer design pattern to track different events in a test run.
We will use these extension points to gather diagnostic information and develop a custom observer and timeout handler for
Information about environment and non-Java processes will be collected by running platform-specific commands.
Gathering information about Java processes will be done via available diagnostic commands which are heavily extended by JEP 228, e.g., the
print_vm_state command which collects information similar to
The information gathered will be stored for later inspection together with test results.
The observer will collect the information on
finishedTest events when tests fail.
Since tests may create other processes, information about test processes and their child processes will be collected. To find such processes, the library will create a process tree with the original test process at the root.
Library sources will be placed in the
test directory in the top-level repository, and makefiles will be updated to build them and bundle them as a part of test bundles.
We will schedule regular testing which uses this library. When the results and test execution become stable, we will extend the use of the library to other components.
Risks and Assumptions
- Risk that execution of some commands can hang: To minimize this risk a command will be executed only for an allotted time and interrupted after that.
- Running out of disk space on a host: The plan is to archive information, restrict the amount of saved information, and check free disk space before information collection.
- Tools unavailable on a platform or host: If a tool is not available on a particular host or platform, the commands which depend on the missing tools will be skipped and a warning message will be added to the log file. Another possible solution is to download required tools from a known tools repository.
- System resource exhaustion: Some failures can cause exhaustion of different types of system resources (CPU, memory, disk-space, etc.) or be caused by a lock of resources. Since it won't be possible to run commands to gather information in these situations, command execution will be skipped to prevent further system degradation.
- Getting process trees in Java:
Getting the process tree in Java requires the new process API described in JEP 102.
Using the JDK under test as the stable JDK (i.e., the JDK which runs the
jtregtest harness) may interfere with test results. To mitigate this, we will develop an alternative process-tree implementation. That implementation will simplify backporting this project into JDK 8.