JEP draft: Use UTF-8 as default Charset

OwnerAlan Bateman
Created2017/08/31 13:16
Updated2018/02/22 06:15
Componentcore-libs / java.nio.charsets


Use UTF-8 as the Java virtual machine's default charset so that APIs that depend on the default charset behave consistently across all platforms.


The goal of this JEP is for APIs that use the default charset behave consistently across platforms, and not depend on the user's locale and configuration.


It is also not the goal of this JEP to define new Java SE or JDK specific APIs although the effort may identify opportunities where convenience methods might make existing APIs more approachable or easier to use.


APIs that use the default charset are a hazard for developers that are new to the Java platform. They are also a bugbear for experienced developers. Consider an application that creates a with its 1-arg constructor and uses it to writes some text to a file. Writing the text encodes it into a sequence of bytes using the default charset. Another application, run on a different machine or by a different user on the same machine, creates a with its 1-arg constructor and uses it to read the text from the file. Reading the file decodes the bytes to a sequence of characters/text using the default charset. If the default charset is different when reading then the resulting text may be silently corrupted or incomplete (as these APIs replace erroneous input, they don't fail).

Developers that are familiar with the hazard may choose to use methods that specify the charset (either by charset name or Charset) but the resulting code is more verbose. Furthermore, using APIs that specify the charset may inhibit the use of some Java Language features (Method References in particular). Sometimes developers attempt to set the default charset by means of the system property file.encoding but this has never been a supported mechanism (and may not actually be effective, especially when changed after the Java virtual machine has been initialized).


The default charset is currently determined when the Java virtual machine starts. On macOS it is UTF-8, on other platforms it depends on the user's locale and the default encoding. The determination of the default charset results in the creation of two JDK internal (and undocumented) system properties:

The value of these system properties can be overridden on the command line although doing so has never been supported.

The default charset is used by several Java SE API, e.g.

Note that the APIs in java.nio.file.Files do not use the default charset. The methods read or write character streams without a Charset parameter are specified to use UTF-8 rather than the default charset. (Newer APIs using UTF-8 is arguably a hazard for applications that use a mix of both old and new APIs).

The specification of the Charset.defaultCharset() API will be changed to specify that the default charset is UTF-8 unless configured otherwise by an implementation specific means. All APIs, including those listed above, that use the default charset will link to Charset.defaultCharset() if they don't already do so.

To mitigate the compatibility impact, the file.encoding property will be documented (in an implementation note) so that it can be set on the command line to the value "SYSTEM" (i.e. -Dfile.encoding=SYSTEM). When started with this value the default charset will be determined based on the locale and default encoding as long standing behavior.

In addition, the file.encoding property will be also be documented to allow it be set on the command line with the value "UTF-8", essentially a no-op.

The system property sun.jnu.encoding and its value will be unchanged. It will remain undocumented.


Significant testing will be required to understand the extent of the compatibility impact. Testing from developers or organizations with geographically diverse user populations will be needed.

Developers can check for issues with existing JDK releases by running with -Dfile.encoding=UTF-8 in advance of any early access or JDK release with the change.

Some existing unit/regression tests may need to be updated.


Risks and Assumptions

There are is no risk in some environments:

In other environments, the risk to changing the default charset to UTF-8 after 20+ years may be significant. We expect the main impact will be to users of Microsoft Windows in Asian locales and maybe some server environments in Asian/other locales.