JEP 400: UTF-8 by Default

AuthorAlan Bateman
OwnerNaoto Sato
TypeFeature
ScopeSE
StatusCandidate
Componentcore-libs / java.nio.charsets
Discussioncore dash libs dash dev at openjdk dot java dot net
EffortXS
DurationXS
Reviewed byAlex Buckley, Brian Goetz
Created2017/08/31 13:16
Updated2021/03/30 04:33
Issue8187041

Summary

Specify UTF-8 as the default charset for the Java SE APIs, so that APIs which depend on the default charset behave consistently across all JDK implementations and independently of the user’s operating system, locale, and configuration.

Non-Goals

It is not a goal to define new Java SE or JDK-specific APIs, although this effort may identify opportunities where new convenience methods might make existing APIs more approachable or easier to use.

Motivation

Several Java SE APIs allow a charset to be specified when reading and writing files and processing text. Supported charsets include US-ASCII, UTF-8, and ISO-8859-1. However, developers often overlook the choice of charset, so APIs are usually capable of functioning without one being specified. Typically, APIs will use the default charset in this case. The JDK chooses a charset to serve as the default charset, based on the operating system, locale, and other factors known at startup.

Since the default charset is not the same everywhere, APIs that use the default charset pose many non-obvious hazards, even to experienced developers.

Consider an application that creates a java.io.FileWriter with its one-argument constructor and then uses it to write some text to a file. The resulting file will contain a sequence of bytes encoded using the default charset of the JDK running the application. A second application, run on a different machine or by a different user on the same machine, creates a java.io.FileReader with its one-argument constructor and uses it to read the bytes in that file. The resulting text contains a sequence of characters decoded using the default charset of the JDK running the second application. If the default charset differs between the JDK of the first application and the JDK of the second application, then the resulting text may be silently corrupted or incomplete, since these APIs replace erroneous input rather than fail.

Developers familiar with such hazards can use methods that take a charset argument explicitly. However, having to pass an argument prevents the methods from being used via method references (::) in Java 8-style streams.

Sometimes developers attempt to set the default charset via the system property file.encoding, but this has never been supported and may not actually work, especially if modified after the Java virtual machine is initialized.

Not all Java SE APIs defer to the JDK's choice of default charset. For example, the methods in java.nio.file.Files that read or write files without a Charset argument are specified to always use UTF-8. The fact that newer APIs default to using UTF-8 while older APIs default to using the default charset is a hazard for applications that use a mix of APIs.

The entire Java ecosystem would benefit if the default charset was specified to be the same everywhere: applications that are not concerned with portability will see little impact, while applications that embrace portability by specifying charsets will see no impact. Since UTF-8 is standard for the XML and JSON files processed by vast numbers of Java programs, and since Java's own APIs increasingly favor UTF-8, e.g., in the NIO API and for properties files, it makes sense to specify UTF-8 as the default charset.

Description

The default charset is currently determined when the Java virtual machine starts. On macOS, it is UTF-8 except in the POSIX C locale; on other platforms, it depends upon the user's locale and the default encoding. The method java.nio.charsets.Charset.defaultCharset() exposes which charset was determined as the default. Several Java SE APIs use the default charset, including:

We propose to change the specification of Charset.defaultCharset() to say that the default charset is UTF-8 unless configured otherwise by an implementation-specific means. This explicit support for non-standard configuration means that Java programs may detect something other than UTF-8 as the default charset. The UTF-8 charset is specified by RFC 2279; the transformation format upon which it is based is specified in Amendment 2 of ISO 10646-1 and is also described in the Unicode Standard. It is not to be confused with "Modified UTF-8".

We will update the specifications of all Java SE APIs that use the default charset, including those listed above, to cross-reference Charset.defaultCharset(). The choice of UTF-8 applies only to Java SE APIs and not to the Java language, which will continue to use UTF-16.

There are four system properties related to the default charset:

The values of these system properties can be set on the command line, although doing so has never been supported and often has no effect. To mitigate the compatibility impact of this JEP, we will revise the treatment of the system property file.encoding so that setting it on the command line is a supported means of configuring the default charset (as envisaged by the specification of Charset.defaultCharset()). This will be documented by an implementation note in System.getProperties, as follows:

The other system properties (sun.stdout.encoding, sun.stderr.encoding, sun.jnu.encoding) will remain unspecified and unsupported.

Testing

Alternatives

Risks and Assumptions

The risk of specifying the default charset as UTF-8 is that applications do not behave correctly when processing data produced when the default charset was unspecified. However, this risk is not wholly new; applications which are inattentive to charsets (for example, by not specifying explicit charset to APIs) have always run the risk of incorrect behavior and/or data corruption.

Fortunately, applications in many environments can expect very low risk from Java's choice of UTF-8:

In other environments, the risk of changing the default charset to UTF-8 after more than twenty years may be significant. We expect the main impact will be to users of Windows in Asian locales, and possibly some server environments in Asian and other locales. Possible scenarios include: