JEP 400: UTF-8 by Default

AuthorsAlan Bateman, Naoto Sato
OwnerNaoto Sato
TypeFeature
ScopeSE
StatusIntegrated
Release18
Componentcore-libs / java.nio.charsets
Discussioncore dash libs dash dev at openjdk dot java dot net
EffortXS
DurationXS
Reviewed byAlex Buckley, Brian Goetz
Endorsed byBrian Goetz
Created2017/08/31 13:16
Updated2021/10/05 15:23
Issue8187041

Summary

Specify UTF-8 as the default charset of the standard Java APIs. With this change, APIs that depend upon the default charset will behave consistently across all implementations, operating systems, locales, and configurations.

Goals

  1. Make Java programs more predictable and portable when their code relies on the default charset.
  2. Clarify where the standard Java API uses the default charset.
  3. Standardize on UTF-8 throughout the standard Java APIs, except for console I/O.

Non-Goals

  1. It is not a goal to define new standard Java APIs or supported JDK APIs, although this effort may identify opportunities where new convenience methods might make existing APIs more approachable or easier to use.
  2. There is no intent to deprecate or remove standard Java APIs that rely on the default charset rather than taking an explicit charset parameter.

Motivation

Standard Java APIs for reading and writing files and for processing text allow a charset to be passed as an argument. A charset governs the conversion between raw bytes and the 16-bit char values of the Java programming language. Supported charsets include, for example, US-ASCII, UTF-8, and ISO-8859-1.

If a charset argument is not passed, then standard Java APIs typically use the default charset. The JDK chooses the default charset at startup based upon the run-time environment: the operating system, the user's locale, and other factors.

Because the default charset is not the same everywhere, APIs that use the default charset pose many non-obvious hazards, even to experienced developers.

Consider an application that creates a java.io.FileWriter without passing a charset, and then uses it to write some text to a file. The resulting file will contain a sequence of bytes encoded using the default charset of the JDK running the application. A second application, run on a different machine or by a different user on the same machine, creates a java.io.FileReader without passing a charset and uses it to read the bytes in that file. The resulting text contains a sequence of characters decoded using the default charset of the JDK running the second application. If the default charset differs between the JDK of the first application and the JDK of the second application then the resulting text may be silently corrupted or incomplete, since the FileReader cannot tell that it decoded the text using the wrong charset relative to the FileWriter. Here is an example of this hazard, where a Japanese text file encoded in UTF-8 on macOS is corrupted when read on Windows in US-English or Japanese locales:

java.io.FileReader(“hello.txt”) -> “こんにちは” (macOS)
java.io.FileReader(“hello.txt”) -> “ã?“ã‚“ã?«ã?¡ã? ” (Windows (en-US))
java.io.FileReader(“hello.txt”) -> “縺ォ縺。縺ッ” (Windows (ja-JP)

Developers familiar with such hazards can use methods and constructors that take a charset argument explicitly. However, having to pass an argument prevents methods and constructors from being used via method references (::) in stream pipelines.

Developers sometimes attempt to configure the default charset by setting the system property file.encoding on the command line (i.e., java -Dfile.encoding=...), but this has never been supported. Furthermore, attempting to set the property programmatically (i.e., System.setProperty(...)) after the Java runtime has started does not work.

Not all standard Java APIs defer to the JDK's choice of default charset. For example, the methods in java.nio.file.Files that read or write files without a Charset argument are specified to always use UTF-8. The fact that newer APIs default to using UTF-8 while older APIs default to using the default charset is a hazard for applications that use a mix of APIs.

The entire Java ecosystem would benefit if the default charset were specified to be the same everywhere. Applications that are not concerned with portability will see little impact, while applications that embrace portability by passing charset arguments will see no impact. UTF-8 has long been the most common charset on the World Wide Web. UTF-8 is standard for the XML and JSON files processed by vast numbers of Java programs, and Java's own APIs increasingly favor UTF-8 in, e.g., the NIO API and for property files. It therefore makes sense to specify UTF-8 as the default charset for all Java APIs.

We recognize that this change could have a widespread compatibility impact on programs that migrate to JDK 18. For this reason, it will always be possible to recover the pre-JDK 18 behavior, where the default charset is environment-dependent.

Description

In JDK 17 and earlier, the default charset is determined when the Java runtime starts. On macOS, it is UTF-8 except in the POSIX C locale. On other operating systems, it depends upon the user's locale and the default encoding, e.g., on Windows, it is a codepage-based charset such as windows-1252 or windows-31j. The method java.nio.charsets.Charset.defaultCharset() returns the default charset. A quick way to see the default charset of the current JDK is with the following command:

java -XshowSettings:properties -version 2>&1 | grep file.encoding

Several standard Java APIs use the default charset, including:

We propose to change the specification of Charset.defaultCharset() to say that the default charset is UTF-8 unless configured otherwise by an implementation-specific means. (See below for how to configure the JDK.) The UTF-8 charset is specified by RFC 2279; the transformation format upon which it is based is specified in Amendment 2 of ISO 10646-1 and is also described in the Unicode Standard. It is not to be confused with Modified UTF-8.

We will update the specifications of all standard Java APIs that use the default charset to cross-reference Charset.defaultCharset(). Those APIs include the ones listed above, but not System.out and System.err, whose charset will be as specified by Console.charset().

The file.encoding and native.encoding system properties

As envisaged by the specification of Charset.defaultCharset(), the JDK will allow the default charset to be configured to something other than UTF-8. We will revise the treatment of the system property file.encoding so that setting it on the command line is the supported means of configuring the default charset. We will specify this in an implementation note of System.getProperties() as follows:

Prior to deploying on a JDK where UTF-8 is the default charset, developers are strongly encouraged to check for charset issues by starting the Java runtime with java -Dfile.encoding=UTF-8 ... on their current JDK (8-17).

JDK 17 introduced the native.encoding system property as a standard way for programs to obtain the charset chosen by the JDK's algorithm, regardless of whether the default charset is actually configured to be that charset. In JDK 18, if file.encoding is set to COMPAT on the command line, then the run-time value of file.encoding will be the same as the run-time value of native.encoding; if file.encoding is set to UTF-8 on the command line, then the run-time value of file.encoding may differ from the run-time value of native.encoding.

In Risks and Assumptions below, we discuss how to mitigate the possible incompatibilities that arise from this change to file.encoding, as well as the native.encoding system property and recommendations for applications.

There are three charset-related system properties used internally by the JDK. They remain unspecified and unsupported, but are documented here for completeness:

Source file encoding

The Java language allows source code to express Unicode characters in a UTF-16 encoding, and this is unaffected by the choice of UTF-8 for the default charset. However, the javac compiler is affected because it assumes that .java source files are encoded with the default charset, unless configured otherwise by the -encoding option. If source files were saved with a non-UTF-8 encoding and compiled with an earlier JDK, then recompiling on JDK 18 or later may cause problems. For example, if a non-UTF-8 source file has string literals that contain non-ASCII characters, then those literals may be misinterpreted by javac in JDK 18 or later unless -encoding is used.

Prior to compiling on a JDK where UTF-8 is the default charset, developers are strongly encouraged to check for charset issues by compiling with javac -encoding UTF-8 ... on their current JDK (8-17). Alternatively, developers who prefer to save source files with a non-UTF-8 encoding can prevent javac from assuming UTF-8 by setting the -encoding option to the value of the native.encoding system property on JDK 17 and later.

The legacy default charset

In JDK 17 and earlier, the name default is recognized as an alias for the US-ASCII charset. That is, Charset.forName("default") produces the same result as Charset.forName("US-ASCII"). The default alias was introduced in JDK 1.5 to ensure that legacy code which used sun.io converters could migrate to the java.nio.charset framework introduced in JDK 1.4.

It would be extremely confusing for JDK 18 to preserve default as an alias for US-ASCII when the default charset is specified to be UTF-8. It would also be confusing for default to mean US-ASCII when the user configures the default charset to its pre-JDK 18 value by setting -Dfile.encoding=COMPAT on the command line. Redefining default to be an alias not for US-ASCII but rather for the default charset (whether UTF-8 or user-configured) would cause subtle behavioral changes in the (few) programs that call Charset.forName("default").

We believe that continuing to recognize default in JDK 18 would be prolonging a poor decision. It is not defined by the Java SE Platform, nor is it recognized by IANA as the name or alias of any character set. In fact, for ASCII-based network protocols, IANA encourages use of the canonical name US-ASCII rather than just ASCII or obscure aliases such as ANSI_X3.4-1968 -- plainly, use of the JDK-specific alias default goes counter to that advice. Java programs can use the enum constant StandardCharsets.US_ASCII to make their intent clear, rather than passing a string to Charset.forName(...).

Accordingly, in JDK 18, Charset.forName("default") will throw an UnsupportedCharsetException. This will give developers a chance to detect use of the idiom and migrate to either US-ASCII or to the result of Charset.defaultCharset().

Testing

Risks and Assumptions

We assume that applications in many environments will see no impact from Java's choice of UTF-8:

In other environments, the risk of changing the default charset to UTF-8 after more than 20 years may be significant. The most obvious risk is that applications which implicitly depend on the default charset (e.g., by not passing an explicit charset argument to APIs) will behave incorrectly when processing data produced when the default charset was unspecified. A further risk is that data corruption may silently occur. We expect the main impact will be to users of Windows in Asian locales, and possibly some server environments in Asian and other locales. Possible scenarios include:

Where application code can be changed, then we recommend it is changed to pass a charset argument to constructors. If an application has no particular preference among charsets, and is satisfied with the traditional environment-driven selection for the default charset, then the following code can be used on all Java releases to obtain the charset determined from the environment:

String encoding = System.getProperty("native.encoding");  // Populated on Java 18 and later
Charset cs = (encoding != null) ? Charset.forName(encoding) : Charset.defaultCharset();
var reader = new FileReader("file.txt", cs);

If neither application code nor Java startup can be changed, then it will be necessary to inspect the application code to determine manually whether it will run compatibly on JDK 18.

Alternatives