JEP 326: Raw String Literals

OwnerJim Laskey
TypeFeature
ScopeSE
StatusCandidate
Componentspecification / language
Discussionamber dash dev at openjdk dot java dot net
EffortM
DurationM
Reviewed byAlex Buckley
Endorsed byBrian Goetz
Created2018/01/23 15:40
Updated2018/07/11 11:16
Issue8196004

Summary

Add raw string literals to the Java programming language. A raw string literal can span multiple lines of source code and does not interpret escape sequences, such as \n, or Unicode escapes, of the form \uXXXX.

Goals

Non-Goals

Motivation

Escape sequences have been defined in many programming languages, including Java, to represent characters that can not be easily represented directly. As an example, the escape sequence \n represents the ASCII newline control character. To print "hello" and "world" on separate lines the string "hello\nworld\n" can be used;

System.out.print("hello\nworld\n");

Output:

hello
world

Besides suffering from readability issues, this example fixedly targets Unix based systems, where other OSes use alternate new line representations, such as \r\n (Windows). In Java, we use a higher level method such as println to provide the platform appropriate newline sequence:

System.out.println("hello");
System.out.println("world");

If "hello" and "world" are being displayed using a GUI library, control characters may not have any significance at all.

The escape sequence indicator, backslash, is represented in Java string literals as \\. This doubling up of backslashes leads to the Leaning Toothpick Syndrome, where strings become difficult to interpret because of excessive backslashes. Java developers are familiar with examples such as:

Path path = Paths.get("C:\\Program Files\\foo");

Escape sequences, such as \" to represent the double-quote character, also lead to interpretation issues when used in non-Java grammars. For example, searching for a double-quote within a string requires:

Pattern pattern = Pattern.compile("\\\"");

The reality of escape sequences is they are often the exception and not the rule in everyday Java development. We use control characters less, and escape presence adversely affects the readability and maintainability of our code. Once we come to this realization, the notion of a non-interpreted string literal becomes a well reasoned result.

Real-world Java code, which frequently embeds fragments of other programs (SQL, JSON, XML, regex, etc) in Java programs, needs a mechanism for capturing literal strings as-is, without special handling of Unicode escaping, backslash, or new lines.

This JEP proposes a new kind of literal, a raw string literal, which sets aside both Java escapes and Java line terminator specifications, to provide character sequences that under many circumstances are more readable and maintainable than the existing traditional string literal.

File Paths Example

Traditional String Literals Raw String Literals
Runtime.getRuntime().exec("\"C:\\Program Files\\foo\" bar");
Runtime.getRuntime().exec(`"C:\Program Files\foo" bar`);

Multi-line Example

Traditional String Literals Raw String Literals
String html = "<html>\n" +
              "    <body>\n" +
              "		    <p>Hello World.</p>\n" +
              "    </body>\n" +
              "</html>\n";
String html = `<html>
                   <body>
                       <p>Hello World.</p>
                   </body>
               </html>
              `;

Regular Expression Example

Traditional String Literals Raw String Literals
System.out.println("this".matches("\\w\\w\\w\\w"));
System.out.println("this".matches(`\w\w\w\w`));

Output:

true

Polyglot Example

Traditional String Literals Raw String Literals
String script = "function hello() {\n" +
                "   print(\'\"Hello World\"\');\n" +
                "}\n" +
                "\n" +
                "hello();\n";
ScriptEngine engine = new ScriptEngineManager().getEngineByName("js");
Object obj = engine.eval(script);
String script = `function hello() {
                    print('"Hello World"');
                 }
hello();
            `

ScriptEngine engine = new ScriptEngineManager().getEngineByName("js"); Object obj = engine.eval(script);

Output:

"Hello World"

Database Example

Traditional String Literals Raw String Literals
String query = "SELECT `EMP_ID`, `LAST_NAME` FROM `EMPLOYEE_TB`\n" +
               "WHERE `CITY` = ‘INDIANAPOLIS'\n" +
               "ORDER BY `EMP_ID`, `LAST_NAME`;\n";
String query = ``
                 SELECT `EMP_ID`, `LAST_NAME` FROM `EMPLOYEE_TB`
                 WHERE `CITY` = ‘INDIANAPOLIS'
                 ORDER BY `EMP_ID`, `LAST_NAME`;
               ``;

Description

A raw string literal is a new form of literal.

Literal:
  IntegerLiteral
  FloatingPointLiteral
  BooleanLiteral
  CharacterLiteral
  StringLiteral
  RawStringLiteral
  NullLiteral

RawStringLiteral:
  RawStringDelimiter RawInputCharacter {RawInputCharacter} RawStringDelimiter

RawStringDelimiter:
    ` {`}

A raw string literal consists of one or more characters enclosed in sequences of backticks ` (\u0060) (backquote, accent grave). A raw string literal will open with a sequence of one or more backticks. The raw string literal will close when an equal number of backticks is encountered. Any other sequence of backticks is treated as part of the string body.

Embedding of backticks in a raw string literal can be accomplished by increasing or decreasing the number of backticks in the open/close sequences to mismatch any embedded sequences.

Characters in a raw string literal are never interpreted, with the exception of CR and CRLF, which are platform-specific line terminators. CR (\u000D) and CRLF (\u000D\u000A) sequences are always translated to LF (\u000A). This translation provides the least-surprising behavior across platforms.

It is a compile-time error to have an open backtick sequence and no corresponding close backtick sequence before the end of the compilation unit.

The Java Language Specification stipulates there are two kinds of escapes used in traditional string literals: Unicode escapes and escape sequences. Raw string literals never interpret escapes. That is, the individual characters that make up the escape are used as-is.

Unicode escapes, in the form \uxxxx, are processed as part of character input prior to interpretation by the lexer. To support the raw string literal as-is requirement, Unicode escape processing is disabled when the lexer encounters an opening backtick and reenabled when encountering a closing backtick. For consistency, the Unicode escape \u0060 may not be used as a substitute for the opening backtick.

The following are examples of raw string literals:

`"`                // a string containing " alone
``can`t``          // a string containing 'c', 'a', 'n', '`' and 't'
`This is a string` // a string containing 16 characters
`\n`               // a string containing '\' and 'n'
`\u2022`           // a string containing '\', 'u', '2', '0', '2' and '2'
`This is a
two-line string`   // a single string constant

In a class file, a string constant does not record whether it was derived from a raw string literal or a traditional string literal.

Like a traditional string literal, a raw string literal is always of type java.lang.String. Strings derived from raw string literals are treated in the same manner as strings derived from traditional string literals.

Escapes

It is highly probable that a developer may want a string that is multi-line but has interpreted escape sequences. To facilitate this requirement, instance methods will be added to the String class to support the run-time interpretation of escape sequences. Primarily,

public String unescape()

will translate each character sequence beginning with \ that has the same spelling as a sequence defined in the JLS (3.3 Unicode Escapes, 3.10.6. Escape Sequences for Character and String Literals) to the character represented by that sequence.

Examples (b0 thru b3 are true):

boolean b0 = `\n`.equals("\\n");
boolean b1 = `\n`.unescape().equals("\n");
boolean b2 = `\n`.length == 2;
boolean b3 = `\n`.unescape().length == 1;

Other methods will provide finer control over which escapes are translated.

There will also be a provision for tools to invert escapes. The following method will also be added to the String class:

public String escape()

which will convert all characters less than ' ' into Unicode or character escape sequences, characters above '~' to Unicode escape sequences, and the characters ", ', \ to escape sequences.

Examples (b0 thru b3 are true):

boolean b0 = "\n".escape().equals(`\n`);
boolean b1 = `•`.escape().equals(`\u2022`);
boolean b2 = "•".escape().equals(`\u2022`);
boolean b3 = !"•".escape().equals("\u2022");

Source Encoding

If a source file contains non-ASCII characters, ensure use of the correct encoding on the javac command line (see javac -encoding). Alternatively, supply the appropriate Unicode escapes in the raw string and then use one of the provided library routines described above to translate Unicode escapes to the desired non-ASCII characters.

Margin Management

One of the issues with multi-line strings is whether to format the string against the left margin (as in heredoc) or, ideally, blend with the indentation used by surrounding code. The question then becomes, how to manage this incidental indentation.

For example, some developers may choose to code as

String s = `
this is my
    embedded string
`;

while other developers may not like the outdenting style and choose to embed relative to the indentation of the code

String html = `
                       this is my
                           embedded string
                      `;

In the latter case, the developer probably intends that this should be left-justified while embedded should be relatively indented by four spaces, and we surely want to support this, but we are reluctant to try and read the developer's mind and assume that this white space is incidental.

To allow for contrasting coding styles, while providing a flexible and enduring solution, raw string literals are scanned with the incidental indentation intact; i.e., raw. The consequence of this design is that if the developer chooses the above former case, they need no further processing. Otherwise, the developer will have access to easy-to-use library support for a variety of alternate coding styles. This will permit coding style change without affecting the JLS.

We believe the most common case will be the latter case above. For that reason, we will provide the following String instance method:

public String align()

which after removing all leading and trailing blank lines, left justifies each line without loss of relative indentation. Thus, stripping away all incidental indentation and line spacing.

Example:

String html = `
                       <html>
                           <body>
                               <p>Hello World.</p>
                           </body>
                       </html>
                  `.align();
    System.out.print(html);

Output:

<html>
    <body>
        <p>Hello World.&</p>
    </body>
</html>

Further, generalized control of indentation will be provided with the following String instance method:

public String indent(int n)

where n specifies the number of white spaces to add or remove from each line of the string; a positive n adds n spaces (U+0020) and negative n removes n white spaces.

Example:

String html = `
                       <html>
                           <body>
                               <p>Hello World.</p>
                           </body>
                       </html>
                  `.align().indent(4);
    System.out.print(html);

Output:

<html>
        <body>
            <p>Hello World.&</p>
        </body>
    </html>

In the cases where align() is not what the developer wants, we expect the preponderance of cases to be align().ident(n). Therefore, an additional variation of align will be provided:

public String align(int n)

where n is the indentation applied to the string after alignment.

Example:

String html = `
                       <html>
                           <body>
                               <p>Hello World.</p>
                           </body>
                       </html>
                  `.align(4);
    System.out.print(html);

Output:

<html>
        <body>
            <p>Hello World.&</p>
        </body>
    </html>

Customizable margin management will be provided by the string instance method:

<R> R transform​(Function<String,​R> f)

where the supplied function f is called with this string as the argument.

Example:

public class MyClass {
    private static final String MARKER= "| ";
    public String stripMargin(String string) {
        return lines().map(String::strip)
                      .map(s -> s.startsWith(MARKER) ? s.substring(MARKER.length()) : s)
                      .collect(Collectors.joining("\n", "", "\n"));
    }

    String stripped = `
                          | The content of
                          | the string
                      `.transform(MyClass::stripMargin);
    System.out.print(stripped);

Output:

The content of
the string

It should be noted that concern for class file size and runtime impact by this design is addressed by the constant folding features of JEP 303.

Alternatives

Choice of Delimiters

A traditional string literal and a raw string literal both enclose their character sequence with delimiters. A traditional string literal uses the double-quote character as both the opening and closing delimiter. This symmetry makes the literal easy to read and parse. A raw string literal will also adopt symmetric delimiters, but it must use a different delimiter because the double-quote character may appear unescaped in the character sequence. The choice of delimiters for a raw string literal is informed by the following considerations:

We assume that the string-literal delimiter choice includes only the three Latin1 quote characters: single-quote, double-quote, and backtick. Any other choice would affect clarity and be inconsistent with traditional string literals.

Still, it is necessary to differentiate a raw string literal from a traditional string literal. For example, double-quote could be combined with other characters or custom phrases to form a kind of compound delimiter for raw string literals. For example, $"xyz"$ or abcd"xyz"abcd. These compound delimiters meet the basic requirements, but lack a clean and simple embedding of the closing delimiter. Also, there is a temptation in the custom phrases case to assign semantic meaning to the phrase, heralding another industry similar to Java annotations.

There is the possibility to use quote repetition: """xyz""". Here we have to be cautious to avoid ambiguity. Example: "" + x + "" can be parsed as the concatenation of a traditional string literal with a variable and another traditional string literal, or as a raw string literal for the seven-character string " + x + ".

The advantage of the backtick is that it does not require repurposing. We can also avoid the ambiguity created by quote repetition and the empty string. It is a new delimiter in terms of the Java Language Specification. It meets all the delimiter requirements, including a simple embedding rule.

Another consideration for choice of delimiters is the potential for future technologies. With raw and traditional string literals both using simple delimiters, any future technology could be applied symmetrically.

This JEP proposes to use backtick character. It is distinct from existing quotes in the language but conveys similar purpose.

Multi-line Traditional String Literals

Even though this option has been set aside as a raw string literal solution, it may still be reasonable to allow multi-line traditional string literals in addition to raw string literals. Enabling such a feature would affect tools and tests that assume multi-line traditional string literals as an error.

Other Languages

Java remains one of a small group of contemporary programming languages that do not provide language-level support for raw strings.

The following programming languages support raw string literals and were surveyed for their delimiters and use of raw and multi-line strings; C, C++, C#, Dart, Go, Groovy, Haskell, Java, JavaScript, Kotlin, Perl, PHP, Python, R, Ruby, Scala and Swift. The Unix tools bash, grep and sed were also examined for string representations.

A multi-line literal solution could have been simply achieved by changing the Java specification to allow CR and LF in the body of a double-quote traditional string literal. However, the use of double quote implies that escapes must be interpreted.

A different delimiter was required to signify different interpretation behavior. Other languages chose a variety of delimiters:

Delimiters

Language/Tool

"""..."""

Groovy, Kotlin, Python, Scala, Swift

`...`

Go, JavaScript

@"..."

C#

R"..."

Groovy (old style)

R"xxx(...)xxx"

C/C++

%(...)

Ruby

qq{...}

Perl

Python, Kotlin, Groovy and Swift have opted to use triple double quotes to indicate raw strings. This choice reflects the connection with existing string literals.

Go and JavaScript use the backtick. This choice uses a character that is not commonly used in strings. This is not ideal for use in Markdown documents, but addresses a majority of cases.

A unique meta-tag such as @"..." used in C# provides similar functionality to the backticks proposed here. However, @ suggests annotations in Java. The use of another meta-tag limits the use of that meta-tag for future purposes.

Heredoc

An alternative to quoting for raw strings is using "here" documents or heredocs. Heredocs were first used in Unix shells and have found their way into programming languages such as Perl. A heredoc has a placeholder and an end marker. The placeholder indicates where the string is to be inserted in the code as well as providing the description of end marker. The end marker comes after the body of the string. For example,

System.out.println(<<HTML);
<html>
    <body>
        <p>Hello World.</p>
    </body>
</html>
HTML

Heredocs provide a solution for raw strings, but are thought by many to be an anachronism. They are also obtrusive and complicate margin management.

Testing

String test suites should be extended to duplicate existing tests replacing traditional string literals with raw string literals.

Negative tests should be added to test corner cases for line terminators and end of compilation unit.

Tests should be added to test escape and margin management methods.

Tests should be added to ensure we can embed Java-in-Java and Markdown-in-Java.