JEP draft: Raw String Literals

OwnerJim Laskey
Created2018/01/23 15:40
Updated2018/02/18 23:30
TypeFeature
StatusSubmitted
Componentspecification / language
ScopeSE
Discussionamber dash dev at openjdk dot java dot net
EffortM
DurationM
Priority3
Reviewed byAlex Buckley
Endorsed byBrian Goetz
Issue8196004

Summary

Add a new kind of literal, a raw string literal, to the Java programming language. Like the traditional string literal, a raw string literal produces a String, but does not interpret string escapes and can span multiple lines of source code.

Definitions

Goals

Non-Goals

Motivation

Escape sequences have been used in many programming languages, including Java, to represent characters that can not be easily represented directly. As an example, the escape sequence \n represents the ASCII newline control character. To print "hello" and "world" on separate lines the string "hello\nworld\n" can be used;

System.out.print("hello\nworld\n");

Output:

hello
world

Beside suffering from readability issues, this example fixedly targets Unix based systems, where other OSes use alternate new line representations, such as \r\n (Windows). In Java, we use a higher level method such as println to provide the platform appropriate new line sequence;

System.out.println("hello");
System.out.println("world");

If "hello" and "world" are being displayed using a GUI library, control characters may not have any significance at all.

The escape sequence indicator, backslash, is represented in Java string literals as \\. This doubling up of backslashes leads to Leaning Toothpick Syndrome where strings become difficult to interpret because of excessive backslashes. Java developers are familiar with examples such as;

Path path = Paths.get("C:\\Program Files\\foo");

Escape sequences, such as \" to represent the double-quote character, also lead to interpretation issues when used in non-Java grammars. For example, searching for a double-quote within a string requires;

Pattern pattern = Pattern.compile("\\\"");

The reality of escape sequences is they are often the exception and not the rule in everyday Java development. We use control characters less and escape presence adversely affects the readability and maintainability of our code. Once we come to this realization, the notion of a non-interpreted string literal becomes a well reasoned result.

Real-world Java code, which frequently embeds fragments of other programs (SQL, JSON, XML, Regex, etc) in Java programs, needs a mechanism for capturing literal strings as-is, without special handling of Unicode escaping, backslash, or new lines.

This JEP proposes a new kind of literal, a raw string literal, which sets aside both Java escapes and Java line terminator specifications, to provide character sequences that under many circumstances are more readable and maintainable than the existing traditional string literal.

File Paths Example

Traditional String Literals Raw String Literals
Runtime.getRuntime().exec("\"C:\\Program Files\\foo\" bar");
Runtime.getRuntime().exec(`"C:\Program Files\foo" bar"`);

Multi-line Example

Traditional String Literals Raw String Literals
String html = "<html>\n" +
              "    <body>\n" +
              "		    <p>Hello World.</p>\n" +
              "    </body>\n" +
              "</html>\n";
String html = `<html>
                   <body>
                       <p>Hello World.</p>
                   </body>
               </html>
              `;

Regular Expression Example

Traditional String Literals Raw String Literals
System.out.println("this".matches("\\w\\w\\w\\w"));
System.out.println("this".matches(`\w\w\w\w`));

Output:

true

Polyglot Example

Traditional String Literals Raw String Literals
String script = "function hello() {\n" +
                "   print(\'\"Hello World\"\');\n" +
                "}\n" +
                "\n" +
                "hello();\n";
ScriptEngine engine = new ScriptEngineManager().getEngineByName("js");
Object obj = engine.eval(script);
String script = `function hello() {
                    print('"Hello World"');
                 }
hello();
            `

ScriptEngine engine = new ScriptEngineManager().getEngineByName("js"); Object obj = engine.eval(script);

Output:

"Hello World"

Database Example

Traditional String Literals Raw String Literals
String query = "SELECT `EMP_ID`, `LAST_NAME` FROM `EMPLOYEE_TB`\n" +
               "WHERE `CITY` = ‘INDIANAPOLIS'\n" +
               "ORDER BY `EMP_ID`, `LAST_NAME`;\n";
String query = ``
                 SELECT `EMP_ID`, `LAST_NAME` FROM `EMPLOYEE_TB`
                 WHERE `CITY` = ‘INDIANAPOLIS'
                 ORDER BY `EMP_ID`, `LAST_NAME`;
               ``;

Description

A raw string literal is a new form of literal.

Literal:
  IntegerLiteral
  FloatingPointLiteral
  BooleanLiteral
  CharacterLiteral
  StringLiteral
  RawStringLiteral
  NullLiteral

RawStringLiteral:
  RawStringDelimiter RawInputCharacter {RawInputCharacter} RawStringDelimiter

RawStringDelimiter:
    ` {`}

A raw string literal consists of one or more characters enclosed in sequences of backticks ` (\u0060) (backquote, accent grave). A raw string literal will open with a sequence of one or more backticks. The raw string literal will close when an equal number of backticks is encountered. Any other sequence of backticks is treated as part of the string.

Embedding of backticks in a raw string literal can be accomplished by increasing/decreasing the number of backticks in the open/close sequences to mismatch any embedded sequences.

Characters in a raw string literal are never interpreted with the exception of CR and CRLF.

CR (\u000D) and CRLF (\u000D\u000A) sequences are always translated to LF (\u000A). This translation provides least surprise behavior across platforms.

It is a compile-time error for to have an open backtick sequence and no corresponding close backtick sequence before the end of the compilation unit.

The Java specification stipulates there are two kinds of escapes used in traditional string literals; Unicode escapes and escape sequences. Raw string literals never interpret escapes. That is, the individual characters that make up the escape are used as-is.

Unicode escapes, in the form \uxxxx, are processed as part of character input prior to interpretation by the lexer. To support the raw string literal as-is requirement, Unicode escape processing is disabled when the lexer encounters an opening backtick and reenabled when encountering a closing backtick. For consistency, Unicode escape \u0060 may not be used as a substitute for the opening backtick.

The following are examples of raw string literal:

`"`                // a string containing " alone
``can`t``          // a string containing 'c', 'a', 'n', '`' and 't'
`This is a string` // a string containing 16 characters
`\n`               // a string containing '\' and 'n'
`\u2022`           // a string containing '\', 'u', '2', '0', '2' and '2'
`This is a
two-line string`   // a single string constant

Once parsed, Strings from raw string literals are treated exactly as Strings from traditional string literals. The resulting class file does not preserve the original literal form.

Escapes

It is highly probable that a developer may want a string that is multi-line but has interpreted escape sequences. To facilitate this requirement, instance methods will be added to the String class to support run-time interpretation of escape sequences. Primarily,

public String unescape()

which translates each character sequence beginning with \ that has the same spelling as a sequence defined in the JLS (3.3 Unicode Escapes, 3.10.6. Escape Sequences for Character and String Literals) to the character represented by that sequence.

Examples (b0 thru b3 are true):

boolean b0 = `\n`.equals("\\n");
boolean b1 = `\n`.unescape().equals("\n");
boolean b2 = `\n`.length == 2;
boolean b3 = `\n`.unescape().length == 1;

Other methods provide finer control over which escapes are translated.

There will also be a provision for tools to invert escape. The following method will also be added to the String class,

public String escape()

which convert all characters less than ' ' into Unicode or character escape sequences, characters above '~' to Unicode escape sequences, and characters ", ', \ are represented as escape sequences.

Examples (b0 thru b3 are true):

boolean b0 = "\n".escape().equals(`\n`);
boolean b1 = `•`.escape().equals(`\u2022`);
boolean b2 = "•".escape().equals(`\u2022`);
boolean b3 = !"•".escape().equals("\u2022");

Margin Management

One of the issues with multi-line strings is whether to format the string against the left margin (as in heredoc) or with the indentation used by surrounding code. Ideally, the string should blend with surrounding code. The question then becomes, what to do with the extraneous left spacing.

To provide a flexible solution, raw string literals are scanned with the margin intact. String methods are supplied to trim extraneous left spacing.

Alternatives

Choice of Delimiters

A traditional string literal and a raw string literal both enclose their character sequence with delimiters. A traditional string literal uses the double-quote character as both the opening and closing delimiter. This symmetry makes the literal easy to read and parse. A raw string literal will also adopt symmetric delimiters, but it must use a different delimiter because the double-quote character may appear unescaped in the character sequence. The choice of delimiters for a raw string literal is informed by the following considerations:

It is assumed that string literal delimiter choice involves only the three Latin1 quote characters: single-quote, double-quote, and backtick. Any other choice would affect clarity and be inconsistent with traditional string literals.

Still, it is necessary to differentiate a raw string literal from a traditional string literal. For example, double-quote could be combined with other characters or custom phrases to form a kind of compound delimiter for raw string literals. For example, $"xyz"$ or abcd"xyz"abcd. These compound delimiters meet the basic requirements, but lack a clean and simple embedding of the closing delimiter. Also, there is a temptation in the custom phrases case to assign semantic meaning to the phrase, heralding another industry similar to Java annotations.

There is the possibility to use quote repetition """xyz""". Here we have to be cautious to avoid ambiguity. Example: "" + x + "" can be parsed as the concatenation of a traditional string literal with a variable and another traditional string literal; or as a raw string literal for the seven-character sequence + x +.

The advantage of the backtick is that it does not require repurposing. We can also avoid the ambiguity created by quote repetition and the empty string. It is a new delimiter in Java Specification terms. It meets all the delimiter requirements, including a simple embedding rule.

Another consideration for choice of delimiters is the potential for future technologies. With raw and traditional string literals both using simple delimiters, any future technology could be applied symmetrically.

This JEP recommends the use of the backtick character. It is distinct from the other Java quotes but conveys similar purpose.

Multi-line Traditional String Literals

Even though this option has been set aside as a raw string literal solution, it may still be reasonable to allow multi-line traditional string literals in addition to raw string literals. Enabling such a feature would affect tools and tests that assume multi-line traditional string literals as an error.

Other Languages

Java remains one of a small group of contemporary programming languages that do not provide language-level support for raw strings.

The following programming languages support raw string literals and were surveyed for their delimiters and use of raw and multi-line strings; C, C++, C#, Dart, Go, Groovy, Haskell, Java, JavaScript, Kotlin, Perl, PHP, Python, R, Ruby, Scala and Swift. Unix tools bash, grep and sed were also probed for string representations.

A multi-line literal solution could have been simply achieved by changing the Java specification to allow CR and LF in the body of a double quote traditional string literal. However, the use of double quote implies that Java interpretation of escapes must take place.

A different delimiter was required to signify different interpretation behavior. Other languages chose a variety of delimiters;

Delimiters

Language/Tool

"""..."""

Python, Kotlin, Groovy, Swift

`...`

Go, JavaScript

@"..."

C#

R"..."

Groovy (old style)

R"xxx(...)xxx"

C/C++

raw"..."

Scala

%(...)

Ruby

qq{...}

Perl

Python, Kotlin, Groovy and Swift have opted to use triple double quotes to indicate raw string. This choice reflects the connection with the existing string literal.

Go and JavaScript use the backtick. This choice uses a character that is not commonly used in strings. This is not ideal for use in markdown documents, but addresses a majority of cases.

A unique meta-tag such as @"..." used in C# provides similar functionality to the proposed backticks. However, @ implies annotations in Java. The use of another meta-tag limits the use of that meta-tag for future purposes.

Heredoc

An alternative to quoting for raw strings is using here documents or heredocs. Heredocs were first used in Unix shells and have found their way into programming languages such as Perl. A heredoc has a placeholder and an end marker. The placeholder indicates where the string is to be inserted in the code as well as providing the description of end marker. The end marker comes after the body of the string. For example,

System.out.println(<<HTML);
<html>
    <body>
        <p>Hello World.</p>
    </body>
</html>
HTML

Heredocs provide a solution for raw strings, but thought to be an anachronism. They are also obtrusive and complicate margin management.

Testing

String test suites should be extended to duplicate existing tests replacing traditional string literals with raw string literals.

Negative tests should be added to test corner cases for line terminators and end of compilation unit.

Tests should be added to test escape and margin management methods.

Risks and Assumptions

There is an assumption that raw string literals containing Markdown, Go or JavaScript will infrequently use backticks and that the use of repeating backtick delimi is less intrusive than other delimiters.

Dependencies