JEP 111: Additional Unicode Constructs for Regular Expressions

OwnerXueming Shen
Discussioncore dash libs dash dev at openjdk dot java dot net
Endorsed byBrian Goetz
Created2011/07/26 20:00
Updated2016/01/18 04:55


Adopt further regular-expression constructs from from Unicode TR#18.


The primary motivation is to enhance/enrich the Unicode support level to allow developers to write sophisticated Unicode-enabled regular expressions on the Java platform. This is important to keep the Java Platform competitive with other languages that already offer more complete support for Unicode regular expressions.


Java Regular Expressions are derived from Perl Regular Expression and are supposed to provide Java developers most of the Perl style regression expression features. Perl Regular Expressions have evolved rapidly in the past couple years to follow Unicode Standard TR#18 Unicode Regular Expressions. Java Regular Expressions have claimed to be in conformance with Level 1 of the same Unicode Standard TR#18 Unicode Regular Expressions, plus RL2.1 Canonical Equivalents, which is the "lowest" level of conformance. Given that the Unicode Standard has been widely accepted as the de facto standard for development platforms and Java uses Unicode as its internal encoding scheme, it appears that higher-level Unicode support is desirable for developers working on Unicode-aware applications. The following new constructs and features are proposed to provide better Unicode support in Java Regular Expressions:


All the features (new regex constructs) listed here will be covered by the new unit tests and run by the existing test framework.