New string features in ECMAScript 6

The blog post covers new features of strings in ECMAScript 6 (ES6).




Unicode code point escapes

Unicode “characters” (code points) are 21 bit long [2]. JavaScript strings are (roughly) sequences of 16 bit characters, encoded as UTF-16. Therefore, code points beyond the first 16 bits of the code point range (the Basic Multilingual Pane, BMP) are represented by two JavaScript characters. Until now, if you wanted to specify such code points via numbers, you needed two so-called Unicode escapes. As an example, the following statement prints a rocket (code point 0x1F680) to most consoles:



console.log('\uD83D\uDE80');

In ECMAScript 6, there is a new kind of Unicode escape that lets you specify any code point:



console.log('\u{1F680}');

String interpolation, multi-line string literals and raw string literals

Template strings [3] provide three interesting features.


First, template strings support string interpolation:



let first = 'Jane';
let last = 'Doe';
console.log(`Hello ${first} ${last}!`);
// Hello Jane Doe!

Second, template strings can contain multiple lines:



let multiLine = `
This is
a string
with multiple
lines`;

Third, template strings are “raw” if you prefix them with the tag String.raw – the backslash is not a special character and escapes such as \n are not interpreted:



let raw = String.raw`Not a newline: \n`;
console.log(raw === 'Not a newline: \\n'); // true

Iterating over strings

Strings are iterable [4], which means that you can use for-of to iterate over their characters:



for (let ch of 'abc') {
console.log(ch);
}
// Output:
// a
// b
// c

And you can use the spread operator (...) to turn strings into arrays:



let chars = [...'abc'];
// ['a', 'b', 'c']

Handling Unicode code points

The string iterator splits strings along code point boundaries, which means that the strings it returns comprise one or two characters:



for (let ch of 'x\uD83D\uDE80y') {
console.log(ch.length);
}
// Output:
// 1
// 2
// 1

That gives you a quick way to count the Unicode code points in a string:



> [...'x\uD83D\uDE80y'].length
3

It also helps with reversing strings that contain non-BMP code points:



let str = 'x\uD83D\uDE80y';

// ES5: \uD83D\uDE80 are (incorrectly) reversed
console.log(str.split('').reverse().join(''));
// 'y\uDE80\uD83Dx'

// ES6: order of \uD83D\uDE80 is preserved
console.log([...str].reverse().join(''));
// 'y\uD83D\uDE80x'





The two reversed strings in the Firefox console.


Numeric values of code points

The new method codePointAt() returns the numeric value of a code point at a given index in a string:



let str = 'x\uD83D\uDE80y';
console.log(str.codePointAt(0).toString(16)); // 78
console.log(str.codePointAt(1).toString(16)); // 1f680
console.log(str.codePointAt(3).toString(16)); // 79

This method works well when combined with iteration over strings:



for (let ch of 'x\uD83D\uDE80y') {
console.log(ch.codePointAt(0).toString(16));
}
// Output:
// 78
// 1f680
// 79

The opposite of codePointAt() is String.fromCodePoint():



> String.fromCodePoint(0x78, 0x1f680, 0x79) === 'x\uD83D\uDE80y'
true

Checking for containment and repeating strings

Three new methods check whether a string exists within another string:



> 'hello'.startsWith('hell')
true
> 'hello'.endsWith('ello')
true
> 'hello'.includes('ell')
true

Each of these methods has a position as an optional second parameter, which specifies where the string to be searched starts or ends:



> 'hello'.startsWith('ello', 1)
true
> 'hello'.endsWith('hell', 4)
true

> 'hello'.includes('ell', 1)
true
> 'hello'.includes('ell', 2)
false

The repeat() method repeats strings:



> 'doo '.repeat(3)
'doo doo doo '

All new methods

Template strings:



  • String.raw(callSite, ...substitutions) : string
    Template string tag for “raw” content (backslashes are not interpreted).


Unicode and code points:



  • String.fromCodePoint(...codePoints : number[]) : string
    Turns numbers denoting Unicode code points into a string.

  • String.prototype.codePointAt(pos) : number
    Returns the number of the code point starting at position pos (comprising one or two JavaScript “characters”).

  • String.prototype.normalize(form? : string) : string
    Different combinations of code points may look the same. Unicode normalization changes them all to the same value(s), their so-called canonical representation. That helps with comparing and searching for strings. The 'NFC' form is recommended for general text.


Finding strings:



  • String.prototype.startsWith(searchString, position=0) : boolean
    Does the receiver start with searchString? position lets you specify where the string to be checked starts.

  • String.prototype.endsWith(searchString, endPosition=searchString.length) : boolean
    Does the receiver end with searchString? endPosition lets you specify where the string to be checked ends.

  • String.prototype.includes(searchString, position=0) : boolean
    Does the receiver contain searchString? position lets you specify where the string to be searched starts.


Repeating strings:



  • String.prototype.repeat(count) : string
    Returns the receiver, concatenated count times.


Further reading


  1. Using ECMAScript 6 today [an early draft of my book on ECMAScript 6]

  2. Chapter 24, “Unicode and JavaScript” of “Speaking JavaScript”; includes an introduction to Unicode.

  3. Template strings: embedded DSLs in ECMAScript 6

  4. Iterators and generators in ECMAScript 6


Comments

Popular posts from this blog

Steve Lopez and the Importance of Newspapers

Ideas for fixing unconnected computing

Omar to kill me