Function std.utf.byCodeUnit
Iterate a range of char, wchar, or dchars by code unit.
auto byCodeUnit(R)
(
R r
)
if (isConvertibleToString!R && !isStaticArray!R || isInputRange!R && isSomeChar!(ElementEncodingType!R));
The purpose is to bypass the special case decoding that
front
does to character arrays. As a result,
using ranges with byCodeUnit
can be nothrow
while
front
throws when it encounters invalid Unicode
sequences.
A code unit is a building block of the UTF encodings. Generally, an
individual code unit does not represent what's perceived as a full
character (a.k.a. a grapheme cluster in Unicode terminology). Many characters
are encoded with multiple code units. For example, the UTF-8 code units for
ø
are 0xC3 0xB8
. That means, an individual element of byCodeUnit
often does not form a character on its own. Attempting to treat it as
one while iterating over the resulting range will give nonsensical results.
Parameters
Name | Description |
---|---|
r | an input range of characters (including strings) or a type that implicitly converts to a string type. |
Returns
If r
is not an auto-decodable string (i.e. a narrow string or a
user-defined type that implicitly converts to a string type), then r
is returned.
Otherwise, r
is converted to its corresponding string type (if it's
not already a string) and wrapped in a random-access range where the
element encoding type of the string (its code unit) is the element type
of the range, and that range returned. The range has slicing.
If r
is quirky enough to be a struct or class which is an input range
of characters on its own (i.e. it has the input range API as member
functions), and it's implicitly convertible to a string type, then
r
is returned, and no implicit conversion takes place.
If r
is wrapped in a new range, then that range has a source
property for returning the string that's currently contained within that
range.
See Also
Refer to the std
docs for a reference on Unicode
terminology.
For a range that iterates by grapheme cluster (written character) see
byGrapheme
.
Example
import std .range .primitives;
import std .traits : isAutodecodableString;
auto r = "Hello, World!" .byCodeUnit();
static assert(hasLength!(typeof(r)));
static assert(hasSlicing!(typeof(r)));
static assert(isRandomAccessRange!(typeof(r)));
static assert(is(ElementType!(typeof(r)) == immutable char));
// contrast with the range capabilities of standard strings (with or
// without autodecoding enabled).
auto s = "Hello, World!";
static assert(isBidirectionalRange!(typeof(r)));
static if (isAutodecodableString!(typeof(s)))
{
// with autodecoding enabled, strings are non-random-access ranges of
// dchar.
static assert(is(ElementType!(typeof(s)) == dchar));
static assert(!isRandomAccessRange!(typeof(s)));
static assert(!hasSlicing!(typeof(s)));
static assert(!hasLength!(typeof(s)));
}
else
{
// without autodecoding, strings are normal arrays.
static assert(is(ElementType!(typeof(s)) == immutable char));
static assert(isRandomAccessRange!(typeof(s)));
static assert(hasSlicing!(typeof(s)));
static assert(hasLength!(typeof(s)));
}
Example
byCodeUnit
does no Unicode decoding
string noel1 = "noe\u0308l"; // noël using e + combining diaeresis
assert(noel1 .byCodeUnit[2] != 'ë');
writeln(noel1 .byCodeUnit[2]); // 'e'
string noel2 = "no\u00EBl"; // noël using a precomposed ë character
// Because string is UTF-8, the code unit at index 2 is just
// the first of a sequence that encodes 'ë'
assert(noel2 .byCodeUnit[2] != 'ë');
Example
byCodeUnit
exposes a source
property when wrapping narrow strings.
import std .algorithm .comparison : equal;
import std .range : popFrontN;
import std .traits : isAutodecodableString;
{
auto range = byCodeUnit("hello world");
range .popFrontN(3);
assert(equal(range .save, "lo world"));
static if (isAutodecodableString!string) // only enabled with autodecoding
{
string str = range .source;
writeln(str); // "lo world"
}
}
// source only exists if the range was wrapped
{
auto range = byCodeUnit("hello world"d);
static assert(!__traits(compiles, range .source));
}