std.encoding
Category | Functions |
---|---|
Decode | codePoints decode decodeReverse safeDecode |
Conversion | codeUnits sanitize transcode |
Classification | canEncode isValid isValidCodePoint isValidCodeUnit |
BOM | BOM BOMSeq getBOM utfBOM |
Length & Index | firstSequence encodedLength index lastSequence validLength |
Encoding schemes | encodingName EncodingScheme EncodingSchemeASCII EncodingSchemeLatin1 EncodingSchemeLatin2 EncodingSchemeUtf16Native EncodingSchemeUtf32Native EncodingSchemeUtf8 EncodingSchemeWindows1250 EncodingSchemeWindows1252 |
Representation | AsciiChar AsciiString Latin1Char Latin1String Latin2Char Latin2String Windows1250Char Windows1250String Windows1252Char Windows1252String |
Exceptions | INVALID_SEQUENCE EncodingException |
auto e = EncodingScheme.create("utf-8");This library supplies EncodingScheme subclasses for ASCII, ISO-8859-1 (also known as LATIN-1), ISO-8859-2 (LATIN-2), WINDOWS-1250, WINDOWS-1252, UTF-8, and (on little-endian architectures) UTF-16LE and UTF-32LE; or (on big-endian architectures) UTF-16BE and UTF-32BE. This library provides a mechanism whereby other modules may add EncodingScheme subclasses for any other encoding.
Source: std/encoding.d
- enum dchar
INVALID_SEQUENCE
; - Special value returned by safeDecode
- enum
AsciiChar
: ubyte;
aliasAsciiString
= immutable(AsciiChar)[]; - Defines various character sets.
- enum
Latin1Char
: ubyte; - Defines an Latin1-encoded character.
- alias
Latin1String
= immutable(Latin1Char)[]; - Defines an Latin1-encoded string (as an array of immutable(Latin1Char)).
- enum
Latin2Char
: ubyte; - Defines a Latin2-encoded character.
- alias
Latin2String
= immutable(Latin2Char)[]; - Defines an Latin2-encoded string (as an array of immutable(Latin2Char)).
- enum
Windows1250Char
: ubyte; - Defines a Windows1250-encoded character.
- alias
Windows1250String
= immutable(Windows1250Char)[]; - Defines an Windows1250-encoded string (as an array of immutable(Windows1250Char)).
- enum
Windows1252Char
: ubyte; - Defines a Windows1252-encoded character.
- alias
Windows1252String
= immutable(Windows1252Char)[]; - Defines an Windows1252-encoded string (as an array of immutable(Windows1252Char)).
- pure nothrow @nogc @safe bool
isValidCodePoint
(dcharc
); - Returns
true
ifc
is a valid code pointNote that this includes the non-character code points U+FFFE and U+FFFF, since these are valid code points (even though they are not valid characters).Supersedes: This function supersedes std.utf.startsValidDchar().
Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Parameters:dchar c
the code point to be tested - @property string
encodingName
(T)(); - Returns the name of an encoding.The type of encoding cannot be deduced. Therefore, it is necessary to explicitly specify the encoding type.Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Examples:
writeln(encodingName!(char)); // "UTF-8" writeln(encodingName!(wchar)); // "UTF-16" writeln(encodingName!(dchar)); // "UTF-32" writeln(encodingName!(AsciiChar)); // "ASCII" writeln(encodingName!(Latin1Char)); // "ISO-8859-1" writeln(encodingName!(Latin2Char)); // "ISO-8859-2" writeln(encodingName!(Windows1250Char)); // "windows-1250" writeln(encodingName!(Windows1252Char)); // "windows-1252"
- bool
canEncode
(E)(dcharc
); - Returns
true
iff it is possible to represent the specified codepoint in the encoding.The type of encoding cannot be deduced. Therefore, it is necessary to explicitly specify the encoding type.Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Examples:assert( canEncode!(Latin1Char)('A')); assert( canEncode!(Latin2Char)('A')); assert(!canEncode!(AsciiChar)('\u00A0')); assert( canEncode!(Latin1Char)('\u00A0')); assert( canEncode!(Latin2Char)('\u00A0')); assert( canEncode!(Windows1250Char)('\u20AC')); assert(!canEncode!(Windows1250Char)('\u20AD')); assert(!canEncode!(Windows1250Char)('\uFFFD')); assert( canEncode!(Windows1252Char)('\u20AC')); assert(!canEncode!(Windows1252Char)('\u20AD')); assert(!canEncode!(Windows1252Char)('\uFFFD')); assert(!canEncode!(char)(cast(dchar) 0x110000));
Examples:How to check an entire stringimport std.algorithm.searching : find; import std.utf : byDchar; assert("The quick brown fox" .byDchar .find!(x => !canEncode!AsciiChar(x)) .empty);
- bool
isValidCodeUnit
(E)(Ec
); - Returns
true
if the code unit is legal. For example, the byte 0x80 would not be legal in ASCII, because ASCII code units must always be in the range 0x00 to 0x7F.Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Parameters:E c
the code unit to be tested Examples:assert(!isValidCodeUnit(cast(char) 0xC0)); assert(!isValidCodeUnit(cast(char) 0xFF)); assert( isValidCodeUnit(cast(wchar) 0xD800)); assert(!isValidCodeUnit(cast(dchar) 0xD800)); assert(!isValidCodeUnit(cast(AsciiChar) 0xA0)); assert( isValidCodeUnit(cast(Windows1250Char) 0x80)); assert(!isValidCodeUnit(cast(Windows1250Char) 0x81)); assert( isValidCodeUnit(cast(Windows1252Char) 0x80)); assert(!isValidCodeUnit(cast(Windows1252Char) 0x81));
- bool
isValid
(E)(const(E)[]s
); - Returns
true
if the string is encoded correctlySupersedes: This function supersedes std.utf.validate(), however note that this function returns a bool indicating whether the input was valid or not, whereas the older function would throw an exception.
Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Parameters:const(E)[] s
the string to be tested Examples:assert( isValid("\u20AC100")); assert(!isValid(cast(char[3])[167, 133, 175]));
- size_t
validLength
(E)(const(E)[]s
); - Returns the length of the longest possible substring, starting from the first code unit, which is validly encoded.Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Parameters:
const(E)[] s
the string to be tested - immutable(E)[]
sanitize
(E)(immutable(E)[]s
); - Sanitizes a string by replacing malformed code unit sequences with valid code unit sequences. The result is guaranteed to be valid for this encoding.If the input string is already valid, this function returns the original, otherwise it constructs a new string by replacing all illegal code unit sequences with the encoding'
s
replacement character, Invalid sequences will be replaced with the Unicode replacement character (U+FFFD) if the character repertoire contains it, otherwise invalid sequences will be replaced with '?'.Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Parameters:immutable(E)[] s
the string to be sanitized Examples:writeln(sanitize("hello \xF0\x80world")); // "hello \xEF\xBF\xBDworld"
- size_t
firstSequence
(E)(const(E)[]s
); - Returns the length of the first encoded sequence.The input to this function MUST be validly encoded. This is enforced by the function'
s
in-contract.Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Parameters:const(E)[] s
the string to be sliced Examples:writeln(firstSequence("\u20AC1000")); // "\u20AC".length writeln(firstSequence("hel")); // "h".length
- size_t
lastSequence
(E)(const(E)[]s
); - Returns the length of the last encoded sequence.The input to this function MUST be validly encoded. This is enforced by the function'
s
in-contract.Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Parameters:const(E)[] s
the string to be sliced Examples:writeln(lastSequence("1000\u20AC")); // "\u20AC".length writeln(lastSequence("hellö")); // "ö".length
- ptrdiff_t
index
(E)(const(E)[]s
, intn
); - Returns the array
index
at which the (n
+1)th code point begins.The input to this function MUST be validly encoded. This is enforced by the function's
in-contract.Supersedes: This function supersedes std.utf.toUTFindex().
Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Parameters:const(E)[] s
the string to be counted int n
the current code point index
Examples:writeln(index("\u20AC100", 1)); // 3 writeln(index("hällo", 2)); // 3
- dchar
decode
(S)(ref Ss
); - Decodes a single code point.This function removes one or more code units from the start of a string, and returns the decoded code point which those code units represent. The input to this function MUST be validly encoded. This is enforced by the function'
s
in-contract.Supersedes: This function supersedes std.utf.
decode
(), however, note that the function codePoints() supersedes it more conveniently.Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Parameters:S s
the string whose first code point is to be decoded - dchar
decodeReverse
(E)(ref const(E)[]s
); - Decodes a single code point from the end of a string.This function removes one or more code units from the end of a string, and returns the decoded code point which those code units represent. The input to this function MUST be validly encoded. This is enforced by the function'
s
in-contract.Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Parameters:const(E)[] s
the string whose first code point is to be decoded - dchar
safeDecode
(S)(ref Ss
); - Decodes a single code point. The input does not have to be valid.This function removes one or more code units from the start of a string, and returns the decoded code point which those code units represent. This function will accept an invalidly encoded string as input. If an invalid sequence is found at the start of the string, this function will remove it, and return the value INVALID_SEQUENCE.Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Parameters:
S s
the string whose first code point is to be decoded - size_t
encodedLength
(E)(dcharc
); - Returns the number of code units required to encode a single code point.The input to this function MUST be a valid code point. This is enforced by the function's in-contract. The type of the output cannot be deduced. Therefore, it is necessary to explicitly specify the encoding as a template parameter.Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Parameters:
dchar c
the code point to be encoded - E[]
encode
(E)(dcharc
); - Encodes a single code point.This function encodes a single code point into one or more code units. It returns a string containing those code units. The input to this function MUST be a valid code point. This is enforced by the function's in-contract. The type of the output cannot be deduced. Therefore, it is necessary to explicitly specify the encoding as a template parameter.
Supersedes: This function supersedes std.utf.
encode
(), however, note that the function codeUnits() supersedes it more conveniently.Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Parameters:dchar c
the code point to be encoded - size_t
encode
(E)(dcharc
, E[]array
); - Encodes a single code point into an
array
.This function encodes a single code point into one or more code units The code units are stored in a user-supplied fixed-sizearray
, which must be passed by reference. The input to this function MUST be a valid code point. This is enforced by the function's in-contract. The type of the output cannot be deduced. Therefore, it is necessary to explicitly specify the encoding as a template parameter.Supersedes: This function supersedes std.utf.
encode
(), however, note that the function codeUnits() supersedes it more conveniently.Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Parameters:dchar c
the code point to be encoded E[] array
the destination array
Returns:the number of code units written to thearray
- void
encode
(E)(dcharc
, void delegate(E)dg
); - Encodes a single code point to a delegate.This function encodes a single code point into one or more code units. The code units are passed one at a time to the supplied delegate. The input to this function MUST be a valid code point. This is enforced by the function's in-contract. The type of the output cannot be deduced. Therefore, it is necessary to explicitly specify the encoding as a template parameter.
Supersedes: This function supersedes std.utf.
encode
(), however, note that the function codeUnits() supersedes it more conveniently.Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Parameters:dchar c
the code point to be encoded void delegate(E) dg
the delegate to invoke for each code unit - size_t
encode
(Tgt, Src, R)(in Src[]s
, Rrange
); - Encodes the contents of
s
in units of type Tgt, writing the result to an outputrange
.Returns:The number of Tgt elements written.Parameters:Tgt Element type of range
.Src[] s
Input array. R range
Output range
. - CodePoints!E
codePoints
(E)(immutable(E)[]s
); - Returns a foreachable struct which can bidirectionally iterate over all code points in a string.The input to this function MUST be validly encoded. This is enforced by the function'
s
in-contract. You can foreach either with or without an index. If an index is specified, it will be initialized at each iteration with the offset into the string at which the code point begins.Supersedes: This function supersedes std.utf.decode().
Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Parameters:immutable(E)[] s
the string to be decoded Example:
string s = "hello world"; foreach (c;codePoints(s)) { // do something with c (which will always be a dchar) }
Note that, currently, foreach (c:codePoints
(s
)) is superior to foreach (c;s
) in that the latter will fall over on encountering U+FFFF.Examples:string s = "hello"; string t; foreach (c;codePoints(s)) { t ~= cast(char) c; } writeln(s); // t
- CodeUnits!E
codeUnits
(E)(dcharc
); - Returns a foreachable struct which can bidirectionally iterate over all code units in a code point.The input to this function MUST be a valid code point. This is enforced by the function's in-contract. The type of the output cannot be deduced. Therefore, it is necessary to explicitly specify the encoding type in the template parameter.
Supersedes: This function supersedes std.utf.encode().
Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Parameters:dchar c
the code point to be encoded Examples:char[] a; foreach (c;codeUnits!(char)(cast(dchar)'\u20AC')) { a ~= c; } writeln(a.length); // 3 writeln(a[0]); // 0xE2 writeln(a[1]); // 0x82 writeln(a[2]); // 0xAC
- void
transcode
(Src, Dst)(Src[]s
, out Dst[]r
); - Convert a string from one encoding to another.
Supersedes: This function supersedes std.utf.toUTF8(), std.utf.toUTF16() and std.utf.toUTF32() (but note that to!() supersedes it more conveniently).
Standards:Unicode 5.0, ASCII, ISO-8859-1, ISO-8859-2, WINDOWS-1250, WINDOWS-1252Parameters:Src[] s
Source string. Must be validly encoded. This is enforced by the function' s
in-contract.Dst[] r
Destination string See Also:Examples:wstring ws; // transcode from UTF-8 to UTF-16 transcode("hello world",ws); writeln(ws); // "hello world"w Latin1String ls; // transcode from UTF-16 to ISO-8859-1 transcode(ws, ls); writeln(ws); // "hello world"
- class
EncodingException
: object.Exception; - The base class for exceptions thrown by this module
- abstract class
EncodingScheme
; - Abstract base class of all encoding schemes
- void
register
(Klass : EncodingScheme)(); - Registers a subclass of EncodingScheme.This function allows user-defined subclasses of EncodingScheme to be declared in other modules.Parameters:
Klass The subclass of EncodingScheme to register
.Example:
class Amiga1251 : EncodingScheme { shared static this() { EncodingScheme.register!Amiga1251; } }
- static EncodingScheme
create
(stringencodingName
); - Obtains a subclass of EncodingScheme which is capable of encoding and decoding the named encoding scheme.This function is only aware of EncodingSchemes which have been registered with the register() function.
Example:
auto scheme = EncodingScheme.create("Amiga-1251");
- abstract const string
toString
(); - Returns the standard name of the encoding scheme
- abstract const string[]
names
(); - Returns an array of all known
names
for this encoding scheme - abstract const bool
canEncode
(dcharc
); - Returns
true
if the characterc
can be represented in this encoding scheme. - abstract const size_t
encodedLength
(dcharc
); - Returns the number of ubytes required to encode this code point.The input to this function MUST be a valid code point.Parameters:
dchar c
the code point to be encoded Returns:the number of ubytes required. - abstract const size_t
encode
(dcharc
, ubyte[]buffer
); - Encodes a single code point into a user-supplied, fixed-size
buffer
.This function encodes a single code point into one or more ubytes. The suppliedbuffer
must be code unit aligned. (For example, UTF-16LE or UTF-16BE must be wchar-aligned, UTF-32LE or UTF-32BE must be dchar-aligned, etc.) The input to this function MUST be a valid code point.Parameters:dchar c
the code point to be encoded ubyte[] buffer
the destination array Returns:the number of ubytes written. - abstract const dchar
decode
(ref const(ubyte)[]s
); - Decodes a single code point.This function removes one or more ubytes from the start of an array, and returns the decoded code point which those ubytes represent. The input to this function MUST be validly encoded.Parameters:
const(ubyte)[] s
the array whose first code point is to be decoded - abstract const dchar
safeDecode
(ref const(ubyte)[]s
); - Decodes a single code point. The input does not have to be valid.This function removes one or more ubytes from the start of an array, and returns the decoded code point which those ubytes represent. This function will accept an invalidly encoded array as input. If an invalid sequence is found at the start of the string, this function will remove it, and return the value INVALID_SEQUENCE.Parameters:
const(ubyte)[] s
the array whose first code point is to be decoded - abstract const @property immutable(ubyte)[]
replacementSequence
(); - Returns the sequence of ubytes to be used to represent any character which cannot be represented in the encoding scheme.Normally this will be a representation of some substitution character, such as U+FFFD or '?'.
- bool
isValid
(const(ubyte)[]s
); - Returns
true
if the array is encoded correctlyParameters:const(ubyte)[] s
the array to be tested - size_t
validLength
()(const(ubyte)[]s
); - Returns the length of the longest possible substring, starting from the first element, which is validly encoded.Parameters:
const(ubyte)[] s
the array to be tested - immutable(ubyte)[]
sanitize
()(immutable(ubyte)[]s
); - Sanitizes an array by replacing malformed ubyte sequences with valid ubyte sequences. The result is guaranteed to be valid for this encoding scheme.If the input array is already valid, this function returns the original, otherwise it constructs a new array by replacing all illegal sequences with the encoding scheme'
s
replacement sequence.Parameters:immutable(ubyte)[] s
the string to be sanitized - size_t
firstSequence
()(const(ubyte)[]s
); - Returns the length of the first encoded sequence.The input to this function MUST be validly encoded. This is enforced by the function'
s
in-contract.Parameters:const(ubyte)[] s
the array to be sliced - size_t
count
()(const(ubyte)[]s
); - Returns the total number of code points encoded in a ubyte array.The input to this function MUST be validly encoded. This is enforced by the function'
s
in-contract.Parameters:const(ubyte)[] s
the string to be counted - ptrdiff_t
index
()(const(ubyte)[]s
, size_tn
); - Returns the array
index
at which the (n
+1)th code point begins.The input to this function MUST be validly encoded. This is enforced by the function's
in-contract.Parameters:const(ubyte)[] s
the string to be counted size_t n
the current code point index
- class
EncodingSchemeASCII
: std.encoding.EncodingScheme; - EncodingScheme to handle ASCIIThis scheme recognises the following names: "ANSI_X3.4-1968", "ANSI_X3.4-1986", "ASCII", "IBM367", "ISO646-US", "ISO_646.irv:1991", "US-ASCII", "cp367", "csASCII" "iso-ir-6", "us"
- class
EncodingSchemeLatin1
: std.encoding.EncodingScheme; - EncodingScheme to handle Latin-1This scheme recognises the following names: "CP819", "IBM819", "ISO-8859-1", "ISO_8859-1", "ISO_8859-1:1987", "csISOLatin1", "iso-ir-100", "l1", "latin1"
- class
EncodingSchemeLatin2
: std.encoding.EncodingScheme; - EncodingScheme to handle Latin-2This scheme recognises the following names: "Latin 2", "ISO-8859-2", "ISO_8859-2", "ISO_8859-2:1999", "Windows-28592"
- class
EncodingSchemeWindows1250
: std.encoding.EncodingScheme; - EncodingScheme to handle Windows-1250This scheme recognises the following names: "windows-1250"
- class
EncodingSchemeWindows1252
: std.encoding.EncodingScheme; - EncodingScheme to handle Windows-1252This scheme recognises the following names: "windows-1252"
- class
EncodingSchemeUtf8
: std.encoding.EncodingScheme; - EncodingScheme to handle UTF-8This scheme recognises the following names: "UTF-8"
- class
EncodingSchemeUtf16Native
: std.encoding.EncodingScheme; - EncodingScheme to handle UTF-16 in native byte orderThis scheme recognises the following names: "UTF-16LE" (little-endian architecture only) "UTF-16BE" (big-endian architecture only)
- class
EncodingSchemeUtf32Native
: std.encoding.EncodingScheme; - EncodingScheme to handle UTF-32 in native byte orderThis scheme recognises the following names: "UTF-32LE" (little-endian architecture only) "UTF-32BE" (big-endian architecture only)
- enum
BOM
: int; - Definitions of common Byte Order Marks. The elements of the enum can used as indices into bomTable to get matching BOMSeq.
none
- no BOM was found
utf32be
- [0x00, 0x00, 0xFE, 0xFF]
utf32le
- [0xFF, 0xFE, 0x00, 0x00]
utf1
- [0xF7, 0x64, 0x4C]
utfebcdic
- [0xDD, 0x73, 0x66, 0x73]
scsu
- [0x0E, 0xFE, 0xFF]
bocu1
- [0xFB, 0xEE, 0x28]
gb18030
- [0x84, 0x31, 0x95, 0x33]
utf8
- [0xEF, 0xBB, 0xBF]
utf16be
- [0xFE, 0xFF]
utf16le
- [0xFF, 0xFE]
- alias
BOMSeq
= std.typecons.Tuple!(BOM, "schema", ubyte[], "sequence").Tuple; - The type stored inside bomTable.
- immutable Tuple!(BOM, "schema", ubyte[], "sequence")[]
bomTable
; - Mapping of a byte sequence to Byte Order Mark (BOM)
- immutable(BOMSeq)
getBOM
(Range)(Rangeinput
)
if (isForwardRange!Range && is(Unqual!(ElementType!Range) == ubyte)); - Returns a BOMSeq for a given
input
. If no BOM is present the BOMSeq for BOM.none is returned. The BOM sequence at the beginning of the range will not be comsumed from the passed range. If you pass a reference type range make sure that save creates a deep copy.Parameters:Range input
The sequence to check for the BOM Returns:the found BOMSeq corresponding to the passedinput
.Examples:import std.format : format; auto ts = dchar(0x0000FEFF) ~ "Hello World"d; auto entry = getBOM(cast(ubyte[]) ts); version(BigEndian) { writeln(entry.schema); // BOM.utf32be } else { writeln(entry.schema); // BOM.utf32le }
- enum dchar
utfBOM
; - Constant defining a fully decoded BOM