Previous: Wide Strings, Up: Unicode


5.7.2 Unicode Representations

The procedures in this section implement transformations that convert between the internal representation of Unicode characters and several standard external representations. These external representations are all implemented as sequences of bytes, but they differ in their intended usage.

UTF-8
Each character is written as a sequence of one to four bytes.
UTF-16
Each character is written as a sequence of one or two 16-bit integers.
UTF-32
Each character is written as a single 32-bit integer.

The UTF-16 and UTF-32 representations may be serialized to and from a byte stream in either big-endian or little-endian order. In big-endian order, the most significant byte is first, the next most significant byte is second, etc. In little-endian order, the least significant byte is first, etc. All of the UTF-16 and UTF-32 representation procedures are available in both orders, which are indicated by names containing `utfNN-be' and `utfNN-le', respectively. There are also procedures that implement host-endian order, which is either big-endian or little-endian depending on the underlying computer architecture.

— procedure: utf8-string->wide-string string [start [end]]
— procedure: utf16-be-string->wide-string string [start [end]]
— procedure: utf16-le-string->wide-string string [start [end]]
— procedure: utf16-string->wide-string string [start [end]]
— procedure: utf32-be-string->wide-string string [start [end]]
— procedure: utf32-le-string->wide-string string [start [end]]
— procedure: utf32-string->wide-string string [start [end]]

Each of these procedures converts a byte vector to a wide string, treating string as a stream of bytes encoded in the corresponding `utfNN' representation. The arguments start and end allow specification of a substring; they default to zero and string's length, respectively.

— procedure: utf8-string-length string [start [end]]
— procedure: utf16-be-string-length string [start [end]]
— procedure: utf16-le-string-length string [start [end]]
— procedure: utf16-string-length string [start [end]]
— procedure: utf32-be-string-length string [start [end]]
— procedure: utf32-le-string-length string [start [end]]
— procedure: utf32-string-length string [start [end]]

Each of these procedures counts the number of Unicode characters in a byte vector, treating string as a stream of bytes encoded in the corresponding `utfNN' representation. The arguments start and end allow specification of a substring; they default to zero and string's length, respectively.

— procedure: wide-string->utf8-string string [start [end]]
— procedure: wide-string->utf16-be-string string [start [end]]
— procedure: wide-string->utf16-le-string string [start [end]]
— procedure: wide-string->utf16-string string [start [end]]
— procedure: wide-string->utf32-be-string string [start [end]]
— procedure: wide-string->utf32-le-string string [start [end]]
— procedure: wide-string->utf32-string string [start [end]]

Each of these procedures converts a wide string to a stream of bytes encoded in the corresponding `utfNN' representation, and returns that stream as a byte vector. The arguments start and end allow specification of a substring; they default to zero and string's length, respectively.