Author: Patrik Nyblom <pan(at)erlang(dot)org>
Status: Draft
Type: Standards Track
Created: 07-may-2008
Erlang-Version: R12B-4
Post-History: 01-jan-1970
This EEP suggest a standard representation of Unicode characters in Erlang, as well as the basic functionality to deal with them.
As Unicode characters are more widely used, the need for a common representation of Unicode characters in Erlang arise. Up until now, the Erlang programmer writing Unicode programs has to decide on his or her own representation and has little or no help from the standard libraries.
Implementing functions in the libraries dealing with all possible combinations and variants of Unicode representation in Erlang is considered both extremely time consuming and confusing to the future user of the standard library.
One common representation, dealing both with binaries and lists is therefore desirable, making Unicode handling in the standard libraries easier to implement and giving a more stringent result.
Once the representation is agreed upon, implementation can be done incrementally. This EEP only outlines the most basic functionality the system should provide. The Unicode support is by no means complete if this EEP is implemented, but implementation will be feasible.
The EEP also suggests library functions and bit syntax to deal with alternative encodings. However, one standard encoding is suggested, which will be what library functions in Erlang are expected to support, while other representations are supported only in terms of conversion.
Erlang traditionally represents text strings as lists of bytes (8bit entities), where the characters are encoded in ISO-8859-1 (latin1).
As the use of Unicode characters gets more widely spread, the demand for a common view of how to represent Unicode characters in Erlang arise.
Unicode is a character encoding standard where all known, living and historical written languages are represented in one single character set, which of course results in characters demanding more than eight bits each for representation.
Regardless of the representation, the Unicode character set is a super-set of the latin1 ditto, while latin1 in it's turn is a super-set of the traditional 7-bit US-ASCII character set. Representing Unicode characters in Erlang lists is therefore quite naturally done by allowing characters in lists to take on values higher than 255.
Therefore a Unicode string can, in Erlang, be conveniently stored as a list where each element represents one single Unicode character. The following list:
_ex1:
[1050,1072,1082,1074,
1086,32,1077,32,85,110,105,99,111,100,101,32,63]
- would represent the Bulgarian translation of "What is Unicode ?" (which
looks something like like "KAKBO e Unicode ?" with only the last part
in latin letters). The last part
([32,85,110,105,99,111,100,101,32,63]
) is plain latin1 as the string
"Unicode ?" is written in latin letters, while the first part contains
characters not to be represented in a single byte. In essence, the
string is encoded in the Unicode encoding standard UTF-32, one
32bit entity for each character, which is more than sufficient for one
Unicode character per position.
However, the currently most common representation of Unicode
characters is UTF-8, in which the characters are stored in one to
four 8-bit entities organized in such way that plain 7-bit US ASCII is
untouched, while characters 128 and upwards are split over more than
one byte. The advantage of this coding is that e.g. characters having
a meaning to the file/operating system are kept intact and that many
strings in western languages do not occupy more space when transformed
into Unicode. In such an encoding, the above mentioned Bulgarian
string (ex1_) would be represented as the list
[208,154,208,176,208,186,208,178,208,
190,32,208,181,32,85,110,105,99,111,100,101,32,63]
, where the first
part, containing the Bulgarian script letters occupy more bytes per
character, while the trailing part "Unicode ?" is identical to the
plain and more intuitive encoding of one character per list element.
In spite of being less intuitive, the UTF-8 encoding is the one most widely spread and supported by operating systems and terminal emulators. UTF-8 is therefore the most convenient way to communicate text to external entities (files, drivers, terminals and so on).
When dealing with lists in Erlang, the advantages of using one list element per character seems to be greater than the advantage of not having to convert a UTF-8 character string before e.g. printing it on a terminal. This is especially true as the current Erlang implementation allows for all current Unicode characters to occupy the same memory space as a latin1 character would (bearing in mind that each character is represented as an integer and the list element can contain integers up to 16#7ffffff on 32-bit implementations, which is far larger than the largest current Unicode character 16#10ffff). A further advantage is that routines like io:format can easily cope with latin1 characters and Unicode characters alike, as the eight-bit characters of Unicode happen to correspond exactly to the latin1 character set. It would seem as lists have a very natural way of dealing with Unicode characters.
Binaries on the other hand would suffer greatly from a scheme where every character is encoded with a fixed width capable of representing numbers up to 16#10ffff. The standardized way of doing this would be what's commonly referred to as UTF-32, i.e. one 32-bit word for each character. Even a UTF-16 representation would guarantee to double the memory requirements for all text strings encoded in binaries, while UTF-8 would for most common cases be the most space-saving representation.
Binaries are often used to represent data to be sent to external programs, which also speaks in favor of the UTF-8 representation.
There are however problems with the UTF-8 representation, most obviously the fact that characters occupy a variable number of positions (bytes) in the binary, so that traversal is somewhat more tedious. An extension to the bit syntax where UTF-8 characters can be matched in the head of a string conveniently would ease up the situation, but as of today, no such primitives are present. UTF-8 encoded characters are also only backward compatible with 7-bit US-ASCII, and there are only probabilistic approaches to determining if a sequence of bytes represent Unicode characters encoded as UTF-8 or plain latin1. A library function in Erlang therefore needs to be informed about the way characters are encoded in a binary to be able to interpret them correctly. A latin1 character above 128 will be displayed incorrectly if written to a terminal set for displaying UTF-8 encoded Unicode and v.v. As a common example io:format("~s~n",[MyBinaryString]), would need to be informed about the fact that the string is encoded in UTF-8 or latin1 to display it correctly on a terminal. The formatting functions actually present a whole set of challenges regarding Unicode characters. New formatting controls will be needed to inform the formatting functions in the io and io_lib modules that strings are in Unicode or that input is in UTF-8. This is however solvable, as discussed below.
My conclusion so far is that as binaries are often used to save space and commonly utilized when communicating with external entities, the UTF-8 advantages seem to supersede the disadvantages in the binary case. It therefore seems sensible to commonly encode Unicode characters in binaries as UTF-8. Of course any representation is possible, but UTF-8 would be the most common case and can therefore be regarded as the Erlang standard representation.
To furthermore complicate things, Erlang has the concept of
iolist's (or iodata). An io_list is any (or almost any) combination of integers
and binaries representing a sequence of bytes, like i.e
[[85],110,[105,[99]],111,<<100,101>>]
as a
representation of the string "Unicode". When sending data to drivers
and in many BIFs this rather convenient representation is accepted
(convenient when constructing, less convenient when traversing).
When dealing with Unicode strings, a similar abstraction would be desirable, and with the above suggested conventions, that would mean that a Unicode character string could be a list with any combination of integers ranging from 0 to 16#10ffff and binaries with Unicode characters encoded as UTF-8. Converting such data to a plain list or a plain UTF-8 binary would be easily done as long as one knows how the characters are encoded to begin with. It would however not necessarily be an iolist. Furthermore conversion functions need to be aware of the original intention of the list to behave correctly. If one wants to convert an iolist containing latin1 characters in both list part and binary part to UTF-8, the list part cannot be misinterpreted, as latin1 and Unicode are alike for all latin1 characters, but the binary part can, as latin1 characters above 127 are encoded in two bytes if the binary contains UTF-8 encoded characters, but only one byte when latin1 encoding is used. The same of course holds for other encodings, if a binary encoded in UTF-32 would be converted to UTF-8, the process also would differ from the process of converting latin1 characters.
If we stick with the idea of representing Unicode as one character per list element in lists and as UTF-8 in binaries, we could have the following definitions:
Conversion functions between latin1 lists and latin1 binaries as well as from mixed latin1 lists to latin1 binaries are already present in the system as listtobinary, binarytolist, and iolisttobinary.
Conversion between Unicode lists, Unicode binaries, and from mixed Unicode lists could in a similar way be provided by functions like:
unicode:list_to_utf8(UM) -> Bin
Where UM is a mixed Unicode list and the result is a UTF-8 binary, and:
unicode:utf8_to_list(Bin) -> UL
Where Bin is a binary consisting of Unicode characters encoded as UTF-8 and UL is a plain list of Unicode characters.
To allow for conversion to and from latin1 the functions:
unicode:latin1_list_to_utf8(LM) -> Bin
and:
unicode:latin1_list_to_unicode_list(LM) -> UL
would do the same job. Actually latin1listto_list is not necessary in this context, as it is more of an iolist-function, but should be present for completeness.
The fact that lists of integers representing latin1 characters are a
subset of the lists containing Unicode characters might however be more
confusing than useful to utilize when converting from mixed lists to
UTF-8 coded binaries. I think a good approach would be to
differentiate the functions dealing with latin1 characters and Unicode
so that mixed lists are expected to contain only numbers 0..255 if the
binaries are expected to contain latin1 bytes. For functions like
io:format, the same thing should be true i.e. ~s means latin1 mixed
lists and ~ts means Unicode mixed lists (with binaries in
UTF-8). Passing a list with an integer > 255 to ~s would be an error
with this approach, just like passing the same thing to
latin1_list_to_utf8/1
. See below for more discussions about the io system.
The unicode_list_to_utf8/1
and latin1_list_to_utf8/1
functions can be
combined into the single function list_to_utf8/2
like this:
unicode:characters_to_binary(ML,InEncoding) -> binary()
ML := A mixed Unicode list or a mixed latin1 list
InEncoding := {latin1 | unicode}
The word "characters" is used to denote a possibly complex representation of characters in the encoding concerned, like a short word for "a possibly mixed and deep list of characters and/or binaries in either latin1 representation or Unicode".
Giving latin1 as the encoding would mean that all of ML should be interpreted as latin1 characters, implying that integers > 255 in the list would be an error. Giving Unicode as the encoding would mean that all integers 0..16#10ffff are accepted and the binaries are expected to already be UTF-8 coded.
In the same way, conversion to lists of Unicode characters could be done with a function::
unicode:characters_to_list(ML, InEncoding) -> list()
ML := A mixed Unicode list or a mixed latin1 list
InEncoding := {latin1 | unicode}
I think the approach of two simple conversion functions characterstobinary/2 and characterstolist/2 is attractive, despite the fact that certain combinations of in-data would be somewhat harder to convert (e.g. combinations of Unicode characters > 255 in a list with binaries in latin1). Extending the bit syntax to cope with UTF-8 would make it easy to write special conversion functions to handle those rare situations where the above mentioned functions cannot do the job.
To accommodate other encodings, the characterstobinary functionality could be extended to handle other encodings as well. A more general functionality could be provided with the following functions (preferably placed in their own module, the module name 'unicode' being a good name-candidate):
characterstobinary(ML) -> binary() | {error, Encoded, Rest} | {incomplete, Encoded, Rest}
Same as characterstobinary(ML,unicode,unicode).
characterstobinary(ML,InEncoding) -> binary() | {error, Encoded, Rest} | {incomplete, Encoded, Rest}
Same as characterstobinary(ML,InEncoding,unicode).
characterstobinary(ML,InEncoding, OutEncoding) -> binary() | {error, Encoded, Rest} | {incomplete, Encoded, Rest}
Types:
The option 'unicode' is an alias for utf8, as this is the preferred encoding for Unicode characters in binaries. Error tuples are returned when the data cannot be encoded/decoded due to errors in indata and incomplete tuples when the indata is possibly correct but truncated.
characterstolist(ML) -> list() | {error, Encoded, Rest} | {incomplete, Encoded, Rest}
Same as characterstolist(ML,unicode).
characterstolist(ML,InEncoding) -> list() | {error, Encoded, Rest} | {incomplete, Encoded, Rest}
Types:
Here also the option 'unicode' denotes the default Erlang encoding of utf8 in binaries and is therefore an alias for utf8. Error- and incomplete-tuples are returned in the same way as for characterstobinary.
Note that as the datatypes returned upon success are well defined, guard tests exist (islist/1 and isbinary/1), why i suggest not returning the clunky {ok, Data} tuples even though the error and incomplete tuples can be returned. This makes the functions simpler to use when the encoding is known to be correct while return values can still be checked easily.
Using Erlang bit syntax on binaries containing Unicode characters in UTF-8 could be facilitated by a new type. The type name utf8 would be preferable to utf-8, as dashes ("-") have special meaning in bit syntax separating type, signedness, endianess and units.
The utf8 type in bit syntax matching would convert a UTF-8 coded character in the binary to an integer regardless of how many bytes it occupies, leaving the trailing part of the binary to be matched against the rest of the bit syntax matching expression.
When constructing binaries, an integer converted to UTF-8 could consequently occupy between one and four bytes in the resulting binary.
As bit syntax is often used to interpret data from various external sources, it would be useful to have corresponding utf16 and utf32 types as well. While UTF-8, UTF-16 and UTF-32 are easily interpreted with the current bit syntax implementation, the suggested specific types would be convenient for the programmer. Also Unicode imposes restrictions in terms of range and has some forbidden ranges which are best handled using a built in bit syntax type.
The utf16 and utf32 types need to have an endianess option, as UTF-16 and UTF-32 can be stored as big or little endian entities.
Given a default Unicode character representation in Erlang, let's dig deeper into the formatting functions. I suggest the concept of formatting control sequence modifiers, an extra character between the "~" and the control character, denoting Unicode input/output. The letter "t" (for translate) is not used in any formatting functions today, making it a good candidate. The meaning of the modifier should be such that e.g. the formatting control "~ts" means a string in Unicode while "~s" means means a string in iso-latin-1. The reason for not simply introducing a new single control character, is that the suggested modifier can be applicable to various control characters, like e.g. "p" or even "w", while a new single control character for Unicode strings would only be a replacement for the current "s" control character.
Although the io-protocol in Erlang from the beginning did not impose any limit on what characters could be transferred between a client and an io_server, demands for better performance from the io-system in Erlang has made later implementations use binaries for communication, which in practice has made the io-protocol contain bytes, not general characters.
Furthermore has the fact that the io-system currently works with characters that can be represented as bytes been utilized in numerous applications, so that output from io-functions (i.e. iolib:format) has been sent directly to entities only accepting byte input (like sockets) or that ioservers have been implemented assuming only character ranges of 0 - 255. Of course this can be changed, but such a change might include lower performance from the io-system as well as large changes to code already in production (aka "customer code").
The io-system in Erlang currently works around an assumption that data is always a stream of bytes. Although this was not the original intention, this is how it's today used. This means that a latin1 string can be sent to a terminal or a file in much the same way, there will never be any conversion needed. This might not always hold for terminals, but in case of terminals there is always one single conversion needed, namely that from the byte-stream to whatever the terminal likes. A disk-file is a stream of bytes as well as a terminal is, at least as far as the Erlang io-system is concerned. Furthermore the iolib formatting function always returns (possibly) deep lists of integers, each representing one character, making it hard to differentiate between different encodings. The result is then sent as is by functions like io:format to the ioserver where it is finally put on the disk. The servers also accept binaries, but they are never produced by io_lib:format.
When Erlang starts supporting Unicode characters, the world changes a little. A file might contain text in UTF-8 or in iso-latin-1 and there is no telling from the list produced by e.g io_lib:format what the user originally intended.
Suggested solution ...................
To make a solution that as far as possible does not break current code and also keeps (or reverts to) the original intention of the io-system protocol, I suggest a scheme where the formatting functions that return lists, keep to the current behavior as far as possible.
So the iolib:format function returns a (possibly deep) list of integers 0..255 (latin1, which can be viewed as a subset of Unicode) if used without translation modifiers. If the translation modifiers are used, it will however return a possibly deep list of integers in the complete unicode range. Going back to the Bulgarian string (ex1), let's look at the following::
1> UniString = [1050,1072,1082,1074,
1086,32,1077,32,85,110,105,99,111,100,101,32,63].
2> io_lib:format("~s",[UniString]).
- here the Unicode string violates the mixed latin1 list property and a badarg exception will be raised. This behavior should be retained. On the other hand::
3> io_lib:format("~ts",[UniString]).
- would return a (deep) list with the Unicode string as a list of integers::
[[1050,1072,1082,1074,1086,32,1077,32,85,110,105,99,111,100,
101,32,63]]
The downside of introducing integers > 255 in the result list is of course that the return value of the function is no longer valid iodata(), but on the other hand, the following code::
lists:flatten(io_lib:format("~ts",[UniString]))
will give a result similar to that of a non-Unicode version.
As the format modifier "t" is new, the possibility to get integers > 255 in the resulting deep list will not break old code. To get iodata() in UTF-8, one could simply do::
unicode:characters_to_binary(io_lib:format("~ts",[UniString]),
unicode, unicode)
As before, directly formatting (with ~s) a list of characters > 255 would be an error, but with the "t" modifier it would work.
When it comes to range checking and backward compatibility::
6> io:format(File,"~s",[UniString]).
- would as before throw the badarg exception, while::
7> io:format(File,"~ts",[UniString]).
- would be accepted.
The corresponding behavior of io:fread/2,3 would be to expect Unicode data in this call::
11> io:fread(File,'',"~ts").
- but expect latin1 in this::
12> io:fread(File,'',"~s").
The actual io-protocol, on the other hand, should deal only with Unicode, meaning that when data is converted to binaries for sending, all data should be translated into UTF-8. When lists of integers are used in communication, the latin1 and Unicode representations are the same, why no conversion or restrictions apply. Recall that the io-system is built so that characters should have one interpretation regardless of the io-server. The only possible encoding would be a Unicode one.
As we are communicating heavily between processes (the client and server processes in the io-system), converting the data to Unicode binaries (UTF-8) is the most efficient strategy for larger amounts of data.
Generally, writing to an io-server using the file-module will only be possible with byte-oriented data, while using the io-module will work on Unicode characters. Calling the function file\:write/2 will send the bytes to the file as is, as files are byte-oriented, but when writing on a file using the io-module, Unicode characters are expected and handled.
The io-protocol will make conversions of bytes into Unicode when sending to io-servers, but if the file is byte-oriented, the conversion back will make this transparent to the user. All bytes are representable in UTF-8 and can be converted back and forth without hassle.
The incompatible change will have to be to the put_chars function in io. It should only allow Unicode data, not iodata() as it is documented to do now. The big change being that any binaries provided to the function need to be in UTF-8. However, most usage of this function is restricted to lists, why this incompatible change is expected not to cause trouble for users.
To handle possible Unicode text data on a file, one should be able to provide encoding parameters when opening a file. A file should by default be opened for byte (or latin1) encoding, while the option to open it for i.e. utf8 translation should be available.
Lets look at some examples:
Example 1 - common byte-oriented writing ........................................
A file is opened as usual with file\:open. We then want to write bytes to it:
Using file\:write with iodata() (bytes), the data is converted into UTF-8 by the io-protocol, but the io-server will convert it back to latin1 before actually putting the bytes on file. For better performance, the file could be opened in raw mode, avoiding all conversion.
Using file\:write with data already converted to UTF-8 by the user, the io-protocol will embed this in yet another layer of UTF-8 encoding, the file-server will unpack it and we will end up with the UTF-8 bytes written to the file as expected.
Using io:putchars, the io-server will return an error if any of the Unicode characters sent are not possible to represent in one byte. Characters representable in latin1 will however be written nicely even though they might be encoded as UTF-8 in binaries sent to io:putchars. As long as the iolib:format function is used without the translation-modifier, everything will be valid latin1 and all return values will be lists, why it is both valid Unicode *and* possible to write on a default file. Old code will function as before, except when feeding io:putchars with latin1 binaries, in that case the call should be replaced with a file\:write call.
Example 2 - Unicode-oriented writing ....................................
A file is opened using a parameter telling that Unicode data should be written in a defined encoding, in this case we'll select UTF-16/bigendian to avoid mix-ups with the native UTF-8 encoding. We open the file with file\:open(Name,[write,{encoding,utf16,bigendian}]).
Using file\:write with iodata(), the io-protocol will convert into the default Unicode representation (UTF-8) and send the data to the io-server, which will in turn convert the data to UTF-16 and put it on the file. The file is to be regarded as a text file and all iodata() sent to it will be regarded as text.
If the data is already in Unicode representation (say UTF-8) it should not be written to this type of file using file\:write, io:put_chars is expected to be used (which is not a problem as Unicode data should not exist in old code and this is only a problem when the file is opened to translate).
If the data is in the Erlang default Unicode format, it can be written to the file using io:putchars. This works for all types of lists with integers and for binaries in UTF-8, for other representations (most notably latin1 in binaries) the data should be converted using Unicode:charactersto_XXX(Data,latin1) prior to sending. For latin1 mixed lists (iodata()), file\:write can also be used directly.
To sum up this case - Unicode strings (including latin1 lists) are written to a converting file using io:put_chars, but pure iodata() can also be implicitly converted to the encoding by using file\:write.
Example 3 - raw writing .......................
A file opened for raw access will only handle bytes, it cannot be used together with io:put_chars.
Data formatted with iolib:format can still be written to a raw file using file\:write. The data will end up being written as is. If the translation modifier is consistently used when formatting, the file will get the native UTF-8 encoding, if no translation modifiers are used, the file will have latin1 encoding (each character in the list returned from iolib:format will be representable as a latin1 byte). If data is generated in different ways, the conversion functions will have to be used.
Data written with file\:write will be put on the file directly, no conversion to and from Unicode representation will happen.
Example 4 - byte-oriented reading .................................
When a file is opened for reading, much the same things apply as for writing.
file\:read on any file will expect the io-protocol to deliver data as Unicode. Each byte will be converted to Unicode by the io_server and turned back to a byte by file\:read
If the file actually contains Unicode characters, they will be byte-wise converted to Unicode and then back, giving file\:read the original encoding. If read as (or converted to) binaries they can then easily be converted back to the Erlang default representation by means of the conversion routines.
If the file is read with io:getchars, all characters will be returned in a list as expected. All characters will be latin1, but that is a subset of Unicode and there will be no difference to reading a translating file. If the file however contains Unicode converted characters and is read in this way, the return value from io:getchars will be hard to interpret, but that is to be expected. If such a functionality is desired, the list can be converted to a binary with listtobinary and then explored as a Unicode entity in the encoding the file actually has.
Example 5 - Unicode file reading ................................
As when writing, reading Unicode converting files is best done with the io-module. Let's once again assume UTF-16 on the file.
When reading using file\:read, the UTF-16 data will be converted into a Unicode representation native to Erlang and sent to the client. If the client is using file\:read, it will translate the data back to bytes in the same way as bytes were translated to Unicode for the protocol when writing. Is everything representable as bytes, the function will succeed, but if any Unicode character larger than 255 is present, the function will fail with a decoding error.
Unicode data in the range over code-point 255 can not be retrieved by use of the file-module. The io-module should be used instead.
io:getchars and io:getline will work on the Unicode data provided by the io-protocol. All Unicode returns will be as Unicode lists as expected. The fread function will return lists with integers > 255 only when the translation modifier is supplied.
Example 6 - raw reading .......................
As with writing, only the file module can be used and only byte oriented data is read. If encoded, the encoding will remain when reading and writing raw files.
Conclusions from the examples .............................
With this solution, the file module is consistent with latin1 ioservers (aka common files) and raw files. A file type, a translating file, is added for the io-module to be able to get implicit conversion of its Unicode data (another example of such an ioserver with implicit conversion would of course be the terminal). Interface-wise,common files behave as before and we only get added functionality.
The downsides are the subtly changed behavior of io:put_chars and the performance impact by the conversion to and from Unicode representations when using the file module on non-raw files with default (latin1/byte) encoding. The latter may be possible to change by extending the io-protocol to tag whole chunks of data as bytes (latin1) or Unicode, but using raw files for writing large amounts of data is often the better solution in those cases.
I suggest the convention of letting the Unicode representation in lists be one character per element, in binaries UTF-8 and in mixed Unicode entities a combination of those.
I also suggest a module 'unicode', containing functions for converting between representations of Unicode. The default format for all functions should be utf8 in binaries to point out this as the preferred internal representation of Unicode characters in binaries.
The two main conversion functions should be characterstobinary/3 and characterstolist/2 as described above.
I suggest an extension to the bit syntax, allowing matching and construction in UTF-8 coding, e.g::
<<Ch/utf8,_/binary>> = BinString
as well as::
MyBin = <<Ch/utf8,More/binary>>
Optionally UTF-16 could be supported in a similar way for binaries, e.g::
<<Ch/utf16-little,_/binary>> = BinString
UTF-32 will need to be supported in a similar way as UTF-16, both for completeness and for the range-checking that will be involved when converting Unicode characters.
I finally suggest the "t" modifier to control sequence in the formatting function, which expects mixed lists of integers 0..16#10ffff and binaries with UTF-8 coded Unicode characters. The functions in io and io_lib will retain their current functionality for code not using the translation modifier, but will return Unicode characters when ordered to.
The fread function should in the same way accept Unicode data only when the "t" modifier is used.
The io-protocol need to be changed to always handle Unicode characters. Options given when opening a file will allow for implicit conversion of text files.
This document has been placed in the public domain.