Author: Patrik Nyblom <pan(at)erlang(dot)org>,
Fredrik Svahn <Fredrik(dot)Svahn(at)gmail>
Status: Draft
Type: Standards Track
Created: 29-Sep-2010
Erlang-Version: R14B
Post-History:
Replaces: 9
This EEP contains developed suggestions regarding the module binary_string
first suggested in EEP 9. The module name is now however changed to bstring
.
EEP 9 suggests several modules and is partially superseded by later EEP's (i.e. EEP 11 and EEP 31), while still containing valuable suggestions not yet implemented. This last remaining module suggested in EEP 9 will therefore appear in this separate EEP. This is made in agreement with the original author of EEP 9.
The module bstring
is suggested to contain functions for
convenient manipulation of textual data stored in binaries,
i.e. binary strings. It somewhat resembles the string
module
(which is list oriented), but is not to be viewed simply as a
string
module for binaries.
The module suggested handles binary character encoding in both the standard character encodings of Erlang, namely ISO-Latin-1 and UTF-8.
Text strings are traditionally represented as lists of integers in Erlang. While this is convenient and more or less built into the syntax of the language (i.e. "ABC" is syntactic sugar for [$A,$B,$C]), a more compact representation is often desired. Also, in some circumstances binaries can be more efficient to manipulate in terms of algorithm complexity than lists are (especially in the fixed character width case of ISO-Latin-1).
More modules have been added to the standard libraries lately to aid
the usage of binaries for text strings, both as representing
ISO-Latin-1 characters and Unicode strings encoded in UTF-8. Most
notably the re
library, but also the unicode
module are fairly
new additions to stdlib
which will make life easier for the
programmer when it comes to manipulating binary encoded strings. Also
a module for fast searching and replacing in byte oriented binaries is
present (the module binary
), but no traditional string manipulation module is
yet in the libraries. To ease use of binary encoded strings, such a module is
needed.
The module string
for text oriented operations on lists has been
present in the standard libraries for so long that most programmers
don't remember a time when it wasn't there. It is said to originally
be a merge of two different string modules, written and designed by
two different programmers with possibly slightly different goals and
definitely slightly different views on function naming. While
sometimes criticized for duplicated functionality and inconsistent
function naming, among other things, the module has remained useful
throughout the entire lifespan of Erlang/OTP. The string
representation used has also withstood the evolution of Unicode.
It is worth to note that the only functions in the string
module
that actually are language or region dependent are later additions to
the module. Those functions (like to_upper
, to_lower
, to_integer
and
to_float
), or their binary equivalence, are not part of the module
interface I suggest for bstring
for the simple reason that they
need language support not yet present in Erlang. A future EEP might
suggest such language support (i.e. some kind of "locale" support), but
that is future work not covered by this EEP.
So, however criticized, the string module is very useful for manipulating lists, and the same functionality for binary strings is desirable. While a lot of the functionality will be similar, there are some major issues to consider when implementing a module for manipulating strings encoded in binaries:
Unicode - Binaries can have different encodings. A Character encoded as UTF-8 might take more than one (up to four) byte positions, and even the same character can have different encodings in ISO-Latin-1 and UTF-8 (all codepoints from 128 to 255). The functions need to be informed of the character encoding explicitly, The encoding information is not present in the binaries.
Mixed character encodings - As characters can be encoded in different ways, two strings in the same program could have different encodings. Supplying the functions with non-homogeneous string encoding data should be consistently solved throughout the module, as should the selection of returned encoding where applicable.
Default character encoding - As functions will take extra arguments to specify encoding, a consistent default might be useful. Choosing the default is not entirely simple, as the tradition states ISO-Latin-1, while the future suggests UTF-8.
Languages - Erlang has no notion of "Locale" or preferred number format. A general string module can not assume neither a specific notion of uppercase or lowercase letters, nor a specific number encoding format (especially true for floating point numbers).
Word separators - The space character is certainly not the only word separator for textual data (in any language). The notion of words separated by spaces imposes a restriction of the relevant languages.
Left to right or right to left - Notions like left or right to denote the beginning or end of a string are certainly not language independent. While strings in a language have a beginning and an end, that beginning and end may be placed both to the left, the right or even at the top, bottom or center of the graphical representation. A string manipulation module should not use naming implying a left-to-right script, or any other type of script.
Naming and duplicated functionality - The original string
module
has been accused of having somewhat inconsistent naming and
functionality duplicated. In fact the only duplicated functions are
substr
and sub_string
. Some cleanup of the interface might
be needed.
Byte oriented versus character oriented return values - When dealing
with Unicode data, a character may take more than one byte, why
i.e. counting the number of characters in a string tells you very
little about the actual size of the string in bytes. Furthermore,
later processing of a binary might require byte-oriented
manipulation of a string rather than character oriented (i.e. you
want to manipulate the string using the binary
module or with
bit-syntax), while characters are actually what constitutes a
string, not bytes. You would want both.
New or replaced functionality - New functionality have been suggested from several sources,
most notably EEP 9. For example the function split
suggested in EEP 9 is very similar to
tokens
anyway, for example?
I'll address the different issues below.
The interface has to support both ISO-Latin-1 and UTF-8. The unicode
module supports even more encodings, but Erlang/OTP uses UTF-8 for all "internal" interfaces and UTF-8 is the expected encoding of a binary Unicode string. Even though UTF-8 is compatible with ISO-Latin-1 in the 7bit ASCII range, characters with codepoints between 128 and 255 are encoded differently in the "plain" ISO-Latin-1 encoding and in UTF-8. This means that all functions in the bstring
module need to have the actual encoding as one or more extra parameters.
One could invent a more abstract binary string format where the data is for example represented as a tuple with the string and the encoding packed together. However no other module supports such a string construct and I don't think that would really add something, neither functionality nor readability. Consider code like:
bstring:tokens(Bin,latin1,[$ ,$\n])
compared to:
bstring:tokens({Bin,latin1}, [$ ,$\n]).
or even:
bstring:tokens(#bstring{data = Bin, encoding = latin1}, [$ ,$\n]).
In many cases the extra information needs to be added in connection to the call, making the code no more readable or simple to write than with the separate extra argument. Consider if we had a default value for encoding. The code:
f(Data) ->
bstring:tokens(Data,[$ ,$\n]).
would not in any way indicate if Data
was supposed to be a binary with the default encoding or some kind of complex data structure indicating both the actual string and it's encoding.
I think the extra argument for the encoding is straight forward and simple, and it makes programming easier when using the binary string in other modules as well (i.e. re
, binary
, file
etc). I think we should simply not have a special string datatype for this module, character encoding should be supplied as a separate argument.
To ease transition between character encodings, I think the interface should accept different encodings for both different parameters and the return value. This makes it possible to convert on the fly and for the functions to decide on the most efficient character conversion path for the supplied arguments and the return value.
The downside of this approach is that some functions will take a lot of parameters telling different character encodings, for example a string concatenation routine could look like:
concat(BString1, Encoding1, BString2, Encoding2, Encoding3) -> BString3
being called like:
US = bstring:concat(SA,latin1, SB, latin1, unicode),
which might look a little awkward to write. On the other hand, conversion is made on the fly and you will not need to explicitly call the unicode
module to convert the result.
I think implicit conversion is so useful that it is worth the extra arguments. For example a concat
function would be more or less useless without it, the bit syntax would be much easier to use if no conversion should be allowed.
Choosing a default character encoding is not obvious. While ISO-Latin-1 is the default in Erlang (i.e. <<"korvsmörgås">> gives a ISO-Latin-1 encoded binary string), UTF-8 usage is expected to grow in the future.
Although its tempting to select UTF-8 as the default encoding, I think we should stick to ISO-Latin-1 as the default even for this module. There are several reasons:
We need not, as a rule, impose new standards in every module we add to the standard library. Consistence certainly adds value, and both the bit-syntax, the source code encoding and things like the io:format routine has ISO-Latin-1 as default. Lets not make this module inconsistent with the others.
The string
module is often used to manipulate arbitrary lists
of integers, not always actually representing textual data. In the
same way can bstring
probably be used to manipulate arbitrary
blobs of bytes if ISO-latin-1 versions are used. ISO-Latin-1 is
actually the raw bytes uninterpreted, why any binary data can be
worked on in a ISO-Latin-1 oriented routine. Using UTF-8 encoding as
default would narrow the use for the default functions to only work
on real text data.
The pure ISO-Latin-1 implementations of the functions will be the
most efficient ones as no data checking at all is needed. Any byte
value is acceptable in any version. Some functions are usable on
UTF-8 strings even though they expect ISO-Latin-1 data. The
difference between the ISO-Latin-1 version and the UTF-8 version
being only indata control. If the data given to, for example
bstring:concat
is already checked for correct UTF-8, the simpler
ISO-Latin-1 version of the function is both more efficient and
guaranteed to give as correct output as the input:
CorrectUtf8_1 = give_me_good_string(),
CorrectUtf8_2 = give_me_another_good_string(),
CorrectUtf8_3 = bstring:concat(CorrectUtf8_1, latin1, CorrectUtf8_2, latin1, latin1),
...
Simply put, ISO-Latin-1 versions of the functions are more generally useful than pure UTF-8 versions and are also more efficient.
A wrapper module providing pure UTF-8 interfaces can easily be written. The overhead of going via a wrapper would be relatively lower for an UTF-8 wrapper than for an ISO-Latin-1 ditto, as the overhead of character decoding/encoding of UTF-8 strings in the module would be quite high. Simply put, a wrapper would cost very little compared to the cost of checking the data for UTF-8 correctness.
I actually suggest a module ubstring
that has the part of the
bstring
interface where a default encoding is implied, but with
the difference that UTF-8 is expected. For example, a function
ubstring:tokens/2
would look like this:
tokens(S,L) -> bstring:tokens(S,unicode,L).
Quite simple.
To conclude, I think all functions should exist in a version where no encoding is supplied and ISO-Latin-1 encoded data is expected.
Even though Unicode characters can be used to express text in most
known, living and dead scripts, language and region knowledge is a
completely different thing. String interfaces often impose language
specific properties of the string, like left-to-right writing
direction, the notion of words built up by space separated groups of
characters, ways of representing numbers and decimal points etc. As
Erlang does not (yet) have a way of specifying such language-, or
region-specific properties of a string, the interface should not
contain language-dependent functionality. The string
module did not
originally contain such functions (except that character alignment
functions were named left
and right
), but unfortunately
functions like to_float
and to_upper
have been added.
I think that having language-dependent functions in the string
module was a mistake and I do not want to make that mistake
again. Hence I have not included such functions or names in
bstring
.
I rather suggest "Locale" functionality as a subject of a future
EEP. For those who consider that simple, try to write a correct
to_upper
function for just all European languages, make sure it
works on all platforms that can run Erlang... Maybe not rocket science, but a
lot of metadata is required. Data that is not always available in
the underlying OS, but probably needs to be distributed with Erlang/OTP for
consistent functionality. Definitely worth it's own EEP.
In connection with language independence, I think we should drop the
notion of words as a group of characters separated by space. The word
"token" is more general and does not in the same way indicate language
constructs. The string
module has the ASCII space character as a
default for word separation, which I think should be dropped in
bstring
. Whatever should separate tokens should be supplied,
possibly as alternatives. I therefore suggest the functions
bstring:num_tokens
and bstring:nth_token
to fulfill the
functionality of string:words
and string:sub_word
.
As in EEP 9, I suggest a new function split
to handle the case
of multi-character separators for tokens. A compilation of split
and join
makes a convenient replace
function too.
As mentioned earlier, I don't think direction of the graphical
representation should be implied in the interface, why I suggest using
notions like leading and trailing (meaning leading and trailing
characters in the binary) rather than any directional notions. I also
think aligning strings (like in strings:right
etc.) could be solved
in one function align
, taking one of the atoms leading
,
trailing
or center
as a parameter, if it should at all be
implemented.
I definitely do not think we should have all interfaces from
string
duplicated to bstring
. Especially interfaces that are
aliases should not be carried along to the bstring
module. Most
functions in the string
module however have short and fairly
describing names, often similar to names found in other languages. I
think using a r
prefix for functionality working from the end of
the string towards the beginning is a good choice, as is c
for
complement.
Some functions in string
, that are certainly useful, return numbers
denoting character positions. The same functions should definitely be
present in the bstring
module and the return values should
definitely be character oriented. However byte offsets are definitely
useful, for example if we use a function like span
to find the
first character not in a set of characters, we might want the byte
offset of that first character too.
I suggest adding some interfaces returning byte offsets, or part()'s
like the ones used in the binary
module and by re
, to cope
with the need for byte offsets and lengths in some circumstances. A
b
suffix to the function name could denote such functionality, so
that bstring:span
returns a character position while
bstring:spanb
returns a byte position and btring:str
returns a
character position and bstring:strb
returns a part(). Although
this will in the end give rise to more functions in the interface,
having return-type-changing options in an option list is not the way
to go (I know, I have them in re
, but it's still not generally a
good idea...).
When writing a general string module, there is no end to the new, more
or less esoteric, functionality one could add. I think we, at least
in an initial implementation, should stick to the functionality
outlined in EEP 9, namely extending str
and friends to
optionally take a list of alternative strings to search for, add a
function split
to take care of multi-character separators (as
opposed to single character separators in the function tokens
) and
a substitution function, which I think should be named replace
as
in other modules.
The use of pre-compiled matches from the binary
module is however
not a good idea, as the binary
module has no notion of character
encoding. Search strings need to be given in defined character
encodings and both the "haystacks" and the "needles" encoding need to
be known when doing an efficient search. So - no pre-compiled search
expressions.
As made obvious above, I prefer the name bstring
for a binary
string module in favor of the more verbose name binary_string
originally suggested. In that module bstring
, I suggest the
following interfaces, expressed as in a manual page of OTP.
encoding() = latin1 | unicode | utf8
- The encoding of characters in the binary data, both input and output
bstring()
- Binary with characters encoded either in ISO-Latin-1 or UTF-8
unicode_char() = non_negative_integer()
- An integer representing a valid unicode codepoint
non_negative_integer()
- An integer >= 0
align(BString, Alignment, Number, Char) -> Result
align(BString, Encoding, Alignment, Number, Char) -> Result
Types:
BString = Result = bstring()
Encoding = encoding()
Alignment = leading | trailing | center
Number = non_negative_integer()
Char = unicode_char()
Aligns the characters in BString
in a Result
of Number
characters according to the Alignment
parameter. Alignment is done by inserting the character Char
in the beginning or end (or both) of the binary string.
The resulting binary string will contain exactly Number
characters, the string is truncated if it contains more characters than Number
- either at the end if Alignment
is leading
, or at the beginning if Alignment
is trailing
, or at both ends if Alignment
is center
. If Encoding
is unicode
, the Result
may well contain more bytes than Number
, as one character may require several bytes.
Example:
> bstring:align(<<"Hello">>, latin1, center, 10, $.).
<<"..Hello...">>
If the encoding is not given, it is assumed to be latin1
, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg
exception if BString
does not contain characters encoded according to the Encoding
parameter, Encoding
or Alignment
has an invalid value, the character Char
cannot be encoded in the character encoding given as Encoding
or any of the parameters are of the wrong type.
chr(BString, Character) -> Position
chr(BString, Encoding, Character) -> Position
rchr(BString, Character) -> Position
rchr(BString, Encoding, Character) -> Position
Types:
BString = bstring()
Encoding = encoding()
Character = unicode_char()
Position = integer()
Returns the (zero-based) character position of the first/last occurrence of Character
in BString
. -1
is returned if Character
does not occur.
Note that the character position is not the same as the byte position. Use the chrb
and rchrb
functions to get the byte positions.
If Character
cannot be represented in the encoding, it is not an error, you are just certain to get -1
as a return value.
If the encoding is not given, it is assumed to be latin1
, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg
exception if the searched part of BString
does not contain characters encoded according to the Encoding
parameter, the Encoding
has an invalid value, or any of the parameters are of the wrong type.
chrb(BString, Character) -> {BytePosition, ByteLength}
chrb(BString, Encoding, Character) -> {BytePosition, ByteLength}
rchrb(BString, Character) -> {BytePosition, ByteLength}
rchrb(BString, Encoding, Character) -> {BytePosition, ByteLength}
Types:
BString = bstring()
Encoding = encoding()
Character = unicode_char()
BytePosition = integer()
ByteLength = non_negative_integer()
Works as chr
and rchr
respectively, but returns the byte position and byte length of the character.
If the character is not found, {-1,0}
is returned.
concat(BString1, BString2) -> BString3
concat(BString1, Encoding1, BString2, Encoding2, Encoding3) -> BString3
Types:
BString1 = BString2 = BString3 = bstring()
Encoding1 = Encoding2 = Encoding3 = encoding()
Concatenates two binary strings to form a new string. Returns the new binary string in the encoding given by Encoding3.
If the encoding is not given, it is assumed to be latin1
, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg
exception if BString1
or Bstring2
does not contain characters encoded according to the Encoding1
and Encoding2
parameters, the encoding parameters has an invalid value, the codepoints in the in-parameters cannot be represented in the output encoding or any of the parameters are of the wrong type.
equal(BString1, BString2) -> bool()
equal(BString1, Encoding1, BString2, Encoding2) -> bool()
Types:
BString1 = BString2 = bstring()
Encoding1 = Encoding2 = encoding()
Tests whether two binary strings are equal. Returns true
if they are, otherwise false
.
Encoding1
is the encoding of BString1
and Encoding2
is the encoding of BString2
.
Note that the strings can have different encoding and that it is the character values encoded in the strings that are compared. The binary strings are scanned as long as they are equal, meaning that if the function returns true
, both strings are correctly encoded, while a return value of false
does not guarantee correct encoding in both binary strings. An exception is raised if faulty encoding is determined while comparing the strings, not if parts of the string not inspected contain encoding errors.
If the encoding is not given, it is assumed to be latin1
, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg
exception if wrongly encoded characters, according to the encoding parameters, are encountered during comparison, the encoding parameters has an invalid value or any of the parameters are of the wrong type.
join(BStringList, Separator) -> Result
join(BStringList, BStringListEncoding, Separator, SeparatorEncoding, ResultEncoding) -> Result
Types:
BStringList = [bstring()]
BStringListEncoding = SeparatorEncoding = ResultEncoding = encoding()
Separator = bstring()
Result = bstring()
Returns a binary string with the elements of BStringList
separated by the binary string in Seperator
.
All the binary strings in BStringList
should have the same encoding (given as BStringListEncoding
. The Separator
can however have a different encoding (given as SeparatorEncoding
), as can the Result
(given as ResultEncoding
).
Example:
> bstring:join([<<"one">>, <<"two">>, <<"three">>], latin1, <<", ">>, latin1, latin1).
<<"one, two, three">>
If the encoding is not given, it is assumed to be latin1
, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg
exception if binary strings in BStringList
or the Separator
do not contain characters encoded according to the BStringListEncoding
and SeparatorEncoding
parameters respectively, the encoding parameters has an invalid value, the codepoints in the in-parameters cannot be represented in the output encoding ResultEncoding
or any of the parameters are of the wrong type.
len(BString) -> Length
len(BString, Encoding) -> Length
Types:
BString = bstring()
Encoding = encoding()
Length = non_negative_integer()
Returns the number of characters in the binary string.
If the encoding is not given, it is assumed to be latin1
, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg
exception if BString
does not contain characters encoded according to the Encoding
parameter, the Encoding
has an invalid value or any of the parameters are of the wrong type.
nth_token(BString, N, CharList) -> Result
nth_token(BString, Encoding, N, CharList) -> Result
Types:
BString = Result = bstring()
Encoding = encoding()
CharList = [ unicode_char() ]
N = non_negative_integer()
Returns the token number N
of BString
(zero-based). Tokens are separated by the characters in CharList
.
The returned token will have the same encoding as BString
.
For example:
> bstring:nth_token(<<" Hello old boy !">>,latin1,3,[$o, $ ]).
<<"ld b">>
CharList
is to be viewed as a set of characters, order is not significant. Codepoints given in CharList
that cannot be represented by the Encoding
, is not an error.
Values of N
>= number of tokens in BString
will result in the empty binary string <<>>
being returned.
If the encoding is not given, it is assumed to be latin1
, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg
exception if BString
does not contain characters encoded according to the Encoding
parameter, the Encoding
has an invalid value, or any of the parameters are of the wrong type.
num_tokens(BString, CharList) -> Count
num_tokens(BString, Encoding, CharList) -> Count
Types:
BString = bstring()
Encoding = encoding()
CharList = [ unicode_char() ]
Count = non_negative_integer()
Returns the number of tokens in String
, separated by the characters in CharList
.
The result is the same as for length(bstring:tokens(BString,Encoding,CharList)), but avoids building the result.
For example:
> num_tokens(<<" Hello old boy!">>, latin1, [$o, $ ]).
4
CharList
is to be viewed as a set of characters, order is not significant. Codepoints given in CharList
that cannot be represented by the Encoding
, is not an error.
If the encoding is not given, it is assumed to be latin1
, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg
exception if BString
does not contain characters encoded according to the Encoding
parameter, the Encoding
has an invalid value, or any of the parameters are of the wrong type.
span(BString, Chars) -> Length
span(BString, Encoding, Chars) -> Length
rspan(BString, Chars) -> Length
rspan(BString, Encoding, Chars) -> Length
cspan(BString, Chars) -> Length
cspan(BString, Encoding, Chars) -> Length
rcspan(BString, Chars) -> Length
rcspan(BString, Encoding, Chars) -> Length
Types:
BString = bstring()
Encoding = encoding()
Chars = [ integer() ]
Length = non_negative_integer()
Returns the length (in characters) of the maximum initial (span and cspan) or trailing (rspan and rcspan) segment of BString, which consists entirely of characters from (span and rspan), or not from (cspan and rcspan) Chars.
Chars
is to be viewed as a set of characters, order is not significant. Codepoints given in Char
that cannot be represented by the Encoding
, is not an error.
For example:
> bstring:span(<<"\t abcdef">>,latin1," \t").
5
> bstring:cspan((<<"\t abcdef">>,latin1, " \t").
0
Codepoints in Chars
that can not be represented by Encoding
is not considered an error.
If the encoding is not given, it is assumed to be latin1
, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg
exception if the searched part of BString
does not contain characters encoded according to the Encoding
parameter, the Encoding
has an invalid value, or any of the parameters are of the wrong type.
spanb(BString, Chars) -> ByteLength
spanb(BString, Encoding, Chars) -> ByteLength
rspanb(BString, Chars) -> ByteLength
rspanb(BString, Encoding, Chars) -> ByteLength
cspanb(BString, Chars) -> ByteLength
cspanb(BString, Encoding, Chars) -> ByteLength
rcspanb(BString, Chars) -> ByteLength
rcspanb(BString, Encoding, Chars) -> ByteLength
Types:
BString = bstring()
Encoding = encoding()
Chars = [ integer() ]
ByteLength = non_negative_integer()
Work exactly as the functions span
, rspan
, cspan
and rcspan
respectively, but returns the number of bytes rather than the number of characters.
split(BString, Separators, Where) -> Tokens
split(BString, Encoding, Separators, SepEncoding, Where, ReturnEncoding) -> Tokens
Types:
String = bstring()
Encoding = SepEncoding = ReturnEncoding = encoding()
Separators = [ bstring() ]
Where = first | last | all
Tokens = [bstring()]
Returns a list of tokens in BString
, separated by the binary strings in Separators
.
The Tokens
returned are encoded according to ReturnEncoding
.
Example:
> bstring:split(<<"abc defxxghix jkl">>, latin1, [<<"x">>,<<" ">>],all,latin1).
[<<"abc">>, <<"def">>, <<"ghi">>, <<"jkl">>]
Separators
is to be viewed as a set of binary strings, order is not significant. Codepoints given in Separators
that cannot be represented by the Encoding
, is not an error.
The Where
parameter specifies at which occurrence of any of the Separators
the binary string is to be split, either at the first
occurrence, the last
occurrence or at all
occurrences, in which case the Tokens
may be an arbitrary long list.
If the encoding is not given, it is assumed to be latin1
, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg
exception if BString
or Separators
does not contain characters encoded according to the Encoding
and SepEncoding
parameters respectively, the resulting tokens cannot be encoded in the ReturnEncoding
, the Encoding
has an invalid value, or any of the parameters are of the wrong type.
str(BString, SubBStrings) -> Position
str(BString, Encoding, SubBStrings, SubEnc) -> Position
rstr(BString, SubBStrings) -> Position
rstr(BString, Encoding, SubBStrings, SubEnc) -> Position
Types:
BString = bstring()
SubBString = bstring() | [ bstring() ]
Encoding = SubEnc = encoding()
Position = integer()
Returns the (zero-based) character position where the first/last occurrence of any of the SubBStrings
begins in BString
. -1
is returned if SubBString
does not exist in BString
.
Note that the Character
position is not the same as the byte position. Use the strb
and rstrb
functions to get the byte positions.
The encoding need not be the same for BString
and SubBStrings
, however all strings in SubBStrings need to have the same encoding.
If the codepoints in SubBString can not be represented in the encoding of BString, that is not an error, but will always result in the return value -1.
Example:
> bstring:str(<<" Hello Hello World World ">>,latin1,<<"Hello World">>,latin1).
7
Note that if both encodings are the same and repeated searches with the same SubBStrings
are to be performed, it is more efficient to use the binary:match/{2,3}
functions with a precompiled pattern on the raw binary data.
If the encoding is not given, it is assumed to be latin1
, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg
exception if the searched part of BString
or SubBString
does not contain characters encoded according to the Encoding
and SubEnc
parameters, the Encoding
has an invalid value, or any of the parameters are of the wrong type.
strb(BString, SubBStrings) -> {BytePosition, ByteLength}
strb(BString, Encoding, SubBStrings, SubEnc) -> {BytePosition, ByteLength}
rstrb(BString, SubBStrings) -> {BytePosition, ByteLength}
rstrb(BString, Encoding, SubBStrings, SubEnc) -> {BytePosition, ByteLength}
Types:
BString = bstring()
SubBString = bstring() | [ bstring() ]
Encoding = SubEnc = encoding()
BytePosition = integer()
ByteLength = non_negative_integer()
Works as str
and rstr
respectively, but returns the byte position and byte length of the found substring.
Note that ByteLength
is the length the found substring has in BString
, regardless of the encoding in SubBStrings
, so that ByteLength
may be both larger and smaller than byte_size(SubBString)
depending on the binary string's encoding.
If the substring is not found, {-1,0}
is returned.
strip(BString, Which, CharList) -> Result
strip(BString, Encoding, Which, CharList) -> Result
Types:
BString = Result = bstring()
Encoding = encoding()
Which = leading | trailing | both
CharList = [ unicode_char() ]
Removes leading (Which
= leading
), trailing (Which
= trailing
) or both leading and trailing (Which
= both
) characters belonging to the set indicated by CharList
from the binary string BString
.
This is essentially the same as using spanb
and/or rspanb
in combination with bit syntax to remove the characters.
Example:
> bstring:strip(<<"...He.llo.....">>, latin1, both, [$.]).
<<"He.llo">>
If the encoding is not given, it is assumed to be latin1
, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg
exception if scanned part of BString
does not contain characters encoded according to the Encoding
parameter, Encoding
or Which
has an invalid value, or any of the parameters are of the wrong type.
replace(BString, Separators, Replacement, Where) -> Result
replace(BString, Encoding, Separators, SeparatorsEncoding, Replacement, ReplacementEncoding, Where, ResultEncoding) -> Result
Types:
BString = bstring()
Encoding = SeparatorsEncoding = ReplacementEncoding, ResultEncoding = encoding()
Separators = [ bstring() ]
Replacement = bstring()
Where = first | last | all
Result = bstring()
Produces the same result as
bstring:join(bstring:split(BString,Encoding,Separators,SeparatorsEncoding,Where,
unicode),
unicode,Replacement,ReplacementEncoding,ResultEncoding)
but with less overhead.
substr(BString, Start, Length) -> SubBString
substr(BString, Encoding, Start, Length) -> SubBString
Types:
BString = SubBString = bstring()
Encoding = bstring()
Start = integer()
Length = non_negative_integer() | infinity
Returns a substring of String
, starting at the zero-based character position Start
, and ending at the end of the binary string (if Length
is infinity
or up to, but not including, the character position Start+Length
(if Length
is a non negative integer).
The returned SubBString
will have the same encoding as BString
.
Example:
> bstring:substr(<<"Hello World">>, latin1, 3, 5).
<<"lo Wo">>
A negative value of Start
denotes abs(Start)
characters from the end of BString
, so that -1
is the last character position in the binary string.
Example:
> bstring:substr(<<"Hello World">>, latin1, -1, 3).
<<"rld">>
As the true length of an UTF-8 encoded binary string is quite costly to determine ( O(N)
, where N
is the number of bytes in the binary), the function is very forgiving about positions given outside of the string, both Start
s and Length
s. Character positions outside of the string in either direction are collapsed to the empty binary string.
Examples:
> bstring:substr(<<"01234">>, latin1, 5, 5).
<<>>
> bstring:substr(<<"01234">>, latin1, 4, 5).
<<"4">>
> bstring:substr(<<"01234">>, latin1, -5, 100).
<<"01234">>
> bstring:substr(<<"01234">>, latin1, -6, 1).
<<>>
> bstring:substr(<<"01234">>, latin1, -6, 2).
<<"0">>
If the encoding is not given, it is assumed to be latin1
, implying that no interpretation is given to the bytes in the binary string.
If the encoding is not given, it is assumed to be latin1
, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg
exception if the searched part of BString
does not contain characters encoded according to the Encoding
parameter, the Encoding
has an invalid value, or any of the parameters are of the wrong type.
tokens(BString, SeparatorList) -> Tokens
tokens(BString, Encoding, SeparatorList) -> Tokens
Types:
String = bstring()
Encoding = encoding
SeparatorList = [ non_negative_integer() ]
Tokens = [bstring()]
Returns a list of tokens in BString
, separated by the characters in SeparatorList
.
The Tokens
returned are encoded in the same character encoding as the BString
.
Example:
> bstring:tokens(<<"abc defxxghix jkl">>, latin1, [$x,$ ]).
[<<"abc">>, <<"def">>, <<"ghi">>, <<"jkl">>]
SeparatorList
is to be viewed as a set of characters, order is not significant. Codepoints given in SeparatorList
that cannot be represented by the Encoding
, is not an error.
If the encoding is not given, it is assumed to be latin1
, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg
exception if the searched part of BString
does not contain characters encoded according to the Encoding
parameter, the Encoding
has an invalid value, or any of the parameters are of the wrong type.
This module can, and probably should, be implemented entirely in
Erlang, no BIF's or NIF's are needed. Both the binary
and
unicode
modules can be utilized to speed up conversion and indata
checking. The Unicode versions will definitely be slower than the
ISO-Latin-1 versions, as character encoding, decoding and checking is
bound to produce overhead.
The suggested wrapper ubstring
should not impose any significant
cost compared to calling bstring
with all encoding arguments set
to unicode
.
The idea is to make string manipulation using binaries convenient as it has a great positive impact on systems memory-wise. Increased speed compared to list-oriented strings is not the goal, although it may well be a side-effect.
No specific reference implementation is made, the code will however be made available on GitHub during any development.
This document is licensed under the Creative Commons license.