Author: Patrik Nyblom <pan(at)erlang(dot)org>,
        Fredrik Svahn <Fredrik(dot)Svahn(at)gmail>
Status: Draft
Type: Standards Track
Created: 29-Sep-2010
Erlang-Version: R14B
Post-History:
Replaces: 9

EEP 35: Binary string module(s)

Abstract

This EEP contains developed suggestions regarding the module binary_string first suggested in EEP 9. The module name is now however changed to bstring.

EEP 9 suggests several modules and is partially superseded by later EEP's (i.e. EEP 11 and EEP 31), while still containing valuable suggestions not yet implemented. This last remaining module suggested in EEP 9 will therefore appear in this separate EEP. This is made in agreement with the original author of EEP 9.

The module bstring is suggested to contain functions for convenient manipulation of textual data stored in binaries, i.e. binary strings. It somewhat resembles the string module (which is list oriented), but is not to be viewed simply as a string module for binaries.

The module suggested handles binary character encoding in both the standard character encodings of Erlang, namely ISO-Latin-1 and UTF-8.

Motivation

Text strings are traditionally represented as lists of integers in Erlang. While this is convenient and more or less built into the syntax of the language (i.e. "ABC" is syntactic sugar for [$A,$B,$C]), a more compact representation is often desired. Also, in some circumstances binaries can be more efficient to manipulate in terms of algorithm complexity than lists are (especially in the fixed character width case of ISO-Latin-1).

More modules have been added to the standard libraries lately to aid the usage of binaries for text strings, both as representing ISO-Latin-1 characters and Unicode strings encoded in UTF-8. Most notably the re library, but also the unicode module are fairly new additions to stdlib which will make life easier for the programmer when it comes to manipulating binary encoded strings. Also a module for fast searching and replacing in byte oriented binaries is present (the module binary), but no traditional string manipulation module is yet in the libraries. To ease use of binary encoded strings, such a module is needed.

Rationale

The module string for text oriented operations on lists has been present in the standard libraries for so long that most programmers don't remember a time when it wasn't there. It is said to originally be a merge of two different string modules, written and designed by two different programmers with possibly slightly different goals and definitely slightly different views on function naming. While sometimes criticized for duplicated functionality and inconsistent function naming, among other things, the module has remained useful throughout the entire lifespan of Erlang/OTP. The string representation used has also withstood the evolution of Unicode.

It is worth to note that the only functions in the string module that actually are language or region dependent are later additions to the module. Those functions (like to_upper, to_lower, to_integer and to_float), or their binary equivalence, are not part of the module interface I suggest for bstring for the simple reason that they need language support not yet present in Erlang. A future EEP might suggest such language support (i.e. some kind of "locale" support), but that is future work not covered by this EEP.

So, however criticized, the string module is very useful for manipulating lists, and the same functionality for binary strings is desirable. While a lot of the functionality will be similar, there are some major issues to consider when implementing a module for manipulating strings encoded in binaries:

  • Unicode - Binaries can have different encodings. A Character encoded as UTF-8 might take more than one (up to four) byte positions, and even the same character can have different encodings in ISO-Latin-1 and UTF-8 (all codepoints from 128 to 255). The functions need to be informed of the character encoding explicitly, The encoding information is not present in the binaries.

  • Mixed character encodings - As characters can be encoded in different ways, two strings in the same program could have different encodings. Supplying the functions with non-homogeneous string encoding data should be consistently solved throughout the module, as should the selection of returned encoding where applicable.

  • Default character encoding - As functions will take extra arguments to specify encoding, a consistent default might be useful. Choosing the default is not entirely simple, as the tradition states ISO-Latin-1, while the future suggests UTF-8.

  • Languages - Erlang has no notion of "Locale" or preferred number format. A general string module can not assume neither a specific notion of uppercase or lowercase letters, nor a specific number encoding format (especially true for floating point numbers).

  • Word separators - The space character is certainly not the only word separator for textual data (in any language). The notion of words separated by spaces imposes a restriction of the relevant languages.

  • Left to right or right to left - Notions like left or right to denote the beginning or end of a string are certainly not language independent. While strings in a language have a beginning and an end, that beginning and end may be placed both to the left, the right or even at the top, bottom or center of the graphical representation. A string manipulation module should not use naming implying a left-to-right script, or any other type of script.

  • Naming and duplicated functionality - The original string module has been accused of having somewhat inconsistent naming and functionality duplicated. In fact the only duplicated functions are substr and sub_string. Some cleanup of the interface might be needed.

  • Byte oriented versus character oriented return values - When dealing with Unicode data, a character may take more than one byte, why i.e. counting the number of characters in a string tells you very little about the actual size of the string in bytes. Furthermore, later processing of a binary might require byte-oriented manipulation of a string rather than character oriented (i.e. you want to manipulate the string using the binary module or with bit-syntax), while characters are actually what constitutes a string, not bytes. You would want both.

  • New or replaced functionality - New functionality have been suggested from several sources, most notably EEP 9. For example the function split suggested in EEP 9 is very similar to string:tokens/2. Should we keep tokens anyway, for example?

I'll address the different issues below.

Unicode

The interface has to support both ISO-Latin-1 and UTF-8. The unicode module supports even more encodings, but Erlang/OTP uses UTF-8 for all "internal" interfaces and UTF-8 is the expected encoding of a binary Unicode string. Even though UTF-8 is compatible with ISO-Latin-1 in the 7bit ASCII range, characters with codepoints between 128 and 255 are encoded differently in the "plain" ISO-Latin-1 encoding and in UTF-8. This means that all functions in the bstring module need to have the actual encoding as one or more extra parameters.

One could invent a more abstract binary string format where the data is for example represented as a tuple with the string and the encoding packed together. However no other module supports such a string construct and I don't think that would really add something, neither functionality nor readability. Consider code like:

bstring:tokens(Bin,latin1,[$ ,$\n])

compared to:

bstring:tokens({Bin,latin1}, [$ ,$\n]).

or even:

bstring:tokens(#bstring{data = Bin, encoding = latin1}, [$ ,$\n]).

In many cases the extra information needs to be added in connection to the call, making the code no more readable or simple to write than with the separate extra argument. Consider if we had a default value for encoding. The code:

f(Data) ->
       bstring:tokens(Data,[$ ,$\n]).

would not in any way indicate if Data was supposed to be a binary with the default encoding or some kind of complex data structure indicating both the actual string and it's encoding.

I think the extra argument for the encoding is straight forward and simple, and it makes programming easier when using the binary string in other modules as well (i.e. re, binary, file etc). I think we should simply not have a special string datatype for this module, character encoding should be supplied as a separate argument.

Mixed character encodings

To ease transition between character encodings, I think the interface should accept different encodings for both different parameters and the return value. This makes it possible to convert on the fly and for the functions to decide on the most efficient character conversion path for the supplied arguments and the return value.

The downside of this approach is that some functions will take a lot of parameters telling different character encodings, for example a string concatenation routine could look like:

  concat(BString1, Encoding1, BString2, Encoding2, Encoding3) -> BString3

being called like:

  US = bstring:concat(SA,latin1, SB, latin1, unicode),

which might look a little awkward to write. On the other hand, conversion is made on the fly and you will not need to explicitly call the unicode module to convert the result.

I think implicit conversion is so useful that it is worth the extra arguments. For example a concat function would be more or less useless without it, the bit syntax would be much easier to use if no conversion should be allowed.

Default character encoding

Choosing a default character encoding is not obvious. While ISO-Latin-1 is the default in Erlang (i.e. <<"korvsmörgås">> gives a ISO-Latin-1 encoded binary string), UTF-8 usage is expected to grow in the future.

Although its tempting to select UTF-8 as the default encoding, I think we should stick to ISO-Latin-1 as the default even for this module. There are several reasons:

  • We need not, as a rule, impose new standards in every module we add to the standard library. Consistence certainly adds value, and both the bit-syntax, the source code encoding and things like the io:format routine has ISO-Latin-1 as default. Lets not make this module inconsistent with the others.

  • The string module is often used to manipulate arbitrary lists of integers, not always actually representing textual data. In the same way can bstring probably be used to manipulate arbitrary blobs of bytes if ISO-latin-1 versions are used. ISO-Latin-1 is actually the raw bytes uninterpreted, why any binary data can be worked on in a ISO-Latin-1 oriented routine. Using UTF-8 encoding as default would narrow the use for the default functions to only work on real text data.

  • The pure ISO-Latin-1 implementations of the functions will be the most efficient ones as no data checking at all is needed. Any byte value is acceptable in any version. Some functions are usable on UTF-8 strings even though they expect ISO-Latin-1 data. The difference between the ISO-Latin-1 version and the UTF-8 version being only indata control. If the data given to, for example bstring:concat is already checked for correct UTF-8, the simpler ISO-Latin-1 version of the function is both more efficient and guaranteed to give as correct output as the input:

        CorrectUtf8_1 = give_me_good_string(),
        CorrectUtf8_2 = give_me_another_good_string(),
        CorrectUtf8_3 = bstring:concat(CorrectUtf8_1, latin1, CorrectUtf8_2, latin1, latin1),
        ...
    

    Simply put, ISO-Latin-1 versions of the functions are more generally useful than pure UTF-8 versions and are also more efficient.

  • A wrapper module providing pure UTF-8 interfaces can easily be written. The overhead of going via a wrapper would be relatively lower for an UTF-8 wrapper than for an ISO-Latin-1 ditto, as the overhead of character decoding/encoding of UTF-8 strings in the module would be quite high. Simply put, a wrapper would cost very little compared to the cost of checking the data for UTF-8 correctness.

    I actually suggest a module ubstring that has the part of the bstring interface where a default encoding is implied, but with the difference that UTF-8 is expected. For example, a function ubstring:tokens/2 would look like this:

    tokens(S,L) -> bstring:tokens(S,unicode,L).
    

    Quite simple.

To conclude, I think all functions should exist in a version where no encoding is supplied and ISO-Latin-1 encoded data is expected.

Languages

Even though Unicode characters can be used to express text in most known, living and dead scripts, language and region knowledge is a completely different thing. String interfaces often impose language specific properties of the string, like left-to-right writing direction, the notion of words built up by space separated groups of characters, ways of representing numbers and decimal points etc. As Erlang does not (yet) have a way of specifying such language-, or region-specific properties of a string, the interface should not contain language-dependent functionality. The string module did not originally contain such functions (except that character alignment functions were named left and right), but unfortunately functions like to_float and to_upper have been added.

I think that having language-dependent functions in the string module was a mistake and I do not want to make that mistake again. Hence I have not included such functions or names in bstring.

I rather suggest "Locale" functionality as a subject of a future EEP. For those who consider that simple, try to write a correct to_upper function for just all European languages, make sure it works on all platforms that can run Erlang... Maybe not rocket science, but a lot of metadata is required. Data that is not always available in the underlying OS, but probably needs to be distributed with Erlang/OTP for consistent functionality. Definitely worth it's own EEP.

Word separators

In connection with language independence, I think we should drop the notion of words as a group of characters separated by space. The word "token" is more general and does not in the same way indicate language constructs. The string module has the ASCII space character as a default for word separation, which I think should be dropped in bstring. Whatever should separate tokens should be supplied, possibly as alternatives. I therefore suggest the functions bstring:num_tokens and bstring:nth_token to fulfill the functionality of string:words and string:sub_word.

As in EEP 9, I suggest a new function split to handle the case of multi-character separators for tokens. A compilation of split and join makes a convenient replace function too.

Left-to-right or Right-to-left

As mentioned earlier, I don't think direction of the graphical representation should be implied in the interface, why I suggest using notions like leading and trailing (meaning leading and trailing characters in the binary) rather than any directional notions. I also think aligning strings (like in strings:right etc.) could be solved in one function align, taking one of the atoms leading, trailing or center as a parameter, if it should at all be implemented.

Naming and duplicated functionality

I definitely do not think we should have all interfaces from string duplicated to bstring. Especially interfaces that are aliases should not be carried along to the bstring module. Most functions in the string module however have short and fairly describing names, often similar to names found in other languages. I think using a r prefix for functionality working from the end of the string towards the beginning is a good choice, as is c for complement.

Byte oriented versus character oriented return values

Some functions in string, that are certainly useful, return numbers denoting character positions. The same functions should definitely be present in the bstring module and the return values should definitely be character oriented. However byte offsets are definitely useful, for example if we use a function like span to find the first character not in a set of characters, we might want the byte offset of that first character too.

I suggest adding some interfaces returning byte offsets, or part()'s like the ones used in the binary module and by re, to cope with the need for byte offsets and lengths in some circumstances. A b suffix to the function name could denote such functionality, so that bstring:span returns a character position while bstring:spanb returns a byte position and btring:str returns a character position and bstring:strb returns a part(). Although this will in the end give rise to more functions in the interface, having return-type-changing options in an option list is not the way to go (I know, I have them in re, but it's still not generally a good idea...).

New or replaced functionality

When writing a general string module, there is no end to the new, more or less esoteric, functionality one could add. I think we, at least in an initial implementation, should stick to the functionality outlined in EEP 9, namely extending str and friends to optionally take a list of alternative strings to search for, add a function split to take care of multi-character separators (as opposed to single character separators in the function tokens) and a substitution function, which I think should be named replace as in other modules.

The use of pre-compiled matches from the binary module is however not a good idea, as the binary module has no notion of character encoding. Search strings need to be given in defined character encodings and both the "haystacks" and the "needles" encoding need to be known when doing an efficient search. So - no pre-compiled search expressions.

Excerpt of a suggested manual page

As made obvious above, I prefer the name bstring for a binary string module in favor of the more verbose name binary_string originally suggested. In that module bstring, I suggest the following interfaces, expressed as in a manual page of OTP.

DATA TYPES

    encoding() = latin1 | unicode | utf8
      - The encoding of characters in the binary data, both input and output
    bstring() 
      - Binary with characters encoded either in ISO-Latin-1 or UTF-8
    unicode_char() = non_negative_integer() 
      - An integer representing a valid unicode codepoint
    non_negative_integer()
      - An integer >= 0

EXPORTS

align(BString, Alignment, Number, Char) -> Result

align(BString, Encoding, Alignment, Number, Char) -> Result

Types:

BString = Result = bstring()
Encoding = encoding()
Alignment = leading | trailing | center
Number = non_negative_integer()
Char = unicode_char()

Aligns the characters in BString in a Result of Number characters according to the Alignment parameter. Alignment is done by inserting the character Char in the beginning or end (or both) of the binary string.

The resulting binary string will contain exactly Number characters, the string is truncated if it contains more characters than Number - either at the end if Alignment is leading, or at the beginning if Alignment is trailing, or at both ends if Alignment is center . If Encoding is unicode, the Result may well contain more bytes than Number, as one character may require several bytes.

Example:

> bstring:align(<<"Hello">>, latin1, center, 10, $.).
<<"..Hello...">>

If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.

Raises a badarg exception if BString does not contain characters encoded according to the Encoding parameter, Encoding or Alignment has an invalid value, the character Char cannot be encoded in the character encoding given as Encoding or any of the parameters are of the wrong type.

chr(BString, Character) -> Position

chr(BString, Encoding, Character) -> Position

rchr(BString, Character) -> Position

rchr(BString, Encoding, Character) -> Position

Types:

BString = bstring()
Encoding = encoding()
Character = unicode_char()
Position = integer()

Returns the (zero-based) character position of the first/last occurrence of Character in BString . -1 is returned if Character does not occur.

Note that the character position is not the same as the byte position. Use the chrb and rchrb functions to get the byte positions.

If Character cannot be represented in the encoding, it is not an error, you are just certain to get -1 as a return value.

If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.

Raises a badarg exception if the searched part of BString does not contain characters encoded according to the Encoding parameter, the Encoding has an invalid value, or any of the parameters are of the wrong type.

chrb(BString, Character) -> {BytePosition, ByteLength}

chrb(BString, Encoding, Character) -> {BytePosition, ByteLength}

rchrb(BString, Character) -> {BytePosition, ByteLength}

rchrb(BString, Encoding, Character) -> {BytePosition, ByteLength}

Types:

BString = bstring()
Encoding = encoding()
Character = unicode_char()
BytePosition = integer()
ByteLength = non_negative_integer()

Works as chr and rchr respectively, but returns the byte position and byte length of the character.

If the character is not found, {-1,0} is returned.

concat(BString1, BString2) -> BString3

concat(BString1, Encoding1, BString2, Encoding2, Encoding3) -> BString3

Types:

BString1 = BString2 = BString3 = bstring()
Encoding1 = Encoding2 = Encoding3 = encoding()

Concatenates two binary strings to form a new string. Returns the new binary string in the encoding given by Encoding3.

If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.

Raises a badarg exception if BString1 or Bstring2 does not contain characters encoded according to the Encoding1 and Encoding2 parameters, the encoding parameters has an invalid value, the codepoints in the in-parameters cannot be represented in the output encoding or any of the parameters are of the wrong type.

equal(BString1, BString2) -> bool()

equal(BString1, Encoding1, BString2, Encoding2) -> bool()

Types:

BString1 = BString2 = bstring()
Encoding1 = Encoding2 = encoding()

Tests whether two binary strings are equal. Returns true if they are, otherwise false .

Encoding1 is the encoding of BString1 and Encoding2 is the encoding of BString2 .

Note that the strings can have different encoding and that it is the character values encoded in the strings that are compared. The binary strings are scanned as long as they are equal, meaning that if the function returns true, both strings are correctly encoded, while a return value of false does not guarantee correct encoding in both binary strings. An exception is raised if faulty encoding is determined while comparing the strings, not if parts of the string not inspected contain encoding errors.

If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.

Raises a badarg exception if wrongly encoded characters, according to the encoding parameters, are encountered during comparison, the encoding parameters has an invalid value or any of the parameters are of the wrong type.

join(BStringList, Separator) -> Result

join(BStringList, BStringListEncoding, Separator, SeparatorEncoding, ResultEncoding) -> Result

Types:

BStringList = [bstring()]
BStringListEncoding = SeparatorEncoding = ResultEncoding = encoding()
Separator = bstring()
Result = bstring()

Returns a binary string with the elements of BStringList separated by the binary string in Seperator .

All the binary strings in BStringList should have the same encoding (given as BStringListEncoding . The Separator can however have a different encoding (given as SeparatorEncoding ), as can the Result (given as ResultEncoding ).

Example:

> bstring:join([<<"one">>, <<"two">>, <<"three">>], latin1, <<", ">>, latin1, latin1).
<<"one, two, three">>

If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.

Raises a badarg exception if binary strings in BStringList or the Separator do not contain characters encoded according to the BStringListEncoding and SeparatorEncoding parameters respectively, the encoding parameters has an invalid value, the codepoints in the in-parameters cannot be represented in the output encoding ResultEncoding or any of the parameters are of the wrong type.

len(BString) -> Length

len(BString, Encoding) -> Length

Types:

BString = bstring()
Encoding = encoding()
Length = non_negative_integer()

Returns the number of characters in the binary string.

If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.

Raises a badarg exception if BString does not contain characters encoded according to the Encoding parameter, the Encoding has an invalid value or any of the parameters are of the wrong type.

nth_token(BString, N, CharList) -> Result

nth_token(BString, Encoding, N, CharList) -> Result

Types:

BString = Result = bstring()
Encoding = encoding()
CharList = [ unicode_char() ]
N = non_negative_integer()

Returns the token number N of BString (zero-based). Tokens are separated by the characters in CharList .

The returned token will have the same encoding as BString .

For example:

> bstring:nth_token(<<" Hello old boy !">>,latin1,3,[$o, $ ]).
<<"ld b">>

CharList is to be viewed as a set of characters, order is not significant. Codepoints given in CharList that cannot be represented by the Encoding, is not an error.

Values of N >= number of tokens in BString will result in the empty binary string <<>> being returned.

If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.

Raises a badarg exception if BString does not contain characters encoded according to the Encoding parameter, the Encoding has an invalid value, or any of the parameters are of the wrong type.

num_tokens(BString, CharList) -> Count

num_tokens(BString, Encoding, CharList) -> Count

Types:

BString = bstring()
Encoding = encoding()
CharList = [ unicode_char() ]
Count = non_negative_integer()

Returns the number of tokens in String, separated by the characters in CharList .

The result is the same as for length(bstring:tokens(BString,Encoding,CharList)), but avoids building the result.

For example:

> num_tokens(<<" Hello old boy!">>, latin1, [$o, $ ]).
4

CharList is to be viewed as a set of characters, order is not significant. Codepoints given in CharList that cannot be represented by the Encoding, is not an error.

If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.

Raises a badarg exception if BString does not contain characters encoded according to the Encoding parameter, the Encoding has an invalid value, or any of the parameters are of the wrong type.

span(BString, Chars) -> Length

span(BString, Encoding, Chars) -> Length

rspan(BString, Chars) -> Length

rspan(BString, Encoding, Chars) -> Length

cspan(BString, Chars) -> Length

cspan(BString, Encoding, Chars) -> Length

rcspan(BString, Chars) -> Length

rcspan(BString, Encoding, Chars) -> Length

Types:

BString = bstring()
Encoding = encoding()
Chars = [ integer() ]
Length = non_negative_integer()

Returns the length (in characters) of the maximum initial (span and cspan) or trailing (rspan and rcspan) segment of BString, which consists entirely of characters from (span and rspan), or not from (cspan and rcspan) Chars.

Chars is to be viewed as a set of characters, order is not significant. Codepoints given in Char that cannot be represented by the Encoding, is not an error.

For example:

> bstring:span(<<"\t    abcdef">>,latin1," \t").
5
> bstring:cspan((<<"\t    abcdef">>,latin1, " \t").
0

Codepoints in Chars that can not be represented by Encoding is not considered an error.

If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.

Raises a badarg exception if the searched part of BString does not contain characters encoded according to the Encoding parameter, the Encoding has an invalid value, or any of the parameters are of the wrong type.

spanb(BString, Chars) -> ByteLength

spanb(BString, Encoding, Chars) -> ByteLength

rspanb(BString, Chars) -> ByteLength

rspanb(BString, Encoding, Chars) -> ByteLength

cspanb(BString, Chars) -> ByteLength

cspanb(BString, Encoding, Chars) -> ByteLength

rcspanb(BString, Chars) -> ByteLength

rcspanb(BString, Encoding, Chars) -> ByteLength

Types:

BString = bstring()
Encoding = encoding()
Chars = [ integer() ]
ByteLength = non_negative_integer()

Work exactly as the functions span, rspan, cspan and rcspan respectively, but returns the number of bytes rather than the number of characters.

split(BString, Separators, Where) -> Tokens

split(BString, Encoding, Separators, SepEncoding, Where, ReturnEncoding) -> Tokens

Types:

String = bstring()
Encoding = SepEncoding = ReturnEncoding = encoding()
Separators = [ bstring() ]
Where = first | last | all
Tokens = [bstring()]

Returns a list of tokens in BString, separated by the binary strings in Separators .

The Tokens returned are encoded according to ReturnEncoding .

Example:

> bstring:split(<<"abc defxxghix jkl">>, latin1, [<<"x">>,<<" ">>],all,latin1).
[<<"abc">>, <<"def">>, <<"ghi">>, <<"jkl">>]

Separators is to be viewed as a set of binary strings, order is not significant. Codepoints given in Separators that cannot be represented by the Encoding, is not an error.

The Where parameter specifies at which occurrence of any of the Separators the binary string is to be split, either at the first occurrence, the last occurrence or at all occurrences, in which case the Tokens may be an arbitrary long list.

If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.

Raises a badarg exception if BString or Separators does not contain characters encoded according to the Encoding and SepEncoding parameters respectively, the resulting tokens cannot be encoded in the ReturnEncoding, the Encoding has an invalid value, or any of the parameters are of the wrong type.

str(BString, SubBStrings) -> Position

str(BString, Encoding, SubBStrings, SubEnc) -> Position

rstr(BString, SubBStrings) -> Position

rstr(BString, Encoding, SubBStrings, SubEnc) -> Position

Types:

BString = bstring()
SubBString = bstring() | [ bstring() ]
Encoding = SubEnc = encoding()
Position = integer()

Returns the (zero-based) character position where the first/last occurrence of any of the SubBStrings begins in BString . -1 is returned if SubBString does not exist in BString .

Note that the Character position is not the same as the byte position. Use the strb and rstrb functions to get the byte positions.

The encoding need not be the same for BString and SubBStrings, however all strings in SubBStrings need to have the same encoding.

If the codepoints in SubBString can not be represented in the encoding of BString, that is not an error, but will always result in the return value -1.

Example:

> bstring:str(<<" Hello Hello World World ">>,latin1,<<"Hello World">>,latin1).
7

Note that if both encodings are the same and repeated searches with the same SubBStrings are to be performed, it is more efficient to use the binary:match/{2,3} functions with a precompiled pattern on the raw binary data.

If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.

Raises a badarg exception if the searched part of BString or SubBString does not contain characters encoded according to the Encoding and SubEnc parameters, the Encoding has an invalid value, or any of the parameters are of the wrong type.

strb(BString, SubBStrings) -> {BytePosition, ByteLength}

strb(BString, Encoding, SubBStrings, SubEnc) -> {BytePosition, ByteLength}

rstrb(BString, SubBStrings) -> {BytePosition, ByteLength}

rstrb(BString, Encoding, SubBStrings, SubEnc) -> {BytePosition, ByteLength}

Types:

BString = bstring()
SubBString = bstring() | [ bstring() ]
Encoding = SubEnc = encoding()
BytePosition = integer()
ByteLength = non_negative_integer()

Works as str and rstr respectively, but returns the byte position and byte length of the found substring.

Note that ByteLength is the length the found substring has in BString, regardless of the encoding in SubBStrings, so that ByteLength may be both larger and smaller than byte_size(SubBString) depending on the binary string's encoding.

If the substring is not found, {-1,0} is returned.

strip(BString, Which, CharList) -> Result

strip(BString, Encoding, Which, CharList) -> Result

Types:

BString = Result = bstring()
Encoding = encoding()
Which = leading | trailing | both
CharList = [ unicode_char() ]

Removes leading (Which = leading), trailing (Which = trailing) or both leading and trailing (Which = both) characters belonging to the set indicated by CharList from the binary string BString .

This is essentially the same as using spanb and/or rspanb in combination with bit syntax to remove the characters.

Example:

> bstring:strip(<<"...He.llo.....">>, latin1, both, [$.]).
<<"He.llo">>

If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.

Raises a badarg exception if scanned part of BString does not contain characters encoded according to the Encoding parameter, Encoding or Which has an invalid value, or any of the parameters are of the wrong type.

replace(BString, Separators, Replacement, Where) -> Result

replace(BString, Encoding, Separators, SeparatorsEncoding, Replacement, ReplacementEncoding, Where, ResultEncoding) -> Result

Types:

BString = bstring()
Encoding = SeparatorsEncoding = ReplacementEncoding, ResultEncoding = encoding()
Separators = [ bstring() ]
Replacement = bstring()
Where = first | last | all
Result = bstring()

Produces the same result as

bstring:join(bstring:split(BString,Encoding,Separators,SeparatorsEncoding,Where,
                           unicode),
             unicode,Replacement,ReplacementEncoding,ResultEncoding)

but with less overhead.

substr(BString, Start, Length) -> SubBString

substr(BString, Encoding, Start, Length) -> SubBString

Types:

BString = SubBString = bstring()
Encoding = bstring()
Start = integer()
Length = non_negative_integer() | infinity

Returns a substring of String, starting at the zero-based character position Start, and ending at the end of the binary string (if Length is infinity or up to, but not including, the character position Start+Length (if Length is a non negative integer).

The returned SubBString will have the same encoding as BString .

Example:

> bstring:substr(<<"Hello World">>, latin1, 3, 5).
<<"lo Wo">>

A negative value of Start denotes abs(Start) characters from the end of BString, so that -1 is the last character position in the binary string.

Example:

> bstring:substr(<<"Hello World">>, latin1, -1, 3).
<<"rld">>

As the true length of an UTF-8 encoded binary string is quite costly to determine ( O(N), where N is the number of bytes in the binary), the function is very forgiving about positions given outside of the string, both Start s and Length s. Character positions outside of the string in either direction are collapsed to the empty binary string.

Examples:

> bstring:substr(<<"01234">>, latin1, 5, 5).
<<>>
> bstring:substr(<<"01234">>, latin1, 4, 5).
<<"4">>
> bstring:substr(<<"01234">>, latin1, -5, 100).
<<"01234">>
> bstring:substr(<<"01234">>, latin1, -6, 1).
<<>>    
> bstring:substr(<<"01234">>, latin1, -6, 2).
<<"0">>

If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.

If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.

Raises a badarg exception if the searched part of BString does not contain characters encoded according to the Encoding parameter, the Encoding has an invalid value, or any of the parameters are of the wrong type.

tokens(BString, SeparatorList) -> Tokens

tokens(BString, Encoding, SeparatorList) -> Tokens

Types:

String = bstring()
Encoding = encoding
SeparatorList = [ non_negative_integer() ]
Tokens = [bstring()]

Returns a list of tokens in BString, separated by the characters in SeparatorList .

The Tokens returned are encoded in the same character encoding as the BString .

Example:

> bstring:tokens(<<"abc defxxghix jkl">>, latin1, [$x,$ ]).
[<<"abc">>, <<"def">>, <<"ghi">>, <<"jkl">>]

SeparatorList is to be viewed as a set of characters, order is not significant. Codepoints given in SeparatorList that cannot be represented by the Encoding, is not an error.

If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.

Raises a badarg exception if the searched part of BString does not contain characters encoded according to the Encoding parameter, the Encoding has an invalid value, or any of the parameters are of the wrong type.

Performance

This module can, and probably should, be implemented entirely in Erlang, no BIF's or NIF's are needed. Both the binary and unicode modules can be utilized to speed up conversion and indata checking. The Unicode versions will definitely be slower than the ISO-Latin-1 versions, as character encoding, decoding and checking is bound to produce overhead.

The suggested wrapper ubstring should not impose any significant cost compared to calling bstring with all encoding arguments set to unicode.

The idea is to make string manipulation using binaries convenient as it has a great positive impact on systems memory-wise. Increased speed compared to list-oriented strings is not the goal, although it may well be a side-effect.

Reference implementation

No specific reference implementation is made, the code will however be made available on GitHub during any development.

Copyright

This document is licensed under the Creative Commons license.