Author: Fredrik Svahn <Fredrik(dot)Svahn(at)gmail>
Status: Draft
Type: Standards Track
Created: 28-Dec-2007
Erlang-Version: R12B-2
Post-History:

EEP 9: Library for working with binaries

Abstract

This EEP suggests the addition of two binary help libraries with built-in functions for time critical activities such as searching and splitting erlang binaries as well as library functions for common operations on binaries. The EEP also suggest the addition of a regular expressions library using built in functions.

Rationale

For the lists data type there is a help library providing functions for common operations such as searching and splitting lists. This EEP suggests that a similar set of library functions should be created for binaries. Many of the proposed functions are based on answers to questions regarding binaries on the erlang-questions mailing list, e.g. "how do I convert a number to a binary?".

Motivation

Since binaries are typically used for time critical activities on larger amounts of data it is suggested that some operations on binaries are implemented as built-in functions, BIF:s.

Specifically there seems to be a huge interest in the community for an efficient regexp implementation for searching binaries. Also for maximum performance when searching and splitting binaries it is suggested that the the regexp search function is complemented by a high performance function for simple searches, e.g. locating and splitting binaries on newline characters. Tests show that e.g. the Boyer-Moore algorithm may be significantly faster than regular expression algorithms for such purposes.

When reviewing the EEP it is clear that there is also a strong demand for string operations on binaries for better performance.

The reference implementation sent separately to the OTP team gives an indication of the expected performance improvements compared to e.g. the current regular expression module for searching on lists. Some results are available at the end of this EEP.

Suggested Changes

This EEP suggests the addition of two new modules; one module named binary and one called binary_string.

The EEP also suggests a new regular expression library based on Perl Compatible Regular Expressions (PCRE). The library should be able to operate both on binary_strings and on strings.

Finally, the following functions should be added to the erlang module:

binary_to_atom(Binary) -> Atom
atom_to_binary(Atom) -> Binary
binary_to_existing_atom(Binary) -> Atom

Not Included

At the moment the following is not included in the EEP:
- Support for different encodings, e.g. UTF-8
- Changes to the string module

The "binary_string" Module

The binary_string module should be based on the current string module but should operate on strings represented by binaries as opposed to the current strings module which operates on strings represented by lists.

Apart from operating on binaries the interface of binary_string should be the same as for string with the following exceptions:

  1. str/2 and rstr/2 should be modified to optionally take a list of binaries or a MatchSpec such as the one returned by binary:match_compile/2 as second argument. If the Keys argument corresponds to several keys the function should return a tuple indicating the Key that matched and the matching Index, i.e.

    str(Binary, Keys) -> Return
    rstr(Binary, Keys) -> Return
    

    Binary = binary() Keys = Key | [ Key ] | MatchSpec Key = string() | binary() MatchSpec = tuple() as returned by binary:match_compile/1 Return = Index | {NeedleNumber, Index} Index = integer()

    str/rstr should be implemented as built-in functions using efficient algorithms such as Boyer-Moore, Aho-Corasick or similar. Typically the function could be built on binary:match/2.

  2. A new function split should be added. It should behave as tokens/2 but take a list of separator binaries/strings instead of a list of separator characters.

    split(Binary, SplitKeys) -> List
    
    
    Binary = binary()
    SplitKeys = Key | [ Key ] | MatchSpec
    Key = string() | binary()
    MatchSpec = tuple() as returned by binary:match_compile/1
    List = [ binary() ]
    

    Splits Binary into a list of binaries based on matching the pattern specified in the SplitKeys binary.

    Examples:

    binary_string:split(<<"cat and dog">>, <<"and">>). [<<"cat ">>, <<" dog">>]

    > binary_string:split(<<"cat and dog">>, "and").
    [<<"cat ">>, <<" dog">>] 
    
    
    > binary_string:split(<<"cat and dog">>,["a","n",<<"d">>]).
    [<<"c">>,<<"t ">>,<<" ">>,<<"og">>]
    

    The resulting list should be the same as for regexp:split/2 (with the obvious exception for special characters such as "*", ".", "^", etc).

    Please note that the third example should give the same result as binary_string:tokens(<<"cat and dog">>, "and").

  3. The new functions substitute and globally_substitute should be added.

     substitute(OldBinary, Key, Replacement)-> NewBinary
    
    
     OldBinary, NewBinary, Replacement = binary()
     Keys = binary() | [ binary() ] | MatchSpec
     MatchSpec = tuple() as returned by binary:match_compile/1
    

    Creates a binary NewBinary from OldBinary by substituting the first occurence of any of the binaries in Keys in OldBinary with the Replacement binary.

    The Replacement binary need not have the same size as the matched Key.

    Example:

    binary_string:substitute(<<"cat anf dog">>,<<"anf">>,<<"and">>). [<<"cat and dog">>]

    globally_substitute(OldBinary, Key, Replacement)-> NewBinary
    
    
    OldBinary, NewBinary, Replacement = binary()
    Keys = binary() | [ binary() ] | MatchSpec
    MatchSpec = tuple() as returned by binary:match_compile/1
    

    Same as substitute except that all non-overlapping occurrences of a subbinary in OldBinary are replaced by the Replacement binary.

    It is suggested that the same functions are also added to the string module, but this is out of the scope of this EEP.

The "binary" Module

The interface of the binary module should have the following exported functions (please note that some functions are intentionally the same as in binary_string since it is believed they can be useful both for string and binary data manipulation):

match(Binary, Keys) -> Return
match(Binary, Keys, {StartIndex, EndIndex}) -> Return

Binary = binary()
Keys = binary() | [ binary() ] | MatchSpec
MatchSpec = tuple() as returned by binary:match_compile/1
StartIndex = EndIndex = integer()

Return = Index | {KeyNumber, Index}
Index = KeyNumber = integer()

Returns position of first occurence in Binary of the first matching binary in Keys or 0 if no match. If a list of keys is given, the function will return a tuple with the KeyNumber of the matched Key and the position in Binary where it was found.

There has been a discussion on whether the function should return the matched Key instead of the KeyNumber. Returning the KeyNumber should be slightly more efficient, and since the matched key can easily be retrieved by lists:nth(KeyNumber, Keys) if needed it is suggested that the function returns the KeyNumber.

Binary is searched from StartIndex to EndIndex. If StartIndex and EndIndex are not specified the default is to search Binary from the beginning to the end.

Example:

binary:match(<<1,2,3,0,0,0,4>>, <<0,0,0>>). 4

> binary:match(<<1,2,255,0,0,0,4,255>>, [<<0,0,0>>, <<255>>]).
{2, 3}

Suggestions on implementation: Should be implemented as one or more BIF:s using e.g. Boyer-Moore, Aho-Corasick or similar efficient algorithms.

matches(Binary, Keys) -> Return
matches(Binary, Keys, {StartIndex, EndIndex}) -> Return

Binary = binary()
Keys = binary() | [ binary() ] | MatchSpec
MatchSpec = tuple() as returned by binary:match_compile/1
StartIndex = EndIndex = integer() 

Return = [ Index ] | [ {KeyNumber, Index} ]
Index = KeyNumber = integer()

Finds all matches of the Keys in Haystack. Returns a list of the indexes for all non-overlapping ocurrences of the key or keys.

split(Binary, SplitKeys) -> List
split(Binary, SplitKeys, {StartIndex, EndIndex}) -> List

Binary = binary()
SplitKeys = binary() | [ binary() ] | MatchSpec
MatchSpec = tuple() as returned by binary:match_compile/1
StartIndex = EndIndex = integer()

List = [ binary() ]

Splits Binary into a list of binaries based on matching the pattern specified in SplitKeys.

Example:

binary:split(<<1,255,4,0,0,0,2,3>>, <<0,0,0>>). [<<1,255,4>>, <<2,3>>]

> binary:split(<<0,1,0,0,4,255,255,9>>, [<<0,0>>, <<255,255>>]).
[<<0,1>>,<<4>>,<<9>>]

The resulting list should basically be the same as for regexp:split/2 (with the obvious exception for special characters such as "*", ".", "^", etc).

The binaries in List are all subbinaries of Binary meaning that the data in Binary is not actually copied to new binaries.

substitute(OldBinary, Key, Replacement)-> NewBinary

OldBinary, NewBinary, Replacement = binary()
Keys = binary() | [ binary() ] | MatchSpec
MatchSpec = tuple() as returned by binary:match_compile/1

Creates a binary NewBinary from OldBinary by substituting the first occurence of any of the binaries in Keys in OldBinary with the Replacement binary.

The Replacement binary need not have the same size as the matched Key.

globally_substitute(OldBinary, Key, Replacement)-> NewBinary

OldBinary, NewBinary, Replacement = binary()
Keys = binary() | [ binary() ] | MatchSpec
MatchSpec = tuple() as returned by binary:match_compile/1

Same as substitute except that all non-overlapping occurrences of a subbinary in OldBinary are replaced by the Replacement binary.

match_compile(Keys) -> MatchSpec

Keys = binary() | [ binary() ] 
MatchSpec = tuple()

Builds an internal structure representing one or more search keys. The MatchSpec structure can be used to speed up searching if multiple searches with binary:match/2 or binary_string:str/2 are to be performed with the same search keywords.

binary:from_unsigned(Integer)-> Binary
binary:to_unsigned(Binary)-> Integer

Converts a positive integer the smallest possible representation in the binary data type format and vice versa.

Example:
> binary:from_unsigned(11111111). 
<<169,138,199>>

> binary:to_unsigned(<<169,138,199>>).
11111111


first(Binary1)-> Binary2
first(SizeBytes, Binary1)-> Binary2

Returns a subbinary with the first byte or the SizeBytes first bytes in Binary1.

Example:
> binary:first(2, <<"abc">>).                                  
<<"ab">>


last(Binary1)-> Binary2.
last(SizeBytes, Binary1)-> Binary2

Returns a subbinary with the last byte or the SizeBytes last bytes in Binary1.

Example:

binary:last(2, <<"abc">>).
<<"bc">>

nth(N, Binary) -> Value

N = integer(), 1 =< N =< size(Binary)
Value = integer()

Extracts a byte at position N from Binary. Same as

T = N-1,
<<_:T/binary, Value:Size/binary, _/binary>> = Binary, 
Value.

although this function is somewhat shorter and easier to write.

extract(N, Size, Binary) -> SubBinary

N = integer(), 1 =< N =< size(Binary)
Size = integer()
SubBinary = subbinary()

Returns a subbinary of size Size starting at position N from Binary. No data is copied in this operation.

It has been discussed if there should be a function for copying a part of a binary rather than getting a subbinary. This would make it possible to get a small part of a binary and let the rest be garbage collected. Since it is possible to achieve the same result by converting the extracted part to a list and then back again to a binary and it is a very specialized operation which may confuse new users it has been excluded at this stage.

When talking to designers many seem to prefer the name extract over the name subbinary for this function.

duplicate(N, Byte)-> Binary

Similar to lists:duplicate/2. Creates a new binary consisting of Byte repeated N times.

Example:

binary:duplicate(5, $a). <<"aaaaa">>

The Regular Expressions Library

It is suggested that a new regular expression library based on built in functions is added. It should have the following interface functions (name of the module to be decided, for backwards compatibility reasons it should probably exists in parallell with the old regexp module):

During a first round of feedback it has been suggested that the final implementation should be a built in function based on the Perl Compatible Regular Expressions (PCRE) library. It is optimised, well supported, and is more or less considered a standard today. It is used in a number of prominent products and projects, e.g. Apples Safari, Apache, KDE, PHP, Postfix and Nmap.

It is suggested that the module has the following exported functions:

compile(Regex) -> MatchSpec

Regex = string()
MatchSpec = tuple()

Builds an internal structure representing one or more search keys. The MatchSpec structure can be used to speed up searching if multiple searches are to be performed with the same search keywords.

match(BinOrString, RegExp)-> Return
match(BinOrString, RegExp, {StartIndex, EndIndex})-> Return

BinOrString = binary() | string()
RegExp = string() | MatchSpec
MatchSpec = tuple() as returned by match_compile/1
StartIndex = EndIndex = integer()
Return = 0 | {Start, Length, [CapturedPatterns]}

Finds the first, longest match of the regular expression RegExp in BinOrString. This function searches for the longest possible match and returns the first one found if there are several expressions of the same length.

The function supports pattern capturing. Patterns captured (if any) are returned in a list in the Return tuple.

Examples:
> binary:regex_match(<<"abcde">>, "b?cd").
{2,3,[]}

> binary:regex_match(<<"127.0.0.1">>, "(\d*)\.(\d*)\.").
{1,6,[<<"127">>, <<"0">>]}

Open questions: - It might be a good idea to add an Options parameter (optional of course), e.g. to specify that the partial matching feature should be activated - handling of Encodings.

matches(BinOrString, RegExp)-> Return
matches(BinOrString, RegExp, {StartIndex, EndIndex})-> Return

BinOrString = binary() | string()
RegExp = string() | MatchSpec
MatchSpec = tuple() as returned by match_compile/1
StartIndex = EndIndex = integer()

Return = 0 | [ {Start, Length, [CapturedPatterns]} ]

Finds all matches of the regular expression RegExp in BinOrString.

Example:

binary:regex_matches(<<"aaa">>, "a"). [{1,1,[]},{2,1,[]},{3,1,[]}]

sub(BinOrString, RegExp, Replacement)-> NewStringOrBinary

BinOrString = NewStringOrBinary = binary() | string()
RegExp = string() | MatchSpec
MatchSpec = tuple() as returned by match_compile/1
Replacement = string()

Substitutes the first occurence of a substring or subbinary matching RegExp in BinOrString with Replacement. A & in the Replacement string is replaced by the matched substring or subbinary of BinOrString. \& puts a literal & into the replacement string or binary. The type of NewStringOrBinary will be the same as the type of BinOrString.

gsub(BinOrString, RegExp, Replacement)-> Binary2

Same as sub except that all non-overlapping occurrences of a substring or subbinary matching RegExp in BinOrString are replaced by the string Replacement.

split(BinOrString, RegExp) -> List
split(BinOrString, RegExp, {StartIndex, EndIndex}) -> List

BinOrString = binary() | string()
RegExp = string() | MatchSpec
MatchSpec = tuple() as returned by match_compile/1
StartIndex = EndIndex = integer()

List = [ binary() ]

Splits Binary into a list of binaries based on the pattern specified in RegExp.

The resulting list should basically be the same as for regexp:split/2.

Performance

Performance was measured for the functions considered most important using the reference implementation. Some examples:

  1. Searching for a non-existing 1 and 3 byte binary in a ~1 Mb binary. Notice how binary:match/2 gets faster the longer the needle is thanks to the O(n/m) algorithm. All times in microseconds.

        Search for:        1 byte   3 bytes
    ---------------------------------------   
    binary:match/2:        17598      6045 
    binary:regex_first/2:  47299     46701
    string:str/2:          68969     69637
    regexp:first_match/2: 460858    887485
    
  2. Splitting a ~1 Mb binary on newline chars. This particular binary contained a newline every 60 chars on average.

    binary:split/2:  89142 microseconds
    regexp:split/2: 564911 microseconds
    
  3. Regex-DNA benchmark from computer language shootout

    prototype regexp bif:    1.9 seconds
    regexp module in R12B:  99.1 seconds
    

In the examples at the computer language shootout PCRE has a slightly lower performance compared to other algorithms such as the one in the reference implementation or in particular the one featured in TCL. This may not necessarily mean that this is true for all types of patterns.

Reference implementation

A reference implementation has been provided to the OTP team.

Copyright

This document is licensed under the Creative Commons license.