Author: Richard A. O'Keefe <ok(at)cs(dot)otago(dot)ac(dot)nz>
Status: Draft
Type: Standards Track
Erlang-Version: R12B-4
Created: 27-Aug-2008
Post-History:

EEP 22: Range checking for binaries

Abstract

A module may request that bit fields be range checked.

Specification

A new directive is added.

-bit_range_check(Wanted).

where Wanted is 'false' or 'true'.

Recall that a segment of a bit string (or binary) has the form

Value [':' Size] ['/' Type_Specifier_List]

where Type_Specifier_List includes such things as 'integer', 'signed', and 'unsigned'. Currently the documentation states that

"Signedness ... Only matters for matching and when the type
 is integer.  The default is unsigned."

Combining the Size with the Unit gives a Size_In_Bits. The on-line Erlang manual does not state in section 6.16 that in constructing a bit string the bottom Size_In_Bits bits of an integer are used with the rest quietly ignored, but it is so.

The directive -bit_range_check(false) makes explicit the programmer's intention that this C-like truncation should happen.

The directive -bit_range_check(true) says that it is a checked run-time error in

Value:Size/unsigned-integer-unit:1

or constructions otherwise equivalent to it if Value does not lie in the range 0 <= Value < 2**Size, and it is a checked run-time error in

Value:Size/signed-integer-unit:1

or constructions otherwise equivalent to it if Value does not lie in the range -(2**(Size-1)) <= Value < 2**(Size-1).

The error that is raised is like the error that would be raised for (1//0):Size/Type_Specifier_List except for using 'badrange' instead of 'badarith'.

The behaviour of integer bit syntax segments in the absence of a -bit_range_check directive is implementation defined and subject to change.

The BEAM system is extended with a new instruction or instructions similar to the existing instruction or instructions for integer segments but checking the range. The compiler is extended to generate them for <<...>> expressions in the range of a -bit_range_check(true) directive.

A -bit_range_check directive may not appear after a bit syntax pattern or expression or after another -bit_range_check directive.

Motivation

It keeps on coming as an unpleasant surprise to Erlang programmers that this truncation happens. Quiet destruction of information is otherwise alien to Erlang: integer arithmetic is unbounded, not wrapped as in some (but not all) C systems; element/2 doesn't take indices modulo tuple size but raises an exception if the index is out of range, and so on.

In any case where the truncation is wanted, an Erlang programmer can already write

(Value rem 256):unsigned-integer

and the Erlang compiler could notice this and optimise the 'rem' operation away, so the truncation is not only unusual in Erlang, it is also unexpected in this particular case.

It is not only unexpected, it removes a chance to find mistakes, so it would seem to be undesirable.

Edwin Fine asked "How difficult could it be to add optional run- time checking to detect this condition without a serious risk of adverse effects on the correctness of Erlang run-time execution?"

Björn Gustavsson replied "it would be better to add optional support in the compiler to turn on checks (either for an entire module, or for individual segments of a binary). If someone writes an EEP, we will consider implementing it."

This is that EEP.

Rationale

The Erlang/OTP team regard the old behaviour as a feature, and wish to retain it. In particular, they wish modules that were written expecting the old behaviour to continue to work (for now) without modification.

One alternative would be to add new syntax, such as having a new 'checked' specifier, so that

Value/checked-unsigned-integer

would require a value in the range 0..255. But many Erlang programmers will want to use this as the normal case, and will not like the safe version being so much more effort to write than the unsafe version.

It appears that "truncation wanted/not wanted" is not a matter of this expression or that, but of this programmer or that, and we can expect that each module will be written by someone expecting only one behaviour or expecting only the other.

Adding a

-bit_range_check(true).

directive to a module is more work than doing nothing at all, but programmers who want this behaviour should be able to set up their editing environment to have this line in their template for creating new Erlang modules.

There are several questions:

  • Should this apply to bit strings as well as integers?
  • What should the name of the directive be?
  • What should the argument(s) of the directive be?
  • Should multiple instances of the directive be allowed in a module?

Bit strings: Assume X = <<5:3/unsigned-integer-unit:1>>. Currently, <<X:2/bits>> quietly truncates X. This drops bits from the right of X, giving <<2:2>>. If this worked the same as integers, you would expect <<1:2>>. This is certainly very odd. Since we get truncation on the left and padding on the left for integers, we naturally expect padding on the right for bit strings to go with truncation on the right. But <<X:4/bits>> isn't <<10:4>>, it's a runtime exception. All very odd indeed. It would certainly be desirable to have an easy way for the programmer to indicate whether they wanted truncation on the left or the right and padding on the left or the right. Perhaps a new built in function

set_bit_size(Bit_String, Desired_Width,
             Truncation, Padding, Fill)

Bit_String : a bit string
Desired_Width : a non-negative integer, the width wanted
Truncation: 'left' | 'right' | 'error';
    if bit_size(Bit_String) > Desired_Width
        truncate on the left/truncate on the right/
        report an error
Padding: 'left' | 'right' | 'error';
    if bit_size(Bit_String) < Desired_Width
        pad on the left/pad on the right/report an error
Fill: 0 | 1 | 'copy';
    pad with 0/pad with 1/pad with a copy of the
    last bit at the end where padding is done.

However, that idea is only partly baked, and is not part of the current proposal. As things currently stand, using the bit syntax and relying on implicit truncation is the simplest way to extract the leading bits of a bit string.

As long as the name of the directive is intention-revealing, it doesn't matter very much what it is. I proposed bit_range_check because it is all about checking, ranges in bit syntax, but since in this draft it does NOT apply to bit string segments, perhaps bit_integer_range_check would be better.

The arguments false and true seem clear enough. Alternatives would be something like

-bit_integer_range(check).
-bit_integer_range(no_check).

That would be fine too.

Classical Pascal compilers let you do things like

{$I-}   (* disable index checks *)
(* code with no index checks *)
{$I+}   (* re-enable index checks *)

Allowing multiple -bit_range_check directives in a module could let you use code written for the old approach inside a module that otherwise uses the new approach. I don't believe that we want to encourage that sort of thing: it is MUCH easier when reading a module if all of it follows the same rule.

It is also easier for an Erlang compiler that expects to be able to process function definitions in any order. The compiler can check for one of these directives anywhere in a module before it handles any bit syntax forms anywhere. However, it is easier for people reading a module if, when they first see a <<...>> construction, they have already seen any directive that might affect what it means.

The restrictions on the number and placement of these directives can always be relaxed later if necessary.

Backwards Compatibility

All existing Erlang code remains acceptable with unchanged semantics.

Reference Implementation

None, because I still can't find my way around the compiler.

Copyright

This document has been placed in the public domain.