Author: Richard A. O'Keefe <ok(at)cs(dot)otago(dot)ac(dot)nz>
Status: Draft
Type: Standards Track
Created: 09-Feb-2010
Erlang-Version: R13B-3
Post-History:

EEP 32: Module-local process names

Abstract

The process registry in Erlang is convenient, but counts as a global shared mutable variable, with two major defects: the possibility of data races (shared mutable variable) and the impossibility of encapsulation (global). This EEP resurrects the old (1997 or earlier) proposal of module- local process-valued variables, providing a replacement for node-local uses of the registry with encapsulation and without races.

Specification

A module (or an instance of a parameterized module) may have one or more top level pid-valued variables, and if so, has a lock associated with them. The directive has the form

-pid_name(Atom).

where Atom is an atom. To avoid confusing programmers who still have to deal with the registry, this Atom may not be 'undefined'.

If there is at least one such directive in a module, the compiler automatically generates a function called pid_name/1. In the scope of directives

-pid_name(pn_1).
...
-pid_name(pn_k).

the pid_name/1 function is rather like

pid_name(pn_1) ->
    with_module_lock(read) -> X = *pn_1 end, X;
...
pid_name(pn_k) ->
    with_module_lock(read) -> X = *pn_k end, X.

except that we expect there to be a VM instruction get_pid_safely(Address), and we expect the compiler to inline calls to pid_name(Atom) when Atom is known. On a machine like the X86 or X86_64, this could be a single locked load instruction.

The value of a -pid_name is always a process id.
There is a special process id value which at all times represents a dead process. So within a module,

pid_name(X) ! Message

is legal if and only if X is one of the pid-names declared in the module, and whether or not the process it names has died.

If there is a need to discover whether a -pid_name has within the recent but unpredictable past been associated with a live process, that can be found out by combining pid_name/1 with process_info/2.

As with the registry, a process may have at most one pid_name. For debugging purposes, I suppose that process_info could be extended to return a {pid_name,{Module,Name}} tuple.

When a process exits, it is automatically unregistered. That is, if it was bound to a -pid_name, that -pid_name now refers to the conventional dead process. This draft of this EEP includes no other way for a process to be unregistered.

The important thing about registering a process is that it should be atomic. So there are two new functions

pid_name_spawn(Name, Fun)
pid_name_spawn_link(Name, Fun)

We can understand them as

pid_name_spawn(Name, Fun)
  when is_atom(Name), is_function(Fun, 0) ->
    with_module_lock(write) ->
    P = *Name,
    if P is a live process ->
        P
     ; P is a dead process ->
        Q = spawn(Fun),
        *Name := Q,
        Q
    end
    end.

pid_name_spawn_link(Name, Fun)
  when is_atom(Name), is_function(Fun, 0) ->
    with_module_lock(write) ->
    P = *Name,
    if P is a live process ->
        P
     ; P is a dead process ->
        Q = spawn(Fun),
        *Name := Q,
        Q
    end
    end.

Here, as earlier, with_module_lock is pseudo-code, meant to suggest some sort of reader-writer locking on a private lock, existing only inside a module that has declared a -pid_name.

These two functions are automatically declared inside the module, like pid_name/1. The three functions are not functions automatically inherited from the erlang: module but functions that are logically inside the module, however they might be actually implemented. There doesn't seem to be any good reason for a module to export any of these functions, and the compiler should at least warn if that is attempted.

Motivation

Encapsulation.

The process registry is often used when clients of a module need to communicate with one or more servers managed by the module, but the interface code is inside the module. There is no advantage, and much risk, in exposing the process. A big reason for this process is to get the benefit of having mutable process variables without the loss of encapsulation.
Efficiency.

As a shared mutable data structure, the registry has to be accessed within the scope of suitable locks. With this approach, each module has its own lock, contention ought to be pretty nearly zero, and the commonest use case of the registry can, I believe, be a simple load instruction.
Safety.

It is actually surprisingly hard to register a process safely, and the use of registered names is oddly inconsistent with the use of direct process ids. This interface is meant to be simpler to use safely.

Rationale

The old Erlang book describes four functions for dealing with registered process names. There are two more main interfaces.

Name ! Message when is_atom(Name) ->
  % Also available as erlang:send(Name, Message).
  % A 'badarg' exception results if Pid is an atom that is
  % not the registered name of a live local process or port.
    whereis(Name) ! Message.

register(Name, Pid) when is_atom(Name), is_pid(Pid) ->
  % A 'badarg' exception results if Pid is not a live local
  % process or port, if Name is not an atom or is already in
  % use, if Pid already has a registered name, or if Name is
  % 'undefined'.
    "whereis(Name) := Pid".

unregister(Name) when is_atom(Name) ->
  % A 'badarg' exception results if Name is not an atom
  % currently in use as the registered name of some process
  % or port.  'undefined' is always an error.
    "whereis(Name) := undefined".

whereis(Name) when is_atom(Name) ->
  % A 'badarg' exception results if Name is not a name.
  % in effect, a global mutable hash table with
  % atom keys and pid-or-'undefined' values.

registered() ->
    % yes, I know this is not executable Erlang.
    [Name || is_atom(Name), is_pid(whereis(Name))].

process_info(Pid, registered_name) when is_pid(Pid) ->
    % yes, I know this is not executable Erlang.
    case [Name || is_atom(Name), whereis(Name) =:= Pid]
      of [N] -> {registered_name,N}
       ; []  -> []
    end.

When a process terminates, for whatever reason, it does the equivalent of

case process_info(self(), registered_name)
  of {_,Name} -> unregister(Name)
   ; []       -> ok
end.

This has an astonishing consequence.

Suppose I do

Pid = spawn(Fun),
...
Pid ! Message

and between the time the process was created and the time I send the message to it, the process dies. In Erlang this is perfectly ok, and the message just disappears.

Now suppose I do

register(Name, spawn(Fun)),
...
Name ! Message

and between the time the process was created and the time I send the message to it, the process dies. Anyone would expect the result to be exactly the same: because the Name pointed to a process which has died, this amounts to sending a message to a dead process, which is perfectly ok, and the message just disappears. Most confusingly, that is not what happens, and instead you get a 'badarg' exception.

Now suppose I do

send(Pid, Message) when is_pid(Pid) ->
    Pid ! Message;
send(Name, Message) when is_atom(Name) ->
    case whereis(Name)
      of undefined -> ok
       ; Pid when is_pid(Pid) -> Pid ! Message
    end.
...
    register(Name, spawn(Fun)),
    ...
    send(Name, Message)

This works the way we would expect, but why is it necessary?

In Erlang as it stands, Name ! Message will raise an error if Name would have referred to the right process but that process has died. It might be argued that this is a useful debugging aid, but nothing helps us if Name now refers to the WRONG process. Right now, consider

whereis(Name) ! Message

This will raise an exception if the named process had died before whereis/1 was called, but consider this timing:

live           dies
   whereis runs      message sent

A slight change in timing can unpredictably change the behaviour from silence-on-late-death to error-on-early-death and vice versa.

pid_name(Name) ! Message

is consistently silent.

The current process registry is also used for ports, which act in many ways like processes.

The old Erlang book is absolutely right that sometimes you need a way to talk to a process you haven't been previously introduced to. However, it is not true that this must be done by means of a global hash table. You could always ask a module for the information.

Let's take program 5.5 from the book.

-module(number_analyser). 
-export([start/0,server/1]). 
-export([add_number/2,analyse/1]). 

start() -> 
    register(number_analyser, 
    spawn(number_analyser, server, [nil])). 

%% The interface functions. 

add_number(Seq, Dest) -> 
    request({add_number,Seq,Dest}). 

analyse(Seq) -> 
    request({analyse,Seq}). 

request(Req) -> 
    number_analyser ! {self(), Req}, 
    receive 
    {number_analyser,Reply} -> 
            Reply 
    end. 

%% The server. 

server(Analyser_Table) -> 
    receive 
        {From, {analyse, Seq}} -> 
        Result = lookup(Seq, Analyser_Table), 
        From ! {number_analyser, Result}, 
        server(Analyser_Table)
      ; {From, {add_number, Seq, Dest}} -> 
        From ! {number_analyser, ack}, 
        server(insert(Seq, Dest, Analyser_Table)) 
    end.

The first thing we notice about this is that the registry is used to allow a process that is a client of this module to communicate with a process managed by this module through interface functions in this module. There is no reason why the process should be given a GLOBALLY visible name, and every reason why it should NOT. We would like to ensure that all communication with the server process goes through the interface functions, and as long as the process is in a global registry, anything could happen. The global process registry thus defeats its own purpose.

Similarly, because the reply messages to the interface functions are tagged, not with the server's identity, but with its public name, they are easy to forge. Both of these problems also apply to Program 5.6 in the old book.

But there is worse. It is NEVER safe to call register/2 or unregister/1. Recall that the precondition for register/2 requires that the Name not be in use. But there is no way to ever be sure of that. For example, you might try

spawn_if_necessary(Name, Fun) ->
    case whereis(Name)      % T1
      of undefined ->
     Pid = spawn(Fun),  % T2
     register(Name, Pid)    % T3
       ; Pid when is_pid(Pid) ->
         ok
    end,
    Pid.

Unfortunately, between time T1, when whereis/1 reports that the Name is not in use, and time T3, when we try to assign it, some other process might have been registered. Also, between time T2, when the new process is created, and T3, when we use the Pid, the process might have died.

Because the registry is global, it is no use searching existing code to see whether the Name is clobbered; the bug might be introduced in future code.

There appears to be no way to protect against the possibility of a process dying between T2 and T3. The obvious hack,

Pid = spawn(Fun),
erlang:suspend_process(Pid),
register(Name, Pid),
erlang:resume_process(Pid)

won't work because erlang:suspend_process/1 is documented as having the same 'badarg if Pid is not the pid of a live local process' snafu as register/2. The only really safe way around the issue would be for the new process to be born suspended, and there's no way to do that. There is no 'suspended' option allowed in the options list of spawn_opt/[2-5].

In practice, of course, the new process WON'T die, typically because it goes into a loop waiting for a message. Even so, this amount of fragility in a primitive is a bit worrying.

Let's take a quick check to see how real all this is.

sounder.erl has

start() ->
    case whereis(sounder) of
        undefined ->
        case file:read_file_info('/dev/audio') of
            {ok, FI} when FI#file_info.access==read_write ->
            register(sounder, spawn(sounder,go,[])),
            ok;
            _Other ->
            register(sounder, spawn(sounder,nosound,[])),
            silent
        end;
        _Pid ->
        ok
    end.

Here's a curious thing: the first time sounder:start/0 is called, it will return different values (ok, silent) depending on whether sound (is, is not) supported. Later calls always return ok. This contradicts the documentation. Whoops! Apart from that, it's a straightforward spawn_if_necessary.

man.erl has

start() ->
    case whereis(man) of
        undefined ->
        register(man,Pid=spawn(man,init,[])),
        Pid;
        Pid ->
        Pid
    end.

This is precisely

start() -> spawn_if_necessary(fun () -> man:init() end).

tv_table_owner has

start() ->
    case whereis(?REGISTERED_NAME) of
        undefined ->
        ServerPid = spawn(?MODULE, init, []),
        case catch register(?REGISTERED_NAME, ServerPid) of
            true ->
            ok;
            {'EXIT', _Reason} ->
            exit(ServerPid, kill),
            timer:sleep(500),
            start()
        end;
        Pid when is_pid(Pid) ->
        ok
    end.

Let's repackage that to see what's going on:

spawn_if_necessary(Name, Fun) ->
    case whereis(Name)
      of undefined ->
         Pid = spawn(Fun),          
         case catch register(Name, Pid)
           of true ->
              Pid
            ; {'EXIT', _} ->
              exit(Pid, kill),
              timer:sleep(500),
              spawn_if_necessary(Name, Fun)
         end
       ; Pid when is_pid(Pid) ->
     ok
    end.

If there is a live local process registered under Name, return its Pid. Of course, after the function returns to believe that there is STILL a live local process registered under Name, but that's just as true of whereis/1.

If there is not, then create a new process, regardless of whether that turns out to be useful. Try to register it. The Pid will be the pid of a live local process that is not registered under any other name, and Name must be an atom other than 'undefined', or whereis/1 would have crashed. So it should be that the only thing that can go wrong is that some other process has snuck in and swiped the registry slot. In that case, kill the process, wait a long time, and try again.

In theory, it is possible for this to loop forever, with just the right malevolent timing by an adversary. In practice, I'm sure it works very well.

The thing is, if the 'primitives' are this fragile, I would rather not expose beginners to them. Or for that matter, most people: there are plenty of uses of register/1 in the Erlang/OTP sources that are not this well protected.

The simplest fix to the 'registration race' problem would be to verify that spawn_if_necessary/2 is sound, correct it if necessary, and put it in a library. However, that does nothing to fix the globality of the registry.

There is no analogue of registered(). Inside a module, you can see what names are available; outside the module, you have no right to know.

This EEP does not propose abolishing the old registry. There is a lot of code, and a lot of training material, that still uses or mentions it. Above all, the old registry can do one thing that this EEP cannot do and isn't meant to, and that is to provide names that can be used in other nodes, in {Node,Name} form. The aim of this proposal is to provide something that can replace MOST uses of the registry with something safer, and in particular to allow gradual migration to per-module registration.

Backwards Compatibility

The only modules that are affected by the new feature are those that visibly contain an explicit -pid_name directive.

Reference Implementation

None.

Example

Here is the old book's Program 5.5 again, brought up to date.

-module(number_analyser). 
-export([
    add_number/2,
    analyse/1,
    start/0,
    stop/0
 ]).
-pid_name(server).

start() ->
    pid_name_spawn(server, fun () -> server(nil) end).

stop() ->
    pid_name(server) ! stop.

add_number(Seq, Dest) ->
    request({add_number,Seq,Dest}).

analyse(Seq) ->
    request({analyse,Seq}).

request(Request) ->
    P = pid_name(server),
    P ! {self(), Request},
    receive {P,Reply} -> Reply end.

server(Analyser_Table) ->
    receive 
        {From, {analyse, Seq}} -> 
        From ! {self(), lookup(Seq, Analyser_Table)},
        server(Analyser_Table)
      ; {From, {add_number, Seq, Dest}} -> 
        From ! {self(), ok}, 
        server(insert(Seq, Dest, Analyser_Table)) 
    end.

It is now possible to use a programming convention where the -pid_name of every server is 'server'.
It is no longer possible for code outside the module to send messages to the server process.
It is no longer possible (well, no longer embarrassingly easy) for an outsider to forge responses from the server.

Copyright

This document has been placed in the public domain.