Friday, 28 September 2018 - Björn Gustavsson
This blog post looks back on the development of the SSA-based intermediate representation from the beginning of this year to the end of August when the branch was merged.
In January this year we realized that we have reached the limit of the optimizations that we could do working on BEAM code.
John had finished the work on extending beam_bsm
(a pass that attempts to delay creation of sub
binaries). The extended beam_bsm
pass could apply the
optimization in a few more cases than it could before, but the amount
of code in beam_bsm
to achieve that modest improvement of the
optimization was insane.
John, Lukas, and I discussed what we should do about it. Clearly, we needed a better intermediate format. But what should it be? Could we use the existing BEAM code but with variables instead of BEAM registers and do register allocation later? That would solve some of the problems but not all of them. The irregular nature of BEAM instructions makes it cumbersome to traverse and analyze BEAM code.
So we decided to do like most modern compilers and use an SSA-based intermediate format.
Introducing a new intermediate format would require rewriting at least some parts of the compiler. The problem with rewrites is that they always take longer time than expected and that they often get abandoned before they are finished.
To increase the odds that this rewrite would be successful, we come up with this plan to do the minimum amount of work to get something working as soon as possible:
Write a new pass that translates from Kernel Erlang to SSA code.
Write a new pass that translates from SSA code to BEAM code.
Keep all existing optimization passes.
Rewrite the optimization passes one at a time.
It didn't quite work out according to the plan, as will soon be evident.
I made the first the commit February 1 this year.
The first pass I wrote was the translator from Kernel Erlang to SSA code.
We named it beam_kernel_to_ssa
.
My first thought was to write the pass from scratch, as opposed to
base it on v3_codegen
. After all, there are fundamental differences
between BEAM code and SSA code. BEAM code is a flat list of instructions.
SSA code consists of blocks of numbered blocks stored in a map, and there
are also the phi nodes.
On the other hand, the input for both v3_codegen
and
beam_kernel_to_ssa
was Kernel Erlang. There was nothing wrong with
the code that handled the Kernel Erlang records and I didn't want to
rewrite that code from scratch. Instead, I rewrote the part of the
code that produced a list of BEAM instructions to produce a list of
SSA instructions. I then wrote a simple pass (about 100 lines of
code) that packaged the SSA instructions into blocks and added the
phi nodes.
I prefer to test the code I write a soon as possible after writing it. It is much easier to find and fix bugs in code that has been recently written.
How can one test beam_kernel_to_ssa
before the code generator for
BEAM code has been written?
One cannot, not completely, but there are ways to find major problems.
One such way is smoke testing. I modified
the compiler so that it would first run beam_kernel_to_ssa
but
discard its output, then run v3_codegen
and the rest of the compiler
passes. That allowed me to run the entire compiler test suite, and
if the beam_kernel_to_ssa
pass crashed, I've had found a bug.
Another way was to write a validator or linter of the SSA code.
John wrote the beam_ssa_lint
pass (actually called
beam_ssa_validator
at that time and later renamed), which would
verify that a variable was only defined once, that variables were
defined before they were used, that labels in terminators and phi
nodes referred to defined blocks, and so on. It helped me find a
few bugs.
Dialyzer also helped me find some bugs. I made sure
that I added types for all fields in all new records and
specifications for all exported functions. Dialyzer pointed out
some bugs when I ran it and thinking about the types when writing
the -type
declarations was also useful.
I am not sure exactly how long time I spent on the initial
implementation of beam_kernel_to_ssa
, but it was probably less
than two weeks. There were a few snags along the way, most of
them bugs in v3_kernel
that did not cause any problems
with the old v3_codegen
.
Here is an example. I chose to fix it in OTP 21 even though it was harmless in that release:
v3_kernel: Stop ensuring one return value in #k_try{}
Next up was the translation from SSA code to BEAM code.
I have already decided that the translation was sufficiently complicated that to better be split into two major passes.
The first pass of those passes,
beam_ssa_pre_codegen
, would work on the SSA
code, rewriting it, and adding annotations for another pass that would
generate the BEAM code, but the output would still be valid SSA code
so that ssa_lint
could be used to validate the output. The pretty-printed
SSA code also includes the annotations to facilitate debugging.
The dprecg
option can be used to produce a pretty-printed listing of
the SSA code. The following command will create the file blog.precodegen
:
erlc +dprecg blog.erl
The next section will dig deeper into the workings of beam_ssa_pre_codegen
.
On a first reading, you might want to skip that section and jump ahead to
the section about beam_ssa_codegen.
To provide some context for the description of beam_ssa_pre_codegen
,
we will first look at some BEAM code and talk about stack frames and
Y registers.
Here is the example in Erlang:
foo(C, L) ->
Sum = lists:sum(L),
C + Sum.
The BEAM code looks like this:
{allocate,1,2}.
{move,{x,0},{y,0}}.
{move,{x,1},{x,0}}.
{line,[{location,"blog.erl",5}]}.
{call_ext,1,{extfunc,lists,sum,1}}.
{line,[{location,"blog.erl",6}]}.
{gc_bif,'+',{f,0},1,[{y,0},{x,0}],{x,0}}.
{deallocate,1}.
return.
As usual, we will walk through the code one or a few lines at a time.
{allocate,1,2}.
The allocate
instruction allocates a stack frame. The 1
operand
means that there should be room for one slot in the stack frame for
storing one value. The slots in the stack frame are called Y
registers.
The 2
operand means that two X registers ({x,0}
and {x,1}
) are
live and must be preserved if allocate
needs to do a garbage
collection in order to allocate space for the stack frame.
{move,{x,0},{y,0}}.
The C
argument for foo/2
is in {x,0}
. The move
instruction
copies the value of {x,0}
to {y,0}
, which is the zeroth slot
in the stack frame. The reason for doing this copy will soon become
clear.
{move,{x,1},{x,0}}.
Preparing for the call of lists:sum/1
, the value of L
in {x,1}
is copied to {x,0}
.
{line,[{location,"blog.erl",5}]}.
{call_ext,1,{extfunc,lists,sum,1}}.
Here lists:sum/1
is called. The argument is in {x,0}
. The result
(the sum of all numbers in the list) is returned in {x,0}
. Also,
the contents of all other X registers are destroyed. That means that
any value that is to be used after a function call must be saved to
a Y register.
{gc_bif,'+',{f,0},1,[{y,0},{x,0}],{x,0}}.
This instruction calculates the sum of C
(in {y,0}
) and Sum
(in {x,0}
),
storing the result in {x,0}
.
{deallocate,1}.
Preparing to return from the function, the deallocate
instruction
removes the stack frame that allocate
created.
return.
return
returns from the function. The return value is in {x,0}
.
Here is the SSA code for the function:
function blog:foo(_0, _1) {
0:
%% blog.erl:5
_2 = call remote (literal lists):(literal sum)/1, _1
%% blog.erl:5
_3 = bif:'+' _0, _2
@ssa_bool = succeeded _3
br @ssa_bool, label 3, label 1
3:
ret _3
1:
@ssa_ret = call remote (literal erlang):(literal error)/1, literal badarg
ret @ssa_ret
}
After running beam_ssa_pre_codegen
, the SSA code looks like this:
function blog:foo(x0/_0, x1/_1) {
%% _0: 0..1
%% _1: 0..1 0..3
%% #{frame_size => 1,yregs => [0]}
0:
%% _0:4: 1..5
[1] y0/_0:4 = copy x0/_0
%% blog.erl:5
%% _2: 3..5
[3] x0/_2 = call remote (literal lists):(literal sum)/1, x1/_1
%% blog.erl:5
%% _3: 5..11
[5] x0/_3 = bif:'+' y0/_0:4, x0/_2
%% @ssa_bool: 7..9
[7] z0/@ssa_bool = succeeded x0/_3
[9] br z0/@ssa_bool, label 3, label 1
3:
[11] ret x0/_3
1:
%% @ssa_ret: 13..15
[13] x0/@ssa_ret = call remote (literal erlang):(literal error)/1, literal badarg
[15] ret x0/@ssa_ret
}
We will describe what the important (for this example) sub passes of
beam_ssa_pre_codegen
do, and point to the relevant part of code while
doing so.
The sub pass place_frames determines where stack frames should be allocated. In the example, block 0 needs a stack frame.
The sub pass find_yregs determines which variables that are to be
placed in Y registers. The result will be a yregs
annotation added
to each block that allocates a stack frame. For this example, the
annotation will look like:
Variable _0
is C
from the Erlang code. It needs to be saved across the
call to lists:sum/1
.
The sub pass reserve_yregs uses the yregs
annotations and inserts copy
instructions
to copy each variable that needs saving to a new variable. For the example,
the following instruction will be added
[1] y0/_0:4 = copy x0/_0
It copies the value of _0
to _0:4
.
The sub pass number_instructions numbers all instructions as a preparation for register
allocation. In the listing, those numbers are in brackets before each instruction:
[1]
, [3]
, [5]
, and so on.
The sub pass live_intervals calculates the intervals in which each variable is live. In the listing, the live intervals are shown as comments before the definition of the variable:
%% _0:4: 1..5
[1] y0/_0:4 = copy x0/_0
The variable _0:4
is live from instruction [1]
(its definition) to
[5]
(its last use).
The sub pass linear_scan uses the linear scan algorithm to allocate registers for each variable. The result is saved as annotation for the function. In the listing of the SSA code, the register will be added to the definition and each use of a variable. For example:
[1] y0/_0:4 = copy x0/_0
Variable _0
(the argument L
) is in {x,0}
. Its copy in _0:4
is in
{y,0}
.
But what is z0
?
succeeded
is not a BEAM instruction. It will be combined with the previous
instruction (bif:+
in this example) and the br
instruction that follows it
to the following BEAM instruction:
{gc_bif,'+',{f,0},1,[{y,0},{x,0}],{x,0}}.
Thus, the value @ssa_bool
is never explicitly stored in a BEAM
register. Before I invented Z registers, @ssa_bool
would have been
assigned to an X register. That worked most of the time, but sometimes
an X register would seem to be occupied when it was not, and prevent
another instruction from using that register.
Here are the references that I used when implementing linear scan.
The sub pass frame_size uses the information from the linear scan pass to calculate the size of each stack frame. The result is stored as an annotation:
<pre class="highlight"> %% #{<b>frame_size => 1</b>,yregs => [0]} </pre>The beam_ssa_codegen
pass generates BEAM code
from the annotated SSA code. Testing of this pass was easier, because
I could compile some sample code and try to run it.
Often I did not even have to run the code to know that it was wrong. The compiler would tell me, loudly:
{% raw %}
blog: function bar/2+4:
Internal consistency check failed - please report this bug.
Instruction: {test_heap,2,3}
Error: {{x,2},not_live}:
{% endraw %}
It's time to introduce the beam_validator
pass.
The beam_validator
pass was introduced in one of
the R10B releases (probably in 2006). It is run directly before the
BEAM code is packaged into a binary and written to a BEAM file. The
purpose of beam_validator
is to find unsafe instructions that
could crash the runtime system or cause it to misbehave in other
ways.
Let's look at a simple example:
bar(H, T) ->
[H|T].
Here is the BEAM code, but edited by me to contain an unsafe instruction:
<pre class="highlight"> {label,4}. {test_heap,2,<b>3</b>}. {put_list,{x,0},{x,1},{x,0}}. return. </pre>The number of live registers is here given as 3
instead of 2
.
That means that {x,0}
, {x,1}
, and {x,2}
are supposed to contain
valid Erlang terms. Because bar/2
is only called with two arguments,
{x,2}
can contain any old garbage.
When running this code, it could crash the runtime system, or it could
be completely harmless. It depends on whether there will be a garbage
collection during execution of the test_heap
instruction, and on the
exact nature of the garbage in {x,2}
. For example, if the garbage
happens to be an atom nothing bad will happen. That means that this
type of compiler bug is difficult to reliably catch in a test case.
beam_validator
will find this bug immediately. It keeps track of
which registers are initialized at any point in the function. If it
finds a reference to a register that is not initialized it will
complain:
{% raw %}
blog: function bar/2+4:
Internal consistency check failed - please report this bug.
Instruction: {test_heap,2,3}
Error: {{x,2},not_live}:
{% endraw %}
During the implementation of beam_ssa_codegen
, the beam_validator
pass pointed out many bugs for me. It was my friend.
It was also my foe, sort of. It would complain that some perfectly safe
code was unsafe. When that happened, I had to thoroughly investigate the
code to make doubly sure it was safe, and then extend beam_validator
to make it smarter so that it would understand that the code was safe.
Here is one example where I had to make beam_validator
smarter.
Consider this code:
{move,{x,0},{y,0}}.
{test,is_map,{f,777},[{x,0}]}.
{put_map_assoc,{f,0},{y,0},...}.
The move
instruction stores a copy of {x,0}
in {y,0}
(a location
on the stack). The following test
instruction tests whether {x,0}
is a map and branches to label 777 if not. The put_map_assoc
instruction
updates the map in {y,0}
.
The put_map_assoc
instruction will crash if its source argument is
not a map. Therefore, beam_validator
complains if put_map_assoc
is
used with a source argument that is not a map. In this example,
beam_validator
had not seen a test
instruction that ensured that
{y,0}
was a map, so it complained. It is obvious (for a human) that
{y,0}
is a map because it is a copy of {x,0}
, which is a map.
v3_codegen
never generated such code; in fact, it explicitly
avoided generating such code.
I did not want to add similar kludges to the new code generator, so
beam_validator
had to become smarter.
Some of the unsafe code that beam_validator
found was really unsafe,
but it was not the fault of my new compiler passes, but of the
optimization passes that optimized the generated BEAM code.
The problem was that some of the optimization passes had implicit
assumptions of the kind of code that v3_codegen
would generate
(or, rather, would not generate). The new code generator broke
those assumptions.
At first, when I saw those bugs, I removed the broken part of the optimization pass. Making the optimizations safe would be non-trivial and ultimately wasted work because we intended to rewrite all those optimization passes to work on SSA code.
When I have seen a few too many of those unsafe optimizations, I ripped out all of the unsafe optimization passes.
That meant that we would have to re-implement all of the optimizations before the generated code would be as good as the code from the old compiler. I had also noticed that the new BEAM code generator in a few ways generated better code than the old one, but in other ways the code was worse. For example, the generated code used more stack space and did a lot of register shuffling. Eventually, that had to be addressed in some way.
Meanwhile, I had worse problems to worry about.
On March 14 I presented my progress on the new compiler passes for the OTP team. One of my slides had the following text:
- Can compile all modules in OTP (and run many of them correctly)
Yes, I had finished the initial implementation of beam_ssa_codegen
so that I could compile all code in OTP.
The problem that I only at hinted in the slide was that Erlang could crash and dump core when running test suites. Not every time, and never in the same test case twice. It only happened when I have compiled OTP with the new compiler.
The crash didn't seem related to the test cases themselves, but to the
writing of log files. I soon narrowed it down to that the crash could
happen if file_io_server
had been compiled with
the new compiler passes. However, that was not much help. The module
contains complicated code that uses the binary syntax, try
/catch
,
and receive
, all of which are complicated instructions that might
not be correctly translated by the new compiler passes.
beam_validator
was supposed to catch those kinds of bugs before
they can cause a crash. Either there was some kind of bug that
beam_validator
didn't look for, or there was a bug in the
implementation of some of the instructions in the BEAM interpreter.
I ended up spending the rest of March trying to hunt down that bug.
At the beginning of April, the bug still eluded me. I had narrowed
it down somewhat. I was pretty sure it had something to do with
receive
.
Then Rickard gave me some information that I could connect to another piece of information that I had absorbed during my hunt for the bug.
This section is somewhat advanced, and if you wish you can skip to the fix.
Reading about the Erlang Garbage Collector can give some background to better understand this section.
Rickard reminded me about the message_queue_data
option that
had been added to process_flag/2
in OTP 19. After
calling process_flag(message_queue_data, off_heap)
all messages
that have not yet been received would be stored outside the process
heap. Storing the messages outside the heap means that the garbage
collector doesn't have to spend time copying the unreceived messages
during garbage collection, which can be a huge win for processes that
have many messages in its message queue.
The implementations details of messages outside the heap are
crucial. Consider this selective receive
:
receive
{tagged_message,Message} -> Message
end.
When the BEAM interpreter executes this code, it will retrieve a reference message from the external message queue and match it against the tuple pattern. If the message does not match, the next message will be processed in the same way, and so on.
If a message does not match, there must not be any remaining
references to it stored on the stack. The reason is that if there is a
garbage collection, the garbage collector will copy the message (or
part of the message) to the heap, and, even worse, it will destroy the
original message during the copy operation. The message is still in
the external message queue, but it has now been corrupted by the
garbage collector. If the message is later matched out in a receive
,
it will likely cause a crash.
When Rickard first implemented off-heap messages, he asked me whether the compiler could ever store references to unreceived messages on the stack. I assured him that it could not happen.
Yes, that was true, it could not happen because of the way
v3_codegen
generated the code for receive
.
With the new compiler passes, it could happen. When I first discussed the bug with Richard in March, he did mention that it is forbidden to store references to off-heap messages on the stack. At that time, I was not aware that the compiler could store references to off-heap messages on the stack.
When Rickard reminded me about that for the second time in April, I remember seeing during my bug hunt generated code that stored off-heap message references on the stack.
After finding the reason for the bug, I first taught beam_validator
to complain about "fragile references" on the stack.
I included that commit in OTP 21.
I then added a sub pass to beam_ssa_pre_codegen
to rewrite receive
.
It introduces new variables and copy
instructions to ensure that
any references to the message being matched are kept in X registers.
With no known bugs in the code generator, I could start rewriting the optimization passes I had removed.
beam_ssa_recv
is a replacement for the unsafe
beam_receive
pass. The purpose is to optimize a
receive
that can only match a newly created reference. The
optimization avoids scanning the messages that were placed in the
message queue before the reference was created.
I actually wrote beam_ssa_recv
at the beginning of March as an
experiment to see how easy it would be to write optimizations of SSA code.
It turned out to be pretty easy. beam_ssa_recv
can apply the optimization
in more places than beam_receive
could, using slightly less code.
In the old beam_receive
pass, a lot of code is needed to handle
the many variants of BEAM instructions. For example, in
opt_update_regs/3
there are three
clauses just to handle three variants of a call
instruction (calling
a local function, calling an external function, and calling a fun).
Here is an example of a function that beam_receive
did not optimize, but
beam_ssa_recv
can optimize.
The beam_ssa_opt
pass runs a number of
optimizations. Many of the optimizations are replacements
for the optimizations I removed earlier.
beam_ssa_type
replaces the unsafe beam_type
pass.
The beam_type
pass did a local type analysis (basically for extended basic blocks),
and tried to simplify the code, for example by removing unnecessary type tests.
The beam_ssa_type
pass analyzes the types in an entire function and
simplifies the code, for example by removing unnecessary type
tests. It finds many more opportunities for optimizations than
beam_type
did.
At the beginning of May, John started working on what was to become this pull request:
#1958: Rewrite BSM optimizations in the new SSA-based intermediate format
I continued to write optimizations and fix bugs that John found while developing his optimizations.
While working on his binary optimizations, John realized that the SSA
instructions for binary matching were difficult to optimize. The
binary match instructions I had designed were close to the semantics
of the BEAM instructions. John suggested that the bs_get
instruction
should be broken up into a bs_match
instruction and a bs_extract
instruction to simplify optimizations.
The breaking up of the instructions meant that beam_ssa_pre_codegen
would have to work harder to combine
them, but it vastly simplified John's
optimizations. It turned out that it also enabled other optimizations:
the liveness optimizations could remove unused
instructions more aggressively.
On the first day of Code BEAM STO 2018 May 31, I didn't know of any bugs in the new compiler passes and my list of optimizations to re-implement was shrinking steadily. I met Michał Muskała (a frequent contributor to Erlang/OTP and a member of the Elixir Core Team) there and told him about my work on the compiler and that it was stable enough be tested outside OTP, for example to compile Elixir code...
I received an email from Michał in the middle of June. He had tried out my compiler branch. He wrote:
First impression is that it took a loooong time to compile Elixir's unicode module, so long that I had to shut it down after about 10 minutes.
He sent me an Erlangified version of Elixir's unicode module. The size of the Erlang source for the module was almost 82,000 lines or about 3,700,000 bytes. Based on the size, compilation could be expected to be a little bit slow, but not that slow. On his computer, OTP 21.0-RC2 finished the compilation in 16 seconds.
I compiled the module using the time
option. The slowest pass was beam_ssa_type
.
After some further profiling, I found the bottleneck in the joining of two maps.
Here is the corrected code. The original code didn't compare
the size of maps and swap them as needed. I might have done some other improvements,
too. Anyway, that took care of that bottleneck. Now beam_ssa_pre_codegen
was the
slowest pass.
I fixed several bottlenecks in the linear scan sub pass, and after
that some other bottleneck in beam_ssa_pre_codegen
. I think that reduced the
compilation time to well under one minute.
After having finished the re-implementation of the last optimization
pass (I think it was the optimization of floating point
operations, as previously done by the unsafe
beam_type
pass), I started to compare the code
generated by OTP 21 with code from the new compiler passes.
I used scripts/diffable, which compiles about 1000
modules from OTP to BEAM code and massages the BEAM code to make it more friendly
for diffing. I then ran diff -u old new
to compare the new code to
the old code.
In the last part of June and the first week of July, I then improved
beam_ssa_pre_codegen
and beam_ssa_codegen
to address the issues that I
noticed when reading the diff.
beam_ssa_pre_codegen
improvementsI did not change the linear scan sub pass of beam_ssa_pre_codegen
itself. Instead I added transformations of the SSA code that would help
linear scan do a better job of allocating registers.
The most obvious issue I noticed was unnecessary move
instructions.
Here are two of the sub passes I added to address that issue:
reserve_xregs gives hints to the linear scan sub pass that a certain X register should be used for a certain variable, if possible.
opt_get_list tries to eliminate the extra move
instruction that
is frequently added when matching out elements from a list. See the
comments in the code for an example and an explanation.
Another frequent issue was that the code generated from the new code generator used more stack space because two variables that were not strictly live at the same time were allocated different Y registers (slots on the stack) instead of re-using the same Y register. I addressed that issue in copy_retval. See the comments in the code for an example.
beam_ssa_codegen
improvementsMichał noticed that when a value was stored in both an X register and a Y register (on the stack), instructions using the value would always use the Y register. The old code generator would use the X register. The new code could be slower because the BEAM interpreter is generally optimized for operands being in X registers.
I added prefer_xregs to address that issue. See the comments in the code for examples.
Vacation.
Before I left for vacation, it seemed that the new compiler passes generally generated code at least as good as the old compiler passes. In some cases, the code would be much better.
Back after my vacation, I did some final polishing.
On Aug 17 I created a pull request.
Before merging the pull request, I sneaked in a few final optimizations.
On Aug 24 I merged the pull request.
The SSA-based intermediate representation provides a solid framework for future improvements of the compiler. After the merging of the pull request in August, several pull requests have already added further improvements:
Here is a list of possible further improvements that could be implemented either by OTP members or external contributors before OTP 22 is released:
Rewrite sys_core_dsetel
to be SSA-based.
Rewrite the guard optimizing sub pass guard_opt/2
in v3_kernel
to an SSA-based optimization pass.
Rewrite beam_trim
. It would probably have to be a part of beam_ssa_codegen
.
Optimize switch
branches. If two branches jump to blocks
that do the same thing, let both branches jump to the same
block. beam_jump
does this kind of optimization, but doing it
earlier in the SSA representation could speed up compilation of functions
with many clauses.
Get rid of the beam_utils
module, especially the is_killed()
and
is_not_used()
family of functions. The functions in beam_utils
used by beam_jump
could be moved into beam_jump
.
Rewrite beam_bs
to be SSA-based. This rewrite might not improve
the generated code, but it might speed up compilation of modules
with heavy use of the binary syntax.