Setting the Stage
In this post we will focus on C++
inline functions. The problem
described here may apply to other cases as well, but we won’t focus on
inline functions are usually defined in header files that are
included in multiple translation units. The intent is to inline
them at all call sites, but some call-sites may not actually inline
these functions (to cap code size, for instance) in which case an
out-of-line copy of the definition is required. In order to have an
out-of-line definition when needed, compilers typically emit a copy of
the function body in each translation unit that includes the header
(the header defining the
inline function, that is), and at link time
the linker chooses one of the (potentially multiple) bodies it has
access to as the definitive copy of the function and links all
un-inlined call sites to call into that. There is some more detail
about the process at http://www.airs.com/blog/archives/52.
Since the compiler can see the definition of C++
when optimizing its callers, it is tempting to do some
inter-procedural optimization (IPO) even if we don’t inline the
function bodies. For instance, if we know the call target does not
read or write memory, we can do dead store elimination across the
This post will show that IPO like the above is generally not sound for
Given two differently (but correctly!) optimized versions of the same source level function, we may be able to prove things about one of the copies that are not valid for the other.
From (1) it follows that inter-procedural optimizations are generally not valid if the call site can ultimately link to a differently optimized version of the same source level function. This means IPO across call-sites calling C++
inlinefunctions is unsound if we link two differently optimized object files together (i.e.
“Differently optimized” does not necessarily mean “different optimization pipeline”. It is possible for the same function to be differently optimized just by virtue of being in a different context. Therefore IPO across C++
inlinefunctions call boundaries is incorrect even if every object file being linked together is optimized identically (i.e.
As-if Execution and Refinement
Many optimizations involve the compiler “choosing” one valid execution from many possible valid executions.
For instance, consider this snippet:
f can either return
false: there could be a
concurrent thread writing different values to
executes, and the two loads from
ptr may or may not observe
different values. However, a compiler is allowed to (expected, even)
to first optimize the above to
and then to
This is legal since even though
f is allowed to return
does not have to return
false even in the face of racing writes.
That is it does not have to see different values from the two reads
ptr – we can always optimize it as if both the reads from
ptr saw the same value.
The key observation here is that
f_opt are not
equivalent – it is legal to replace
f_opt (that is,
f_opt is a valid implementation of
f), but not the other way
around. Henceforth, let’s call going from
f_opt (and similar
cases) refinement and going the other way from
similar cases) derefinement. As stated, derefinement is not a
valid optimization in the traditional sense.
Refinement is not specific to atomic operations and multi-threaded programs. Given
we can produce
by refining away undefined behavior.
Clang will also do as-if optimizations on
malloc (though I
personally find this specific transform somewhat questionable):
Talking about LLVM IR directly, many basic arithmetic optimizations
are refinement operations since the IR specifies an
For instance, replacing
x - x with
0 refines the expression from
undef”1 to “
0 for all values of
Derefinement breaks static analysis
Given the optimization transforming
static analysis on
f_opt is not necessarily valid on
f_opt does not read or write memory while
f does, and if
we optimize the caller of
f_opt based on the assumption that it does
not read or write memory then changing the call site to call
f_opt will retroactively invalidate the optimization.
This demonstrates claim (1): given two differently optimized versions
of the same source level function, we may be able to prove things
about one of the copies that are not valid for the other. Note that
f is a trivially “optimized” version of itself.
“Differently optimized” does not mean “different optimization pipeline”
To see why claim (3) is true, consider something like:
X.cpp it is possible for
F::g to get inlined into
F::f and then for
F::f to ultimately be optimized to an empty
function. This could then justify an inter-procedural optimization to
optimize away the store of
which would be bad if the call site ultimately called the copy of
F::f present in the
Y.cpp translation unit. The copy of
present in the
Y.cpp translation unit will not have inlined
(since it can’t even see it) and thus will contain a viable call to
read_and_write_memory (which presumably reads and writes memory).
Therefore the store elimination we did in
external_caller_0 will be
rendered invalid retroactively if we replace the call site in
external_caller_0 to call
F::f from the
Y.cpp translation unit.
You can find an “exploit” based on this concept at https://github.com/sanjoy/comdat-ipo, but note that this bug is now fixed in both clang and GCC.
Firstly, I’ll note that this problem disappears in “whole program
optimization” like situations where the optimizer runs after the
definitive copy of each C++
inline functions has been selected.
Here are some ways to solve this in a traditional, “optimize and
.o files, and then link” setting:
Link a call site calling a C++
inlinefunction only to the exact definition that the compiler saw when optimizing the caller. This will require some smartness in the linker to avoid excessive code bloat, but it allows unrestricted IPO across C++
Don’t do IPO over call sites that call C++
inlinefunctions. This is the fix implemented in clang today. I don’t know what fix was implemented in GCC for this.
Do IPO across C++
inlinecall boundaries, but only before the inline candidates have had any refinement.
I want to expand on the third option a little, since it begs the interesting question: what optimizations don’t involve refinement? Here are some examples of non-refining optimizations:
Since racing on non-atomic reads and writes is undefined behavior, optimizing:
is not refinement. Optimizing
isn’t refinement either – both the original and transformed programs
have the exact same set of behaviors – “call
Informally, to decide whether a transform involves refinement or not we can ask the following question: is the inverse of the said transform a correct optimization? Since refinement transforms remove behaviors, their inverses add behaviors, something a correct optimization typically2 cannot do. Therefore optimizations whose inverses are correct optimizations cannot be refining optimizations.