This post describes infrastructure that has gone in to LLVM piecemeal over the last couple of months. All of the information in this post is scattered throughout in commit messages on llvm-commits and email threads on llvm-dev. This post is intended to present a coherent story for people not actively involved in the original discussions and without the spare time to stitch together the big picture from individual commits and emails.

Motivation: Checks in Managed Languages

In “safe” languages like Java, it is the virtual machine’s job to ensure that illegal operations (like dereferencing bad memory or unsound type coercions) does not lead to the program into arbitrarily bad states. This is typically enforced by adding runtime checks to certain operations to check for violations. Field accesses elicit a null check, array loads and stores have a range check (and in some cases a type check), type casts check the cast is well-formed etc. For this post we’ll focus on range checks only, but the general idea is applicable to any kind of runtime check.

Let’s start with a simple example:

void foo(int[] arr) {
  arr[0] = 0;
  arr[1] = 1;
  arr[2] = 0;
  arr[3] = 1;
}

Pretending there are no null checks (for brevity, this also applies to the rest of the post), the code above will be lowered into:

void _foo(int[] arr) {
  if (!(0 u< arr.length)) throw OutOfBounds;
  arr[0] = 0;
  if (!(1 u< arr.length)) throw OutOfBounds;
  arr[1] = 1;
  if (!(2 u< arr.length)) throw OutOfBounds;
  arr[2] = 0;
  if (!(3 u< arr.length)) throw OutOfBounds;
  arr[3] = 1;
}

Notationally, I’ll use _<function> to denote a “lowered” version of <function> that has some checks made explicit. u< denotes the “unsigned less than” comparison. lhs u< rhs is equal to (but cheaper than) 0 s<= lhs && lhs s< rhs given rhs is positive, wheres< denotes the “two’s complement signed less than” operation.

The codegen above is pretty bad. Predictable branches are cheap, but not free, and we’d like to do better than emit four compare-branches for something as simple as foo.

For a moment if we focus solely on guaranteeing safety, we notice some redundancy; we could optimize the above to:

void _foo(int[] arr) { /// version_a
  if (!(3 u< arr.length)) throw OutOfBounds;
  arr[0] = 0;
  arr[1] = 1;
  arr[2] = 0;
  arr[3] = 1;
}

and be “just as safe”. No array store will actually store to invalid memory, and if we do end up throwing an OutOfBounds exception ( ArrayIndexOutOfBoundsException really, but I don’t like typing) then we know that at least one following array accesses would have been illegal.

However, in Java, exceptions like ArrayIndexOutOfBounds are recoverable. Someone could have called foo as

void caller() {
  int[] arr = new arr[3];
  try {
    foo(arr);
  } catch (Exception e) {
    assert(arr[1] == 1);
  }
}

and version_a would be observably different for such a caller (some languages explicitly allow this difference, though1) – it would throw an exception without writing anything to a[0], a[1] or a[2], when the original program would have written to a[0], a[1] and a[2], and then thrown an exception.

Does this mean we’re forever stuck generating code like version_a above?2 An important observation is that an actual out of bounds array access is rare. Despite out-of-bounds exceptions being recoverable, in 99.99% (anecdotal ratio :) ) of the cases an out of bounds exception is a bug in the application (and gets fixed), so it is okay if we’re slow (but correct!) in cases like caller above.

Let’s make a change in how we represent range checks. Instead of throwing, we “deoptimize”:

void _foo(int[] arr) {
  // loc_0:
  if (!(0 u< arr.length)) deoptimize() [ "deopt"(<values_0>) ];
  arr[0] = 0;
  // loc_1:
  if (!(1 u< arr.length)) deoptimize() [ "deopt"(<values_1>) ];
  arr[1] = 1;
  // loc_2:
  if (!(2 u< arr.length)) deoptimize() [ "deopt"(<values_2>) ];
  arr[2] = 0;
  // loc_3:
  if (!(3 u< arr.length)) deoptimize() [ "deopt"(<values_3>) ];
  arr[3] = 1;
}

Deoptimization3 discards the current runtime frame, and resumes execution in the interpreter from a “safe” state. In the lowering shown above, if arr.length is 2 then we won’t actually throw an exception from compiled code, but will re-start execution in the interpreter from loc_2, which will then immediately throw an exception. <values_2>, attached to the call instruction as an “operand bundle” contains all of the information required to construct the interpreter frame(s) to restart execution from loc_2, as a sequence of SSA values.

The “failure paths” are now not explicit exception throws, but bails to the interpreter; and we can optimize the above code to:

void _foo(int[] arr) {
  if (!(3 u< arr.length)) deoptimize() [ "deopt"(<values_0>) ];
  arr[0] = 0;
  arr[1] = 1;
  arr[2] = 0;
  arr[3] = 1;
}

because if we do have a caller like

void caller() {
  int[] arr = new arr[3];
  try {
    foo(arr);
  } catch (Exception e) {
    assert(arr[1] == 1);
  }
}

then _foo will jump back to the interpreter, and the interpreter will execute the stores to arr[0], arr[1] and arr[2] and then throw an exception. As far as the caller is concerned, nothing observably changed, but _foo will be faster in the very likely case that there aren’t any range check violations. In the very unlikely case that we do have a range check violation, _foo will have to pay the penalty of deoptimization followed by the penalty of staying in the interpreter till the end of the method4.

This technique can be applied to range checks without non-constant indices too. If we have

void bar(int[] arr, int i) {
  arr[i]     = 10;
  arr[i + 1] = 20;
  arr[i + 2] = 30;
  arr[i + 3] = 40;
}

and the math works out, then we can lower and optimize this to:

void _bar(int[] arr, int i) {
  if (!(i       u< arr.length) ||
      !((i + 3) u< arr.length))
    deoptimize();
  
  // No more range checks.
  arr[i]     = 10;
  arr[i + 1] = 20;
  arr[i + 2] = 30;
  arr[i + 3] = 40;
}

The term “widening” here denotes that we’re widening the previous check to “fail more often”, on a “broader” range of values.

Why Guards?

The Problem with Control Flow

There is a problem with the above approach. Consider a case like this:

void _foo(int[] arr) {
  bool condition = 0 u< arr.length;
  // loc_0:
  if (!(0 u< arr.length)) deoptimize() [ "deopt"(condition, ...) ];
  arr[0] = 0;
  // loc_1:
  if (!(1 u< arr.length)) deoptimize() [ "deopt"(<values_1>) ];
  arr[1] = 1;
}

Above, condition feeds into the interpreter state. LLVM is allowed (expected, even) to optimize this to:

void _foo(int[] arr) {
  bool condition = 0 u< arr.length;
  // loc_0:
  if (!(0 u< arr.length)) deoptimize() [ "deopt"(false, ...) ];
  arr[0] = 0;
  // loc_1:
  if (!(1 u< arr.length)) deoptimize() [ "deopt"(<values_1>) ];
  arr[1] = 1;
}

but that is a problem if we widen the transformed program:

void _foo(int[] arr) {
  bool condition = 0 u< arr.length;
  // loc_0:
  if (!(1 u< arr.length)) deoptimize() [ "deopt"(false, ...) ];
  arr[0] = 0;
  arr[1] = 1;
}

Now if arr.length is 1 then we’ll deoptimize to the interpreter with false as the value of 0 u< arr.length! This can be arbitrarily bad. E.g. say instead of foo we were optimizing:

void strangeLove(int[] arr) {
  bool condition = 0 u< arr.length;
  arr[0] = 0;
  if (!condition) {
    /* FIXME: TODO: remove dead code. */
    launch_nukes();
  }
  arr[1] = 1;
}

As written above, strangeLove will never call launch_nukes, since if condition is false (and we would have called launch_nukes), we are guaranteed to throw an exception when accessing arr[0]. But if we optimize strangeLove to something like _foo above, then when passed in an array of length 1, we are going to enter the interpreter with condition set to false, but will not fail the range check on arr[0]. This will spuriously call launch_nukes, which is probably not a good thing.

Enter Guards

What bit us above is that when we let the optimizer loose on _foo for the first time, we told it that the deoptimizing path for the first range check is executed if and only if !(0 u< arr.length). However, only half of that is true! If we allow range check widening, the deoptimizing path is executed if but not only if !(0 u< arr.length). To be correct, we’ll have to write our code in a way that the correctly expresses the “if but not only if” sentiment.

The most obvious approach is to express the not-only-if’ness of the condition directly in the expression feeding into the branch:

void _foo(int[] arr) {
  bool condition = 0 u< arr.length;
  bool unknown_0 = create_unknown();
  // loc_0:
  if (!(0 u< arr.length) || unknown_0)
    deoptimize() [ "deopt"(condition, ...) ];
  arr[0] = 0;
  // loc_1:
  bool unknown_1 = create_unknown();
  if (!(1 u< arr.length) || unknown_1)
    deoptimize() [ "deopt"(<values_1>) ];
  arr[1] = 1;
}

Given the control flow above, the use of condition in the first deoptimize call cannot be optimized to false: we know that if the branch to deoptimize was not taken then 0 u< arr.length is definitely true (so we know the load from arr[0] is safe), but we can’t assume that if deoptimize was called, 0 u< arr.length was false. Widening the first range check to include the second can then be semantically seen as choosing a suitable value for unknown_0.

However, this extra create_unknown call can get messy quickly by creating incidental complexity. For instance, we’ll have to state its memory effects. If it is readnone or readonly then what happens when some of them get CSE’ed? If it is readwrite then does it inhibit too much optimization? To avoid all of that, and to have a generally cleaner representation, we introduced guard intrinsics to LLVM IR. A guard intrinsic takes an i1 as a parameter, and expresses the following semantics internally:

define void @guard(i1 %pred) {
  %realPred = and i1 %pred, undef
  br i1 %realPred, label %continue, label %leave

leave:
  call void @deoptimize() [ "deopt"(...) ]
  ret void

continue:
  ret void
}

A call to @guard(i1 %condition) guarantees that %condition is true if it returns (if %condition is false, we’ll break out of the compiled code into the interpreter), but there is no guarantee that if a guard deoptimizes then %condition was false – a guard can deoptimize “spuriously” by choosing false for the value of undef. Normally this kind of bitwise operation (and i1 %pred, undef) will immediately be folded to false5, but the “implementation” of @guard shown above is really only the specification; we don’t actually provide a body for the intrinsic, but lower it directly to a call to @deoptimize guarded by an explicit conditional branch.

Using guards also benefits us indirectly by not having too many unnecessary basic blocks that don’t express user-level logic. A simpler control flow graph is better for compile time and memory usage.

All in all, this is how the widening optimization can be expressed using guards. First

void foo(int[] arr) {
  arr[0] = 1;
  arr[1] = 0;
  arr[2] = 1;
  arr[3] = 0;
}

gets lowered to:

void _foo(int[] arr) {
  guard(0 u< arr.length);
  arr[0] = 1;
  guard(1 u< arr.length);
  arr[1] = 0;
  guard(2 u< arr.length);
  arr[2] = 1;
  guard(3 u< arr.length);
  arr[3] = 0;
}

Note: we don’t negate the condition, since guard deoptimizes if the condition passed to it is false.

and then optimized and widened to:

void _foo(int[] arr) {
  guard(3 u< arr.length);
  arr[0] = 1;
  arr[1] = 0;
  arr[2] = 1;
  arr[3] = 0;
}

and finally lowered to:

void foo(int[] arr) {
  if (!(3 u< arr.length)) deoptimize();
  arr[0] = 1;
  arr[1] = 0;
  arr[2] = 1;
  arr[3] = 0;
}

Putting it all Together

If you want to use check widening in LLVM for your compiler, you have to

  • Implement some form of deoptimization (this is a fundamental change to your runtime), and use that to implement exception throws.

  • Represent checks that you want to widen as predicates passed to guard intrinsics.

  • Schedule your pass pipeline to run the GuardWidening pass, and at some later point run the LowerGuardIntrinsic pass (these are still experimental passes, so use at your own risk, don’t use in production, no backward compatibility guarantees etc.).

  • Yell at me when things fall over. :)

  1. For instance, see “VI.Annex F Imprecise faults” in “Standard ECMA-335 Common Language Infrastructure (CLI)”

  2. According to Betteridge’s law of headlines the answer is obviously no. :)

  3. Hölzle, Urs, Craig Chambers, and David Ungar. “Debugging optimized code with dynamic deoptimization.” ACM Sigplan Notices. Vol. 27. No. 7. ACM, 1992.

  4. Typically a JVM will have ways to heal such widened range checks back into normal range checks if the assumption of out of bounds access being “very unlikely” no longer holds for a particular check.

  5. This fold will be a performance issue, since we’ll branch to the interpreter unnecessarily, but it would be correct. Deoptimizing with the right interpreter state is always correct.