GSoC Finals

The end of my Google Summer of Code is approaching,  it is therefore time to take a breath and to summarize what I’ve done over the last few months. So I’ll take the opportunity to lead you, step by step, through the various stages of my project and outline the tasks I’ve been working on (for a better understanding you may also have a look at my original project proposal).

Making first contact

My very first activities involved finding a way to allow the use of Polly’s optimization capabilities for Julia programs. For this purpose I introduced a new macro @polly  that can be used to explicitly annotate functions which you want to be optimized by Polly:

However, since allowing polyhedral optimization in Julia is still in an experimental stage this feature will not be enabled by default. You may therefore have to build Julia from source in order to activate them. The simplest way will be to setup a Make.user file that contains the following two lines:

With USE_POLLY:=1 you tell the build system to activate all the parts that are necessary to use @polly . Currently, this also requires that you build Julia with the latest version of LLVM, therefore LLVM_VER:=svn  is needed. This will then download and compile the latest version of LLVM and Polly to build your Julia executable.

Alternatively, if you want to use a local build of LLVM you may use a Make.user that looks like the following:

This way, the build system knows you want to use your own LLVM build located at YOUR_LLVM_BUILD_DIR.  If you use a static LLVM build you may also want to include the configuration USE_LLVM_SHLIB:=0.

Additionally, I also added a command line option --polly=no that can be used to deactivate Polly without having to manually remove all the @polly declarations from your program which can for example simplify debugging or performance tracking.

The PRs that are accountable for bringing you this functionality are the following:

Simplifying boolean expressions

For the representation of truth values Julia provides a Bool datatype which it internally represents as 8-bit datatype instead of using a single bit. This will have interesting consequences when it comes to the lowering of branch conditions to LLVM IR. For example, consider the following code snippet:

When translated to LLVM IR we would get code that corresponds to the following (I apologize for the extra empty lines in between, my code highlighting plugin doesn’t seem to handle LLVM code correctly):

Since icmp instructions deliver results of type i1 (which, as its name suggests, has a bit-width of one) Julia will insert additional zext instructions to make them conform to their 8-bit Bool  representation. The noise introduced by these cast operations also limits Polly’s abilities to optimize such code parts. For example, consider the following loop:

Due to the presence of the casts, Polly would not be able to analyze the branch condition. Thus, when the polyhedral representation constructed this code region would have to be conservatively overapproximated which would lead to an atomic polyhedral statement for the whole loop body:

Having a second look at the above IR code, one might wonder why these cast operations can’t actually be removed by LLVM since they seem somehow redundant in this context. After investigating this a bit it turned out that this is actually a missed optimization in LLVM’s InstCombine  pass. I therefore took the chance to enhance InstCombine with an according transformation. On our above example, it would push the casts on the icmp operations after the and which would bring us the following code:

Finally, InstCombine would then be able to totally remove the remaining zext, leading us to:

With this change, Polly is now also able to correctly analyze the branch condition and extracts two separate polyhedral statements for the given code region:

The relevant patches for this change are the following:

Supporting iteration

Julia provides a flexible way to implement sequential iteration over a certain range of values based on its iteration interface which is defined by the three methods start(), next(), and done(). This interface is actually also an important conceptual aspect used to provide you with the concise for loop syntax (examples adopted from the Julia manual):

which is, in effect, only syntactic sugar for

Two very essential types, that implement this interface, are UnitRange and StepRange which enable you to write

to iterate over the values from 1 to n or

if you only want to extract every second value in this range. Actually, 1:n is just a shortcut for UnitRange(1,n) and 1:2:n for StepRange(1,2,n). However, their current implementation makes it difficult for Polly to derive the bounds of loops which iterate over such ranges. Therefore, I’ve supplied the alternative types Polly.UnitRange and Polly.StepRange which are implemented in a way that helps Polly to derive loop bounds more easily. But still it is not necessary to use those ranges explicitly and write

or

Instead, loops that are part of a @polly annotated function will automatically be rewritten for you to use the new ranges (i.e. 1:n would automatically be replaced by Polly.UnitRange(1,n)).

The PR that provides the according functionality can be viewed at https://github.com/JuliaLang/julia/pull/17965 (unfortunately it has not yet found its way to be merged into Julia’s main repository).

Eliminating bounds checks

By default, Julia will create for each array access a corresponding bounds check to make sure that there won’t be touched any memory outside the array. That means for an array access as simple as

Julia would actually produce code like

These checks come at a cost, since they introduce additional overhead at run-time, especially when they occur inside loops. Through analyzing the conditions of such checks, Polly has the ability to detect and eliminate such bounds checks or at least hoist them outside of a loop if it would be redundant to execute them over and over again. At the moment, however, Polly’s ability to do so is limited by the way these checks are lowered to LLVM IR. The error branches of such bounds checks will result in basic blocks that are terminated by unreachable instructions. These instructions currently restrict LLVM’s RegionInfo analysis in finding single entry single exit regions in such programs. There already exists a patch to tackle this problem (https://reviews.llvm.org/D12676, authored by Tobias Grosser) but unfortunately it has not yet been upstreamed.

However, I based my further work on this patch in order to proceed in enabling the actual bounds check elimination (my rebased version of the patch can be found at top of the following branch of my LLVM fork: https://github.com/MatthiasJReisinger/llvm/commits/post-dom-fix). Furthermore, experimenting with this patch helped to identify a bug in Polly’s schedule building algorithm which I resolved on the following branch of my Polly fork: https://github.com/MatthiasJReisinger/polly/tree/boundchecks.

With these adaptions it is already possible to effectively eliminate bounds checks for accessing one-dimensional arrays. However, when approaching the elimination of bounds checks for array accesses that involve multiple dimensions another obstacle shows up. For example, an access A[i,j,k] to a three-dimensional array A would involve the following bounds checks:

The last of these checks is problematic since it involves the non-affine expression ((k * size(A,2) + j) * size(A,1) + i). Polly is not able to analyze expressions of such a form which would also prevent it from applying its bounds check elimination capabilities on such code parts. I therefore adapted Julia’s bounds check emission to generate simpler code of the following form:

The corresponding PR for this adaption can be found at https://github.com/JuliaLang/julia/pull/18064.

PolyBench.jl

As I already described in my previous blog post, as part of my work I ported the PolyBench 4.1 benchmark suite to Julia which is now available as the Julia package PolyBench.jl. Like the original suite, it contains 30 programs for numerical computations each one in the form of a Static Control Part (SCoP). One of the major objectives of my work was to maximize the number of SCoPs that are fully detected by Polly. The below table shows how my contributions helped to increase this number for PolyBench.jl:

 Start of GSoCNow
bounds checks enabled027
bounds checks disabled1928

The increase in detected SCoPs in the presence of bounds checks is due to the above described simplification of the bounds check emission in Julia’ code generator. For deactivated bounds checks the achieved gains are mostly due to the refinements of Julia’s LLVM pass-pipeline and the introduction of Polly.UnitRange and Polly.StepRange.

Further contributions

Besides the actual work on the project itself I also fixed some minor compatibility issues of Julia with newer LLVM versions:

Wrapping up

Julia is now able to make use of Polly’s optimization abilities but there are yet problems to be solved in order to leverage its full power for a wide range of application scenarios. In the end, I hope that with my contributions I could clear the way for further work and I would sure be glad to help, also in the future, to improve the experience of using polyhedral optimization within Julia.

Porting PolyBench to Julia

In my previous post I gave a short introduction to my GSoC project. I described how Polly can now be used to optimize Julia code and provided a vague outlook on the kind of problems that I would have to solve. In this post I want to take up this last thought to give an overview of my recent activities.

One of the major tasks in my work is to identify situations in which Polly cannot unfold its optimization potential. That means, it’s crucial to find concrete pieces of Julia code which Polly fails to optimize. However, Polly is not intended to optimize arbitrary programs – it has definite requirements on the code for which its optimizations are requested. Polly, or polyhedral optimization in general, is only amenable to certain code regions, referred to as Static Control Parts or SCoPs for short (a detailed description of such SCoPs can, for example, be found in Tobias Grosser’s diploma thesis).

Fortunately, with PolyBench there exists a dedicated collection of programs that contains such SCoPs. This benchmark suite has also been used earlier to analyze Polly and its optimization abilities. However, the benchmarks that are part of PolyBench are C programs (to be exact, there also exists a Fortran version). Therefore it stood to reason to port this benchmark suite to Julia – the tentative result of this can be reviewed at my github repository.

To illustrate the performance that can currently be achieved when using Polly to optimize SCoPs in Julia I performed a simplistic benchmark run of this PolyBench port – the results are depicted in figure 1 below. It shows the average speedups on 10 benchmark runs for which I extracted the execution times via Julia’s Base.time_ns() utility. It’s already possible to achieve notable speedups on some of these programs. The slowdown of others, however, may indicate problems in the optimization process.

polybenchjulia
Figure 1: Speedup of ported PolyBench program with Polly enabled in Julia.

In order to get a feeling of what can be expected on these SCoPs in general, I also conducted the original PolcBench/C benchmark suite. Figure 2 below illustrates the comparison between the optimized benchmarks, which were compiled via clang -O3 -mllvm -polly, and the unoptimized benchmarks, compiled via clang -O3 (additionally, I configured PolyBench to use cycle accurate profiling to enable exact measurements via its POLYBENCH_CYCLE_ACCURATE_TIMER option).

On the one hand the respective results might highlight the optimization potential that could not yet be leveraged by Polly within Julia, on the other hand it might illustrate current restrictions on others. (however, some differences in the slowdowns for some of the benchmarks might possibly have emerged due to the differing approaches for the measurements)

polybenchc
Figure 2: Speedup of original PolyBench/C programs when optimized via Polly.

What’s next

This benchmark set has enabled me to systematically analyze SCoPs in Julia and continuously helps to identify characteristics of Julia’s internals that currently prevent Polly from performing optimizations. In my next posts I will highlight some of these in more detail and, as last time, I’m looking forward to letting you know about prospective findings.

Enabling Polyhedral Optimizations in Julia

Over the the next few months I will be working on an exciting project as one of this year’s Gooogle Summer of Code students. My work will aim at making available the power of polyhedral optimizations to the programming language Julia (I also cordially invite you to have a look at my project proposal for an overview of my planned activities). When a Julia program is executed its source code will be translated, at run-time, to native machine code via the LLVM compiler framework. Fortunately, with Polly there already exists a polyhedral optimizer that can be used for LLVM-based compilers – so theoretically I will “just” have to integrate Polly into Julia’s compiler tool-chain. As often, however, it’s the technical details that make this practically challenging. Now, GSoC already started last week, the very first steps have been taken and I therefore would like to give a little insight on my initial activities.

First steps towards Polly

In LLVM, the basic means to organize functionality is a pass – a component that fulfills a certain (ideally coherent) responsibility within the whole compilation process. It is then possible to decide quite freely which of LLVM’s functionality to use by instructing it to include only particular passes. The same holds true for Julia –  when it initializes LLVM it chooses a certain set of passes that it wants to leverage when compiling programs to machine code. Also Polly is essentially a collection of such passes that provides all the necessary analyses and transformations which together make up a polyhedral optimizer.

LLVM makes it relatively easy to configure it’s pass infrastructure, therefore also making it possible to add Polly’s passes to Julia’s pass-pipeline in a relatively non-intrusive manner. The required changes can be viewed at my according pull request that has already been merged into Julia’s official github repository. Furthermore, this pull request adds the possibility for Julia programmers to explicitly run Polly’s optimizations for particular functions – therefore a new macro @polly was introduced, to annotate those functions for which Polly should be run.

How to use the new features

This functionality is still at an early stage and using it requires a few preparative steps. First of all it is necessary to download and build LLVM and Polly from source (for example as described here). Furthermore, also the development version of Julia has to be downloaded from its github repository and Julia’s build process has to be configured in order to build against the previously compiled version of LLVM/Polly. Therefore a Make.user file has to be created that has to be located where Julia will be built, containing the following lines (where YOUR_LLVM_BUILD_DIR has to point to the previously compiled LLVM build):

It will then be possible to annotate functions in your Julia programs, to optimize them via Polly:

What’s next

One of the next goals is to ease this build process by removing the necessity to manually build LLVM/Polly. Instead, it should be possible to automatically retrieve and build Polly when building Julia. In addition to that, to give you an idea of what I will be doing for the rest of the summer then, let’s have a look at the following code sample below. It multiplies the square matrices A and B, writing the result to C.

Provided the input data you use is not too small, Polly will cause this function to run significantly faster. However, you might notice the @inbounds macro in front of the for loop. Usually, Julia strives to emit bounds-checks array accesses. With this macro, however,  it will skip these checks and assume that these accesses will be safe. Normally you should only use this macro for code regions for which you can be sure that there will not happen any out-of-bounds accesses. Even though we cannot guarantee all accesses to be safe there’s a particular reason for the presence of the macro – at its current state Polly will not be able to optimize the code that Julia generates when bound-checks are enabled. This means removing @inbounds here also removes the ability to optimize this function.

Furthermore, you might also ask yourself why we restricted ourselves to square matrices. So let’s at least remove this restriction to allow the multiplication of non-square matrices as well:

Unfortunately, also this change will cause Julia to emit code that can not be optimized by Polly.

So as these examples suggest, my work will not end here. In order to leverage the full power of Polly from within Julia it will be necessary to identify the reasons that prevent Polly from optimizing programs like the above ones. Then, of course, it will be necessary to either adapt Julia’s LLVM code generator to emit code that is amenable to Polly or to directly adapt Polly’s analysis and transformation passes and to improve their applicability.

Either way, I’m looking forward to letting you know about any prospective findings.

Acknowledgements

Special thanks go to Karthik Senthil who provided a first patch that experimentally integrated Polly into an earlier version of Julia which built the basis for my first changes and helped me to find my way into Julia’s internals.

Cold Open

There are many great developers out there and there are many great writers out there. Even though I do not consider myself being part of the intersection of these two sets, I want to use this blog to give away certain things that I come across when working/studying/etc. and which I find noteworthy. Probably this won’t happen on a regular basis, but I’ll do my best to come up with something new every now and then.