LLVM, toolbox for portability of code

Logo LLVM émergeant d'un ordinateur

LLVM provides it with the IR code (Intermediate Representation) it produces. The object language is quite similar to assembly, and optimizations on the code, which would be difficult to produce and maintain on a high-level language, bring it the required speed. The IR is then executed by a Just In Time compiler that translates almost directly into machine code.

LLVM (Low Level Virtual Machine) is more than a virtual machine, but also a compiler infrastructure, a set of tools written in C++ and running on Linux and Unix systems, and Windows. These tools work with the LLVM bitcode (and not bytecode) that is the packaging of the IR code into a distributable module.
The tools consist of the Clang compiler that supports four languages ​​and produces the bitcode or a binary executable, a code optimizer, the LLDB debugger, a linker, JIT virtual machines, an interpreter.
It was created by the University of Illinois and received the contribution of Apple which has mainly added Clang.

LLVM gives new life to old programs written in C, C++ or other languages ​​(all statically typed languages ​​can be compiled into IR). It can run on new systems. They can be converted into JavaScript (with Emscriptem) and run in browsers. Or to Portable Native Client, once translated into IR, which also works on any system...
LLVM is an alternative to Java with the addition of the Web as a target: its IR code may be compiled to Ams.js or to WebAssembly.

There are, however, some disadvantages. The LLVM runtime does not include a garbage collector, this must be supplied with the runtime of the compiled language. In addition, the intermediate code is not portable, it is necessary to produce a code specific to a processor architecture. That's why WebAssembly was invented. LLVM is also referred to as a moving target because the code it produces evolves over time.

In 2018, unexpectedly, employees leave the project because it has become too political and puts a social agenda before competence, which raises doubts about the future quality of the code.

Diagram of how LLVM works

Diagram of how LLVM works

LLVM can generate bitcode from many statically typed languages​​: C and Objective C with Clang, Java, Ada, Fortran with GCC and other languages ​​with other compilers as soon they support the bitcode output.

This bitcode is optimized and then it can be used directly by a LLVM virtual machine. With a linker and a static compilation, it can become a binary executable. And you can also use Emscriptem to convert it to JavaScript or Asm.js, allowing the program to run in the browser.

LLVM includes a JIT virtual machine that is used by Mono, Julia and many other projects.

Difference between Java and LLVM code

The best way to see the difference between the codes that are produced is by example:

Here is a simple function in C or Java to compile:

int arith(int x, int y, int z) {    
    return(x * y + z);  
}

LLVM produced this IR code:

define i32 @arith(i32 %x, i32 %y, i32 %z) {  
   entry:    
   %tmp = mul i32 %x, %y    
   %tmp2 = add i32 %tmp, %z    
   ret i32 %tmp2  
}

While Java produced this bytecode (it can be seen with the javap tool in the JDK):

public class demo.Demo {
  public static int arith(int, int, int);
    Code:
       0: iload_0
       1: iload_1
       2: imul
       3: iload_2
       4: iadd
       5: ireturn
}

We see that the bytecode, besides being closer to machine language, uses a stack to store data and perform operations on it whereas IR uses registers and memory fields.

And difference between bitcode and bytecode

What is the difference between the IR and bitcode? Why do we speak of bitcode and not bytecode, as is the case for Java?

In both cases, the code is executed by a virtual machine, JIT or not. The bytecode name comes from the fact that the instruction set was originally coded on one byte. This is not necessarily still the case, but the stream is a stream of bytes (bytestream) while the term bitcode is used to mark the fact that the stream is expressed in bits (bitstream), and therefore not in bytes, but in units of varying sizes.

IR (Intermediate Representation) is a language designed for a virtual machine or compiler, and it is encapsulated in a file that is called in the case of LLVM, the bitcode. It is encoded in bitstream composed of blocks and records.

Tools

On Windows, you can use Visual Studio or Eclipse CDT with the LLVM plugin. QtCreator can also use Clang.