Why Julia is so fast?
I have always wanted to understand what makes one programming language faster than another. The problem is that in order to know this, one must understand in detail how the specific programming language one wishes to study is implemented. This is often a complicated task, as there are millions of books that explain the basics of the syntax of a programming language, but there are hardly any books that explain the technical and theoretical details behind the design of a language. Generally, to acquire this knowledge, one must have spent a lot of time programming in that language, get to know the project from within, read thousands of blogs and questions and answers on Stack Overflow, etc.
I’ve always found this to be a pity because I believe that the fundamental technical aspects behind a programming language can be summarized in about a hundred pages. This would drastically accelerate the path to becoming an ‘expert’ in that language. I don’t aim for such a thing with this blog, but in the case of the Julia project, although it lacks such a book, it is possible to find references that concentrate and describe the fundamental aspects behind the language design. On the other hand, the fact that it is a language oriented towards performance makes it ideal for studying.
Julia Implementation
Julia is designed to generate efficient native code during runtime. The Julia compiler is an optimizing just-in-time compiler structured in three phases:
- Source code is first parsed into abstract syntax trees [2].
- Those trees are then lowered into an intermediate representation that is used for Julia level optimizations.
- Once those optimizations are complete, the code is translated into LLVM IR and machine code is generated by LLVM.
With the exception of the standard library which is pre-compiled, all Julia code executed by a running program is compiled on demand. The compiler is relatively simple: it is a method-based JIT without compilation tiers; once methods are compiled they are not changed as Julia does not support deoptimization with on-stack replacement.
Memory is managed by a stop-the-world, non-moving, markand-sweep garbage collector [3,4]. The mark phase can be executed in parallel. The collector has a single old generation for objects that survive a number of cycles. It uses a shadow stack to record pointers in order to be precise.
Method Specialization
Julia’s compilation strategy relies on runtime type information: Every time a method is called with a new tuple of argument types, the method is specialized to these specific types. These are the advantage of such a strategy:
-
Optimizing methods at invocation time in Julia provides the Just-In-Time (JIT) compiler with important information, such as the known memory layout of all arguments. This enables optimizations like unboxing and direct field access.
-
Specialization in Julia allows for devirtualization, which replaces method dispatch with direct calls to a specialized method. This replacement reduces dispatch overhead and enables inlining, leading to improved performance.
The biggest disadvantage of method specialization is that the compilation process can be slow. To address this issue, the results are cached, and methods are only compiled the first time they are called with a new type.
Type inference
The design of the type system is one of the most subtle aspects and a key reason for its efficiency. On one hand, it possesses a rich typing structure, but at the same time, it is not overly complex as this could hinder and complicate the compiler’s task of code optimization.
Type information plays a crucial role in optimizing code in Julia. The compiler analyzes the flow of data and applies constraints to determine the types involved. Type requirements must be satisfied at function calls and assignments. Through intra- and interprocedural analysis, types are inferred based on the specific arguments provided, even in cases involving recursion. However, to ensure the analysis process doesn’t become infinite for methods that could potentially have an infinite number of argument or return types, Julia imposes a limit on the size of inferred types. This limitation ensures that the set of possible types remains finite and guarantees termination of the analysis.
Method Inlining
Inlining in Julia refers to the replacement of a function call with the body of the called function. In Julia, inlining is efficient due to its synergy with specialization and type inference. Type-stable method bodies can be effectively inlined, and inlining aids type inference by providing additional context. Inlined code can optimize by eliminating unnecessary branches and propagating precise type information. However, inlining may incur high memory costs and additional compilation time, leading to pragmatic heuristics that limit its application.
Object Unboxing
In Julia, due to its dynamic nature, a variable can hold values of multiple types. Consequently, in the general case, values are allocated on the heap with a tag that specifies their type. However, the optimization of unboxing allows for direct manipulation of values. Several design choices contribute to this optimization. First, concrete types in Julia are final, meaning they specify both the size and layout of a value. This differs from languages like Java or TypeScript that have subtyping. Additionally, Julia does not have a null value, which eliminates the need for an extra tag for primitive values. As a result, values such as integers and floats can always be stored unboxed. While repeated boxing and unboxing can be expensive, and unboxing may not be possible for recursive data structures despite the presence of type information, heuristics are used to determine when to perform this optimization, similar to inlining.
Multiple dispatch
Julia uses multiple dispatch extensively, allowing extension of functionality by means of overloading. Each function can consist of an arbitrarily large number of methods. Each of these methods declares what types it can handle, and Julia will dispatch to whichever method is most specific for a given call. Multiple dispatch is omnipresent, virtually every operation in Julia involves dispatch.
Multiple dispatch is the standout feature of Julia’s design. Its synergy with specialization is essential for understanding the language’s performance and its efficient ability to devirtualize and inline code. Furthermore, one of the promises of multiple dispatch is that it allows extending existing behavior with new implementations.
By default, Julia employs LLVM at optimization level O2, and disabling LLVM optimizations significantly slows down the code (1.1x and 7.1x slower) However turning off type inference or devirtualization capabilities will have and impact on performance much bigger (5x and 2000x slower). Clearly the type inference, devirtualization and code inlining are the most important optimizations in Julia together with the capabilities of multiple dispatching.
References
This is a summary of the ideas developed in the work Bezanson, Jeff, et al. “Julia: Dynamism and performance reconciled by design.” Proceedings of the ACM on Programming Languages 2.OOPSLA (2018): 1-23.
Also there a some ideas inspired by the following blog posts and papers:
[1] Bezanson, Jeff, et al. “Julia: Dynamism and performance reconciled by design.” Proceedings of the ACM on Programming Languages 2.OOPSLA (2018): 1-23.
[2] https://www.abstractsyntaxseed.com/blog/antlr4-autocomplete-library/introduction
[3] https://www.geeksforgeeks.org/mark-and-sweep-garbage-collection-algorithm/
[4] Richard Jones, Antony Hosking, and Eliot Moss. 2011. The Garbage Collection Handbook: The Art of Automatic Memory Management (1st. ed.). Chapman & Hall/CRC.
[5] https://en.wikipedia.org/wiki/Shadow_stack
[6] https://scientificcoder.com/the-art-of-multiple-dispatch
[7] https://viralinstruction.com/posts/badjulia/