Behind the F# editor tootling. Part 1: introduction to compilers.
Behind the F# editor tootling. Part 1: introduction to compilers.
F# editor tooling, while it’s not on the level of mainstream languages like C# or Java, has been in a unique spot among Functional Programming languages. F# tooling is available cross-platform, it’s feature-rich, stable, innovative and, what’s also essential, it’s commercially supported and actively developed. Moreover, all this is available while communities of some other FP languages are discussing whether displaying simple tooltip is not too expensive.
In the series, I’d like to dive behind what we see in the editor and show some of the magic that lets us have this excellent developer tooling experience. In this post I’d like to do a gentle introduction to compiler design, introduce some concepts and vocabulary that will be useful later, and give a general description of the potential approaches to the editor tooling in context of compilers.
Please keep in mind that this post is a high-level introduction to the concepts with some simplifications and without too many details. If you’re one of a few people that work on compilers or editor tooling this post is probably not for you ;-)
Compilers 101
If you ever took university course about compilers or read any introductory book to the topic you probably know that compilers are often treated as a black box - compiler takes a list of files, some options, do magic, and then it outputs some executable or library.
Typically inside the compiler, several things are happening in order:
- Lexing - it’s the process that transforms text (source code) into a list of tokens that can be understood by the compiler
- Parsing - it takes the list of tokens produced by lexing and transforms it into Abstract Syntax Tree (AST). AST is a tree that represents the structure of your code. Each node of the tree denotes a construct occurring in the source code. The syntax is “abstract” in the sense that it does not represent every detail appearing in the real syntax, but rather just the structural or content-related details. For example, construct like an if-condition-then expression may be denoted as a single node with three branches. Parsing can produce errors if your code contains structural errors (for example, missing parentheses, or wrong indentation)
- Typechecking - In a statically typed language, it’s a process of verifying and enforcing the constraints of the types. Usually it takes as an input AST and any additional context (for example external references) and ensures that everything is fine with your code regarding type system - for example, it checks what types are variables you use, if they contain the members you use if your function returned declared type. A result of type checking is a different type of the syntax tree - in F# compiler we call it Typed Abstract Syntax Tree (TAST). TAST is built on a higher level of abstraction - it contains fewer details about the structure of the code, but more information about entities (variables, objects, functions, etc.) including full type information. Type checking can produce a lot of different types of errors and is one of the primary sources of them.
- Optimisations - is a process where compiler perform a different type of optimisations on your code such as removing dead code or inlining. Every optimisation is taking TAST as an input and returns new TAST as an output, and the process is repeated as long as the output TAST is different from input one.
- Code generation - it’s the final step of the process, in which compiler takes final TAST and generates output code from it (either in the form of the byte code of some form - IL in case of F# - or just native code)
Compiler 101 vs editor tooling
Unfortunately, the naive, black-box approach to the building compilers is not enough. With such design, you can only start getting information about code only after code is built, inspecting the final result. It is problematic for multiple reasons:
- You don’t have any information connected to the structure of code - you only have final results. It may be suitable enough to get some tooltip information or maybe a list of usages of the given Symbol, but it’s not enough in case of any editor features depending on code structure (refactorings, autocomplete lists, formatters etc.)
- You only have up-to-date information about your code after you’ve saved the file and built project - for example, you don’t get updated errors in your code while you’re typing and making changes in the file
- Saving the file and building a project usually takes a while - if our compiler process is entirely stateless it needs to perform all phases. Also, everything has its costs. Especially optimisations and code generation is not interesting for us from editor tooling point of view - we want to provide tooling in a context of the source code we edit, not in the context of optimised, final artefact we get from the compiler.
Due to those problems editor tooling for languages using a black-box design of the compiler usually have huge problems - either they’re limited in what’s possible in such editor tooling, or attempt to work around those problems by re-inventing what compiler does by using technologies such as tree-sitter or just by writing “presentation compiler”.
However, such workarounds are never as accurate and fully featured as systems based on the compiler directly. While building performant, accurate and productive editor tooling, we need to have direct access to the internal state of a compiler (such as AST, TAST, and Symbol information) and that’s why we need a different approach to building compilers.
Modern compiler design
In recent years many mainstream programming languages have moved towards compiler service approach to the editor tooling. In such approach, editor tooling is powered by internal capabilities of the compiler which ensures that editor functionality. Examples of such approach are [C# Roslyn API] or [F# Compiler Service] (on which we will focus more in the next parts of this series).
The modern compiler needs to be:
- API - compiler needs to be accessible as a normal library or some other kind of API. We need to be able to run it inside another process, and we need to be able to communicate with different layers of a compiler (usually lexer, parser and type-checker)
- Server - it needs to be ready to run as a long-running, stateful application that gives us the ability to do partial operations (for example single file parsing) in the context of a whole compilation unit. The fact it’s the stateful, long-running and asynchronous process has vast implications on the design of compiler and the performance optimisations.
- Database - as mentioned above compiler is a long-running stateful process. Additionally, we need to have a way of getting what is usually internal information of the compiler - Tokens, AST, TAST, list of known by the compiler symbols and more.
With such compiler capabilities, we can host it directly in editor process (for example F# or C# tools in VS) or in another application providing higher-level API that is also communication layer which can be used from editors (for example language server using Language Server Protocol)
Summary
I hope this blog post is good enough, the high-level introduction of compiler design, concepts and vocabulary that you’ll need to know in the future posts from this series.