Thursday, February 25, 2021

Understanding how Leela Chess Zero works

I wrote the following article in May 2018, now almost 3 years ago. During an interview with Albert Silver (an author affiliated with ChessBase, whom the article was supposedly for) about Leela Chess Zero, there were questions about where to find more information, in an easy to understand manner, how the Alpha Zero ideas worked in chess and how they were implemented in Leela. At that point, most of the easily accessible information was still referring to the Go engine, so I did an attempt to clear up some confusion.

The article and interview were never published (despite being announced), but ChessBase would go on to re-publish Leela Chess Zero as "Fat Fritz 1" a year later. 

As I already spent the time writing it, and it might still be interesting to my audience, I'll put it here on my blog, in the hope that someone finds it useful.

Understanding how Leela Zero Chess works (2018)


How the neural network works, and what it calculates.

The neural network takes as input a stack of 8 x 8 bitmaps, enough to fully represent the state of a chess game. Most obviously, this will be the position of the pieces ("is there a pawn belonging to the player to move on this square"), but also some non-visible things, such as "is the player to move still allowed to castle" and "is there an en-passant move" possible?

These inputs are passed through a deep stack of image filters. These stacks are typically 20 to 80 layers deep, and typically have 128 to 256 outputs per layer, which then also form the input for the next layer. In every layer, every output square is calculated by taking the corresponding square and the surrounding ones (in total a 3x3 square) from all outputs from the previous layer, and applying an image filter to them, to compute a new output. These allow the network to gradually compute concepts such as "is this pawn isolated" or "is there an opposing pawn one rank up" and combine them. For example now the next layer can compute "is there an opposing pawn one or two ranks up", until it arrives at higher level concepts "is this pawn on an open file", which it can then combine with previously discovered ones, to form features such as "is this pawn isolated on an open file". Note that no-one is explaining these features to the program or programming them: it has to discover them all by itself by analyzing millions of chess games. Because some processing in the stack (of which we'll not go into detail about) is only done every 2 layers, they are typically thought of as a unit and referred to as a single "residual block".

There are 2 final layers: in the first one, all outputs from the above "stack" are combined and reworked so they map to the possible moves in the position. The network produces a probability for each move that it is the best. This is called the "policy prior".

In the other output, all outputs from the stack are combined to calculate a single value: how likely is it that the player to move wins the game. This is, in essence, the evaluation of the position.

Now, we move on to the tree search.

The search algorithm used by Leela Zero is Monte Carlo Tree Search (referred to as MCTS). It works as follows: first, we evaluate the neural network on the root position. This gives us both an initial evaluation for the position, as well as a list of possible moves with an associated (policy) prior. Now, we investigate the possible moves in
this root node. We will try to find the move that still has odds of being the best (the one with the highest upper confidence bound). If we have not looked at the position before, this will be the one with the highest prior. If we have, this will most of the time be the move that has been scoring best (leading to the position that seems best for the player to move), but sometimes also a move that does not have a terrible score but that we have not looked at very much yet and has a reasonable policy prior (so which could still turn out to be good). Perhaps surprisingly, the mathematics underlying this procedure come from the optimal strategy to play at slot machines! Once we have found the most interesting move, we will play it and evaluate the network on the new position, adding it to our search tree. This procedure, finding the most promising sequence of moves along the search tree, and expanding the final leaf node with the neural network, is repeated until time runs out. Every time we evaluate the network to expand the search tree, we also back up the new evaluation for the final position, and add it to the nodes one level higher, i.e. closer to the root of the tree. This way, at every node in the tree, we keep an average score of the most promising lines of play following that position. As the program homes in on the best moves, this average will gradually shift to reflect the score along the best line of play for both players, converging towards the so called mini-max value that classical programs compute.

Once time expires, we play the move we investigated most of all. This is almost always the move with the highest average score, but it could happen that a new promising move appears with a high score, yet we run out of time before we are able to investigate it more deeply.

Note that despite the name, no Monte Carlo playouts are done at all, i.e. the program does not play out the position till the end during the search. The evaluation from the neural network serves as a high quality approximation for the score the playouts would have produced, based on studying millions of completed training games.

The training.

Leela Zero plays a lot of games against itself, running on the computers of the volunteers who have made their machines available for this. (And you can do so too!). On every move, Leela records what the best moves were after some amount of searching ahead, and at the end of the game, she notes who won. Note that this data corresponds to what the neural network produces! This then gets sent off to a central server. A training machine with fast GPUs collects the data, and adjusts the neural network to more accurately match it. The new network is then sent out back again to the volunteers.

Because searching ahead improves the strength of the moves selected, a given network can produce training data than is stronger than its own output, and we get a cycle of gradual, continuous improvement.

To increase the diversity of the games and to give the network higher chances to discover new things, the engine is very occasionally forced to play a sub-optimal move during the training games, and some noise is added to the policy predictions.

The differences with a classical engine.

The evaluation of Leela Zero represents how often the side to move has been able to win from similar positions. This value effectively represents the program's own experience in actually playing those positions out, and it is natively expressed as a percentage. In a classical engine, the evaluation is expressed in pawn-equivalents, and
the programmers spend their time painstakingly adding rules and making adjustments such as "is an isolated pawn on an open file worth -0.10 pawns or -0.15 pawns?" and "should it be -0.20 pawns if the opponent still has both rooks?". If we are talking about things such as how much a compromised king's position is worth after an exchange sacrifice, it is clear trying to come up with good "pawn equivalents" or rules is a soul crushing experience for the programmers.

For compatibility with existing Chess GUI's such as Fritz or ChessBase that expect the program to be such a bean counter, Leela Zero will convert her percentage to a pawn-equivalent score, based on database statistics on how well players score with various material advantages.

This difference in evaluation, i.e. the reflection of game playing experience versus hard-coded rules, is one of the main reasons why Leela Zero has a fundamentally different understanding of compensation and dynamic advantages in chess.

In a classical engine, the engine starts the search with a nominal search depth, which it gradually increases 1 half-move at a time, and investigates every sequence of moves up to this depth. Of course, modern engines like Stockfish are so strong because they do not exactly do this. They will not investigate unpromising moves deep in the tree (pruning) or decide to investigate them much less deeply (reductions), based on heuristic rules or statistics gathered during the search. Nevertheless, every move sequence will tend to have gotten some minimal amount of search, a brute force safeguard, if you will. This is much less so in Leela Zero. If the network decides that a particular move is very unlikely to be good, it can essentially ignore it even if the main line has been investigated for 25 half moves or more. This is why the engine can occasionally still make tactical mistakes from lack of experience. Note that unlike the heuristics for the classical engines, Leela Zero uses the full neural network, i.e. all of its chess knowledge, for every decision where to search. This is also why, despite searching literally thousands of times slower than a classical engine, she can still be competitive.

If you read the description of Leela Zero's search, you will see that there is no such thing as a nominal depth with which to start investigating each move sequence. Again, for compatibility with existing software, Leela will report a search depth in half-moves, though this value is essentially made up.

In a classical engine, the search and evaluation heuristics are a sequence of rules. IF this AND this AND this THEN IF this ELSE... and so on. Modern engines such as Stockfish use 64-bit integer bitmaps to ask such questions about the whole board at once at high speed. Nevertheless, the execution flow inside the program jumps around a lot as every rule and exception is evaluated. These kind of "branching" flows are the forte of the classical CPU as found in desktop computer but also in mobile phones. In contrast, the processing of Leela Zero's neural network, after some mathematical trickery, essentially boils down to doing a lot of straightforward matrix multiplications. This is a highly parallel task with no branching or divergence, the forte of the Graphics Processing Unit (GPU), found in a video card. The raw computing power of the GPU is much higher for these simpler tasks, and this is why Leela runs comparatively faster on a GPU. Because the neural network is so large, even a fast GPU will run at a much lower nodes per second than you might be used to from a classical engine. She has to make up the difference by having a superior understanding of chess, something which she is getting better at every day.