2017 MasteringtheGameofGoWithoutHuma

From GM-RKB
Jump to navigation Jump to search

Subject Headings: AlphaGo Zero, Reinforcement Learning Algorithm, Self-Play Reinforcement Learning.

Notes

Cited By

Quotes

Abstract

A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert movses, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.

Introduction

Much progress towards artificial intelligence has been made using supervised learning systems that are trained to replicate the decisions of human experts 1–4. However, expert data is often expensive, unreliable, or simply unavailable. Even when reliable data is available it may impose a ceiling on the performance of systems trained in this manner 5. In contrast, reinforcement learning systems are trained from their own experience, in principle allowing them to exceed human capabilities, and to operate in domains where human expertise is lacking. Recently, there has been rapid progress towards this goal, using deep neural networks trained by reinforcement learning. These systems have outperformed humans in computer games such as Atari 6, 7 and 3D virtual environments 8–10. However, the most challenging domains in terms of human intellect – such as the game of Go, widely viewed as a grand challenge for artificial intelligence 11 – require precise and sophisticated lookahead in vast search spaces. Fully general methods have not previously achieved human-level performance in these domains.

AlphaGo was the first program to achieve superhuman performance in Go. The published version 12, which we refer to as AlphaGo Fan, defeated the European champion Fan Hui in October 2015. AlphaGo Fan utilised two deep neural networks: a policy network that outputs move probabilities, and a value network that outputs a position evaluation. The policy network was trained initially by supervised learning to accurately predict human expert moves, and was subsequently refined by policy-gradient reinforcement learning. The value network was trained to predict the winner of games played by the policy network against itself. Once trained, these networks were combined with a Monte-Carlo Tree Search (MCTS) 13–15 to provide a lookahead search, using the policy network to narrow down the search to high-probability moves, and using the value network (in conjunction with Monte-Carlo rollouts using a fast rollout policy) to evaluate positions in the tree. A subsequent version, which we refer to as AlphaGo Lee, used a similar approach (see Methods), and defeated Lee Sedol, the winner of 18 international titles, in March 2016.

Our program, AlphaGo Zero, differs from AlphaGo Fan and AlphaGo Lee (12) in several important aspects. First and foremost, it is trained solely by self-play reinforcement learning, starting from random play, without any supervision or use of human data. Second, it only uses the black and white stones from the board as input features. Third, it uses a single neural network, rather than separate policy and value networks. Finally, it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any Monte-Carlo rollouts. To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search inside the training loop, resulting in rapid improvement and precise and stable learning. Further technical differences in the search algorithm, training procedure and network architecture are described in Methods.

1 Reinforcement Learning in AlphaGo Zero

Our new method uses a deep neural network f� with parameters �. This neural network takes as an input the raw board representation s of the position and its history, and outputs both move probabilities and a value, (p; v) = f�(s). The vector of move probabilities p represents the probability of selecting each move (including pass), pa = Pr(ajs). The value v is a scalar evaluation, estimating the probability of the current player winning from position s. This neural network combines the roles of both policy network and value network 12 into a single architecture. The neural network consists of many residual blocks 4 of convolutional layers 16, 17 with batch normalisation 18 and rectifier non-linearities 19 (see Methods).

The neural network in AlphaGo Zero is trained from games of self-play by a novel reinforcement learning algorithm. In each position s, an MCTS search is executed, guided by the neural network f�. The MCTS search outputs probabilities � of playing each move. These search probabilities usually select much stronger moves than the raw move probabilities p of the neural network f�(s); MCTS may therefore be viewed as a powerful policy improvement operator 20, 21. Self-play with search – using the improved MCTS-based policy to select each move, then using the game winner z as a sample of the value – may be viewed as a powerful policy evaluation operator. The main idea of our reinforcement learning algorithm is to use these search operators repeatedly in a policy iteration procedure 22, 23: the neural network’s parameters are updated to make the move probabilities and value (p; v) = f�(s) more closely match the improved search probabilities and self-play winner (�; z); these new parameters are used in the next iteration of self-play to make the search even stronger. Figure 1 illustrates the self-play training pipeline.

The Monte-Carlo tree search uses the neural network f� to guide its simulations (see Figure 2). Each edge (s; a) in the search tree stores a prior probability P(s; a), a visit count N(s; a), and an action-value Q(s; a). Each simulation starts from the root state and iteratively selects moves that maximise an upper confidence bound Q(s; a)+U(s; a), where U(s; a) / P(s; a)=(1+ N(s; a)) 12, 24, until a leaf node s0 is encountered. This leaf position is expanded and evaluated just

Figure 1: Self-play reinforcement learning in AlphaGo Zero. a The program plays a game s1; :::; sT against itself. In each position st, a Monte-Carlo tree search (MCTS) �� is executed (see Figure 2) using the latest neural network f�. Moves are selected according to the search probabilities computed by the MCTS, at � �t. The terminal position sT is scored according to the rules of the game to compute the game winner z. b Neural network training in AlphaGo Zero. The neural network takes the raw board position st as its input, passes it through many convolutional layers with parameters �, and outputs both a vector pt, representing a probability distribution over moves, and a scalar value vt, representing the probability of the current player winning in position st. The neural network parameters � are updated so as to maximise the similarity of the policy vector pt to the search probabilities �t, and to minimise the error between the predicted winner vt and the game winner z (see Equation 1). The new parameters are used in the next iteration of self-play a.

Figure 2: Monte-Carlo tree search in AlphaGo Zero. a Each simulation traverses the tree by selecting the edge with maximum action-value Q, plus an upper confidence bound U that depends on a stored prior probability P and visit count N for that edge (which is incremented once traversed). b The leaf node is expanded and the associated position s is evaluated by the neural network (P(s; �); V (s)) = f�(s); the vector of P values are stored in the outgoing edges from s. c Action-values Q are updated to track the mean of all evaluations V in the subtree below that action. d Once the search is complete, search probabilities � are returned, proportional to N1=� , where N is the visit count of each move from the root state and � is a parameter controlling temperature.

once by the network to generate both prior probabilities and evaluation, (P(s0; �); V (s0)) = f�(s0). Each edge (s; a) traversed in the simulation is updated to increment its visit count N(s; a), and to update its action-value to the mean evaluation over these simulations, Q(s; a) = 1=N(s; a) P s0js;a!s0 V (s0), where s; a ! s0 indicates that a simulation eventually reached s0 after taking move a from position s. MCTS may be viewed as a self-play algorithm that, given neural network parameters � and a root position s, computes a vector of search probabilities recommending moves to play, � = ��(s), proportional to the exponentiated visit count for each move, �a / N(s; a)1=� , where � is a temperature parameter.

The neural network is trained by a self-play reinforcement learning algorithm that uses MCTS to play each move. First, the neural network is initialised to random weights �0. At each subsequent iteration i � 1, games of self-play are generated (Figure 1a). At each time-step t, an MCTS search �t = ��i􀀀1(st) is executed using the previous iteration of neural network f�i􀀀1 , and a move is played by sampling the search probabilities �t. A game terminates at step T when both players pass, when the search value drops below a resignation threshold, or when the game exceeds a maximum length; the game is then scored to give a final reward of rT 2 f􀀀1; +1g (see Methods for details). The data for each time-step t is stored as (st;�t; zt) where zt = firT is the game winner from the perspective of the current player at step t. In parallel (Figure 1b), new network parameters �i are trained from data (s;�; z) sampled uniformly among all time-steps of the last iteration(s) of self-play. The neural network (p; v) = f�i(s) is adjusted to minimise the error between the predicted value v and the self-play winner z, and to maximise the similarity of the neural network move probabilities p to the search probabilities �. Specifically, the parameters � are adjusted by gradient descent on a loss function l that sums over mean-squared error and cross-entropy losses respectively, (p; v) = f�(s); l = (z 􀀀 v)2 􀀀�> log p + cjj�jj2 (1) where c is a parameter controlling the level of L2 weight regularisation (to prevent overfitting). 2 Empirical Analysis of AlphaGo Zero Training We applied our reinforcement learning pipeline to train our program AlphaGo Zero. Training started from completely random behaviour and continued without human intervention for approximately 3 days.

Over the course of training, 4.9 million games of self-play were generated, using 1,600 simulations for each MCTS, which corresponds to approximately 0.4s thinking time per move. Parameters were updated from 700,000 mini-batches of 2,048 positions. The neural network contained 20 residual blocks (see Methods for further details).

Figure 3a shows the performance of AlphaGo Zero during self-play reinforcement learning, as a function of training time, on an Elo scale 25. Learning progressed smoothly throughout training, and did not suffer from the oscillations or catastrophic forgetting suggested in prior literature

Figure 3: Empirical evaluation of AlphaGo Zero. a Performance of self-play reinforcement learning. The plot shows the performance of each MCTS player ��i from each iteration i of reinforcement learning in AlphaGo Zero. Elo ratings were computed from evaluation games between different players, using 0.4 seconds of thinking time per move (see Methods). For comparison, a similar player trained by supervised learning from human data, using the KGS data-set, is also shown. b Prediction accuracy on human professional moves. The plot shows the accuracy of the neural network f�i , at each iteration of self-play i, in predicting human professional moves from the GoKifu data-set. The accuracy measures the percentage of positions in which the neural network assigns the highest probability to the human move. The accuracy of a neural network trained by supervised learning is also shown. c Mean-squared error (MSE) on human professional game outcomes. The plot shows the MSE of the neural network f�i , at each iteration of self-play i, in predicting the outcome of human professional games from the GoKifu data-set. The MSE is between the actual outcome z 2 f􀀀1; +1g and the neural network value v, scaled by a factor of 1

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2017 MasteringtheGameofGoWithoutHumaThore Graepel
David Silver
Ioannis Antonoglou
Demis Hassabis (1976-)
Aja Huang
Arthur Guez
Laurent Sifre
George van den Driessche
Julian Schrittwieser
Timothy Lillicrap
Karen Simonyan
Thomas Hubert
Lucas Baker
Matthew Lai
Adrian Bolton
Yutian Chen
Fan Hui
Mastering the Game of Go Without Human Knowledge2017