r/MachineLearning · June 3, 2026 · 2 min read

Analysis of AlphaZero training data [D]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

I am trying to train an AlphaZero model for Othello on a 6x6-board.

Having been warned that too little exploration during data generation can lead to models being overconfident and trapped in some tight region of the search tree, I started with the value c_puct = 4.0, and then reduced this to 3.5 after a few generations. Also, I added fairly peaked Dirichlet noise (alpha = 0.15) to the prior predictions at the root of each tree search, with the proportion epsilon = 0.25. The temperature was initially set to 1.0, and then reduced to 0.8 after 20 generations.

Now, the models do improve in the sense that later models consistently beat earlier ones, but there is no significant improvement against the two benchmarks I use: classical MCTS, and a greedy agent. Against the latter, the models have a deplorably low win rate of less than 10%.

As can be seen from the curve for the value loss on the validation data, the models don't seem to learn to predict values (which is why I have been hesitant to reduce c_puct further), but the prediction loss seems to behave more or less as it should.

https://preview.redd.it/gjby4omfp35h1.png?width=640&format=png&auto=webp&s=4d2ba4716ade6ec4ce9b7f16605a2e6bd74c6baf

I decided to test if the prediction targets become strongly peaked early on. For this, I compute the normalized entropies of these predictions, meaning that I divide the entropy by the log of the number of legal moves at the given game state. The plot below shows the mean values of these normalized entropies for the data sets created by the different generations of agents.

https://preview.redd.it/5yk216zjp35h1.png?width=640&format=png&auto=webp&s=538f59f5da3671a20c0ef2e1afc1ec96da237107

Finally, I tested how the policy predictions of a fixed set of random game states vary with the models. Here, I have set the second model as a benchmark, and I compute the average Kullback-Leibler divergence between the predictions by the benchmark model and those by later models. This is displayed in the final plot. (The KL-divergence between a model and its successor stabilizes very quickly around the value 0.08.)

https://preview.redd.it/cha5ra8sp35h1.png?width=640&format=png&auto=webp&s=9fb0c07f2148b6c6436e75e4cde728f1a3e0895b

Now, I wonder if the above statistical properties of the training data can help explain anything about the pathological behaviour of my agents. In particular, I wonder why the value predictions on the validation data do not improve. Are any of my hyperparameters chosen unwisely, and could I have avoided this development by better choices?

submitted by /u/YamEnvironmental4720
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/MachineLearning