Supervised learning and samples #140

StepHaze · 2022-06-19T05:32:07Z

The idea was said by Jonathan:
"I guess what you've have to do is generate many samples of the kind that are stored in AlphaZero's memory buffer. You can take these samples either from human play data or have other players play against each other to generate data. If you do so, be careful to add some exploration so that the same game is not played again and again and that you get some diversity in your data. Once you've got the data, you can either use the Trainer utility in learning.jl or just write your training procedure yourself in Flux."

Did anyone implement it? I still don't understand, in which format the games and moves are stored in memory buffer.

jonathan-laurent · 2022-06-19T16:23:00Z

will it ever be implemented?

Doing so is not my priority right now but I would be happy to welcome contributions here.
Note that I am working on a rewrite of AlphaZero.jl that should be ready by the end of the summer so you might want to wait a little bit if your intent is to submit a PR

I still don't understand, in which format the games and moves are stored in memory buffer.

See src/memory.jl.

StepHaze · 2022-06-20T15:48:16Z

Thanks!
You're going to rewrite of AlphaZero.jl, will our old code still work?

jonathan-laurent · 2022-06-20T16:04:10Z

Previous versions will still be accessible from git or the package manager.
But the new version will break compatibility with existing code indeed.

StepHaze · 2022-06-20T16:12:29Z

What's the main reason of the rewriting of AlphaZero.jl?
The existing version allows us to create pretty strong bots

jonathan-laurent · 2022-06-20T16:20:52Z

See https://github.com/jonathan-laurent/AlphaZero.jl/tree/master/redesign.

StepHaze · 2022-06-20T16:40:36Z

Thanks!

StepHaze · 2022-06-20T20:18:28Z

Could you please add a supervised learning feature in the next release so we can insert human-played games instead of self-play games.
We can pay a reasonable price.

jonathan-laurent · 2022-06-21T21:06:03Z

I will keep this is mind, although I cannot make any promise right now.

StepHaze · 2022-06-23T13:32:33Z

Please! We really need it.

Your AlphaZero.jl is a WONDERFUL project. I must say you're a genius.
I spent months trying to train my bots using Python's projects, and that was very slow and inefficient.
With your masterpiece I trained my bot and spent a couple of days.

jonathan-laurent · 2022-06-23T15:15:12Z

Can you tell me more about how you or your company are using AlphaZero.jl and for what game/environment?
It is always interesting for me to get this kind of feedback.

StepHaze · 2022-06-23T16:29:50Z

It's a non-commercial, educational project. I teach kids to play a board game (of a mancala family). We don't have a good software, so sometimes we don't even know where a player made a mistake. With AlphaZero.jl I created a bot that plays pretty strong, and "explore" function gives us an idea which moves are good and which are bad.
Thanks for AlphaZero.jl!

jonathan-laurent · 2022-06-23T17:42:07Z

Thanks for the testimony. It is great to hear that AlphaZero.jl is being used successfully in an educational project.

StepHaze · 2022-06-26T18:43:39Z

Bot plays pretty strong, but still leaves much to be desired.
And when I tried to complicate a vectorize_state (82x1x22), I started to get "Out of memory" error.

So I was thinking about a supervised learning. I have thousands of games played by masters.
I looked at src.memory.jl and noticed the following:
TrainingSample{State}
Type of a training sample. A sample features the following fields:

s::State is the state
π::Vector{Float64} is the recorded MCTS policy for this position
z::Float64 is the discounted reward cumulated from state s
t::Float64 is the (average) number of moves remaining before the end of the game
n::Int is the number of times the state s was recorded

How can I define these values? All I have is thousands of games with moves and result. They weren't played using MCTS, so I don't know π, etc. values. Speaking frankly, I'm very confused.

StepHaze · 2022-07-01T10:27:55Z

I'm not a professional Julia programmer. I had to learn Julia to create a bot based on AlphaZero.jl

jonathan-laurent · 2022-07-06T14:42:01Z

First of all, a word of warning. I understand that you are not a trained programmer and it is all the nicer for me to learn that you were still able to use this package on your own game.

That being said, an algorithm such as AlphaZero can hardly be used as a black box and the moment you try and do something a bit unusual, there is no escaping from understanding the codebase and the underlying algorithm. In the long run, you may want to take the time to improve your Julia skills, read a bit about machine-learning and AlphaZero and then try and understand the codebase as a whole.

Regarding your current question, if you have a database of games played by humans, you can extract samples from it in the following way. In state s, you would set pi to a distribution that puts weight 1 on the action chosen by the human player and 0 elsewhere. Moreover, you would set z using the final outcome of the game that s is a part of.

jonathan-laurent closed this as completed Jun 19, 2022

jonathan-laurent reopened this Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supervised learning and samples #140

Supervised learning and samples #140

StepHaze commented Jun 19, 2022 •

edited

Loading

jonathan-laurent commented Jun 19, 2022

StepHaze commented Jun 20, 2022

jonathan-laurent commented Jun 20, 2022

StepHaze commented Jun 20, 2022

jonathan-laurent commented Jun 20, 2022

StepHaze commented Jun 20, 2022

StepHaze commented Jun 20, 2022

jonathan-laurent commented Jun 21, 2022

StepHaze commented Jun 23, 2022 •

edited

Loading

jonathan-laurent commented Jun 23, 2022

StepHaze commented Jun 23, 2022

jonathan-laurent commented Jun 23, 2022

StepHaze commented Jun 26, 2022 •

edited

Loading

StepHaze commented Jul 1, 2022

jonathan-laurent commented Jul 6, 2022 •

edited

Loading

Supervised learning and samples #140

Supervised learning and samples #140

Comments

StepHaze commented Jun 19, 2022 • edited Loading

jonathan-laurent commented Jun 19, 2022

StepHaze commented Jun 20, 2022

jonathan-laurent commented Jun 20, 2022

StepHaze commented Jun 20, 2022

jonathan-laurent commented Jun 20, 2022

StepHaze commented Jun 20, 2022

StepHaze commented Jun 20, 2022

jonathan-laurent commented Jun 21, 2022

StepHaze commented Jun 23, 2022 • edited Loading

jonathan-laurent commented Jun 23, 2022

StepHaze commented Jun 23, 2022

jonathan-laurent commented Jun 23, 2022

StepHaze commented Jun 26, 2022 • edited Loading

StepHaze commented Jul 1, 2022

jonathan-laurent commented Jul 6, 2022 • edited Loading

StepHaze commented Jun 19, 2022 •

edited

Loading

StepHaze commented Jun 23, 2022 •

edited

Loading

StepHaze commented Jun 26, 2022 •

edited

Loading

jonathan-laurent commented Jul 6, 2022 •

edited

Loading