Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supervised learning and samples #140

Open
StepHaze opened this issue Jun 19, 2022 · 15 comments
Open

Supervised learning and samples #140

StepHaze opened this issue Jun 19, 2022 · 15 comments

Comments

@StepHaze
Copy link

StepHaze commented Jun 19, 2022

The idea was said by Jonathan:
"I guess what you've have to do is generate many samples of the kind that are stored in AlphaZero's memory buffer. You can take these samples either from human play data or have other players play against each other to generate data. If you do so, be careful to add some exploration so that the same game is not played again and again and that you get some diversity in your data. Once you've got the data, you can either use the Trainer utility in learning.jl or just write your training procedure yourself in Flux."

Did anyone implement it? I still don't understand, in which format the games and moves are stored in memory buffer.

@jonathan-laurent
Copy link
Owner

will it ever be implemented?

Doing so is not my priority right now but I would be happy to welcome contributions here.
Note that I am working on a rewrite of AlphaZero.jl that should be ready by the end of the summer so you might want to wait a little bit if your intent is to submit a PR

I still don't understand, in which format the games and moves are stored in memory buffer.

See src/memory.jl.

@StepHaze
Copy link
Author

Thanks!
You're going to rewrite of AlphaZero.jl, will our old code still work?

@jonathan-laurent
Copy link
Owner

Previous versions will still be accessible from git or the package manager.
But the new version will break compatibility with existing code indeed.

@StepHaze
Copy link
Author

What's the main reason of the rewriting of AlphaZero.jl?
The existing version allows us to create pretty strong bots

@StepHaze
Copy link
Author

Thanks!

@StepHaze
Copy link
Author

Could you please add a supervised learning feature in the next release so we can insert human-played games instead of self-play games.
We can pay a reasonable price.

@jonathan-laurent
Copy link
Owner

I will keep this is mind, although I cannot make any promise right now.

@StepHaze
Copy link
Author

StepHaze commented Jun 23, 2022

Please! We really need it.

Your AlphaZero.jl is a WONDERFUL project. I must say you're a genius.
I spent months trying to train my bots using Python's projects, and that was very slow and inefficient.
With your masterpiece I trained my bot and spent a couple of days.

@jonathan-laurent
Copy link
Owner

Can you tell me more about how you or your company are using AlphaZero.jl and for what game/environment?
It is always interesting for me to get this kind of feedback.

@StepHaze
Copy link
Author

It's a non-commercial, educational project. I teach kids to play a board game (of a mancala family). We don't have a good software, so sometimes we don't even know where a player made a mistake. With AlphaZero.jl I created a bot that plays pretty strong, and "explore" function gives us an idea which moves are good and which are bad.
Thanks for AlphaZero.jl!

@jonathan-laurent
Copy link
Owner

Thanks for the testimony. It is great to hear that AlphaZero.jl is being used successfully in an educational project.

@StepHaze
Copy link
Author

StepHaze commented Jun 26, 2022

Bot plays pretty strong, but still leaves much to be desired.
And when I tried to complicate a vectorize_state (82x1x22), I started to get "Out of memory" error.

So I was thinking about a supervised learning. I have thousands of games played by masters.
I looked at src.memory.jl and noticed the following:
TrainingSample{State}
Type of a training sample. A sample features the following fields:

  • s::State is the state
  • π::Vector{Float64} is the recorded MCTS policy for this position
  • z::Float64 is the discounted reward cumulated from state s
  • t::Float64 is the (average) number of moves remaining before the end of the game
  • n::Int is the number of times the state s was recorded

How can I define these values? All I have is thousands of games with moves and result. They weren't played using MCTS, so I don't know π, etc. values. Speaking frankly, I'm very confused.

@StepHaze
Copy link
Author

StepHaze commented Jul 1, 2022

I'm not a professional Julia programmer. I had to learn Julia to create a bot based on AlphaZero.jl

@jonathan-laurent
Copy link
Owner

jonathan-laurent commented Jul 6, 2022

First of all, a word of warning. I understand that you are not a trained programmer and it is all the nicer for me to learn that you were still able to use this package on your own game.

That being said, an algorithm such as AlphaZero can hardly be used as a black box and the moment you try and do something a bit unusual, there is no escaping from understanding the codebase and the underlying algorithm. In the long run, you may want to take the time to improve your Julia skills, read a bit about machine-learning and AlphaZero and then try and understand the codebase as a whole.

Regarding your current question, if you have a database of games played by humans, you can extract samples from it in the following way. In state s, you would set pi to a distribution that puts weight 1 on the action chosen by the human player and 0 elsewhere. Moreover, you would set z using the final outcome of the game that s is a part of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants