Lecture 09 - Policy gradient methods for the Breakout game

MachineLearningCourse.Lecture09Module
Lecture09

Policy gradient methods for the Breakout game.

Available Functions

  • demo_reinforce(): Train agent using REINFORCE algorithm
  • reinforce_action(): Get action from trained REINFORCE policy

Usage

using MachineLearningCourse
policy, logger = Lecture09.demo_reinforce(max_episodes=500)
source
MachineLearningCourse.Lecture09.ActorCriticMethod
ActorCritic(env; kwargs...) -> Flux.Chain

Train an agent using the Actor-Critic algorithm.

Actor-Critic combines policy gradients (actor) with value function learning (critic). Updates are performed at each step using TD error as the advantage estimate.

Arguments

  • env: Environment implementing CommonRLInterface
  • hidden_layers=[64, 32]: Architecture of hidden layers for both actor and critic
  • η=1e-4: Learning rate for actor
  • η_critic=1e-3: Learning rate for critic
  • γ=0.99: Discount factor for TD error
  • T=20_000: Maximum steps per episode
  • max_episodes=1000: Maximum number of episodes to train
  • batch_size=32: Number of steps to collect before updating networks
  • callback=EpisodeLogger(): Function called after each episode

Returns

  • Trained policy network (Flux.Chain)

Example

env = BreakoutEnv()
policy = ActorCritic(env, max_episodes=1000)
source
MachineLearningCourse.Lecture09.REINFORCEMethod
REINFORCE(env; kwargs...) -> Flux.Chain

Train an agent using the REINFORCE policy gradient algorithm.

Arguments

  • env: Environment implementing CommonRLInterface
  • hidden_layers=[64, 32]: Architecture of hidden layers
  • η=1e-3: Learning rate
  • γ=0.99: Discount factor for returns
  • T=20_000: Maximum steps per episode
  • max_episodes=1000: Maximum number of episodes to train
  • callback=EpisodeLogger(): Function called after each episode

Returns

  • Trained policy network (Flux.Chain)

Example

env = BreakoutEnv()
policy = REINFORCE(env, max_episodes=1000)
source
MachineLearningCourse.Lecture09.demoFunction
demo(algorithm=:REINFORCE; max_episodes=100_000, plot=true)

Run policy gradient training demo on Breakout environment.

Arguments

  • algorithm=:REINFORCE: Algorithm function to use (:REINFORCE or :ActorCritic)
  • max_episodes=100_000: Maximum number of training episodes
  • plot=true: Whether to display live plots during training

Returns

  • (policy, logger): Trained network and logger

Usage

# REINFORCE (default)
policy, logger = Lecture09.demo(max_episodes=500)

# Actor-critic
policy, logger = Lecture09.demo(algorithm=ActorCritic, max_episodes=500)
source
MachineLearningCourse.Lecture09.policy_agentMethod
policy_agent(game_state::GameState, policy) -> Any

Get action from discrete policy network for Breakout.

Arguments

  • game_state: Current Breakout game state
  • policy: Trained policy network (Flux.Chain with softmax output)

Returns

  • Discrete action from Breakout action space
source
MachineLearningCourse.Lecture09.sample_actionMethod
sample_action(probs::Vector)

Sample action from probability distribution using cumulative distribution.

Arguments

  • probs: Action probabilities (should sum to 1.0)

Returns

  • Action index (Int) sampled according to probabilities

Example

probs = [0.1, 0.7, 0.2]  # Action probabilities
action_idx = sample_action(probs)  # Returns 1, 2, or 3
source