o
    ci                     @   s   d Z ddlmZ ddlmZ ddlmZ ddlmZ ddl	m
Z
mZ e
ddd	d
Zejdd edkr\e Zejdkr@ede ejddddddjeeddZeee dS dS )au  Example on how to define and run with an RLModule with a dependent action space.

This examples:
    - Shows how to write a custom RLModule outputting autoregressive actions.
    The RLModule class used here implements a prior distribution for the first couple
    of actions and then uses the sampled actions to compute the parameters for and
    sample from a posterior distribution.
    - Shows how to configure a PPO algorithm to use the custom RLModule.
    - Stops the training after 100k steps or when the mean episode return
    exceeds -0.012 in evaluation, i.e. if the agent has learned to
    synchronize its actions.

For details on the environment used, take a look at the `CorrelatedActionsEnv`
class. To receive an episode return over 100, the agent must learn how to synchronize
its actions.


How to run this script
----------------------
`python [script file name].py --enable-new-api-stack --num-env-runners 2`

Control the number of `EnvRunner`s with the `--num-env-runners` flag. This
will increase the sampling speed.

For debugging, use the following additional command line options
`--no-tune --num-env-runners=0`
which should allow you to set breakpoints anywhere in the RLlib code and
have the execution stop there for inspection and debugging.

For logging to your WandB account, use:
`--wandb-key=[your WandB API key] --wandb-project=[some project name]
--wandb-run-name=[optional: WandB run name (within the defined project)]`


Results to expect
-----------------
You should reach an episode return of better than -0.5 quickly through a simple PPO
policy. The logic behind beating the env is roughly:

OBS:  optimal a1:   r1:  optimal a2:   r2:
-1      2            0      -1.0        0
-0.5    1/2       -0.5   -0.5/-1.5      0
0       1            0      -1.0        0
0.5     0/1       -0.5   -0.5/-1.5      0
1       0            0      -1.0        0

Meaning, most of the time, you would receive a reward better than -0.5, but worse than
0.0.

+--------------------------------------+------------+--------+------------------+
| Trial name                           | status     |   iter |   total time (s) |
|                                      |            |        |                  |
|--------------------------------------+------------+--------+------------------+
| PPO_CorrelatedActionsEnv_6660d_00000 | TERMINATED |     76 |          132.438 |
+--------------------------------------+------------+--------+------------------+
+------------------------+------------------------+------------------------+
|    episode_return_mean |   num_env_steps_sample |   ...env_steps_sampled |
|                        |             d_lifetime |   _lifetime_throughput |
|------------------------+------------------------+------------------------|
|                  -0.43 |                 152000 |                1283.48 |
+------------------------+------------------------+------------------------+
    )	PPOConfig)RLModuleSpec)CorrelatedActionsEnv)AutoregressiveActionsRLM)add_rllib_example_script_args#run_rllib_example_script_experimenti  i gܿ)default_itersdefault_timestepsdefault_rewardT)enable_new_api_stack__main__PPOzKThis example script only runs with PPO! Set --algo=PPO on the command line.i        g{Gzt?ga2U0*3?)train_batch_size_per_learner
num_epochsminibatch_sizeentropy_coefflr)module_class)rl_module_specN)__doc__ray.rllib.algorithms.ppor   "ray.rllib.core.rl_module.rl_moduler   6ray.rllib.examples.envs.classes.correlated_actions_envr   @ray.rllib.examples.rl_modules.classes.autoregressive_actions_rlmr   ray.rllib.utils.test_utilsr   r   parserset_defaults__name__
parse_argsargsalgo
ValueErrorenvironmenttraining	rl_modulebase_config r(   r(   e/home/ubuntu/.local/lib/python3.10/site-packages/ray/rllib/examples/actions/autoregressive_actions.py<module>   sB    ?

