o
    ci                     @   s   d Z ddlmZ ddlmZ ddlmZ ddlmZm	Z	 ddl
mZ eddd	d
Zejdd ejdeddd ejdddd dg dddZedkr~e Zeej jdedjddd djejrhd nd!d d"d#d$jedd%d&Ze	ee d S d S )'a
  Example of using a count-based curiosity mechanism to learn in sparse-rewards envs.

This example:
    - demonstrates how to define your own count-based curiosity ConnectorV2 piece
    that computes intrinsic rewards based on simple observation counts and adds these
    intrinsic rewards to the "main" (extrinsic) rewards.
    - shows how this connector piece overrides the main (extrinsic) rewards in the
    episode and thus demonstrates how to do reward shaping in general with RLlib.
    - shows how to plug this connector piece into your algorithm's config.
    - uses Tune and RLlib to learn the env described above and compares 2
    algorithms, one that does use curiosity vs one that does not.

We use a FrozenLake (sparse reward) environment with a map size of 8x8 and a time step
limit of 14 to make it almost impossible for a non-curiosity based policy to learn.


How to run this script
----------------------
`python [script file name].py --enable-new-api-stack`

Use the `--no-curiosity` flag to disable curiosity learning and force your policy
to be trained on the task w/o the use of intrinsic rewards. With this option, the
algorithm should NOT succeed.

For debugging, use the following additional command line options
`--no-tune --num-env-runners=0`
which should allow you to set breakpoints anywhere in the RLlib code and
have the execution stop there for inspection and debugging.

For logging to your WandB account, use:
`--wandb-key=[your WandB API key] --wandb-project=[some project name]
--wandb-run-name=[optional: WandB run name (within the defined project)]`


Results to expect
-----------------
In the console output, you can see that only a PPO policy that uses curiosity can
actually learn.

Policy using count-based curiosity:
+-------------------------------+------------+--------+------------------+
| Trial name                    | status     |   iter |   total time (s) |
|                               |            |        |                  |
|-------------------------------+------------+--------+------------------+
| PPO_FrozenLake-v1_109de_00000 | TERMINATED |     48 |            44.46 |
+-------------------------------+------------+--------+------------------+
+------------------------+-------------------------+------------------------+
|    episode_return_mean |   num_episodes_lifetime |   num_env_steps_traine |
|                        |                         |             d_lifetime |
|------------------------+-------------------------+------------------------|
|                   0.99 |                   12960 |                 194000 |
+------------------------+-------------------------+------------------------+

Policy NOT using curiosity:
[DOES NOT LEARN AT ALL]
    FlattenObservations)DefaultModelConfigCountBasedCuriosity)add_rllib_example_script_args#run_rllib_example_script_experiment)get_trainable_clsgGz?   i@B )default_rewarddefault_itersdefault_timestepsT)enable_new_api_stackz--intrinsic-reward-coeffg      ?znThe weight with which to multiply intrinsic rewards before adding them to the extrinsic ones (default is 1.0).)typedefaulthelpz--no-curiosity
store_truez)Whether to NOT use count-based curiosity.)actionr   F)SFFHFFFHFFFHFFFFFFFHHFFFFFFFFFFHHFFHFFFFHHFHFFHFFFFHFHHFFHFFFFFG   )is_slipperydescmax_episode_steps__main__zFrozenLake-v1)
env_config   c                 C      t  S Nr   )envspacesdevice r(   f/home/ubuntu/.local/lib/python3.10/site-packages/ray/rllib/examples/curiosity/count_based_curiosity.py<lambda>v       r*   )num_envs_per_env_runnerenv_to_module_connectorNc                  O   r#   r$   r   )agskwr(   r(   r)   r*      r+   
   g{Gz?)learner_connector
num_epochsvf_loss_coeff)vf_share_layers)model_config)__doc__"ray.rllib.connectors.env_to_moduler   -ray.rllib.core.rl_module.default_model_configr   ;ray.rllib.examples.connectors.classes.count_based_curiosityr   ray.rllib.utils.test_utilsr   r   ray.tune.registryr	   parserset_defaultsadd_argumentfloatENV_OPTIONS__name__
parse_argsargsalgoget_default_configenvironmentenv_runnerstrainingno_curiosity	rl_modulebase_configr(   r(   r(   r)   <module>   s^    8	
