Investigating optimal policy search via LLMS for Reinforcement Learning tasks

Motivation

The aim of this project is to explore whether an instruction-tuned LLM (Gemma-2B) can be fine-tuned into a decision-making policy for the Frozen Lake environment using only expert Q-values as supervision. The model is not asked to predict Q-values directly; instead, it generates a reasoning trace (<think>) and selects a final action (<answer>). Expert Q-values are then used to evaluate how good that chosen action is. The motivation is to test whether preference-based reinforcement learning (GRPO) can push a language model to behave like an optimal policy purely from scalar rewards derived from Q-value differences, without any supervised labels or regression losses.

Method

For each state, the model is prompted with a text version of the grid and asked to produce a chain-of-thought and a final action. After the model selects an action, we compute a reward from the expert Q-values:

Let the expert Q-values be $[Q_{\text{left}}, Q_{\text{down}}, Q_{\text{right}}, Q_{\text{up}}]$.
Let $\hat{a}$ be the action predicted by the model.
We take the expert's Q-value for that action, $Q_{\text{expert}}(\hat{a})$, and convert it into a normalized advantage:

$$ \text{advantage}(\hat{a}) = \frac{Q_{\text{expert}}(\hat{a}) - Q_{\min}}{Q_{\max} - Q_{\min}} $$

where $Q_{\max}$ and $Q_{\min}$ are the maximum and minimum expert Q-values for that state. This yields a value in $[0,1]$, where 1 corresponds to choosing the optimal action and 0 corresponds to choosing the worst action.

A small bonus is added if the model exactly matches the expert's optimal action.
A separate format reward checks whether the model uses proper <think> and <answer> tags.
The final reward passed to GRPO is the average of the format reward and the normalized advantage.

This reward is then used by GRPO: multiple sampled completions are compared, higher-reward completions are favored, and LoRA adapter weights are updated so the model becomes more likely to choose high-advantage actions in the future.

Results

Across training runs, the model did not develop a stable or improving policy: its chosen actions did not consistently align with higher expert Q-values. I experimented with several alternative reward designs and shaping strategies to strengthen the learning signal, but these modifications did not lead to meaningful improvement. In parallel, we found that a simpler method—one that did not require fine-tuning and was later adopted by my professor’s team—achieved more reliable results for this task. This outcome suggests that, for small control environments like Frozen Lake, lightweight inference-time methods can outperform preference-based fine-tuning approaches such as GRPO.

Alternative Approach: ProPS (What Worked)

While my fine-tuning approach did not converge, my professor and his team were alternatively working on a different idea that proved successful. They used in-context learning to predict actions, and their method worked well. The paper was accepted at NeurIPS 2025.

The team introduced Prompted Policy Search (ProPS) where they added the experiences to the prompts continually without fine tuning the model and the LLM seems to have learned a policy by itself after several iterations.

The post on the project website illustrates the policy search process of ProPS in the various environments.

The following code illustrates how to fine-tune an LLM using GRPO for RL tasks in OpenAI Gym Environments.

Running the code

Requirements

Python 3.8+
PyTorch
Transformers
TRL (Transformer Reinforcement Learning)
PEFT (Parameter-Efficient Fine-Tuning)
Pandas
NumPy
TensorBoard (for monitoring)

Installation

# Create and activate a conda environment
conda create -n gemma-rl python=3.8
conda activate gemma-rl

# Install dependencies
pip install torch transformers trl peft datasets numpy pandas
pip install tensorboard

Dataset

The training relies on a dataset of expert demonstrations for the Frozen Lake environment, stored in expert_demos_batched_avg_q.csv. This file should contain:

state_str: String representation of the Frozen Lake state
action: Expert action (0: LEFT, 1: DOWN, 2: RIGHT, 3: UP)
q_value_left, q_value_down, q_value_right, q_value_up: Q-values for each possible action

Training

To train the model, run:

python train_script.py

Alternatively, if using a SLURM environment, use the provided batch script:

sbatch train_script.sh

Training Configuration

The training configuration uses:

Gemma 2B Instruct model
LoRA fine-tuning (rank 16)
BFloat16 precision
Batch size of 4 with gradient accumulation steps of 4
Learning rate of 3e-5 with cosine scheduler
Combined reward function based on formatting and Q-value optimization

Monitoring

Training progress can be monitored with TensorBoard:

tensorboard --logdir ./frozen-lake-gemma-2b-it-grpo

License

MIT License

Citation

If you use this code in your research, please cite:

@software{continuous_control_gemma,
  author = {Aravind-11},
  title = {Continuous-Control-using-Gemma},
  year = {2025},
  url = {https://github.com/Aravind-11/Continuous-Control-using-Gemma}
}

Acknowledgements

Google for the Gemma model
Hugging Face for the Transformers library
The TRL and PEFT libraries for reinforcement learning with transformers

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
agent_learning_curves.png		agent_learning_curves.png
gemma.err		gemma.err
gemma.out		gemma.out
q_learning_agent.py		q_learning_agent.py
train.py		train.py
train_script.py		train_script.py
train_script.sh		train_script.sh
training_notebook.ipynb		training_notebook.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Investigating optimal policy search via LLMS for Reinforcement Learning tasks

Motivation

Method

Results

Alternative Approach: ProPS (What Worked)

The following code illustrates how to fine-tune an LLM using GRPO for RL tasks in OpenAI Gym Environments.

Running the code

Requirements

Installation

Dataset

Training

Training Configuration

Monitoring

License

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

Aravind-11/Continous-Control-using-Gemma

Folders and files

Latest commit

History

Repository files navigation

Investigating optimal policy search via LLMS for Reinforcement Learning tasks

Motivation

Method

Results

Alternative Approach: ProPS (What Worked)

The following code illustrates how to fine-tune an LLM using GRPO for RL tasks in OpenAI Gym Environments.

Running the code

Requirements

Installation

Dataset

Training

Training Configuration

Monitoring

License

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages