The aim of this project is to explore whether an instruction-tuned LLM (Gemma-2B) can be fine-tuned into a decision-making policy for the Frozen Lake environment using only expert Q-values as supervision. The model is not asked to predict Q-values directly; instead, it generates a reasoning trace (<think>) and selects a final action (<answer>). Expert Q-values are then used to evaluate how good that chosen action is. The motivation is to test whether preference-based reinforcement learning (GRPO) can push a language model to behave like an optimal policy purely from scalar rewards derived from Q-value differences, without any supervised labels or regression losses.
For each state, the model is prompted with a text version of the grid and asked to produce a chain-of-thought and a final action. After the model selects an action, we compute a reward from the expert Q-values:
- Let the expert Q-values be
$[Q_{\text{left}}, Q_{\text{down}}, Q_{\text{right}}, Q_{\text{up}}]$ . - Let
$\hat{a}$ be the action predicted by the model. - We take the expert's Q-value for that action,
$Q_{\text{expert}}(\hat{a})$ , and convert it into a normalized advantage:
where
- A small bonus is added if the model exactly matches the expert's optimal action.
- A separate format reward checks whether the model uses proper
<think>and<answer>tags. - The final reward passed to GRPO is the average of the format reward and the normalized advantage.
This reward is then used by GRPO: multiple sampled completions are compared, higher-reward completions are favored, and LoRA adapter weights are updated so the model becomes more likely to choose high-advantage actions in the future.
Across training runs, the model did not develop a stable or improving policy: its chosen actions did not consistently align with higher expert Q-values. I experimented with several alternative reward designs and shaping strategies to strengthen the learning signal, but these modifications did not lead to meaningful improvement. In parallel, we found that a simpler method—one that did not require fine-tuning and was later adopted by my professor’s team—achieved more reliable results for this task. This outcome suggests that, for small control environments like Frozen Lake, lightweight inference-time methods can outperform preference-based fine-tuning approaches such as GRPO.
While my fine-tuning approach did not converge, my professor and his team were alternatively working on a different idea that proved successful. They used in-context learning to predict actions, and their method worked well. The paper was accepted at NeurIPS 2025.
The team introduced Prompted Policy Search (ProPS) where they added the experiences to the prompts continually without fine tuning the model and the LLM seems to have learned a policy by itself after several iterations.
The post on the project website illustrates the policy search process of ProPS in the various environments.
The following code illustrates how to fine-tune an LLM using GRPO for RL tasks in OpenAI Gym Environments.
- Python 3.8+
- PyTorch
- Transformers
- TRL (Transformer Reinforcement Learning)
- PEFT (Parameter-Efficient Fine-Tuning)
- Pandas
- NumPy
- TensorBoard (for monitoring)
# Create and activate a conda environment
conda create -n gemma-rl python=3.8
conda activate gemma-rl
# Install dependencies
pip install torch transformers trl peft datasets numpy pandas
pip install tensorboardThe training relies on a dataset of expert demonstrations for the Frozen Lake environment, stored in expert_demos_batched_avg_q.csv. This file should contain:
state_str: String representation of the Frozen Lake stateaction: Expert action (0: LEFT, 1: DOWN, 2: RIGHT, 3: UP)q_value_left,q_value_down,q_value_right,q_value_up: Q-values for each possible action
To train the model, run:
python train_script.pyAlternatively, if using a SLURM environment, use the provided batch script:
sbatch train_script.shThe training configuration uses:
- Gemma 2B Instruct model
- LoRA fine-tuning (rank 16)
- BFloat16 precision
- Batch size of 4 with gradient accumulation steps of 4
- Learning rate of 3e-5 with cosine scheduler
- Combined reward function based on formatting and Q-value optimization
Training progress can be monitored with TensorBoard:
tensorboard --logdir ./frozen-lake-gemma-2b-it-grpoIf you use this code in your research, please cite:
@software{continuous_control_gemma,
author = {Aravind-11},
title = {Continuous-Control-using-Gemma},
year = {2025},
url = {https://github.com/Aravind-11/Continuous-Control-using-Gemma}
}
- Google for the Gemma model
- Hugging Face for the Transformers library
- The TRL and PEFT libraries for reinforcement learning with transformers