Finetuning IBM Granite models using RLHF in Watsonx.ai — Part 2 : Proximal Policy Optimization (PPO)

7 min readSep 28, 2024

This article is in continuation of my previous article and here, we will implement Stage 4 of RLHF pipeline.

Target Audience

This article assumes you have basic knowledge of Transformers, language modeling, deep learning, python, Huggingface Transformers, policy gradient methods.

Key Takeaways

How to use reward model to optimize policy parameters
Very brief intro. to PPO (Proximal Policy Optimization)
How to implement PPO using Huggingface TRL in watsonx.ai

Optimizing for Human Preferences

In reinforcement learning, the term “policy” is used to describe an agents decision making strategy and in this case policy is the LLM that we are fine tuning with RLHF. Main idea here is that we want to change the parameters of our policy to maximize the expected reward. Since reward function is an external signal i.e it is a scalar that comes from the environment, we don't have a direct relationship between policy’s actions and rewards. So the reward function in this case in non differentiable w.r.t. our policy. So we do some math and come up with a weight update rule and we use gradient ascent to maximize the function. If you want to explore the derivation further, I found CS224n and Spinning Up extremely helpful. Other sources that helped me understand key concepts are in references section.

What happens during training ?

During training we have total 3 models in memory.

Policy : The LLM we are fine tuning with RLHF
Reference Model : This is the same model as policy but its weights remain frozen throughout training.
Reward Model : The model that produces a scalar score for every [prompt + generation] pair

Steps :

We take the prompt which in this case is multi turn conversation between Human and AI Assistant. According to paper, the prompt should always begin and end on Human side of the conversation. This is created in preprocessing phase. We take the “chosen” column and make sure the last turn is from Human.
We pass the prompt to our policy and let it generate a response.
The [prompt + response] is then passed through the reward model to generate a scalar reward score.
The [prompt + response] is passed through reference model to get output log probabilities.
We also take log probabilities from the policy.
Log probabilities for both models are used to compute KL divergence. This is done to ensure that our policy’s generations do not stray too far away from the generations of SFT model. This prevents large gradient updates which mess up the model and cause it to produce gibberish.
Reward and KL divergence are then used in PPO to update weights of our of policy as per PPO algorithm. Main difference between PPO and Vanilla Policy Gradient(VPG) is how they update the model weights and stabilize training. In VPG, large gradients can cause larger weight updates which can destabilize training. Setting bounds through KL divergence is one of the main ways through which PPO differentiates itself from VPG.

Luckily for us, HuggingFace has already implemented PPO in its TRL (Transformers Reinforcement Learning) library which comes preinstalled with watsonx.ai and which we will use to train our model.

Implementation

I have added comments along the way. If you are familiar with HuggingFace, this implementation is pretty straightforward. If not, then I would suggest taking a quick tour of transformers.

Setup and data loading

from datasets import load_dataset
import torch
from transformers import (
    AutoTokenizer,
    pipeline,
    DataCollatorWithPadding
)
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer
from peft import peft_model, LoraConfig, get_peft_model
from tqdm import tqdm

# 1. create checkpoints for model, tokenizer, data

data_checkpoint = "Anthropic/hh-rlhf"
base_model_checkpoint = "ibm-granite/granite-7b-base"
reward_model_checkpoint = "<your-trained-reward-model-checkpoint>"
MAX_LENGTH = 512

# 2. Load datasets
train_dataset = load_dataset(data_checkpoint, split="train")
test_dataset = load_dataset(data_checkpoint, split="test")

# 3. Create tokenizer from checkpoint
tokenizer = AutoTokenizer.from_pretrained(base_model_checkpoint,padding_side='left')
# Ensure the tokenizer has a pad_token, set it to eos_token if not already set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

"""
Training dataset schema :
Dataset({
    features: ['chosen', 'rejected'],
    num_rows: 160800
})
"""

Data preprocessing

# 4. create preprocess_function
def preprocess_create_prompts_from_conversation(batch):
    """ 
    We need to create prompts from "chosen" column. we should return 
    prompts = {
        "input_ids" : [[prompt1], [prompt2], ....]
    } 
    
    """
    prompts = {
        "input_ids" : [],
        "attention_mask" : []
    }
    string_to_search = "Assistant:"
    for example in batch["chosen"]:
        # Make sure last response in multi turn conversation is from Human (as explained above)
        last_index = example.rfind(string_to_search)

        prompt_ending_with_human_response = example[:last_index].rstrip("\n")
        
        prompt_ending_with_human_response_tokenized = tokenizer(prompt_ending_with_human_response, 
                                                                max_length = MAX_LENGTH, 
                                                                truncation = True )
        
        prompts["input_ids"].append(prompt_ending_with_human_response_tokenized["input_ids"])
        prompts["attention_mask"].append(prompt_ending_with_human_response_tokenized["attention_mask"])
    
    return prompts
    
final_train_dataset = train_dataset.map(
                        preprocess_create_prompts_from_conversation,
                        batched = True,
                        remove_columns = train_dataset.column_names)
final_test_dataset = test_dataset.map(
                        preprocess_create_prompts_from_conversation,
                        batched = True,
                        remove_columns = test_dataset.column_names)
"""
final_train_dataset :
Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 160800
})
"""

Setup PEFT using LoRA

#4. Setup PEFT config for training with LoRA. We keep the rank small to fit on a single GPU
peft_lora_config = LoraConfig( 
    r=4, # Rank of LoRA matrices
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

#6 Get PEFT model (We dont want to fine tune entire model with PPO. We dont have that much compute !)
#peft_lora_model = get_peft_model(base_model,peft_lora_config)
"""get_peft_model() does not work here as it returns a model without value head. But our PPO Trainer expects a model
with value head. get_peft_model() returns a model of type <class 'peft.peft_model.PeftModelForCausalLM'>
While peft does add LoRA matrices, it does not add value head to the model. So the solution here is to pass peft_config
directly to from_pretained method of AutoModelForCausalLMWithValueHead like :
model = AutoModelForCausalLMWithValueHead.from_pretrained(
    rl_model_name,
    peft_config=lora_config,
)
"""
#5. Then get model (granite_7b or any other model) from hub based on checkpoint. Create the PEFT model directly
# from by passing the peft_config object to from_pretrained() method
peft_lora_model = AutoModelForCausalLMWithValueHead.from_pretrained(base_model_checkpoint, peft_config = peft_lora_config, load_in_4bit=True)

Setup reward model pipeline :

We can use HuggingFace pipelines specifically sentiment pipeline since we just need to produce a scalar score from single value head.

# We then define the arguments to pass to the sentiment analysis pipeline.
# We set `return_all_scores` to True to get the sentiment score for each token.
sentiment_kwargs = {
    "return_all_scores": True,
    "function_to_apply": "none",
    "batch_size": 16,
    "truncation": True,
}
#7 Set reward model pipeline through hf pipeline()
sentiment_pipe = pipeline(
    "sentiment-analysis",
    model=reward_model_checkpoint,
    tokenizer=tokenizer,
    return_token_type_ids=False
)

if sentiment_pipe.model.config.pad_token_id is None:
    sentiment_pipe.model.config.pad_token_id = sentiment_pipe.model.config.eos_token_id

Setup PPO Configuration and PPOTrainer

#8 Set generation kwargs for response generation to our prompts
generation_kwargs = {
    # "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.pad_token_id,
    "eos_token_id": 100_000,
    "max_new_tokens": 300
}

#9 Set PPOConfig
# Create configuration for PPO Trainer
ppo_config = PPOConfig(
    steps=10,
    model_name=peft_lora_model,
    learning_rate=1e-6,
    optimize_cuda_cache=True,
    early_stopping=True,
    ppo_epochs= 3,
    batch_size = 4,
    mini_batch_size=2
)

# 10. We then build the PPOTrainer, passing the model, the reference model, the tokenizer
ppo_trainer = PPOTrainer(
    ppo_config,
    peft_lora_model,
    ref_model=None,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(padding=True, max_length=512, tokenizer = tokenizer),
    dataset=final_train_dataset,
)

Setup RLHF training loop as described above

# 11. For each batch, follow the RLHF process described in 7 steps above
for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    
    print("----- Entering epoch ----- :", epoch)

    if epoch >= ppo_config.ppo_epochs:
        break
    
    # batch is a dict { "input_ids" : tensor(batch_size * batch_sequence_length) }
    # 1. Get input prompts
    prompts_batch = batch["input_ids"] # shape (batch_size * batch_sequence_length)

    # 2. convert input_ids to sentences so that it can be concatenated with response later
    batch["prompts_text"] = tokenizer.batch_decode(prompts_batch, skip_special_tokens=True) # list of decoded strings

    # generate() method requires that input should either be a tensor of shape(seq_length) containing query tokens
    # or it should be a list of tensors where each tensor is of shape(seq_length)
    prompts_batch_list = [seq for seq in prompts_batch] # List of tensors : [tensor 1, tensor 2, tensor 3 ....]

    # 3. Generate responses for this batch of prompts
    responses = ppo_trainer.generate(
                prompts_batch_list,
                return_prompt = False,
                **generation_kwargs)
    
    batch["responses_text"] = tokenizer.batch_decode(responses, skip_special_tokens=True) # list of decoded strings
    
    # combine (prompt + generation) and give it to reward model to score
    combined_prompt_generation = [ prompt + generation for prompt, generation in zip(batch["prompts_text"], batch["responses_text"]) ]
    
    # 4. Send this (prompt + generation) to reward model to score
    pipeline_outputs = sentiment_pipe(combined_prompt_generation, **sentiment_kwargs)

    # 5. Extract reward scores from pipeline output
    rewards = [torch.tensor(output[0]["score"]) for output in pipeline_outputs]
    #print(rewards)
    # Run PPO step
    stats = ppo_trainer.step(prompts_batch_list, responses, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

Concluding thoughts

Libraries like HuggingFace TRL have made it extremely easy to implement RLHF. This technique can be leveraged for any use case provided we have human labelled data. In my experience, techniques like RAG and its variants, prompt engineering and prompt tuning are sufficient for many low stake use cases and automation scenarios. If models need to be aligned to domain specific datasets but you don't have lots of labelled data and you want to do that across multiple models, techniques like InstructLab give great results.

In enterprise setting, RLHF is mostly used in high stake scenarios that might impact revenue or NPS for example customer service scenarios where empathy, factual accuracy and human sentiment is critical. It is also used in highly regulated industries and niche domains where domain knowledge and human preference is absolutely critical for example medical science, financial services, pharmaceuticals, legal etc.