The Fine-Tuning Index / RLHF & Preference / #90

raghavc/LLM-RLHF-Tuning-with-PPO-and-DPO

by raghavc · RLHF & Preference · updated 3mo ago

Comprehensive toolkit for Reinforcement Learning from Human Feedback (RLHF) training, featuring instruction fine-tuning, reward model training, and support for PPO and DPO algorithms with various configurations for the Alpaca, LLaMA, and LLaMA2 models.

momentum

191

stars

forks

#90

rank

View on GitHub →

raghavc/LLM-RLHF-Tuning-with-PPO-and-DPO

More in RLHF & Preference