The Fine-Tuning Index / RLHF & Preference / #51

0bserver07/Study-Reinforcement-Learning

by 0bserver07 · RLHF & Preference · updated 28d ago

RL study guide — foundations through RLHF, DPO, GRPO, RLVR, agentic RL, and offline RL. Hand-written CS294 notes, 19 lecture drafts, 5 tested exercises, citations that resolve.

momentum

159

stars

forks

#51

rank

agentic-rlconstitutional-aideep-learningdeepseek-r1dpogrpolecture-notesllm-alignmentmachine-learningpolicy-gradientppoq-learning

View on GitHub →

0bserver07/Study-Reinforcement-Learning

More in RLHF & Preference