The Fine-Tuning Index / RLHF & Preference / #51
0bserver07/Study-Reinforcement-Learning
by 0bserver07 · RLHF & Preference · updated 28d ago
RL study guide — foundations through RLHF, DPO, GRPO, RLVR, agentic RL, and offline RL. Hand-written CS294 notes, 19 lecture drafts, 5 tested exercises, citations that resolve.
52
momentum
159
stars
37
forks
#51
rank
agentic-rlconstitutional-aideep-learningdeepseek-r1dpogrpolecture-notesllm-alignmentmachine-learningpolicy-gradientppoq-learning
View on GitHub →