The Fine-Tuning Index / RLHF & Preference / #156
waltonfuture/MM-UPT
by waltonfuture · RLHF & Preference · updated 7mo ago
[NeurIPS 2025] First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training
22
momentum
87
stars
2
forks
#156
rank
grpomllmunsupervised-learning
View on GitHub →