The Fine-Tuning Index / RLHF & Preference / #156

waltonfuture/MM-UPT

by waltonfuture · RLHF & Preference · updated 7mo ago

[NeurIPS 2025] First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training

22
momentum
87
stars
2
forks
#156
rank
grpomllmunsupervised-learning
View on GitHub →