Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning

1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University 2Beijing Academy of Artificial Intelligence 3Institute of Automation, Chinese Academy of Sciences 4School of Artificial Intelligence, University of Chinese Academy of Sciences *Equal contribution Project leaders  Corresponding author

RoadMap for Reason-RFT x RoboBrain

Please refer to Reason-RFT Github for our full roadmap.

teaser

Reason-RFT Overview

teaser
Overview of Reason-RFT. Compared to traditional SFT-based methods, our proposed Reason-RFT framework demonstrates superior generalization in visual reasoning tasks, excelling in reasoning improvement, out-of-domain performance, and data efficiency.

Abstract

Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI). Existing methods improve VLM reasoning via Chain-of-Thought (CoT) supervised fine-tuning, using meticulously annotated training data to enhance visual reasoning capabilities. However, this training paradigm may lead to overfitting and cognitive rigidity, restricting the model's ability to transfer visual reasoning skills across domains and limiting its real-world applicability. To address these limitations, we propose Reason-RFT, a novel reinforcement fine-tuning framework that significantly enhances generalization capabilities in visual reasoning tasks. Reason-RFT introduces a two-phase training framework for visual reasoning: (1) Supervised Fine-Tuning (SFT) with curated Chain-of-Thought (CoT) data activates the reasoning potential of Vision-Language Models (VLMs), followed by (2) Group Relative Policy Optimization (GRPO)-based reinforcement learning that generates multiple reasoning-response pairs, significantly enhancing generalization in visual reasoning tasks. To evaluate Reason-RFT's visual reasoning capabilities, we reconstructed a comprehensive dataset spanning visual counting, structure perception, and spatial transformation, serving as a benchmark to systematically assess visual cognition, geometric understanding, and spatial generalization. Experimental results demonstrate Reasoning-RFT's three key advantages: (1) Performance Enhancement: achieving state-of-the-art results across multiple tasks, outperforming both open-source and proprietary models; (2) Generalization Superiority: consistently maintaining robust performance across diverse tasks and domains, outperforming alternative training paradigms. (3) Data Efficiency: excelling in few-shot learning scenarios while surpassing full-dataset SFT baselines.

Reason-RFT Pipeline

pipeline
Framework of Reason-RFT. Reason-RFT introduces a two-phase training framework for visual reasoning. First, Supervised Fine-Tuning (SFT) with Chain-of-Thought (CoT) reasoning activates the model's domain-specific reasoning capabilities using a high-quality visual reasoning dataset in stage 1. Subsequently, in stage 2, Group Relative Policy Optimization (GRPO) enhances reasoning capabilities, enabling Reason-RFT to achieve superior generalization by pushing the model's reasoning limits. Specifically, reward evaluation consists of format reward and three different types of accuracy reward.

Evaluation Results

evaluation
Results on three visual reasoning tasks. The best results among different training paradigms are highlighted in bold, while the second-best results are underlined. “ID” denotes in-domain test data, and “OOD” denotes out-of-domain test data.

Training Insights

Case Study on Visual Counting Task

Case Study on Structure Perception Task

Case Study on Spatial Transformation Task