On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training

Xueqing Wu Yu-Chi Lin Kai-Wei Chang Nanyun Peng
University of California, Los Angeles
TL;DR. VLM post-training (both SFT and RL) improves reasoning much more than perception, leaving perception as the bottleneck for end-to-end visual reasoning. We trace this asymmetry to two distinct mechanisms β€” token imbalance in SFT and reward coupling in RL β€” and show that targeted interventions yield up to +18.2 end-to-end accuracy.
Teaser figure: asymmetric optimization in SFT and RL
Asymmetric optimization of perception and reasoning. Both SFT (left) and RL (right) improve reasoning (R) substantially more than perception (P), but for different reasons β€” and each admits a different mitigation.

Abstract

Post-training has greatly improved reasoning in frontier vision-language models, yet its gains for perception remain comparatively limited, creating a bottleneck for end-to-end visual reasoning. To investigate this gap, we introduce a controlled diagnostic framework with two synthetic tasks that disentangle perception from reasoning. Our analysis reveals a consistent perception-reasoning asymmetry: post-training improves reasoning more substantially than perception, though the underlying mechanism differs by training paradigm.

For supervised fine-tuning (SFT), this asymmetry stems from token imbalance in chain-of-thought supervision, where perception occupies fewer tokens and thus receives a weaker training signal. Dynamically reweighting the loss mitigates this imbalance and boosts end-to-end performance by up to 18.2. For reinforcement learning (RL), the asymmetry instead arises from reward coupling: outcome rewards correlate more strongly with reasoning than with perception, weakening the signal for perception learning. Adding a perception-aware reward alleviates the imbalance and improves end-to-end accuracy by up to 6.0; even without ground-truth perception rewards, a reliable surrogate reward provides useful signal, yielding gains of 3.2 points.

Key Findings

1

A consistent asymmetry across SFT and RL

Across two model families (Qwen3-VL, InternVL3.5) and two tasks (Graph Coloring, Sudoku), both SFT and RL improve reasoning far more than perception, leaving perception as the dominant bottleneck for end-to-end visual reasoning.

SFT training curves: reasoning rises sharply while perception stays flat
SFT: reasoning lifts dramatically; perception barely moves.
RL training curves: reasoning rises faster than perception
RL: same pattern β€” reasoning outpaces perception.

The two paradigms share a symptom but, as we show below, arise from different mechanisms β€” and require different mitigations.

2

SFT: token imbalance → loss reweighting

Mechanism. In CoT supervision, perception occupies only ~2.5% of tokens and contributes just ~1.3% of the loss β€” token-averaged cross-entropy implicitly starves perception of gradient.

Mitigation. Reweight the loss to upweight perception tokens. A fixed weight already helps; dynamic multi-task balancing (NGDiff) lifts end-to-end accuracy by up to +18.2.

3

RL: reward coupling → perception-aware reward

Mechanism. The outcome reward correlates strongly with reasoning (r = 0.65–1.00) but only weakly with perception (r = 0.34–0.43), so perception receives a noisy credit signal.

Mitigation. Add a perception term to the reward. Ground-truth perception rewards give up to +6.0; a well-chosen surrogate reward still yields +3.2 without any perception labels.

A Controlled Diagnostic Framework

Real visual reasoning benchmarks rarely admit a clean separation of perception from reasoning. We build a synthetic testbed where the visual input has a canonical textual representation p*, enabling three orthogonal metrics:

  • End-to-end accuracy β€” does the model solve the task from the image?
  • Perception accuracy β€” does the model recover p* from the image?
  • Counterfactual reasoning accuracy β€” given oracle perception p*, can the model reason correctly?
Graph coloring and Sudoku task illustration
Two synthetic tasks. Graph Coloring requires coloring all nodes with the chromatic number while assigning different colors to adjacent nodes; p* is the edge list. Sudoku requires completing the 9Γ—9 grid; p* is the partially-filled grid. Outputs are structured as perception followed by reasoning.

SFT: Token Imbalance Drives the Asymmetry

In CoT supervision, perception is a compact transcription while reasoning sprawls across verification, reflection, and self-correction. Token-averaged cross-entropy implicitly weights each part by its token count β€” so perception is starved of gradient.

SFT perception vs reasoning gap
Reasoning improves dramatically; perception barely moves.
Perception-reasoning trade-off in SFT
Upweighting perception exposes a clean trade-off β€” and end-to-end accuracy peaks when perception is modestly upweighted.
Intervention: Loss Reweighting

Replace the standard token-averaged loss with L = Ξ» Β· Lp + (1βˆ’Ξ») Β· Lr, where Lp and Lr are the token-averaged losses over the perception and reasoning spans, respectively.

A fixed reweighting already gives up to +13.8 points; dynamic balancing with NGDiff β€” which adjusts Ξ» on the fly to equalize perception and reasoning gradients β€” gives up to +18.2.

RL: Reward Coupling Drives the Asymmetry

Under GRPO with an outcome reward, perception only influences the reward when the downstream reasoning also happens to be correct. The result is a credit-assignment problem: the reward signal is well-aligned with reasoning, but only loosely aligned with perception.

RL perception vs reasoning gap
Reasoning gains consistently outpace perception gains across settings.
Perception-reasoning trade-off frontier in RL
Adding a perception reward exposes a clean trade-off frontier β€” end-to-end accuracy peaks at a non-zero perception-reward weight.
Intervention: Perception-Aware Reward

Optimize against R = Ξ± Β· ap + (1βˆ’Ξ±) Β· a, where ap is perception accuracy and a is end-to-end accuracy.

Ground-truth perception rewards yield up to +6.0 end-to-end accuracy. When ground-truth perception isn't available, surrogate rewards still help β€” and reward–perception correlation is a reliable diagnostic for which surrogate will work, yielding up to +3.2.

Takeaways for VLM Post-Training

  • Don't trust end-to-end accuracy alone. A model can post-train into much better reasoning without its perception meaningfully improving β€” and perception is what bottlenecks the next round of gains.
  • The default objectives are biased. Token-averaged SFT and outcome-reward RL each have a built-in preference for reasoning over perception.
  • The fix is paradigm-specific. SFT needs loss reweighting; RL needs perception-aware (or perception-correlated surrogate) rewards. Both are simple drop-in changes.

BibTeX

TODO