ECCV 2026 Accepted to ECCV 2026

Wake up for Touch!
Mask-isolated Tactile Alignment Learning in MLLMs

Teaching compact multimodal models to feel without forgetting how to see.

Yoonhyung Park*,  Minji Kim*,  Sungwon Moon,  Jiyoung Lee
Ewha Womans University
* Equal contribution (co-first authors) ·  Corresponding author
The dormant subspace, waking up for touch frozen critical (vision) trainable dormant (touch)

Abstract

Touch supplies the physical grounding needed to perceive intrinsic material properties, such as friction and compliance, that vision alone often cannot resolve. Equipping multimodal large language models (MLLMs) with this tactile sense, alongside their pretrained vision-language ability, is a natural step toward reasoning about the physical world. In real-world deployment, however, doing so exposes a zero-sum trade-off: the limited parameter budget of compact models forces a choice between acquiring the new sensory modality and preserving the established vision-language reasoning. We present SPLASH, a mask-isolated tactile alignment learning framework for MLLMs. SPLASH quantifies the significance of each pretrained parameter and partitions the parameter space into a dormant and critical subspace. While the frozen critical subspace acts as a stable anchor to safeguard general visual knowledge, SPLASH updates the isolated dormant subspace to internalize tactile alignment towards LLMs. This mechanism effectively prevents catastrophic forgetting and ensures non-destructive modality expansion. Extensive experiments show that SPLASH achieves state-of-the-art performance on visuo-tactile benchmarks, including SSVTP, TVL, and TacQuad, while preserving its original general-purpose capabilities.


The Problem: Catastrophic Forgetting

When a small MLLM is tuned on tactile data, it can start hallucinating about what it sees. The same model describes an image correctly (top), then — after a naive tactile branch is added — invents objects that aren't there.

Catastrophic forgetting example: MLLM produces hallucinated descriptions after naive tactile tuning.
Figure 1. Tactile tuning without isolation distorts pretrained visual features — a small amount of tactile data degrades vision-language reasoning.

Key Contributions

🧊 Parameter Isolation

A visual-relative importance metric finds a dormant subspace and confines all tactile updates to it — zero added inference overhead and forgetting is prevented by construction.

⚡ Single-Stage Training

Frozen critical weights act as stable vision-language anchors, so the tactile front-end and dormant weights train together in one pass — no multi-stage pipeline.

📦 Works at Small Scale

On Qwen2.5-VL-3B and InternVL-1B, SPLASH reaches state-of-the-art tactile reasoning while preserving general-purpose ability — outperforming prior 7B methods.


Method

Locate the dormant subspace, then wake it up for touch.

Step 1 — Locate

Score every weight against vision

Step 2 — Align

Train only the dormant weights for touch

SPLASH framework overview: locating the dormant subspace from weight and activation importance, then mask-guided tactile alignment training.
Figure 2. The SPLASH framework. A vision-relative importance score partitions the LLM into frozen critical and trainable dormant weights; the dormant subspace and a lightweight tactile front-end are then trained jointly in a single stage.

Results

Visuo-Tactile-Language benchmarks (LLM-judge, 1–10 scale)

MethodBackboneSSVTPTVLTacQuadAvg
Zero-shotInternVL-1B3.613.543.453.53
Zero-shotQwen2.5-VL-3B3.743.403.573.57
UniTouchLLaMA-7B3.734.043.443.74
TVLLLaMA-7B5.274.383.294.31
TVLQwen2.5-VL-3B4.984.294.224.50
SPLASH-1BInternVL-1B5.694.345.025.01
SPLASH-3BQwen2.5-VL-3B5.484.394.864.91

Vision-Language preservation benchmarks

MethodBackboneMMMUMathVistaMMEsumMMBench-ENMMBench-CN
Zero-shotInternVL-1B40.943.2195070.766.3
Zero-shotQwen2.5-VL-3B53.162.3215779.178.1
UniTouchLLaMA-7B26.69.0278.58.6
TVLLLaMA-7B26.011.65828.65.5
TVLQwen2.5-VL-3B50.052.8216876.876.5
SPLASH-1BInternVL-1B37.341.0178670.260.4
SPLASH-3BQwen2.5-VL-3B55.365.3215578.076.9

Bold = best, underline = second best. SPLASH preserves — and on MMMU / MathVista exceeds — the pretrained backbone.


BibTeX

@inproceedings{park2026splash,
  title     = {Wake up for Touch! Mask-isolated Tactile
               Alignment Learning in MLLMs},
  author    = {Park, Yoonhyung and Kim, Minji and
               Moon, Sungwon and Lee, Jiyoung},
  booktitle = {Proceedings of the European Conference
               on Computer Vision (ECCV)},
  year      = {2026}
}