SPLASH — Mask-isolated Tactile Alignment Learning in MLLMs

Abstract

Touch supplies the physical grounding needed to perceive intrinsic material properties, such as friction and compliance, that vision alone often cannot resolve. Equipping multimodal large language models (MLLMs) with this tactile sense, alongside their pretrained vision-language ability, is a natural step toward reasoning about the physical world. In real-world deployment, however, doing so exposes a zero-sum trade-off: the limited parameter budget of compact models forces a choice between acquiring the new sensory modality and preserving the established vision-language reasoning. We present SPLASH, a mask-isolated tactile alignment learning framework for MLLMs. SPLASH quantifies the significance of each pretrained parameter and partitions the parameter space into a dormant and critical subspace. While the frozen critical subspace acts as a stable anchor to safeguard general visual knowledge, SPLASH updates the isolated dormant subspace to internalize tactile alignment towards LLMs. This mechanism effectively prevents catastrophic forgetting and ensures non-destructive modality expansion. Extensive experiments show that SPLASH achieves state-of-the-art performance on visuo-tactile benchmarks, including SSVTP, TVL, and TacQuad, while preserving its original general-purpose capabilities.

The Problem: Catastrophic Forgetting

When a small MLLM is tuned on tactile data, it can start hallucinating about what it sees. The same model describes an image correctly (top), then — after a naive tactile branch is added — invents objects that aren't there.

Catastrophic forgetting example: MLLM produces hallucinated descriptions after naive tactile tuning. — **Figure 1.** Tactile tuning without isolation distorts pretrained visual features — a small amount of tactile data degrades vision-language reasoning.

Key Contributions

🧊 Parameter Isolation

A visual-relative importance metric finds a dormant subspace and confines all tactile updates to it — zero added inference overhead and forgetting is prevented by construction.

⚡ Single-Stage Training

Frozen critical weights act as stable vision-language anchors, so the tactile front-end and dormant weights train together in one pass — no multi-stage pipeline.

📦 Works at Small Scale

On Qwen2.5-VL-3B and InternVL-1B, SPLASH reaches state-of-the-art tactile reasoning while preserving general-purpose ability — outperforming prior 7B methods.

Method

Locate the dormant subspace, then wake it up for touch.

Step 1 — Locate

Score every weight against vision

Step 2 — Align

Train only the dormant weights for touch

SPLASH framework overview: locating the dormant subspace from weight and activation importance, then mask-guided tactile alignment training. — **Figure 2.** The SPLASH framework. A vision-relative importance score partitions the LLM into frozen **critical** and trainable **dormant** weights; the dormant subspace and a lightweight tactile front-end are then trained jointly in a single stage.

Results

Visuo-Tactile-Language benchmarks (LLM-judge, 1–10 scale)

Method	Backbone	SSVTP	TVL	TacQuad	Avg
Zero-shot	InternVL-1B	3.61	3.54	3.45	3.53
Zero-shot	Qwen2.5-VL-3B	3.74	3.40	3.57	3.57
UniTouch	LLaMA-7B	3.73	4.04	3.44	3.74
TVL	LLaMA-7B	5.27	4.38	3.29	4.31
TVL	Qwen2.5-VL-3B	4.98	4.29	4.22	4.50
SPLASH-1B	InternVL-1B	5.69	4.34	5.02	5.01
SPLASH-3B	Qwen2.5-VL-3B	5.48	4.39	4.86	4.91

Vision-Language preservation benchmarks

Method	Backbone	MMMU	MathVista	MME_sum	MMBench-EN	MMBench-CN
Zero-shot	InternVL-1B	40.9	43.2	1950	70.7	66.3
Zero-shot	Qwen2.5-VL-3B	53.1	62.3	2157	79.1	78.1
UniTouch	LLaMA-7B	26.6	9.0	27	8.5	8.6
TVL	LLaMA-7B	26.0	11.6	582	8.6	5.5
TVL	Qwen2.5-VL-3B	50.0	52.8	2168	76.8	76.5
SPLASH-1B	InternVL-1B	37.3	41.0	1786	70.2	60.4
SPLASH-3B	Qwen2.5-VL-3B	55.3	65.3	2155	78.0	76.9

Bold = best, underline = second best. SPLASH preserves — and on MMMU / MathVista exceeds — the pretrained backbone.

BibTeX

@inproceedings{park2026splash,
  title     = {Wake up for Touch! Mask-isolated Tactile
               Alignment Learning in MLLMs},
  author    = {Park, Yoonhyung and Kim, Minji and
               Moon, Sungwon and Lee, Jiyoung},
  booktitle = {Proceedings of the European Conference
               on Computer Vision (ECCV)},
  year      = {2026}
}