Teaching compact multimodal models to feel without forgetting how to see.
Touch supplies the physical grounding needed to perceive intrinsic material properties, such as friction and compliance, that vision alone often cannot resolve. Equipping multimodal large language models (MLLMs) with this tactile sense, alongside their pretrained vision-language ability, is a natural step toward reasoning about the physical world. In real-world deployment, however, doing so exposes a zero-sum trade-off: the limited parameter budget of compact models forces a choice between acquiring the new sensory modality and preserving the established vision-language reasoning. We present SPLASH, a mask-isolated tactile alignment learning framework for MLLMs. SPLASH quantifies the significance of each pretrained parameter and partitions the parameter space into a dormant and critical subspace. While the frozen critical subspace acts as a stable anchor to safeguard general visual knowledge, SPLASH updates the isolated dormant subspace to internalize tactile alignment towards LLMs. This mechanism effectively prevents catastrophic forgetting and ensures non-destructive modality expansion. Extensive experiments show that SPLASH achieves state-of-the-art performance on visuo-tactile benchmarks, including SSVTP, TVL, and TacQuad, while preserving its original general-purpose capabilities.
When a small MLLM is tuned on tactile data, it can start hallucinating about what it sees. The same model describes an image correctly (top), then — after a naive tactile branch is added — invents objects that aren't there.
A visual-relative importance metric finds a dormant subspace and confines all tactile updates to it — zero added inference overhead and forgetting is prevented by construction.
Frozen critical weights act as stable vision-language anchors, so the tactile front-end and dormant weights train together in one pass — no multi-stage pipeline.
On Qwen2.5-VL-3B and InternVL-1B, SPLASH reaches state-of-the-art tactile reasoning while preserving general-purpose ability — outperforming prior 7B methods.
Locate the dormant subspace, then wake it up for touch.
Visuo-Tactile-Language benchmarks (LLM-judge, 1–10 scale)
| Method | Backbone | SSVTP | TVL | TacQuad | Avg |
|---|---|---|---|---|---|
| Zero-shot | InternVL-1B | 3.61 | 3.54 | 3.45 | 3.53 |
| Zero-shot | Qwen2.5-VL-3B | 3.74 | 3.40 | 3.57 | 3.57 |
| UniTouch | LLaMA-7B | 3.73 | 4.04 | 3.44 | 3.74 |
| TVL | LLaMA-7B | 5.27 | 4.38 | 3.29 | 4.31 |
| TVL | Qwen2.5-VL-3B | 4.98 | 4.29 | 4.22 | 4.50 |
| SPLASH-1B | InternVL-1B | 5.69 | 4.34 | 5.02 | 5.01 |
| SPLASH-3B | Qwen2.5-VL-3B | 5.48 | 4.39 | 4.86 | 4.91 |
Vision-Language preservation benchmarks
| Method | Backbone | MMMU | MathVista | MMEsum | MMBench-EN | MMBench-CN |
|---|---|---|---|---|---|---|
| Zero-shot | InternVL-1B | 40.9 | 43.2 | 1950 | 70.7 | 66.3 |
| Zero-shot | Qwen2.5-VL-3B | 53.1 | 62.3 | 2157 | 79.1 | 78.1 |
| UniTouch | LLaMA-7B | 26.6 | 9.0 | 27 | 8.5 | 8.6 |
| TVL | LLaMA-7B | 26.0 | 11.6 | 582 | 8.6 | 5.5 |
| TVL | Qwen2.5-VL-3B | 50.0 | 52.8 | 2168 | 76.8 | 76.5 |
| SPLASH-1B | InternVL-1B | 37.3 | 41.0 | 1786 | 70.2 | 60.4 |
| SPLASH-3B | Qwen2.5-VL-3B | 55.3 | 65.3 | 2155 | 78.0 | 76.9 |
Bold = best, underline = second best. SPLASH preserves — and on MMMU / MathVista exceeds — the pretrained backbone.
@inproceedings{park2026splash,
title = {Wake up for Touch! Mask-isolated Tactile
Alignment Learning in MLLMs},
author = {Park, Yoonhyung and Kim, Minji and
Moon, Sungwon and Lee, Jiyoung},
booktitle = {Proceedings of the European Conference
on Computer Vision (ECCV)},
year = {2026}
}