MyungJin Lee*, Eunji Lee*, Jiyoung Lee+
Ewha Womans University
+ Corresponding author: lee.jiyoung@ewha.ac.kr
Modern zero-shot text-to-speech (TTS) models offer unprecedented expressivity but also pose serious crime risks, as they can synthesize voices of individuals who never consented. Existing unlearning approaches, reliant on retraining, are costly and limited to speakers seen in the training set. We present TruS, a training-free speaker unlearning framework that shifts the paradigm from data deletion to inference-time control. TruS steers identity-specific hidden activations to suppress target speakers while preserving other attributes (e.g., prosody and emotion). Experiments on F5-TTS show that TruS effectively forgets both seen and unseen speakers without retraining, establishing a scalable safeguard for speech.
Reference / TruS (ours) / FT / SGU / TGU
| Case | Reference | TruS(ours) [Forget] | FT [Forget] | SGU [Forget] | TGU [Forget] |
|---|---|---|---|---|---|
| Case 1 | |||||
| Case 2 | |||||
| Case 3 | |||||
| Case 4 |
Reference / TruS(ours) / FT / SGU / TGU
| Case | Reference | TruS(ours) [Forbidden] | FT [Remain] | SGU [Remain] | TGU [Remain] |
|---|---|---|---|---|---|
| Case 1 | |||||
| Case 2 | |||||
| Case 3 | |||||
| Case 4 |
Reference Emotion vs. TruS (ours)
| Emotion | Reference Emotion | TruS (ours) |
|---|---|---|
| Happy | ||
| Sad | ||
| Angry | ||
| Neutral | ||
| Disgust | ||
| Fear |