Publications | Multimodal AI Lab

Conference: 29, Journal: 4, Workshop: 6, Preprint: 4

2026

[C29] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

Junwon Lee , Juhan Nam , and Jiyoung Lee^†

In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2026

Code Project Paper

Multimodal Generation Audiovisual Speech
[C28] Erasing Your Voice Before It’s Heard: Training-Free Speaker Unlearning For Zero-Shot Text-To-Speech

Myungjin Lee^* , Eunji Shin^* , and Jiyoung Lee^†

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2026

Code Project Paper

Multimodal Audiovisual Unlearning Speech
[C27] Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation

Jinbae Seo , Hyeongjun Kwon , Kwonyoung Kim , Jiyoung Lee^† , and Kwanghoon Sohn^†

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2026

Code Paper

Multimodal Audiovisual Instance Segmentation

2025

[P4] Referee: Reference-aware Audiovisual Deepfake Detection

Hyemin Boo , Eunsang Lee , and Jiyoung Lee^†

arXiv preprint arXiv:2510.27475, 2025

Code Paper

Multimodal Audiovisual Deepfake
[W6] Seeing What You Say: Expressive Image Generation from Speech

Jiyoung Lee^† , Song Park , Sanghyuk Chun , and Soo-Whan Chung

In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Generative AI for Audio-Visual Content Creation, 2025

Code Project Paper

Multimodal Generation Audiovisual Speech
[J4] Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization

Jungin Park , Jiyoung Lee^† , and Kwanghoon Sohn^†

International Journal of Computer Vision (IJCV), 2025

Paper

Multimodal Video summarization Graph modeling
[P3] Descriptive Image-Text Matching with Graded Contextual Similarity

Jinhyun Jang , Jiyoung Lee^† , and Kwanghoon Sohn^†

arXiv preprint arXiv:2505.09997, 2025

Paper

Preprint Multimodal Vision-language Matching
[C26] Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations

Jungin Park , Jiyoung Lee^† , and Kwanghoon Sohn^†

In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2025

Code Paper

Vision Video
[W5, C25] Read, watch and scream! sound generation from text and video

Yujin Jeong , Yunji Kim , Sanghyuk Chun , and Jiyoung Lee^†

In AAAI Conference on Artificial Intelligence (AAAI), 2025

Code Project Paper

Multimodal Generation Audiovisual

2024

[J3] Prototype-Guided Attention Distillation for Discriminative Person Search

Hanjae Kim , Jiyoung Lee , and Kwanghoon Sohn^†

IEEE transactions on pattern analysis and machine intelligence (TPAMI), 2024

Paper

Vision Person search Distillation
[J2] Discriminative action tubelet detector for weakly-supervised action detection

Jiyoung Lee , Seungryong Kim , Sunok Kim , and Kwanghoon Sohn^†

Pattern Recognition, 2024

Paper

Vision Video Action detection Weakly-supervised
[C24] Bridging Vision and Language Spaces with Assignment Prediction

Jungin Park , Jiyoung Lee^† , and Kwanghoon Sohn^†

In International Conference on Learning Representations (ICLR), 2024

Code Paper

Multimodal LLM Vision-language Multimodal Alignment
[C23] Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation

Junyoung Seo , Wooseok Jang , Min-Seop Kwak , Hyeonsu Kim , Jaehoon Ko , Junho Kim , Jin-Hwa Kim^† , Jiyoung Lee^† , and Seungryong Kim^†

In International Conference on Learning Representations (ICLR), 2024

Code Project Paper

Multimodal 3D Generation

2023

[C22] Robust camera pose refinement for multi-resolution hash encoding

Hwan Heo , Taekyung Kim , Jiyoung Lee , Jaewon Lee , Soohyun Kim , Hyunwoo J Kim^† , and Jin-Hwa Kim^†

In International Conference on Machine Learning (ICML), 2023

Paper

Vision 3D Generation
[C21] Midms: Matching interleaved diffusion models for exemplar-based image translation

Junyoung Seo , Gyuseong Lee , Seokju Cho , Jiyoung Lee , and Seungryong Kim^†

In AAAI Conference on Artificial Intelligence (AAAI), 2023

Code Project Paper

Vision Generation I2I
[C20] Imaginary voice: Face-styled diffusion model for text-to-speech

Jiyoung Lee , Joon Son Chung , and Soo-Whan Chung

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023

Code Project Paper

Multimodal Generation Audiovisual Speech
[P2] Panoramic Image-to-Image Translation

Soohyun Kim , Junho Kim , Taekyung Kim , Hwan Heo , Seungryong Kim^† , Jiyoung Lee^† , and Jin-Hwa Kim^†

arXiv preprint arXiv:2304.04960, 2023

Paper

Preprint Generation I2I
[P1] Semi-parametric video-grounded text generation

Sungdong Kim , Jin-Hwa Kim , Jiyoung Lee , and Minjoon Seo

arXiv preprint arXiv:2301.11507, 2023

Paper

Preprint Vision Video Captioning
[C19] Dense text-to-image generation with attention modulation

Yunji Kim , Jiyoung Lee , Jin-Hwa Kim , Jung-Woo Ha , and Jun-Yan Zhu

In IEEE/CVF International Conference on Computer Vision (ICCV), 2023

Code Demo Paper

Text-to-Image Multimodal Generation
[C18] Hierarchical visual primitive experts for compositional zero-shot learning

Hanjae Kim , Jiyoung Lee , Seongheon Park , and Kwanghoon Sohn^†

In IEEE/CVF International Conference on Computer Vision (ICCV), 2023

Code Paper

Vision Multimodal CZSL
[W4] Three recipes for better 3d pseudo-gts of 3d human mesh estimation in the wild

Gyeongsik Moon , Hongsuk Choi , Sanghyuk Chun , Jiyoung Lee , and Sangdoo Yun

In IEEE/CVF International Conference on Computer Vision Pattern Recognition Workshops (CVPRW), 2023

Project Paper

Vision 3D Generation
[C17] Dual-path adaptation from image to video transformers

Jungin Park^* , Jiyoung Lee^* , and Kwanghoon Sohn

In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2023

Paper

Vision Video
[C16] Language-free training for zero-shot video grounding

Dahye Kim , Jungin Park , Jiyoung Lee , Seongheon Park , and Kwanghoon Sohn^†

In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023

Paper

Vision Video grounding Weakly-supervised

2022

[C15] Pointfix: Learning to fix domain bias for robust online stereo adaptation

Kwonyoung Kim , Jungin Park , Jiyoung Lee , Dongbo Min , and Kwanghoon Sohn^†

In European Conference on Computer Vision (ECCV), 2022

Paper

Vision 3D
[C14] Causalcity: Complex simulations with agency for causal discovery and reasoning

Daniel McDuff , Yale Song , Jiyoung Lee , Vibhav Vineet , Sai Vemprala , Nicholas Alexander Gyde , Hadi Salman , Shuang Ma , Kwanghoon Sohn , and Ashish Kapoor

In Conference on Causal Learning and Reasoning, 2022

Project Paper
[C13] Mutual information divergence: A unified metric for multimodal generative models

Jin-Hwa Kim^† , Yunji Kim , Jiyoung Lee , Kang Min Yoo , and Sang-Woo Lee

In Advances in Neural Information Processing Systems (NeurIPS), 2022

Code Paper

Multimodal Generation
[C12] Multi-domain unsupervised image-to-image translation with appearance adaptive convolution

Somi Jeong , Jiyoung Lee , and Kwanghoon Sohn^†

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022

Paper

Vision Generation
[C11] Pin the memory: Learning to generalize semantic segmentation

Jin Kim , Jiyoung Lee , Jungin Park , Dongbo Min^† , and Kwanghoon Sohn^†

In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2022

Code Paper

Vision
[C10] Probabilistic representations for video contrastive learning

Jungin Park , Jiyoung Lee , Ig-Jae Kim , and Kwanghoon Sohn^†

In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2022

Paper

Vision Video

2021

[C9] Wide and Narrow: Video Prediction from Context and Motion

Jaehoon Cho , Jiyoung Lee , Changjae Oh , Wonil Song , and Kwanghoon Sohn^†

In British Machine Vision Conference (BMVC), 2021

Paper

Vision Video prediction
[C8] Self-balanced learning for domain generalization

Jin Kim , Jiyoung Lee , Jungin Park , Dongbo Min , and Kwanghoon Sohn^†

In IEEE international conference on image processing (ICIP), 2021

Paper

Vision Domain generalization
[C7] Bridge to answer: Structure-aware graph interaction network for video question answering

Jungin Park , Jiyoung Lee , and Kwanghoon Sohn^†

In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2021

Paper

Multimodal Video
[C6] Looking into your speech: Learning cross-modal affinity for audio-visual speech separation

Jiyoung Lee^* , Soo-Whan Chung^* , Sunok Kim , Hong-Goo Kang^† , and Kwanghoon Sohn^†

In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2021

Paper

Multimodal Audiovisual Speech

2020

[J1] Multi-modal recurrent attention networks for facial expression recognition

Jiyoung Lee , Sunok Kim , Seungryong Kim , and Kwanghoon Sohn^†

IEEE Transactions on Image Processing, 2020

Paper

Multimodal Affective Computing
[C5] Sumgraph: Video summarization via recursive graph modeling

Jungin Park^* , Jiyoung Lee^* , Ig-Jae Kim , and Kwanghoon Sohn^†

In European Conference on Computer Vision (ECCV), 2020

Paper

Vision Video summarization Graph modeling

2019

[C4] Graph regularization network with semantic affinity for weakly-supervised temporal action localization

Jungin Park , Jiyoung Lee , Sangryul Jeon , Seungryong Kim , and Kwanghoon Sohn^†

In IEEE International conference on image processing (ICIP), 2019

Paper

Vision Video Action localization Weakly-supervised
[W3] Video summarization by learning relationships between action and scene

Jungin Park , Jiyoung Lee , Sangryul Jeon , and Kwanghoon Sohn

In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2019

3rd place Paper

Vision Video summarization

3rd place in the ICCV Challenge on Comprehensive Video Understanding in the Wild (CoVieW 2019)
[C3] Context-aware emotion recognition networks

Jiyoung Lee , Seungryong Kim , Sunok Kim , Jungin Park , and Kwanghoon Sohn^†

In IEEE/CVF International Conference on Computer Vision (ICCV), 2019

Project Paper

Vision Affective Computing

2018

[W2] Audio-visual attention networks for emotion recognition

Jiyoung Lee , Sunok Kim , Seungryong Kim , and Kwanghoon Sohn

In Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, 2018

Multimodal Affective Computing
[W1] Learning to detect, associate, and recognize human actions and surrounding scenes in untrimmed videos

Jungin Park , Sangryul Jeon , Seungryong Kim , Jiyoung Lee , Sunok Kim , and Kwanghoon Sohn^†

In Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, 2018

Vision Video Action recognition
[C2] Spatiotemporal attention based deep neural networks for emotion recognition

Jiyoung Lee , Sunok Kim , Seungryong Kim , and Kwanghoon Sohn^†

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018

Paper

Vision Affective Computing

2017

[C1] Automatic 2d-to-3d conversion using multi-scale deep neural network

Jiyoung Lee , Hyungjoo Jung , Youngjung Kim , and Kwanghoon Sohn^†

In IEEE International Conference on Image Processing (ICIP), 2017

Vision 3D Stereo