Publications | Multimodal AI Lab

In the AI field, as opposed to many other disciplines, papers published in top conferences (CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, AAAI, and ICASSP) are regarded as more important and influential than most SCI journals in general. For example, CVPR is the 2nd rank among all academic fields according to google scholar metrics.

Conference: 34, Journal: 4, Workshop: 6, Preprint: 3

2026

[C34] Wake up for Touch! Mask-isolated Tactile Alignment Learning in MLLMs

Yoonhyung Park^* , Minji Kim^* , Sungwon Moon , and Jiyoung Lee^†

In European Conference on Computer Vision (ECCV), 2026

Code Project Paper

Multimodal Tactile Alignment
[C33] Referee: Reference-aware Audiovisual Deepfake Detection

Hyemin Boo^* , Eunsang Lee^* , and Jiyoung Lee^†

In Conference of the International Speech Communication Association (Interspeech), 2026

Code Paper

Multimodal Audiovisual Deepfake
[C32] Cross-Lingual Compositional Learning for Code-Switched Lip Reading

Jeonghyeon Joo , and Jiyoung Lee^†

In Conference of the International Speech Communication Association (Interspeech), 2026

Multimodal Audiovisual Speech Lip reading
[C31] Saliency-Aware Model Merging

Jungin Park , Jiyoung Lee^† , and Kwanghoon Sohn^†

In International Conference on Machine Learning (ICML), 2026

Paper

Multimodal Model Merging
[C30] V-LynX: Token Interface Alignment for Video+X LLMs

Jungin Park , Jiyoung Lee^† , and Kwanghoon Sohn^†

In International Conference on Machine Learning (ICML), 2026

Code Paper

Multimodal VideoLLM Alignment
[C29] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

Junwon Lee , Juhan Nam , and Jiyoung Lee^†

In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2026

Code Project Paper

Multimodal Generation Audiovisual Speech
[C28] Erasing Your Voice Before It’s Heard: Training-Free Speaker Unlearning For Zero-Shot Text-To-Speech

Myungjin Lee^* , Eunji Shin^* , and Jiyoung Lee^†

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2026

Code Project Paper

Multimodal Audiovisual Unlearning Speech
[C27] Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation

Jinbae Seo , Hyeongjun Kwon , Kwonyoung Kim , Jiyoung Lee^† , and Kwanghoon Sohn^†

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2026

Code Paper

Multimodal Audiovisual Instance Segmentation

2025

[W6] Seeing What You Say: Expressive Image Generation from Speech

Jiyoung Lee^† , Song Park , Sanghyuk Chun , and Soo-Whan Chung

In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Generative AI for Audio-Visual Content Creation, 2025

Code Project Paper

Multimodal Generation Audiovisual Speech
[J4] Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization

Jungin Park , Jiyoung Lee^† , and Kwanghoon Sohn^†

International Journal of Computer Vision (IJCV), 2025

Paper

Multimodal Video summarization Graph modeling
[P3] Descriptive Image-Text Matching with Graded Contextual Similarity

Jinhyun Jang , Jiyoung Lee^† , and Kwanghoon Sohn^†

arXiv preprint arXiv:2505.09997, 2025

Paper

Preprint Multimodal Vision-language Matching
[C26] Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations

Jungin Park , Jiyoung Lee^† , and Kwanghoon Sohn^†

In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2025

Code Paper

Vision Video
[W5, C25] Read, watch and scream! sound generation from text and video

Yujin Jeong , Yunji Kim , Sanghyuk Chun , and Jiyoung Lee^†

In AAAI Conference on Artificial Intelligence (AAAI), 2025

Code Project Paper

Multimodal Generation Audiovisual

2024

[J3] Prototype-Guided Attention Distillation for Discriminative Person Search

Hanjae Kim , Jiyoung Lee , and Kwanghoon Sohn^†

IEEE transactions on pattern analysis and machine intelligence (TPAMI), 2024

Paper

Vision Person search Distillation
[J2] Discriminative action tubelet detector for weakly-supervised action detection

Jiyoung Lee , Seungryong Kim , Sunok Kim , and Kwanghoon Sohn^†

Pattern Recognition, 2024

Paper

Vision Video Action detection Weakly-supervised
[C24] Bridging Vision and Language Spaces with Assignment Prediction

Jungin Park , Jiyoung Lee^† , and Kwanghoon Sohn^†

In International Conference on Learning Representations (ICLR), 2024

Code Paper

Multimodal LLM Vision-language Multimodal Alignment
[C23] Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation

Junyoung Seo , Wooseok Jang , Min-Seop Kwak , Hyeonsu Kim , Jaehoon Ko , Junho Kim , Jin-Hwa Kim^† , Jiyoung Lee^† , and Seungryong Kim^†

In International Conference on Learning Representations (ICLR), 2024

Code Project Paper

Multimodal 3D Generation

2023

[C22] Robust camera pose refinement for multi-resolution hash encoding

Hwan Heo , Taekyung Kim , Jiyoung Lee , Jaewon Lee , Soohyun Kim , Hyunwoo J Kim^† , and Jin-Hwa Kim^†

In International Conference on Machine Learning (ICML), 2023

Paper

Vision 3D Generation
[C21] Midms: Matching interleaved diffusion models for exemplar-based image translation

Junyoung Seo , Gyuseong Lee , Seokju Cho , Jiyoung Lee , and Seungryong Kim^†

In AAAI Conference on Artificial Intelligence (AAAI), 2023

Code Project Paper

Vision Generation I2I
[C20] Imaginary voice: Face-styled diffusion model for text-to-speech

Jiyoung Lee , Joon Son Chung , and Soo-Whan Chung

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023

Code Project Paper

Multimodal Generation Audiovisual Speech
[P2] Panoramic Image-to-Image Translation

Soohyun Kim , Junho Kim , Taekyung Kim , Hwan Heo , Seungryong Kim^† , Jiyoung Lee^† , and Jin-Hwa Kim^†

arXiv preprint arXiv:2304.04960, 2023

Paper

Preprint Generation I2I
[P1] Semi-parametric video-grounded text generation

Sungdong Kim , Jin-Hwa Kim , Jiyoung Lee , and Minjoon Seo

arXiv preprint arXiv:2301.11507, 2023

Paper

Preprint Vision Video Captioning
[C19] Dense text-to-image generation with attention modulation

Yunji Kim , Jiyoung Lee , Jin-Hwa Kim , Jung-Woo Ha , and Jun-Yan Zhu

In IEEE/CVF International Conference on Computer Vision (ICCV), 2023

Code Demo Paper

Text-to-Image Multimodal Generation
[C18] Hierarchical visual primitive experts for compositional zero-shot learning

Hanjae Kim , Jiyoung Lee , Seongheon Park , and Kwanghoon Sohn^†

In IEEE/CVF International Conference on Computer Vision (ICCV), 2023

Code Paper

Vision Multimodal CZSL
[W4] Three recipes for better 3d pseudo-gts of 3d human mesh estimation in the wild

Gyeongsik Moon , Hongsuk Choi , Sanghyuk Chun , Jiyoung Lee , and Sangdoo Yun

In IEEE/CVF International Conference on Computer Vision Pattern Recognition Workshops (CVPRW), 2023

Project Paper

Vision 3D Generation
[C17] Dual-path adaptation from image to video transformers

Jungin Park^* , Jiyoung Lee^* , and Kwanghoon Sohn

In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2023

Paper

Vision Video
[C16] Language-free training for zero-shot video grounding

Dahye Kim , Jungin Park , Jiyoung Lee , Seongheon Park , and Kwanghoon Sohn^†

In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023

Paper

Vision Video grounding Weakly-supervised

2022

[C15] Pointfix: Learning to fix domain bias for robust online stereo adaptation

Kwonyoung Kim , Jungin Park , Jiyoung Lee , Dongbo Min , and Kwanghoon Sohn^†

In European Conference on Computer Vision (ECCV), 2022

Paper

Vision 3D
[C14] Causalcity: Complex simulations with agency for causal discovery and reasoning

Daniel McDuff , Yale Song , Jiyoung Lee , Vibhav Vineet , Sai Vemprala , Nicholas Alexander Gyde , Hadi Salman , Shuang Ma , Kwanghoon Sohn , and Ashish Kapoor

In Conference on Causal Learning and Reasoning, 2022

Project Paper
[C13] Mutual information divergence: A unified metric for multimodal generative models

Jin-Hwa Kim^† , Yunji Kim , Jiyoung Lee , Kang Min Yoo , and Sang-Woo Lee

In Advances in Neural Information Processing Systems (NeurIPS), 2022

Code Paper

Multimodal Generation
[C12] Multi-domain unsupervised image-to-image translation with appearance adaptive convolution

Somi Jeong , Jiyoung Lee , and Kwanghoon Sohn^†

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022

Paper

Vision Generation
[C11] Pin the memory: Learning to generalize semantic segmentation

Jin Kim , Jiyoung Lee , Jungin Park , Dongbo Min^† , and Kwanghoon Sohn^†

In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2022

Code Paper

Vision
[C10] Probabilistic representations for video contrastive learning

Jungin Park , Jiyoung Lee , Ig-Jae Kim , and Kwanghoon Sohn^†

In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2022

Paper

Vision Video

2021

[C9] Wide and Narrow: Video Prediction from Context and Motion

Jaehoon Cho , Jiyoung Lee , Changjae Oh , Wonil Song , and Kwanghoon Sohn^†

In British Machine Vision Conference (BMVC), 2021

Paper

Vision Video prediction
[C8] Self-balanced learning for domain generalization

Jin Kim , Jiyoung Lee , Jungin Park , Dongbo Min , and Kwanghoon Sohn^†

In IEEE international conference on image processing (ICIP), 2021

Paper

Vision Domain generalization
[C7] Bridge to answer: Structure-aware graph interaction network for video question answering

Jungin Park , Jiyoung Lee , and Kwanghoon Sohn^†

In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2021

Paper

Multimodal Video
[C6] Looking into your speech: Learning cross-modal affinity for audio-visual speech separation

Jiyoung Lee^* , Soo-Whan Chung^* , Sunok Kim , Hong-Goo Kang^† , and Kwanghoon Sohn^†

In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2021

Paper

Multimodal Audiovisual Speech

2020

[J1] Multi-modal recurrent attention networks for facial expression recognition

Jiyoung Lee , Sunok Kim , Seungryong Kim , and Kwanghoon Sohn^†

IEEE Transactions on Image Processing, 2020

Paper

Multimodal Affective Computing
[C5] Sumgraph: Video summarization via recursive graph modeling

Jungin Park^* , Jiyoung Lee^* , Ig-Jae Kim , and Kwanghoon Sohn^†

In European Conference on Computer Vision (ECCV), 2020

Paper

Vision Video summarization Graph modeling

2019

[C4] Graph regularization network with semantic affinity for weakly-supervised temporal action localization

Jungin Park , Jiyoung Lee , Sangryul Jeon , Seungryong Kim , and Kwanghoon Sohn^†

In IEEE International conference on image processing (ICIP), 2019

Paper

Vision Video Action localization Weakly-supervised
[W3] Video summarization by learning relationships between action and scene

Jungin Park , Jiyoung Lee , Sangryul Jeon , and Kwanghoon Sohn

In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2019

3rd place Paper

Vision Video summarization

3rd place in the ICCV Challenge on Comprehensive Video Understanding in the Wild (CoVieW 2019)
[C3] Context-aware emotion recognition networks

Jiyoung Lee , Seungryong Kim , Sunok Kim , Jungin Park , and Kwanghoon Sohn^†

In IEEE/CVF International Conference on Computer Vision (ICCV), 2019

Project Paper

Vision Affective Computing

2018

[W2] Audio-visual attention networks for emotion recognition

Jiyoung Lee , Sunok Kim , Seungryong Kim , and Kwanghoon Sohn

In Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, 2018

Multimodal Affective Computing
[W1] Learning to detect, associate, and recognize human actions and surrounding scenes in untrimmed videos

Jungin Park , Sangryul Jeon , Seungryong Kim , Jiyoung Lee , Sunok Kim , and Kwanghoon Sohn^†

In Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, 2018

Vision Video Action recognition
[C2] Spatiotemporal attention based deep neural networks for emotion recognition

Jiyoung Lee , Sunok Kim , Seungryong Kim , and Kwanghoon Sohn^†

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018

Paper

Vision Affective Computing

2017

[C1] Automatic 2d-to-3d conversion using multi-scale deep neural network

Jiyoung Lee , Hyungjoo Jung , Youngjung Kim , and Kwanghoon Sohn^†

In IEEE International Conference on Image Processing (ICIP), 2017

Vision 3D Stereo