Publications

$^\star$ equal contribution, $^\dagger$ corresponding author(s)

Conference: 26, Journal: 4, Workshop: 6, Preprint: 5

2026

  1. [C28] Erasing Your Voice Before It’s Heard: Training-Free Speaker Unlearning For Zero-Shot Text-To-Speech
    Myungjin LeeEunji Shin , and Jiyoung Lee
    In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2026
    Multimodal Audiovisual Unlearning Speech
  2. [C27] Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation
    Jinbae Seo , Hyeongjun Kwon , Kwonyoung Kim , Jiyoung Lee , and Kwanghoon Sohn
    In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2026
    Multimodal Audiovisual Instance Segmentation

2025

  1. [P5] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
    Junwon Lee , Juhan Nam , and Jiyoung Lee
    arXiv preprint arXiv:2512.02650, 2025
    Multimodal Generation Audiovisual
  2. [P4] Referee: Reference-aware Audiovisual Deepfake Detection
    Hyemin BooEunsang Lee , and Jiyoung Lee
    arXiv preprint arXiv:2510.27475, 2025
    Multimodal Audiovisual Deepfake
  3. [W6] Seeing What You Say: Expressive Image Generation from Speech
    Jiyoung Lee , Song Park , Sanghyuk Chun , and Soo-Whan Chung
    In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Generative AI for Audio-Visual Content Creation, 2025
    Multimodal Generation Audiovisual Speech
  4. [J4] Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization
    Jungin Park , Jiyoung Lee , and Kwanghoon Sohn
    International Journal of Computer Vision (IJCV), 2025
    Multimodal Video summarization Graph modeling
  5. [P3] Descriptive Image-Text Matching with Graded Contextual Similarity
    Jinhyun Jang , Jiyoung Lee , and Kwanghoon Sohn
    arXiv preprint arXiv:2505.09997, 2025
    Preprint Multimodal Vision-language Matching
  6. [C26] Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations
    Jungin Park , Jiyoung Lee , and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2025
    Vision Video
  7. [W5, C25] Read, watch and scream! sound generation from text and video
    Yujin Jeong , Yunji Kim , Sanghyuk Chun , and Jiyoung Lee
    In AAAI Conference on Artificial Intelligence (AAAI), 2025
    Multimodal Generation Audiovisual

2024

  1. [J3] Prototype-Guided Attention Distillation for Discriminative Person Search
    Hanjae Kim , Jiyoung Lee , and Kwanghoon Sohn
    IEEE transactions on pattern analysis and machine intelligence (TPAMI), 2024
    Vision Person search Distillation
  2. [J2] Discriminative action tubelet detector for weakly-supervised action detection
    Jiyoung Lee , Seungryong Kim , Sunok Kim , and Kwanghoon Sohn
    Pattern Recognition, 2024
    Vision Video Action detection Weakly-supervised
  3. [C24] Bridging Vision and Language Spaces with Assignment Prediction
    Jungin Park , Jiyoung Lee , and Kwanghoon Sohn
    In International Conference on Learning Representations (ICLR), 2024
    Multimodal LLM Vision-language Multimodal Alignment
  4. [C23] Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation
    Junyoung Seo , Wooseok Jang , Min-Seop Kwak , Hyeonsu Kim , Jaehoon Ko , Junho Kim , Jin-Hwa KimJiyoung Lee , and Seungryong Kim
    In International Conference on Learning Representations (ICLR), 2024
    Multimodal 3D Generation

2023

  1. [C22] Robust camera pose refinement for multi-resolution hash encoding
    Hwan Heo , Taekyung Kim , Jiyoung Lee , Jaewon Lee , Soohyun Kim , Hyunwoo J Kim , and Jin-Hwa Kim
    In International Conference on Machine Learning (ICML), 2023
    Vision 3D Generation
  2. [C21] Midms: Matching interleaved diffusion models for exemplar-based image translation
    Junyoung Seo , Gyuseong Lee , Seokju Cho , Jiyoung Lee , and Seungryong Kim
    In AAAI Conference on Artificial Intelligence (AAAI), 2023
    Vision Generation I2I
  3. [C20] Imaginary voice: Face-styled diffusion model for text-to-speech
    Jiyoung Lee , Joon Son Chung , and Soo-Whan Chung
    In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
    Multimodal Generation Audiovisual Speech
  4. [P2] Panoramic Image-to-Image Translation
    Soohyun Kim , Junho Kim , Taekyung Kim , Hwan Heo , Seungryong KimJiyoung Lee , and Jin-Hwa Kim
    arXiv preprint arXiv:2304.04960, 2023
    Preprint Generation I2I
  5. [P1] Semi-parametric video-grounded text generation
    Sungdong Kim , Jin-Hwa Kim , Jiyoung Lee , and Minjoon Seo
    arXiv preprint arXiv:2301.11507, 2023
    Preprint Vision Video Captioning
  6. [C19] Dense text-to-image generation with attention modulation
    Yunji Kim , Jiyoung Lee , Jin-Hwa Kim , Jung-Woo Ha , and Jun-Yan Zhu
    In IEEE/CVF International Conference on Computer Vision (ICCV), 2023
    Text-to-Image Multimodal Generation
  7. [C18] Hierarchical visual primitive experts for compositional zero-shot learning
    Hanjae Kim , Jiyoung Lee , Seongheon Park , and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision (ICCV), 2023
    Vision Multimodal CZSL
  8. [W4] Three recipes for better 3d pseudo-gts of 3d human mesh estimation in the wild
    Gyeongsik Moon , Hongsuk Choi , Sanghyuk Chun , Jiyoung Lee , and Sangdoo Yun
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition Workshops (CVPRW), 2023
    Vision 3D Generation
  9. [C17] Dual-path adaptation from image to video transformers
    Jungin Park*Jiyoung Lee* , and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2023
    Vision Video
  10. [C16] Language-free training for zero-shot video grounding
    Dahye Kim , Jungin Park , Jiyoung Lee , Seongheon Park , and Kwanghoon Sohn
    In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023
    Vision Video grounding Weakly-supervised

2022

  1. [C15] Pointfix: Learning to fix domain bias for robust online stereo adaptation
    Kwonyoung Kim , Jungin Park , Jiyoung Lee , Dongbo Min , and Kwanghoon Sohn
    In European Conference on Computer Vision (ECCV), 2022
    Vision 3D
  2. [C14] Causalcity: Complex simulations with agency for causal discovery and reasoning
    Daniel McDuff , Yale Song , Jiyoung Lee , Vibhav Vineet , Sai Vemprala , Nicholas Alexander Gyde , Hadi Salman , Shuang Ma , Kwanghoon Sohn , and Ashish Kapoor
    In Conference on Causal Learning and Reasoning, 2022
  3. [C13] Mutual information divergence: A unified metric for multimodal generative models
    Jin-Hwa Kim , Yunji Kim , Jiyoung Lee , Kang Min Yoo , and Sang-Woo Lee
    In Advances in Neural Information Processing Systems (NeurIPS), 2022
    Multimodal Generation
  4. [C12] Multi-domain unsupervised image-to-image translation with appearance adaptive convolution
    Somi Jeong , Jiyoung Lee , and Kwanghoon Sohn
    In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022
    Vision Generation
  5. [C11] Pin the memory: Learning to generalize semantic segmentation
    Jin Kim , Jiyoung Lee , Jungin Park , Dongbo Min , and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2022
    Vision
  6. [C10] Probabilistic representations for video contrastive learning
    Jungin Park , Jiyoung Lee , Ig-Jae Kim , and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2022
    Vision Video

2021

  1. [C9] Wide and Narrow: Video Prediction from Context and Motion
    Jaehoon Cho , Jiyoung Lee , Changjae Oh , Wonil Song , and Kwanghoon Sohn
    In British Machine Vision Conference (BMVC), 2021
    Vision Video prediction
  2. [C8] Self-balanced learning for domain generalization
    Jin Kim , Jiyoung Lee , Jungin Park , Dongbo Min , and Kwanghoon Sohn
    In IEEE international conference on image processing (ICIP), 2021
    Vision Domain generalization
  3. [C7] Bridge to answer: Structure-aware graph interaction network for video question answering
    Jungin Park , Jiyoung Lee , and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2021
    Multimodal Video
  4. [C6] Looking into your speech: Learning cross-modal affinity for audio-visual speech separation
    Jiyoung Lee* , Soo-Whan Chung* , Sunok Kim , Hong-Goo Kang , and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2021
    Multimodal Audiovisual Speech

2020

  1. [J1] Multi-modal recurrent attention networks for facial expression recognition
    Jiyoung Lee , Sunok Kim , Seungryong Kim , and Kwanghoon Sohn
    IEEE Transactions on Image Processing, 2020
    Multimodal Affective Computing
  2. [C5] Sumgraph: Video summarization via recursive graph modeling
    Jungin Park*Jiyoung Lee* , Ig-Jae Kim , and Kwanghoon Sohn
    In European Conference on Computer Vision (ECCV), 2020
    Vision Video summarization Graph modeling

2019

  1. [C4] Graph regularization network with semantic affinity for weakly-supervised temporal action localization
    Jungin Park , Jiyoung Lee , Sangryul Jeon , Seungryong Kim , and Kwanghoon Sohn
    In IEEE International conference on image processing (ICIP), 2019
    Vision Video Action localization Weakly-supervised
  2. [W3] Video summarization by learning relationships between action and scene
    Jungin Park , Jiyoung Lee , Sangryul Jeon , and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2019
    Vision Video summarization
  3. [C3] Context-aware emotion recognition networks
    Jiyoung Lee , Seungryong Kim , Sunok Kim , Jungin Park , and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision (ICCV), 2019
    Vision Affective Computing

2018

  1. [W2] Audio-visual attention networks for emotion recognition
    Jiyoung Lee , Sunok Kim , Seungryong Kim , and Kwanghoon Sohn
    In Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, 2018
    Multimodal Affective Computing
  2. [W1] Learning to detect, associate, and recognize human actions and surrounding scenes in untrimmed videos
    Jungin Park , Sangryul Jeon , Seungryong Kim , Jiyoung Lee , Sunok Kim , and Kwanghoon Sohn
    In Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, 2018
    Vision Video Action recognition
  3. [C2] Spatiotemporal attention based deep neural networks for emotion recognition
    Jiyoung Lee , Sunok Kim , Seungryong Kim , and Kwanghoon Sohn
    In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018
    Vision Affective Computing

2017

  1. [C1] Automatic 2d-to-3d conversion using multi-scale deep neural network
    Jiyoung Lee , Hyungjoo Jung , Youngjung Kim , and Kwanghoon Sohn
    In IEEE International Conference on Image Processing (ICIP), 2017
    Vision 3D Stereo