Publications

$^\star$ equal contribution, $^\dagger$ corresponding author(s)

In the AI field, as opposed to many other disciplines, papers published in top conferences (CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, AAAI, and ICASSP) are regarded as more important and influential than most SCI journals in general. For example, CVPR is the 2nd rank among all academic fields according to google scholar metrics.

Conference: 31, Journal: 4, Workshop: 6, Preprint: 4

2026

  1. [C31] Saliency-Aware Model Merging
    Jungin Park , Jiyoung Lee , and Kwanghoon Sohn
    In International Conference on Machine Learning (ICML), 2026
  2. [C30] LynX: Token Interface Alignment for Video+X LLMs
    Jungin Park , Jiyoung Lee , and Kwanghoon Sohn
    In International Conference on Machine Learning (ICML), 2026
  3. [C29] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
    Junwon Lee , Juhan Nam , and Jiyoung Lee
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2026
    Multimodal Generation Audiovisual Speech
  4. [C28] Erasing Your Voice Before It’s Heard: Training-Free Speaker Unlearning For Zero-Shot Text-To-Speech
    Myungjin Lee*Eunji Shin* , and Jiyoung Lee
    In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2026
    Multimodal Audiovisual Unlearning Speech
  5. [C27] Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation
    Jinbae Seo , Hyeongjun Kwon , Kwonyoung Kim , Jiyoung Lee , and Kwanghoon Sohn
    In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2026
    Multimodal Audiovisual Instance Segmentation

2025

  1. [P4] Referee: Reference-aware Audiovisual Deepfake Detection
    Hyemin BooEunsang Lee , and Jiyoung Lee
    arXiv preprint arXiv:2510.27475, 2025
    Multimodal Audiovisual Deepfake
  2. [W6] Seeing What You Say: Expressive Image Generation from Speech
    Jiyoung Lee , Song Park , Sanghyuk Chun , and Soo-Whan Chung
    In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Generative AI for Audio-Visual Content Creation, 2025
    Multimodal Generation Audiovisual Speech
  3. [J4] Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization
    Jungin Park , Jiyoung Lee , and Kwanghoon Sohn
    International Journal of Computer Vision (IJCV), 2025
    Multimodal Video summarization Graph modeling
  4. [P3] Descriptive Image-Text Matching with Graded Contextual Similarity
    Jinhyun Jang , Jiyoung Lee , and Kwanghoon Sohn
    arXiv preprint arXiv:2505.09997, 2025
    Preprint Multimodal Vision-language Matching
  5. [C26] Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations
    Jungin Park , Jiyoung Lee , and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2025
    Vision Video
  6. [W5, C25] Read, watch and scream! sound generation from text and video
    Yujin Jeong , Yunji Kim , Sanghyuk Chun , and Jiyoung Lee
    In AAAI Conference on Artificial Intelligence (AAAI), 2025
    Multimodal Generation Audiovisual

2024

  1. [J3] Prototype-Guided Attention Distillation for Discriminative Person Search
    Hanjae Kim , Jiyoung Lee , and Kwanghoon Sohn
    IEEE transactions on pattern analysis and machine intelligence (TPAMI), 2024
    Vision Person search Distillation
  2. [J2] Discriminative action tubelet detector for weakly-supervised action detection
    Jiyoung Lee , Seungryong Kim , Sunok Kim , and Kwanghoon Sohn
    Pattern Recognition, 2024
    Vision Video Action detection Weakly-supervised
  3. [C24] Bridging Vision and Language Spaces with Assignment Prediction
    Jungin Park , Jiyoung Lee , and Kwanghoon Sohn
    In International Conference on Learning Representations (ICLR), 2024
    Multimodal LLM Vision-language Multimodal Alignment
  4. [C23] Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation
    Junyoung Seo , Wooseok Jang , Min-Seop Kwak , Hyeonsu Kim , Jaehoon Ko , Junho Kim , Jin-Hwa KimJiyoung Lee , and Seungryong Kim
    In International Conference on Learning Representations (ICLR), 2024
    Multimodal 3D Generation

2023

  1. [C22] Robust camera pose refinement for multi-resolution hash encoding
    Hwan Heo , Taekyung Kim , Jiyoung Lee , Jaewon Lee , Soohyun Kim , Hyunwoo J Kim , and Jin-Hwa Kim
    In International Conference on Machine Learning (ICML), 2023
    Vision 3D Generation
  2. [C21] Midms: Matching interleaved diffusion models for exemplar-based image translation
    Junyoung Seo , Gyuseong Lee , Seokju Cho , Jiyoung Lee , and Seungryong Kim
    In AAAI Conference on Artificial Intelligence (AAAI), 2023
    Vision Generation I2I
  3. [C20] Imaginary voice: Face-styled diffusion model for text-to-speech
    Jiyoung Lee , Joon Son Chung , and Soo-Whan Chung
    In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
    Multimodal Generation Audiovisual Speech
  4. [P2] Panoramic Image-to-Image Translation
    Soohyun Kim , Junho Kim , Taekyung Kim , Hwan Heo , Seungryong KimJiyoung Lee , and Jin-Hwa Kim
    arXiv preprint arXiv:2304.04960, 2023
    Preprint Generation I2I
  5. [P1] Semi-parametric video-grounded text generation
    Sungdong Kim , Jin-Hwa Kim , Jiyoung Lee , and Minjoon Seo
    arXiv preprint arXiv:2301.11507, 2023
    Preprint Vision Video Captioning
  6. [C19] Dense text-to-image generation with attention modulation
    Yunji Kim , Jiyoung Lee , Jin-Hwa Kim , Jung-Woo Ha , and Jun-Yan Zhu
    In IEEE/CVF International Conference on Computer Vision (ICCV), 2023
    Text-to-Image Multimodal Generation
  7. [C18] Hierarchical visual primitive experts for compositional zero-shot learning
    Hanjae Kim , Jiyoung Lee , Seongheon Park , and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision (ICCV), 2023
    Vision Multimodal CZSL
  8. [W4] Three recipes for better 3d pseudo-gts of 3d human mesh estimation in the wild
    Gyeongsik Moon , Hongsuk Choi , Sanghyuk Chun , Jiyoung Lee , and Sangdoo Yun
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition Workshops (CVPRW), 2023
    Vision 3D Generation
  9. [C17] Dual-path adaptation from image to video transformers
    Jungin Park*Jiyoung Lee* , and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2023
    Vision Video
  10. [C16] Language-free training for zero-shot video grounding
    Dahye Kim , Jungin Park , Jiyoung Lee , Seongheon Park , and Kwanghoon Sohn
    In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023
    Vision Video grounding Weakly-supervised

2022

  1. [C15] Pointfix: Learning to fix domain bias for robust online stereo adaptation
    Kwonyoung Kim , Jungin Park , Jiyoung Lee , Dongbo Min , and Kwanghoon Sohn
    In European Conference on Computer Vision (ECCV), 2022
    Vision 3D
  2. [C14] Causalcity: Complex simulations with agency for causal discovery and reasoning
    Daniel McDuff , Yale Song , Jiyoung Lee , Vibhav Vineet , Sai Vemprala , Nicholas Alexander Gyde , Hadi Salman , Shuang Ma , Kwanghoon Sohn , and Ashish Kapoor
    In Conference on Causal Learning and Reasoning, 2022
  3. [C13] Mutual information divergence: A unified metric for multimodal generative models
    Jin-Hwa Kim , Yunji Kim , Jiyoung Lee , Kang Min Yoo , and Sang-Woo Lee
    In Advances in Neural Information Processing Systems (NeurIPS), 2022
    Multimodal Generation
  4. [C12] Multi-domain unsupervised image-to-image translation with appearance adaptive convolution
    Somi Jeong , Jiyoung Lee , and Kwanghoon Sohn
    In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022
    Vision Generation
  5. [C11] Pin the memory: Learning to generalize semantic segmentation
    Jin Kim , Jiyoung Lee , Jungin Park , Dongbo Min , and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2022
    Vision
  6. [C10] Probabilistic representations for video contrastive learning
    Jungin Park , Jiyoung Lee , Ig-Jae Kim , and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2022
    Vision Video

2021

  1. [C9] Wide and Narrow: Video Prediction from Context and Motion
    Jaehoon Cho , Jiyoung Lee , Changjae Oh , Wonil Song , and Kwanghoon Sohn
    In British Machine Vision Conference (BMVC), 2021
    Vision Video prediction
  2. [C8] Self-balanced learning for domain generalization
    Jin Kim , Jiyoung Lee , Jungin Park , Dongbo Min , and Kwanghoon Sohn
    In IEEE international conference on image processing (ICIP), 2021
    Vision Domain generalization
  3. [C7] Bridge to answer: Structure-aware graph interaction network for video question answering
    Jungin Park , Jiyoung Lee , and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2021
    Multimodal Video
  4. [C6] Looking into your speech: Learning cross-modal affinity for audio-visual speech separation
    Jiyoung Lee* , Soo-Whan Chung* , Sunok Kim , Hong-Goo Kang , and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2021
    Multimodal Audiovisual Speech

2020

  1. [J1] Multi-modal recurrent attention networks for facial expression recognition
    Jiyoung Lee , Sunok Kim , Seungryong Kim , and Kwanghoon Sohn
    IEEE Transactions on Image Processing, 2020
    Multimodal Affective Computing
  2. [C5] Sumgraph: Video summarization via recursive graph modeling
    Jungin Park*Jiyoung Lee* , Ig-Jae Kim , and Kwanghoon Sohn
    In European Conference on Computer Vision (ECCV), 2020
    Vision Video summarization Graph modeling

2019

  1. [C4] Graph regularization network with semantic affinity for weakly-supervised temporal action localization
    Jungin Park , Jiyoung Lee , Sangryul Jeon , Seungryong Kim , and Kwanghoon Sohn
    In IEEE International conference on image processing (ICIP), 2019
    Vision Video Action localization Weakly-supervised
  2. [W3] Video summarization by learning relationships between action and scene
    Jungin Park , Jiyoung Lee , Sangryul Jeon , and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2019
    Vision Video summarization
  3. [C3] Context-aware emotion recognition networks
    Jiyoung Lee , Seungryong Kim , Sunok Kim , Jungin Park , and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision (ICCV), 2019
    Vision Affective Computing

2018

  1. [W2] Audio-visual attention networks for emotion recognition
    Jiyoung Lee , Sunok Kim , Seungryong Kim , and Kwanghoon Sohn
    In Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, 2018
    Multimodal Affective Computing
  2. [W1] Learning to detect, associate, and recognize human actions and surrounding scenes in untrimmed videos
    Jungin Park , Sangryul Jeon , Seungryong Kim , Jiyoung Lee , Sunok Kim , and Kwanghoon Sohn
    In Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, 2018
    Vision Video Action recognition
  3. [C2] Spatiotemporal attention based deep neural networks for emotion recognition
    Jiyoung Lee , Sunok Kim , Seungryong Kim , and Kwanghoon Sohn
    In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018
    Vision Affective Computing

2017

  1. [C1] Automatic 2d-to-3d conversion using multi-scale deep neural network
    Jiyoung Lee , Hyungjoo Jung , Youngjung Kim , and Kwanghoon Sohn
    In IEEE International Conference on Image Processing (ICIP), 2017
    Vision 3D Stereo