Publications

$^\star$ equal contribution, $^\dagger$ corresponding author(s)

Conference: 26, Journal: 4, Workshop: 6, Preprint: 3

2025

  1. [W6] Seeing What You Say: Expressive Image Generation from Speech
    In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Generative AI for Audio-Visual Content Creation, 2025
    Multimodal Generation Audiovisual Speech
  2. [J4] Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization
    International Journal of Computer Vision (IJCV), 2025
    Multimodal Video summarization Graph modeling
  3. [P3] Descriptive Image-Text Matching with Graded Contextual Similarity
    Jinhyun Jang, Jiyoung Lee, and Kwanghoon Sohn
    arXiv preprint arXiv:2505.09997, 2025
    Preprint Multimodal Vision-language Matching
  4. [C26] Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2025
    Vision Video
  5. [W5, C25] Read, watch and scream! sound generation from text and video
    In AAAI Conference on Artificial Intelligence (AAAI), 2025
    Multimodal Generation Audiovisual

2024

  1. [J3] Prototype-Guided Attention Distillation for Discriminative Person Search
    Hanjae Kim, Jiyoung Lee, and Kwanghoon Sohn
    IEEE transactions on pattern analysis and machine intelligence (TPAMI), 2024
    Vision Person search Distillation
  2. [J2] Discriminative action tubelet detector for weakly-supervised action detection
    Pattern Recognition, 2024
    Vision Video Action detection Weakly-supervised
  3. [C24] Bridging Vision and Language Spaces with Assignment Prediction
    In International Conference on Learning Representations (ICLR), 2024
    Multimodal LLM Vision-language Multimodal Alignment
  4. [C23] Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation
    Junyoung Seo, Wooseok Jang, Min-Seop Kwak , Hyeonsu Kim, Jaehoon Ko , Junho Kim , Jin-Hwa KimJiyoung Lee, and Seungryong Kim
    In International Conference on Learning Representations (ICLR), 2024
    Multimodal 3D Generation

2023

  1. [C22] Robust camera pose refinement for multi-resolution hash encoding
    Hwan Heo , Taekyung Kim, Jiyoung Lee , Jaewon Lee , Soohyun Kim , Hyunwoo J Kim , and Jin-Hwa Kim
    In International Conference on Machine Learning (ICML), 2023
    Vision 3D Generation
  2. [C21] Midms: Matching interleaved diffusion models for exemplar-based image translation
    Junyoung Seo , Gyuseong Lee, Seokju Cho, Jiyoung Lee, and Seungryong Kim
    In AAAI Conference on Artificial Intelligence (AAAI), 2023
    Vision Generation I2I
  3. [C20] Imaginary voice: Face-styled diffusion model for text-to-speech
    Jiyoung Lee , Joon Son Chung, and Soo-Whan Chung
    In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
    Multimodal Generation Audiovisual Speech
  4. [P2] Panoramic Image-to-Image Translation
    Soohyun Kim , Junho Kim , Taekyung Kim, Hwan Heo, Seungryong KimJiyoung Lee , and Jin-Hwa Kim
    arXiv preprint arXiv:2304.04960, 2023
    Preprint Generation I2I
  5. [P1] Semi-parametric video-grounded text generation
    Sungdong Kim , Jin-Hwa Kim, Jiyoung Lee , and Minjoon Seo
    arXiv preprint arXiv:2301.11507, 2023
    Preprint Vision Video Captioning
  6. [C19] Dense text-to-image generation with attention modulation
    Yunji KimJiyoung Lee , Jin-Hwa Kim, Jung-Woo Ha, and Jun-Yan Zhu
    In IEEE/CVF International Conference on Computer Vision (ICCV), 2023
    Text-to-Image Multimodal Generation
  7. [C18] Hierarchical visual primitive experts for compositional zero-shot learning
    Hanjae Kim, Jiyoung Lee , Seongheon Park, and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision (ICCV), 2023
    Vision Multimodal CZSL
  8. [W4] Three recipes for better 3d pseudo-gts of 3d human mesh estimation in the wild
    Gyeongsik Moon, Hongsuk Choi, Sanghyuk ChunJiyoung Lee, and Sangdoo Yun
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition Workshops (CVPRW), 2023
    Vision 3D Generation
  9. [C17] Dual-path adaptation from image to video transformers
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2023
    Vision Video
  10. [C16] Language-free training for zero-shot video grounding
    Dahye Kim, Jungin ParkJiyoung Lee , Seongheon Park, and Kwanghoon Sohn
    In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023
    Vision Video grounding Weakly-supervised

2022

  1. [C15] Pointfix: Learning to fix domain bias for robust online stereo adaptation
    In European Conference on Computer Vision (ECCV), 2022
    Vision 3D
  2. [C14] Causalcity: Complex simulations with agency for causal discovery and reasoning
    Daniel McDuff, Yale Song, Jiyoung Lee, Vibhav Vineet, Sai Vemprala, Nicholas Alexander Gyde, Hadi Salman, Shuang Ma, Kwanghoon Sohn, and Ashish Kapoor
    In Conference on Causal Learning and Reasoning, 2022
  3. [C13] Mutual information divergence: A unified metric for multimodal generative models
    Jin-Hwa KimYunji KimJiyoung Lee, Kang Min Yoo , and Sang-Woo Lee
    In Advances in Neural Information Processing Systems (NeurIPS), 2022
    Multimodal Generation
  4. [C12] Multi-domain unsupervised image-to-image translation with appearance adaptive convolution
    Somi Jeong, Jiyoung Lee, and Kwanghoon Sohn
    In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022
    Vision Generation
  5. [C11] Pin the memory: Learning to generalize semantic segmentation
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2022
    Vision
  6. [C10] Probabilistic representations for video contrastive learning
    Jungin ParkJiyoung Lee , Ig-Jae Kim, and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2022
    Vision Video

2021

  1. [C9] Wide and Narrow: Video Prediction from Context and Motion
    Jaehoon Cho, Jiyoung Lee, Changjae Oh, Wonil Song, and Kwanghoon Sohn
    In British Machine Vision Conference (BMVC), 2021
    Vision Video prediction
  2. [C8] Self-balanced learning for domain generalization
    In IEEE international conference on image processing (ICIP), 2021
    Vision Domain generalization
  3. [C7] Bridge to answer: Structure-aware graph interaction network for video question answering
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2021
    Multimodal Video
  4. [C6] Looking into your speech: Learning cross-modal affinity for audio-visual speech separation
    Jiyoung Lee*Soo-Whan Chung*Sunok Kim, Hong-Goo Kang, and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2021
    Multimodal Audiovisual Speech

2020

  1. [J1] Multi-modal recurrent attention networks for facial expression recognition
    IEEE Transactions on Image Processing, 2020
    Multimodal Affective Computing
  2. [C5] Sumgraph: Video summarization via recursive graph modeling
    In European Conference on Computer Vision (ECCV), 2020
    Vision Video summarization Graph modeling

2019

  1. [C4] Graph regularization network with semantic affinity for weakly-supervised temporal action localization
    In IEEE International conference on image processing (ICIP), 2019
    Vision Video Action localization Weakly-supervised
  2. [W3] Video summarization by learning relationships between action and scene
    Jungin ParkJiyoung Lee, Sangryul Jeon, and Kwanghoon Sohn
    In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2019
    Vision Video summarization
  3. [C3] Context-aware emotion recognition networks
    In IEEE/CVF International Conference on Computer Vision (ICCV), 2019
    Vision Affective Computing

2018

  1. [W2] Audio-visual attention networks for emotion recognition
    In Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, 2018
    Multimodal Affective Computing
  2. [W1] Learning to detect, associate, and recognize human actions and surrounding scenes in untrimmed videos
    In Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, 2018
    Vision Video Action recognition
  3. [C2] Spatiotemporal attention based deep neural networks for emotion recognition
    In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018
    Vision Affective Computing

2017

  1. [C1] Automatic 2d-to-3d conversion using multi-scale deep neural network
    Jiyoung Lee, Hyungjoo Jung , Youngjung Kim, and Kwanghoon Sohn
    In IEEE International Conference on Image Processing (ICIP), 2017
    Vision 3D Stereo