Conference: 26, Journal: 4, Workshop: 6, Preprint: 3
2025
-
[W6] Seeing What You Say: Expressive Image Generation from Speech
In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Generative AI for Audio-Visual Content Creation, 2025
Multimodal Generation Audiovisual Speech
-
[J4] Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization
International Journal of Computer Vision (IJCV), 2025
Multimodal Video summarization Graph modeling
-
[P3] Descriptive Image-Text Matching with Graded Contextual Similarity
arXiv preprint arXiv:2505.09997, 2025
Preprint Multimodal Vision-language Matching
-
[C26] Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations
In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2025
Vision Video
-
[W5, C25] Read, watch and scream! sound generation from text and video
In AAAI Conference on Artificial Intelligence (AAAI), 2025
Multimodal Generation Audiovisual
2024
-
[J3] Prototype-Guided Attention Distillation for Discriminative Person Search
IEEE transactions on pattern analysis and machine intelligence (TPAMI), 2024
Vision Person search Distillation
-
[J2] Discriminative action tubelet detector for weakly-supervised action detection
Pattern Recognition, 2024
Vision Video Action detection Weakly-supervised
-
[C24] Bridging Vision and Language Spaces with Assignment Prediction
In International Conference on Learning Representations (ICLR), 2024
Multimodal LLM Vision-language Multimodal Alignment
-
[C23] Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation
In International Conference on Learning Representations (ICLR), 2024
Multimodal 3D Generation
2023
-
[C22] Robust camera pose refinement for multi-resolution hash encoding
Hwan Heo , Taekyung Kim,
Jiyoung Lee , Jaewon Lee , Soohyun Kim , Hyunwoo J Kim
† , and Jin-Hwa Kim
† In International Conference on Machine Learning (ICML), 2023
Vision 3D Generation
-
[C21] Midms: Matching interleaved diffusion models for exemplar-based image translation
In AAAI Conference on Artificial Intelligence (AAAI), 2023
Vision Generation I2I
-
[C20] Imaginary voice: Face-styled diffusion model for text-to-speech
In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Multimodal Generation Audiovisual Speech
-
[P2] Panoramic Image-to-Image Translation
arXiv preprint arXiv:2304.04960, 2023
Preprint Generation I2I
-
[P1] Semi-parametric video-grounded text generation
Sungdong Kim , Jin-Hwa Kim,
Jiyoung Lee , and Minjoon Seo
arXiv preprint arXiv:2301.11507, 2023
Preprint Vision Video Captioning
-
[C19] Dense text-to-image generation with attention modulation
In IEEE/CVF International Conference on Computer Vision (ICCV), 2023
Text-to-Image Multimodal Generation
-
[C18] Hierarchical visual primitive experts for compositional zero-shot learning
In IEEE/CVF International Conference on Computer Vision (ICCV), 2023
Vision Multimodal CZSL
-
[W4] Three recipes for better 3d pseudo-gts of 3d human mesh estimation in the wild
In IEEE/CVF International Conference on Computer Vision Pattern Recognition Workshops (CVPRW), 2023
Vision 3D Generation
-
[C17] Dual-path adaptation from image to video transformers
In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2023
Vision Video
-
[C16] Language-free training for zero-shot video grounding
In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023
Vision Video grounding Weakly-supervised
2022
-
[C15] Pointfix: Learning to fix domain bias for robust online stereo adaptation
In European Conference on Computer Vision (ECCV), 2022
Vision 3D
-
[C14] Causalcity: Complex simulations with agency for causal discovery and reasoning
Daniel McDuff, Yale Song,
Jiyoung Lee, Vibhav Vineet, Sai Vemprala, Nicholas Alexander Gyde, Hadi Salman, Shuang Ma,
Kwanghoon Sohn, and Ashish Kapoor
In Conference on Causal Learning and Reasoning, 2022
-
[C13] Mutual information divergence: A unified metric for multimodal generative models
In Advances in Neural Information Processing Systems (NeurIPS), 2022
Multimodal Generation
-
[C12] Multi-domain unsupervised image-to-image translation with appearance adaptive convolution
In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022
Vision Generation
-
[C11] Pin the memory: Learning to generalize semantic segmentation
In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2022
Vision
-
[C10] Probabilistic representations for video contrastive learning
In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2022
Vision Video
2021
-
[C9] Wide and Narrow: Video Prediction from Context and Motion
In British Machine Vision Conference (BMVC), 2021
Vision Video prediction
-
[C8] Self-balanced learning for domain generalization
In IEEE international conference on image processing (ICIP), 2021
Vision Domain generalization
-
[C7] Bridge to answer: Structure-aware graph interaction network for video question answering
In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2021
Multimodal Video
-
[C6] Looking into your speech: Learning cross-modal affinity for audio-visual speech separation
In IEEE/CVF International Conference on Computer Vision Pattern Recognition (CVPR), 2021
Multimodal Audiovisual Speech
2020
-
[J1] Multi-modal recurrent attention networks for facial expression recognition
IEEE Transactions on Image Processing, 2020
Multimodal Affective Computing
-
[C5] Sumgraph: Video summarization via recursive graph modeling
In European Conference on Computer Vision (ECCV), 2020
Vision Video summarization Graph modeling
2019
-
[C4] Graph regularization network with semantic affinity for weakly-supervised temporal action localization
In IEEE International conference on image processing (ICIP), 2019
Vision Video Action localization Weakly-supervised
-
[W3] Video summarization by learning relationships between action and scene
In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2019
Vision Video summarization
3rd place in the ICCV Challenge on Comprehensive Video Understanding in the Wild (CoVieW 2019)
-
[C3] Context-aware emotion recognition networks
In IEEE/CVF International Conference on Computer Vision (ICCV), 2019
Vision Affective Computing
2018
-
[W2] Audio-visual attention networks for emotion recognition
In Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, 2018
Multimodal Affective Computing
-
[W1] Learning to detect, associate, and recognize human actions and surrounding scenes in untrimmed videos
In Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, 2018
Vision Video Action recognition
-
[C2] Spatiotemporal attention based deep neural networks for emotion recognition
In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018
Vision Affective Computing
2017
-
[C1] Automatic 2d-to-3d conversion using multi-scale deep neural network
In IEEE International Conference on Image Processing (ICIP), 2017
Vision 3D Stereo