[논문 리뷰] BLIP: Bootstrapping Language-Image Pre-training forUnified Vision-Language Understanding and Generation (ICML, 2022)

논문 리뷰 Paper Review

[논문 리뷰] BLIP: Bootstrapping Language-Image Pre-training forUnified Vision-Language Understanding and Generation (ICML, 2022)

킹남지 2025. 9. 19. 01:50

Paper: https://arxiv.org/abs/2201.12086

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has bee

arxiv.org

Github: https://github.com/salesforce/BLIP

GitHub - salesforce/BLIP: PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understan

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation - GitHub - salesforce/BLIP: PyTorch code for BLIP: Bootstrapping Language...

github.com

Introduction

Vision-Language Task에 대한 연구가 활발해지고, 이런 Task를 수행하기 위한 모델의 학습에는 Image-Text Pair를 수집한 대용량 데이터가 필수적이다.

Web에서 데이터를 수집하면 대용량의 데이터를 얻을 수는 있으나, Noise가 매우 많다.

결국 질 좋은 Large Scale Web Dataset으로 모델을 Pretraining 시키고 싶다는 Needs가 있고, 이를 위해 저자들은 이미지의 Caption을 생성하는 Captioner와 이미지의 Noisy Caption을 제거하기 위한 Filter를 활용한다.

Method

Multimodal Mixture of Encoder-Decoder (MED)

BLIP은 크게 세가지 Task를 수행해 Pretraining이 이뤄진다.

1) ITC(Image-Text Contrastive): 이미지와 텍스트를 개별적으로 인코딩, 같은 Pair는 가까이, 다른 Pair는 멀어지도록 Feature Space 정렬

2) ITM(Image-Text Matching): 이미지와 텍스트가 정확히 매칭되는지를 이진 분류

3) LM(Language Modeling): 이미지가 주어졌을 때 캡션을 생성하도록 학습

CapFilt

앞서 소개한 세가지 Task로 MED를 대규모 Noise가 포함된 웹 데이터셋으로 Pre-training 후, COCO와 같은 고품질의 소규모 데이터셋으로 Fine-tuning 한다.

위 절차의 Fine-tuning 이후 Text Encoder가 Filter, Text Decoder가 Captioner의 역할을 하게된다.

이렇게 얻은 Captioner와 Filter를 통해 Noise를 제거한 새로운 데이터셋 생성 후, 다시 MED를 Pre-training 하고 이후 Downstream Task를 위해 활용한다.

'논문 리뷰 Paper Review' 카테고리의 다른 글

[논문 리뷰] NVIDIA C-RADIOv4 (Tech Report, 2026) (0)	2026.01.28
[논문 리뷰] SigLIP 2: Multilingual Vision-Language Encoderswith Improved Semantic Understanding,Localization, and Dense Features (2025) (0)	2025.09.19
[논문 리뷰] Adaptive Multimodal Fusion: Dynamic Attention Allocation for Intent Recognition (AAAI, 2025) (0)	2025.05.20
[논문 리뷰] CAFuser: Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes (IEEE Robotics and Automation Letters, 2025) (0)	2025.05.10
[논문 리뷰] Multi-layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices (CVPR, 2025) (0)	2025.04.22

현재글[논문 리뷰] BLIP: Bootstrapping Language-Image Pre-training forUnified Vision-Language Understanding and Generation (ICML, 2022)

킹남지 컴퍼니