[논문 리뷰] SigLIP 2: Multilingual Vision-Language Encoderswith Improved Semantic Understanding,Localization, and Dense Features (2025)

논문 리뷰 Paper Review

[논문 리뷰] SigLIP 2: Multilingual Vision-Language Encoderswith Improved Semantic Understanding,Localization, and Dense Features (2025)

킹남지 2025. 9. 19. 01:34

Paper: https://arxiv.org/abs/2502.14786

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techni

arxiv.org

Image Encoder과 관련하여 Self-Distillation을 활용하기 위해 참고한 논문이다.

따라서, 전체 학습 부분 중 Image Encoder와 관련된 부분(Teacher-Student Network)에 대해서만 리뷰했다.

Motivation

기존의 CLIP, ALIGN, SigLIP 등의 Image-Text 모델은 대규모 데이터셋을 학습해 높은 성능을 달성했다.

하지만, 이 모델들은 High-level Semantic understanding 능력은 뛰어났으나 Localization, Dense Prediction 등에서 제약이 있었고 저자들은 SigLIP의 구조를 기반으로 다양한 최신 기법을 통합해 이런 점들을 개선하고자 했다.

Method

이 논문에서는 Image Encoder가 강력한 Semantic Representation을 학습할 뿐 아니라, Segmentation, Localization, OCR, VQA 등에 강한 Representation을 학습할 수 있도록 했다.

(이 논문에서는 매우 다양한 방법들을 소개하고 수행했으나, 두개의 주제만 살펴보겠다.)

1. Training with Sigmoid loss and decoder

1) SigLIP과 같이 Sigmoid Loss를 활용

2) Decoder-based Loss(LocCa)

- Image Encoder와 Transformer Decoder를 Cross-attention 기반으로 연결

- Image 자체에 대한 Captioning, 특정 Box에 대한 Captioning 를 예측하며 학습

=> Global&Local Semantic Information을 모두 학습

2. Training with self-distillation and masked prediction

Image Encoder 자체를 Teacher-Student 구조로 구성해 Local-to-global correspondence learning을 수행한 방법들이다.

1) Self-distillation

- Student는 Local Crop만을 입력받고, Teacher는 전체 이미지를 입력받는다. 그리고 둘의 Feature를 일치시키도록 학습한다.

- 이를 통해, Student는 Local Information으로부터 Global Representation을 할 수 있도록 학습한다.

2) Masked prediction

- Student는 Masking된 Image를 입력받고 Teacher는 전체 이미지를 입력받는다.

- Masked Location의 Feature를 일치하도록 학습한다.

- 이를 통해, Student는 결손된 이미지의 정보도 복원할 수 있는 Semantice Understanding 능력을 갖추도록 한다.

'논문 리뷰 Paper Review' 카테고리의 다른 글

[논문 리뷰] NVIDIA C-RADIOv4 (Tech Report, 2026) (0)	2026.01.28
[논문 리뷰] BLIP: Bootstrapping Language-Image Pre-training forUnified Vision-Language Understanding and Generation (ICML, 2022) (0)	2025.09.19
[논문 리뷰] Adaptive Multimodal Fusion: Dynamic Attention Allocation for Intent Recognition (AAAI, 2025) (0)	2025.05.20
[논문 리뷰] CAFuser: Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes (IEEE Robotics and Automation Letters, 2025) (0)	2025.05.10
[논문 리뷰] Multi-layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices (CVPR, 2025) (0)	2025.04.22

현재글[논문 리뷰] SigLIP 2: Multilingual Vision-Language Encoderswith Improved Semantic Understanding,Localization, and Dense Features (2025)

킹남지 컴퍼니