Paper: https://arxiv.org/abs/2201.12086 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationVision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has b..