Google has announced the release of PaliGemma 2, a new family of vision-language models. Launched on December 5th, PaliGemma 2 is the successor to the initial PaliGemma model, which was the first vision-language model in the Gemma family, released seven months prior. Built upon Gemma 2, these models are designed to understand and interact with visual information. According to Google, PaliGemma 2 is intended to simplify the integration of advanced vision-language capabilities into applications for developers. The new models offer improved captioning features, capable of identifying emotions and actions depicted in images. PaliGemma 2 provides scalable performance, adaptable to various tasks through different model sizes—3B, 10B, and 28B parameters—and resolutions of 224px, 448px, and 896px. The models also feature long captioning, enabling the generation of detailed and context-aware captions that go beyond simple object recognition to describe actions, emotions, and the story within an image.
Leave a Reply