S3A: Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment

Dec 12, 2023·

Sheng Zhang

Muzammal Naseer

Guangyi Chen

Zhiqiang Shen

Salman Khan

Kun Zhang

Fahad Shahbaz Khan

· 0 min read

Cite Code arXiv

Abstract

Large scale pre trained Vision Language Models (VLMs) have proven effective for zero shot classification. Despite the success, most traditional VLMs based methods are restricted by the assumption of partial source supervision or ideal target vocabularies, which rarely satisfy the open world scenario. In this paper, we aim at a more challenging setting, Realistic Zero Shot Classification, which assumes no annotation but instead a broad vocabulary. To address the new problem, we propose the Self Structural Semantic Alignment (S3A) framework, which extracts the structural semantic information from unlabeled data while simultaneously self learning. Our S3A framework adopts a unique Cluster Vote Prompt Realign (CVPR) algorithm, which iteratively groups unlabeled data to derive structural semantics for pseudo-supervision. Our CVPR algorithm includes iterative clustering on images, voting within each cluster to identify initial class candidates from the vocabulary, generating discriminative prompts with large language models to discern confusing candidates, and realigning images and the vocabulary as structural semantic alignment. Finally, we propose to self train the CLIP image encoder with both individual and structural semantic alignment through a teacher-student learning strategy. Our comprehensive experiments across various generic and fine grained benchmarks demonstrate that the S3A method substantially improves over existing VLMs based approaches, achieving a more than 15% accuracy improvement over CLIP on average.

Type

Publication

In * The Association for the Advancement of Artificial Intelligence, AAAI 2024*

Last updated on Sep 14, 2024

Authors

Muzammal Naseer

Asst. Professor, Khalifa University

My research interests include adversarial attacks and defenses, Attention based Modeling and Out of distribution Generalization.

← LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts Jan 12, 2024

Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization Nov 12, 2023 →