About

My research project

Publications

Srinivasa Rao Nandam, Sara Atito Ali Ahmed, Zhen-Hua Feng, Josef Kittler, Muhammad Awais (2024)Enhanced Weakly Supervised Few-shot Classification & Segmentation, In: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - Proceedings Institute of Electrical and Electronics Engineers (IEEE)

The emergence of vision-language foundation models has enabled the integration of textual information into vision-based applications. However, in few-shot classification and segmentation (FS-CS), this potential remains underutilised. Commonly, self-supervised vision models have been employed, particularly in weakly-supervised scenarios, to generate pseudo-segmentation masks, as ground truth masks are typically unavailable and only target classification is provided. Despite their success, such models find it difficult to capture accurate semantics when compared to vision-language models. To address this limitation, we propose a novel FS-CS approach that leverages the rich semantic alignment of vision-language models to generate more precise pseudo ground-truth masks. While current vision-language models excel in global visual-text alignment, they struggle with finer, patch-level alignment, which is crucial for detailed segmentation tasks. To overcome this, we introduce a method that enhances patch-level alignment without requiring additional training. In addition, existing FS-CS frameworks typically lacks multi-scale information, limiting their ability to capture fine and coarse features simultaneously. To overcome this, we incorporate a module based on atrous convolutions to inject multi-scale information into the feature maps. Together, these contributions - text enhanced pseudo-mask generation and improved multi-scale feature representation - significantly boost the performance of our model in weakly-supervised settings, surpassing state-of-the-art methods and demonstrating the importance of integrating multi-modal information for robust FS-CS solutions.

Srinivasa Rao Nandam, Sara Atito, Zhenhua Feng, Josef Kittler, Muhammad Awais (2025)Investigating Self-Supervised Methods for Label-Efficient Learning, In: International Journal of Computer Vision133(7)pp. 4522-4537 Springer

Vision transformers combined with self-supervised learning have enabled the development of models which scale across large datasets for several downstream tasks, including classification, segmentation, and detection. However, the potential of these models for low-shot learning across several downstream tasks remains largely under explored. In this work, we conduct a systematic examination of different self-supervised pretext tasks, namely contrastive learning, clustering, and masked image modelling, to assess their low-shot capabilities by comparing different pretrained models. In addition, we explore the impact of various collapse avoidance techniques, such as centring, ME-MAX, and sinkhorn, on these downstream tasks. Based on our detailed analysis, we introduce a framework that combines mask image modelling and clustering as pretext tasks. This framework demonstrates superior performance across all examined low-shot downstream tasks, including multi-class classification, multi-label classification and semantic segmentation. Furthermore, when testing the model on large-scale datasets, we show performance gains in various tasks.

Srinivasa Rao Nandam, Sara Atito Ali Ahmed, Zhen-Hua Feng, Josef Vaclav Kittler, Muhammad Awais (2025)Text Augmented Correlation Transformer for Few-shot Classification & Segmentation, In: IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 - Proceedingspp. 25357-25366 Institute of Electrical and Electronics Engineers (IEEE)

Foundation models like CLIP and ALIGN have transformed few-shot and zero-shot vision applications by fusing visual and textual data, yet the integrative few-shot classification and segmentation (FS-CS) task primarily leverages visual cues, overlooking the potential of textual support. In FS-CS scenarios, ambiguous object boundaries and overlapping classes often hinder model performance, as limited visual data struggles to fully capture high-level semantics. To bridge this gap, we present a novel multi-modal FS-CS framework that integrates textual cues into support data, facilitating enhanced semantic disambiguation and fine-grained segmentation. Our approach first investigates the unique contributions of exclusive text-based support, using only class labels to achieve FS-CS. This strategy alone achieves performance competitive with vision-only methods on FS-CS tasks, underscoring the power of textual cues in few-shot learning. Building on this, we introduce a dual-modal prediction mechanism that synthesizes insights from both textual and visual support sets, yielding robust multi-modal predictions. This integration significantly elevates FS-CS performance, with classification and segmentation improvements of +3.7/6.6% (1-way 1-shot) and +8.0/6.5% (2-way 1-shot) on COCO-20^i, and +2.2/3.8% (1-way 1-shot) and +4.3/4.0% (2-way 1-shot) on Pascal-5^i. Additionally, in weakly supervised FS-CS settings, our method surpasses visual-only benchmarks using textual support exclusively, further enhanced by our dual-modal predictions. By rethinking the role of text in FS-CS, our work establishes new benchmarks for multi-modal few-shot learning and demonstrates the efficacy of textual cues for improving model generalization and segmentation accuracy.