
Srinivasa Rao Nandam
Academic and research departments
Centre for Vision, Speech and Signal Processing (CVSSP), Surrey Institute for People-Centred Artificial Intelligence (PAI).About
My research project
Foundation models for multimodal understandingFoundation models for natural language processing (NLP) have already seen a huge sucess after the seminal work of BERT (BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding) and GPT-N (Generative Pretraining Transformer: Improving language understanding by generative pre-training) which are recognised as the early foundation models for NLP in 2018.
However, the foundation models for computer vision started to emerge three years later at the beginning of 2021 with the seminal work of SiT (SiT: Self-supervised vIsion Transformer (under review)) which proposed the idea of group masked model learning (GMML).
Equipped with both NLP and vision foundation models, the aim of the PhD will be to study the role of these foundation models (e.g., NLP, vision, audio etc.) in multimodal analysis and understanding. For example, the initial work of the current research team (CLMIU: Commonsense Learning in Multimodal Image Understanding (under review)) has already established that using foundation models for vision in multimodal image understanding is more beneficial and has already alleviated the need of object detector which is considered as a critical pre-processing step of visual input.
The PhD research will build more advanced multimodal and cross-modal algorithms suitable for several downstream applications by building upon foundation models. The explainability of the decision and tasks performed by multimodal algorithm will also be of particular focus during the PhD study.
Supervisors
Foundation models for natural language processing (NLP) have already seen a huge sucess after the seminal work of BERT (BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding) and GPT-N (Generative Pretraining Transformer: Improving language understanding by generative pre-training) which are recognised as the early foundation models for NLP in 2018.
However, the foundation models for computer vision started to emerge three years later at the beginning of 2021 with the seminal work of SiT (SiT: Self-supervised vIsion Transformer (under review)) which proposed the idea of group masked model learning (GMML).
Equipped with both NLP and vision foundation models, the aim of the PhD will be to study the role of these foundation models (e.g., NLP, vision, audio etc.) in multimodal analysis and understanding. For example, the initial work of the current research team (CLMIU: Commonsense Learning in Multimodal Image Understanding (under review)) has already established that using foundation models for vision in multimodal image understanding is more beneficial and has already alleviated the need of object detector which is considered as a critical pre-processing step of visual input.
The PhD research will build more advanced multimodal and cross-modal algorithms suitable for several downstream applications by building upon foundation models. The explainability of the decision and tasks performed by multimodal algorithm will also be of particular focus during the PhD study.
Publications
The emergence of vision-language foundation models has enabled the integration of textual information into vision-based applications. However, in few-shot classification and segmentation (FS-CS), this potential remains underutilised. Commonly, self-supervised vision models have been employed, particularly in weakly-supervised scenarios, to generate pseudo-segmentation masks, as ground truth masks are typically unavailable and only target classification is provided. Despite their success, such models find it difficult to capture accurate semantics when compared to vision-language models. To address this limitation, we propose a novel FS-CS approach that leverages the rich semantic alignment of vision-language models to generate more precise pseudo ground-truth masks. While current vision-language models excel in global visual-text alignment, they struggle with finer, patch-level alignment, which is crucial for detailed segmentation tasks. To overcome this, we introduce a method that enhances patch-level alignment without requiring additional training. In addition, existing FS-CS frameworks typically lacks multi-scale information, limiting their ability to capture fine and coarse features simultaneously. To overcome this, we incorporate a module based on atrous convolutions to inject multi-scale information into the feature maps. Together, these contributions - text enhanced pseudo-mask generation and improved multi-scale feature representation - significantly boost the performance of our model in weakly-supervised settings, surpassing state-of-the-art methods and demonstrating the importance of integrating multi-modal information for robust FS-CS solutions.
Vision transformers combined with self-supervised learning have enabled the development of models which scale across large datasets for several downstream tasks, including classification, segmentation, and detection. However, the potential of these models for low-shot learning across several downstream tasks remains largely under explored. In this work, we conduct a systematic examination of different self-supervised pretext tasks, namely contrastive learning, clustering, and masked image modelling, to assess their low-shot capabilities by comparing different pretrained models. In addition, we explore the impact of various collapse avoidance techniques, such as centring, ME-MAX, and sinkhorn, on these downstream tasks. Based on our detailed analysis, we introduce a framework that combines mask image modelling and clustering as pretext tasks. This framework demonstrates superior performance across all examined low-shot downstream tasks, including multi-class classification, multi-label classification and semantic segmentation. Furthermore, when testing the model on large-scale datasets, we show performance gains in various tasks.
Foundation models like CLIP and ALIGN have transformed few-shot and zero-shot vision applications by fusing visual and textual data, yet the integrative few-shot classification and segmentation (FS-CS) task primarily leverages visual cues, overlooking the potential of textual support. In FS-CS scenarios, ambiguous object boundaries and overlapping classes often hinder model performance, as limited visual data struggles to fully capture high-level semantics. To bridge this gap, we present a novel multi-modal FS-CS framework that integrates textual cues into support data, facilitating enhanced semantic disambiguation and fine-grained segmentation. Our approach first investigates the unique contributions of exclusive text-based support, using only class labels to achieve FS-CS. This strategy alone achieves performance competitive with vision-only methods on FS-CS tasks, underscoring the power of textual cues in few-shot learning. Building on this, we introduce a dual-modal prediction mechanism that synthesizes insights from both textual and visual support sets, yielding robust multi-modal predictions. This integration significantly elevates FS-CS performance, with classification and segmentation improvements of +3.7/6.6% (1-way 1-shot) and +8.0/6.5% (2-way 1-shot) on COCO-20^i, and +2.2/3.8% (1-way 1-shot) and +4.3/4.0% (2-way 1-shot) on Pascal-5^i. Additionally, in weakly supervised FS-CS settings, our method surpasses visual-only benchmarks using textual support exclusively, further enhanced by our dual-modal predictions. By rethinking the role of text in FS-CS, our work establishes new benchmarks for multi-modal few-shot learning and demonstrates the efficacy of textual cues for improving model generalization and segmentation accuracy.