Dr. Sauradip Nag

Sauradip Nag

What Goes Around
Comes Around.

I completed my Doctor of Philosophy (PhD), focusing on Computer Vision and Deep Learning in 2023, from Xiang's Phd Group of Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, England, United Kingdom. I was advised by primary supervisor Prof. Tao (Tony) Xiang, and co-supervisor as Prof. Yi-Zhe Song. I also worked closely with Dr. Xiatian (Eddy) Zhu.

Prior to this, I was a Project Associate in Visualization and Perception Lab , IIT Madras working with Prof. Sukhendu Das. I have also collaborated with Prof. Umapada Pal from Indian Statistical Institute (ISI), Kolkata and Prof. Palaihnakhote Shivakumara of University of Malaya , Malyasia and Prof. Partha Pratim Roy of Indian Institute of Technolgy (IIT), Roorkee on various Computer Vision problems during undergraduate.

Website Not Updated Since 2023

Google Scholar / LinkedIn / GitHub / Twitter

Research Interests

I am broadly interested in the field of Computer Vision and Deep Learning. Particularly, I have mostly focused on Visual Scene Understanding (VSU) from images and videos, effective methods of Transfer Learning for VSU, building systems that learn with minimal or no supervision in real and diverse scenarios. I have worked in Image Understanding Tasks ( Object Detection/ Semantic Segmentation/ Depth Estimation ), Video Understanding Tasks ( Exo-Centric Videos/ Ego-Centric Videos/ Human Activity Detection/ Video Classification), Vision-Language Modeling , Low Resource Learning (Few-Shot/ Zero-Shot/ Meta-Learning/ Semi-Supervised/ Self-Supervised Learning) . Recently, I have shifted my focus on Generative Modeling , 2D (Image/ Video) ; 3D (Object/ Human/ Animal) ; 4D (3D+Time) Generation and Editing .

Collaboration : I am always open to discussions and collaborations, feel free to ping me on Email/Linkedin if you are interested in. Check this before contacting.

Updates

New 12/2023 : One paper accepted in AAAI 2024 on Diffusion for Audios as Oral paper

New 11/2023 : I successfully Defended my PhD Thesis before Prof. Ioannis Patras and Prof. Simon Hadfield .

New 10/2023 : Congrats to Anindya and Jay on their ICCV-W and NeurIPS-W papers. Baby Steps !!

New 07/2023 : One paper accepted in ICCV 2023 on Diffusion for Videos

03/2023 : One paper accepted in CVPR 2023 on Video Action Post-Processing

06/2022 : Three papers (top 3% impactful authors) accepted in ECCV 2022 on Video Action Detection

06/2022 : Our team is 1st Runners Up in Fine-grained Retrieval Challenge in CVPR 2022

10/2021 : One paper accepted in ML4AD Workshop in NeurIPS 2021 on Autonomous Driving

10/2021 : One paper accepted in BMVC 2021 on Few-Shot Action Detection

08/2021 : Gave a presentation on Video Understanding using Fewer Labels in Oxford ML Summer School .

06/2021 : One paper accepted in IEEE Transactions on Circuits and Systems for Video Technology

09/2020 : I have moved to South-West London to join Xiang's PhD Group, University of Surrey as a PhD student.

05/2020 : One paper got accepted in Pattern Recognition, Elsevier

Education


						University of Surrey, United Kingdom Position : Doctor of Philosophy (PhD) in Electrical Engineering Under Prof. Tao Xiang and Prof. Yi-Zhe Song Thesis : Towards Efficient Temporal Activity Detection Jul 2020 - Nov 2023


						Kalyani Government Engineering College, India Position : Bachelor of Technology (B.Tech) in Computer Science and Engineering Under Dr. Kousik Dasgupta Thesis : Interacting with Softwares using Gestures Jul 2014 - Jul 2018

Research Experience


						Indian Institute of Technology Madras, India Position : Project Associate in Visualization and Perception Lab and CAIR, DRDO Under Prof. Sukhendu Das, Prof. K Mitra and Prof. B Ravindran Jul 2018 - Feb 2020


						Indian Statistical Institute Kolkata, India Position : Research Associate in Computer Vision and Pattern Recognition Unit Under Dr. Umapada Pal and Dr. P Shivakumara Jul 2016 - Jul 2018

Journal Publications

A New Unified Method for Detecting Text from Marathon Runners and Sports Players in Video

Sauradip Nag , P Shivakumara , Umapada Pal , Tong Lu , Michael Blumenstein
Pattern Recognition, Elsevier [IF : 7.196]

Introduces a new way of Detecting Bib Numbers from sports video by taking Human Torso, Skin and Head into consideration.

Abstract / Code / BibTex

Detecting text located on the torsos of marathon runners and sports players in video is a challenging issue due to poor quality and adverse effects caused by flexible/colorful clothing, and different structures of human bodies or actions. This paper presents a new unified method for tackling the above challenges. The proposed method fuses gradient magnitude and direction coherence of text pixels in a new way for detecting candidate regions. Candidate regions are used for determining the number of temporal frame clusters obtained by K-means clustering on frame differences. This process in turn detects key frames. The proposed method explores Bayesian probability for skin portions using color values at both pixel and component levels of temporal frames, which provides fused images with skin components. Based on skin information, the proposed method then detects faces and torsos by finding structural and spatial coherences between them. We further propose adaptive pixels linking a deep learning model for text detection from torso regions. The proposed method is tested on our own dataset collected from marathon/sports video and three standard datasets, namely, RBNR, MMM and R-ID of marathon images, to evaluate the performance. In addition, the proposed method is also tested on the standard natural scene datasets, namely, CTW1500 and MS-COCO text datasets, to show the objectiveness of the proposed method. A comparative study with the state-of-the-art methods on bib number/text detection of different datasets shows that the proposed method outperforms the existing methods

An Episodic Learning Network for Text Detection on Human Bodies in Sports Images

P Chowdhury , P Shivakumara , R Ramachandra , Sauradip Nag , Umapada Pal , Tong Lu , Daniel Lopresti
IEEE Transactions on CSVT [IF : 4.6]

Introduces a new improved Human Centric approach of Detecting Bib Numbers from sports video by taking motion influenced Human Clothing and Camera Pose into consideration.

Abstract / BibTex

Multi-Modal Few-Shot Temporal Action Detection

Sauradip Nag , Mengmeng Xu , Xiatian Zhu , Juan Perez-Rua , Bernard Ghanem , Yi-Zhe Song , Tao Xiang
Arxiv 2023

This work introduced a novel Multi-Modal Few-Shot setting for Human Activity Detection task, where each Support Set consists of both Videos and associated Captions/Text. This work also shows how Video-to-NullText inversion is done, similar to DreamBooth.

Abstract / Code / arXiv / BibTex

Preprints

PersonalTailor: Personalizing 2D Pattern Design from 3D Point Clouds

Sauradip Nag , Anran Qi , Xiatian Zhu , Ariel Shamir
ArXiv, 2023

This work introduced a multi-modal latent-space disentanglement pipeline for 2D Garment Pattern editing from 3D point clouds. Disentangling latent gives the flexibility to add/edit/delete the panel latents individually whose composition forms new Garment Styles.

Abstract / Code / ArXiv / BibTex

Conference Publications

	DiffSED: Diffusion-based Sound Event Detection Swapnil Bhosale* , Sauradip Nag* , Diptesh Kanojia , Jiankang Deng , Xiatian Zhu AAAI Conference on Artificial Intelligence (AAAI), 2024 (Oral Paper) Vancouver, Canada [ H5-Index : 212 ] This work reformulated the discriminative Sound-Event Detection task into a Generative Learning paradigm using Noise-to-Latent Densoising Diffusion. Abstract / Code / ArXiv / BibTex Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the splitand-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process, conditioned on a target audio sample. During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions in the elegant Transformer decoder framework. Doing so enables the model generate accurate event boundaries from even noisy queries during inference. Extensive experiments on the Urban-SED and EPIC-Sounds datasets demonstrate that our model significantly outperforms existing alternatives, with 40%+ faster convergence in training. @article{bhosale2023diffsed, title={DiffSED: Sound Event Detection with Denoising Diffusion}, author={Bhosale, Swapnil and Nag, Sauradip and Kanojia, Diptesh and Deng, Jiankang and Zhu, Xiatian}, journal={arXiv preprint arXiv:2308.07293}, year={2023} }
	DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion Sauradip Nag , Xiatian Zhu , Jiankang Deng , Yi-Zhe Song , Tao Xiang IEEE International Conference on Computer Vision (ICCV), 2023 Paris, France [ H5-Index : 239 ] This work introduced the first DETR based Diffusion framework for Human Activity Detection task. It introduces a new Noise-to-Proposal denoising paradigm of Diffusion via Transformer Decoder as denoiser. This can be extended to any detection task. Abstract / Code / ArXiv / BibTex We propose a new formulation of temporal action de- tection (TAD) with denoising diffusion, DiffTAD in short. Taking as input random temporal proposals, it can yield ac- tion proposals accurately given an untrimmed long video. This presents a generative modeling perspective, against previous discriminative learning manners. This capability is achieved by first diffusing the ground-truth proposals to random ones (i.e., the forward/noising process) and then learning to reverse the noising process (i.e., the backward/- denoising process). Concretely, we establish the denoising process in the Transformer decoder (e.g., DETR) by intro- ducing a temporal location query design with faster con- vergence in training. We further propose a cross-step selec- tive conditioning algorithm for inference acceleration. Ex- tensive evaluations on ActivityNet and THUMOS show that our DiffTAD achieves top performance compared to previ- ous art alternatives. @inproceedings{nag2023difftad, title={Post-Processing Temporal Action Detection}, author={Nag, Sauradip and Zhu, Xiatian and Deng Jiankang, and Song, Yi-zhe and Xiang, Tao}, booktitle=arxiv, year={2023} }
	Post-Processing Temporal Action Detection Sauradip Nag , Xiatian Zhu , Yi-Zhe Song , Tao Xiang IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023 Vancouver, Canada [ H5-Index : 389 ] This work introduced a new Parameter-Free learnable Post-Processing technique for Human Action Detection task. It uses Gaussian Based refinement of start/end points where the refined shift is estimated using Taylor's Expansion. Abstract / Code / ArXiv / BibTex Existing Temporal Action Detection (TAD) methods typically take a pre-processing step in converting an input varying-length video into a fixed-length snippet representation sequence, before temporal boundary estimation and action classification. This pre-processing step would temporally downsample the video, reducing the inference resolution and hampering the detection performance in the original temporal resolution. In essence, this is due to a temporal quantization error introduced during the resolution downsampling and recovery. This could negatively impact the TAD performance, but is largely ignored by existing methods. To address this problem, in this work we introduce a novel model-agnostic post-processing method without model redesign and retraining. Specifically, we model the start and end points of action instances with a Gaussian distribution for enabling temporal boundary inference at a sub-snippet level. We further introduce an efficient Taylor-expansion based approximation, dubbed as Gaussian Approximated Post-processing (GAP). Extensive experiments demonstrate that our GAP can consistently improve a wide variety of pre-trained off-the-shelf TAD models on the challenging ActivityNet (+0.2%∼0.7% in average mAP) and THUMOS (+0.2%∼0.5% in average mAP) benchmarks. Such performance gains are already significant and highly comparable to those achieved by novel model designs. Also, GAP can be integrated with model training for further performance gain. Importantly, GAP enables lower temporal resolutions for more efficient inference, facilitating low-resource applications @inproceedings{nag2022gap, title={Post-Processing Temporal Action Detection}, author={Nag, Sauradip and Zhu, Xiatian and Song, Yi-zhe and Xiang, Tao}, booktitle=arxiv, year={2022} }
	Proposal-Free Temporal Action Detection via Global Segmentation Mask Sauradip Nag , Xiatian Zhu , Yi-Zhe Song , Tao Xiang European Conference in Computer Vision (ECCV), 2022 Tel Aviv, Israel [ H5-Index : 187 ] This is the first work that introduces a new Proposal-Free paradigm in Human Action Detection task. It reformulates action start/end regression into a action-mask prediction problem. This makes it 30x faster in training and 2x in inference than existing approaches Abstract / Code / ArXiv / Project Page / BibTex Existing temporal action localization (TAL) methods rely on generating an overwhelmingly large number of proposals per video. This leads to complex model designs due to proposal generation and/or per-proposal action instance evaluation and the resultant high computational cost. In this work, for the first time, a proposal-free TAL model is proposed. Our core idea is to learn a global segmentation mask (GSM) of each action instance jointly at the full video length. GSM model differs significantly from the conventional proposal-based methods by focusing on global temporal representation learning to directly detect local start and end points of action instances without proposals. Further, by modeling TAL holistically rather than locally at the individual proposal level, GSM needs a much simpler model architecture with lower computational cost. Extensive experiments show that despite its simpler design, GSM outperforms existing TAL methods, achieving new state-of-the-art performance on two benchmarks. Importantly, it is ∼ 20× faster to train and ∼ 1.6× more efficient for inference @inproceedings{nag2022gsm, title={Temporal Action Localization with Global Segmentation Mask Learning}, author={Nag, Sauradip and Zhu, Xiatian and Song, Yi-zhe and Xiang, Tao}, booktitle=eccv, year={2022} }
	Zero-Shot Temporal Action Detection via Vision-Language Prompting Sauradip Nag , Xiatian Zhu , Yi-Zhe Song , Tao Xiang European Conference in Computer Vision (ECCV), 2022 Tel Aviv, Israel [ H5-Index : 187 ] This is the first work that introduces Vision-language models for Zero-Shot Action Detection task. CLIP models off-the-shelf are not meant for detection tasks, it needs a class-agnostic masking to make it generalizable to zero-shot setting which is illustrated in this work. Abstract / Code / ArXiv / Project Page / BibTex Existing temporal action localization (TAL) methods rely on a large number of training data with segment-level annotations, restricted to the training classes alone during inference. However, collecting and annotating a large training set for all the classes of interest is costly and hence unscalable over time. Zero-shot TAL (ZS-TAL) resolves this obstacle by enabling a pre-trained model to recognize any unseen action classes. Nonetheless, ZS-TAL is also much more challenging than supervised counterpart, consequently significantly under-studied. Inspired by the success of zero-shot image classification with the aid of recent vision-language (ViL) models such as CLIP, we also aim to capitalize them for the complex TAL task. A recent ZS-TAL work of this kind directly combines an off-the-shelf proposal detector with CLIP style classification in a 2-stage design. Due to the sequential localization (e.g., proposal generation) and classification design, it is prone to localization error propagation. To overcome this problem, in this paper we propose a novel Parallel Classification and Localization with class-agnostic Feature Masking (PCL-FM) model. Such a novel design effectively eliminates the dependence between localization and classification by breaking the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for improved optimization. Extensive experiments on standard ZS-TAl video benchmarks show that our PCL-FM significantly outperforms state-of-the-art alternatives. Besides, our model also yield superior results on supervised TAL over recent strong competitors. @inproceedings{nag2022pclfm, title={Language Guided Zero-Shot Temporal Action Localization with Feature Masking}, author={Nag, Sauradip and Zhu, Xiatian and Song, Yi-zhe and Xiang, Tao}, booktitle=eccv, year={2022} }
	Semi-Supervised Temporal Action Detection with Proposal-Free Masking Sauradip Nag , Xiatian Zhu , Yi-Zhe Song , Tao Xiang European Conference in Computer Vision (ECCV), 2022 Tel Aviv, Israel [ H5-Index : 187 ] This work showcases that having a two-stage pipeline for Human Action Detection task suffers from Proposal Error-Propagation problem. This work propsoed a new single-stage framework coupled with novel self-supervised pre-training task to curb out this error. Abstract / Code / ArXiv / Project Page / BibTex Existing temporal action localization (TAL) methods rely on a large number of training data with segment-level annotations. Collecting and annotating such a training set is thus highly expensive and unscalable. Semi-supervised TAL (SSTAL) alleviates this problem by leveraging unlabeled videos freely available at scale. However, SSTAL is also a much more challenging problem than supervised TAL, and consequently much under-studied. Prior SSTAL methods directly combine an existing proposal-based TAL method and a SSL method. Due to their sequential localization (e.g., proposal generation) and classification design, they are prone to proposal error propagation. To overcome this limitation, in this work we propose a novel Proposal-Free Temporal Mask (PFTM) learning model with a parallel localization (mask generation) and classification architecture. Such a novel design effectively eliminates the dependence between localization and classification by cutting off the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for prediction refinement, and a new pretext task for self-supervised model pre-training. Extensive experiments on two standard benchmarks show that our PFTM outperforms state-of-the-art alternatives, often by a large margin @inproceedings{nag2022pftm, title={Semi-Supervised Temporal Action Localization with Proposal-Free Temporal Mask Learning}, author={Nag, Sauradip and Zhu, Xiatian and Song, Yi-zhe and Xiang, Tao}, booktitle=eccv, year={2022} }
	Few-Shot Temporal Action Localization with Query Adaptive Transformer Sauradip Nag , Xiatian Zhu , Tao Xiang British Machine Vision Conference (BMVC), 2021 Manchester, Virtual [ H5-Index : 66 ] Existing Few-Shot Action Detection tasks deal with trimmed video and has different designs for different few-shot algorithms. This work proposed a Model-Agnostic approach to use Untrimmed Video for adapting Query Videos using Support Samples Abstract / Code / ArXiv / BibTex / Slides Existing temporal action localization (TAL) works rely on a large number of training videos with exhaustive segment-level annotation, preventing them from scaling to new classes. As a solution to this problem, few-shot TAL (FS-TAL) aims to adapt a model to a new class represented by as few as a single video. Exiting FS-TAL methods assume trimmed training videos for new classes. However, this setting is not only unnatural – actions are typically captured in untrimmed videos, but also ignores background video segments containing vital contextual cues for foreground action segmentation. In this work, we first propose a new FS-TAL setting by proposing to use untrimmed training videos. Further, a novel FS-TAL model is proposed which maximizes the knowledge transfer from training classes whilst enabling the model to be dynamically adapted to both the new class and each video of that class simultaneously. This is achieved by introducing a query adaptive Transformer in the model. Extensive experiments on two action localization benchmarks demonstrate that our method can outperform all the state-of-the-art alternatives significantly in both single-domain and cross-domain scenarios. title={Few-Shot Temporal Action Localization with Query Adaptive Transformer}, author={Sauradip Nag and Xiatian Zhu and Tao Xiang}, year={2021}, eprint={2110.10552}, archivePrefix={arXiv}, primaryClass={cs.CV} }
	What's There in the Dark Sauradip Nag , Saptakatha Adak , Sukhendu Das International Conference in Image Processing (ICIP), 2019 (Spotlight Paper) Taipei, Taiwan [ H5-Index : 45 ] This is the first work that introduced Semantic Segmentation for Night-Time scenes. This approach used Cycle-GANS as a means to generate Night time segmentations and used a comparator network as a discriminator to distinguish real vs fake night-time sample. Abstract / Code / BibTex Scene Parsing is an important cog for modern autonomous driving systems. Most of the works in semantic segmentation pertains to day-time scenes with favourable weather and illumination conditions. In this paper, we propose a novel deep architecture, NiSeNet, that performs semantic segmentation of night scenes using a domain mapping approach of synthetic to real data. It is a dual-channel network, where we designed a Real channel using DeepLabV3+ coupled with an MSE loss to preserve the spatial information. In addition, we used an Adaptive channel reducing the domain gap between synthetic and real night images, which also complements the failures of Real channel output. Apart from the dual channel, we introduced a novel fusion scheme to fuse the outputs of two channels. In addition to that, we compiled a new dataset Urban Night Driving Dataset (UNDD); it consists of 7125 unlabelled day and night images; additionally, it has 75 night images with pixel-level annotations having classes equivalent to Cityscapes dataset. We evaluated our approach on the Berkley Deep Drive dataset, the challenging Mapillary dataset and UNDD dataset to exhibit that the proposed method outperforms the state-of-the-art techniques in terms of accuracy and visual quality. @inproceedings{nag2019s, title={What’s There in the Dark}, author={Nag, Sauradip and Adak, Saptakatha and Das, Sukhendu}, booktitle={2019 IEEE International Conference on Image Processing (ICIP)}, pages={2996--3000}, year={2019}, organization={IEEE} }
	Facial Micro-Expression Spotting and Recognition using Time Contrasted Feature with Visual Memory Sauradip Nag , Ayan Kumar Bhunia , Aishik Konwer, Partha Pratim Roy International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019 Brighton, United Kingdom [ H5-Index : 80 ] This work introduced a new spatio-temporal network for Facial Micro-Expression detection task. It introduced a time-constrasted feature extraction module that greatly improved the spotting of micro-expression from inconspicuous facial movements. Abstract / arXiv / BibTex Facial micro-expressions are sudden involuntary minute muscle movements which reveal true emotions that people try to conceal. Spotting a micro-expression and recognizing it is a major challenge owing to its short duration and intensity. Many works pursued traditional and deep learning based approaches to solve this issue but compromised on learning low level features and higher accuracy due to unavailability of datasets. This motivated us to propose a novel joint architecture of spatial and temporal network which extracts time-contrasted features from the feature maps to contrast out micro-expression from rapid muscle movements. The usage of time contrasted features greatly improved the spotting of micro-expression from inconspicuous facial movements. Also, we include a memory module to predict the class and intensity of the micro-expression across the temporal frames of the micro-expression clip. Our method achieves superior performance in comparison to other conventional approaches on CASMEII dataset. @inproceedings{nag2019facial, title={Facial Micro-Expression Spotting and Recognition using Time Contrasted Feature with Visual Memory}, author={Nag, Sauradip and Bhunia, Ayan Kumar and Konwer, Aishik and Roy, Partha Pratim}, booktitle={ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={2022--2026}, year={2019}, organization={IEEE} }
	CRNN based Jersey-Bib Number/Text Recognition in Sports and Marathon Images Sauradip Nag , Raghavendra Ramachandra , Palaiahnakote Shivakumara , Umapada Pal , Tong Lu , Mohan Kankanhalli International Conference on Document Analysis and Recognition (ICDAR), 2019 Sydney, Australia [ H5-Index : 26 ] This work further improves the Bib Detection from images. It uses a 2D-Human Pose Keypoints to identify different possible locations for Bib numbers and then individually extracts them using LSTM based text recognition pipeline Abstract / BibTex The primary challenge in tracing the participants in sports and marathon video or images is to detect and localize the jersey/Bib number that may present in different regions of their outfit captured in cluttered environment conditions. In this work, we proposed a new framework based on detecting the human body parts such that both Jersey Bib number and text is localized reliably. To achieve this, the proposed method first detects and localize the human in a given image using Single Shot Multibox Detector (SSD). In the next step, different human body parts namely, Torso, Left Thigh, Right Thigh, that generally contain a Bib number or text region is automatically extracted. These detected individual parts are processed individually to detect the Jersey Bib number/text using a deep CNN network based on the 2-channel architecture based on the novel adaptive weighting loss function. Finally, the detected text is cropped out and fed to a CNN-RNN based deep model abbreviated as CRNN for recognizing jersey/Bib/text. Extensive experiments are carried out on the four different datasets including both bench-marking dataset and a new dataset. The performance of the proposed method is compared with the state-of-the-art methods on all four datasets that indicates the improved performance of the proposed method on all four datasets. @inproceedings{nag2019crnn, title={CRNN Based Jersey-Bib Number/Text Recognition in Sports and Marathon Images}, author={Nag, Sauradip and Ramachandra, Raghavendra and Shivakumara, Palaiahnakote and Pal, Umapada and Lu, Tong and Kankanhalli, Mohan}, booktitle={2019 International Conference on Document Analysis and Recognition (ICDAR)}, pages={1149--1156}, year={2019}, organization={IEEE} }
	A New COLD Feature based Handwriting Analysis for Ethnicity/Nationality Identification Sauradip Nag , Palaiahnakote Shivakumara , Wu Yirui , Umapada Pal , Tong Lu International Conference on Frontiers in Handwriting Recognition (ICFHR), 2018 Niagara Falls, USA [ H5-Index : 18 ] This is the first work that can identify Ethnicity from Handwriting in documents. It uses a Cloud-of-Line distribution based feature representation whose dimension is reduced by a PCA to select the prominent differences and used by SVM for classification of ethnicity. Abstract / arXiv / BibTex Identifying crime for forensic investigating teams when crimes involve people of different nationals is challenging. This paper proposes a new method for ethnicity (nationality) identification based on Cloud of Line Distribution (COLD) features of handwriting components. The proposed method, at first, explores tangent angle for the contour pixels in each row and the mean of intensity values of each row in an image for segmenting text lines. For segmented text lines, we use tangent angle and direction of base lines to remove rule lines in the image. We use polygonal approximation for finding dominant points for contours of edge components. Then the proposed method connects the nearest dominant points of every dominant point, which results in line segments of dominant point pairs. For each line segment, the proposed method estimates angle and length, which gives a point in polar domain. For all the line segments, the proposed method generates dense points in polar domain, which results in COLD distribution. As character component shapes change, according to nationals, the shape of the distribution changes. This observation is extracted based on distance from pixels of distribution to Principal Axis of the distribution. Then the features are subjected to an SVM classifier for identifying nationals. Experiments are conducted on a complex dataset, which show the proposed method is effective and outperforms the existing method. @inproceedings{nag2018new, title={New COLD Feature Based Handwriting Analysis for Enthnicity/Nationality Identification}, author={Nag, Sauradip and Shivakumara, Palaiahnakote and Wu, Yirui and Pal, Umapada and Lu, Tong}, booktitle={2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR)}, pages={523--527}, year={2018}, organization={IEEE} }

Workshops and Challenges

	Adaptive-Labeling for Enhancing Remote Sensing Cloud Understanding Jay Gala , Sauradip Nag , Huichou Huang , Ruirui Liu , Xiatian Zhu Tackling Climate Change with Machine Learning Workshop, NeurIPS 2023 This work introduces a new algorithm to iteratively improve the existing noisy annotations and extract the best performance from any model via combination of Dynamic Thresholding coupled with FixMatch style optimization. Abstract / Code / ArXiv / BibTex Cloud analysis is a critical component of weather and climate science, impacting various sectors like disaster management. However, achieving fine-grained cloud analysis, such as cloud segmentation, in remote sensing remains challenging due to the inherent difficulties in obtaining accurate labels, leading to significant labeling errors in training data. Existing methods often assume the availability of reliable segmentation annotations, limiting their overall performance. To address this inherent limitation, we introduce an innovative model-agnostic Cloud Adaptive-Labeling (CAL) approach, which operates iteratively to enhance the quality of training data annotations and consequently improve the performance of the learned model. Our methodology commences by training a cloud segmentation model using the original annotations. Subsequently, it introduces a trainable pixel intensity threshold for adaptively labeling the cloud training images on-the-fly. The newly generated labels are then employed to fine-tune the model. Extensive experiments conducted on multiple standard cloud segmentation benchmarks demonstrate the effectiveness of our approach in significantly boosting the performance of existing segmentation models. Our CAL method establishes new state-of-the-art results when compared to a wide array of existing alternatives. @inproceedings{gala2023cal, title={Adaptive-Labeling for Enhancing Remote Sensing Cloud Understanding}, author={Gala, Jay and Nag, Sauradip and Huang, Huichou and Liu, Ruirui and Zhu, Xiatian}, booktitle={NeurIPS 2023 Workshop on Tackling Climate Change with Machine Learning}, year={2023} }
	Actor-Agnostic Multi-Label Action Recognition with Multi-Modal Query Anindya Mondal* , Sauradip Nag* , Joaquin M. Prada , Xiatian Zhu , Anjan Dutta New Ideas in Vision Transformers Workshop (NIVT), ICCV 2023 This work showcases that Action-recognition is not Actor specific if we can make use of Language embeddings. Hence be it Animal or Human action, it is a unified model without any actor specific information requirement for recognition. Abstract / Code / ArXiv / BibTex Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors, (e.g., humans vs animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources, (e.g., class name text). To address all these limitations, we first introduce a new problem of actor-agnostic multi-modal multi-label action recognition, a single model architecture generalising to different types of actors, such as humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework, (e.g., DETR), characterised by leveraging both visual and textual modalities to represent the action classes more richly. Thus, we completely eliminate the conventional need for actor-specific model designs, (e.g., pose estimation). Extensive experiments on four publicly available benchmarks show that our MSQNet consistently outperforms the prior arts of actor-specific alternatives on both human and animal single- and multi-label action recognition tasks by a margin of up to 50% mean average precision score. @inproceedings{nag2022muppet, title={Multi-Modal Few-Shot Temporal Action Detection via Vision-Language Meta-Adaptation}, author={Nag, Sauradip and Xu, Mengmeng and Zhu, Xiatian and Perez-Rua, Juan-Manuel and Ghanem, Bernard and Song, Yi-zhe and Xiang, Tao}, booktitle=arxiv, year={2022} }
	Large-Scale Product Retrieval with Weakly Supervised Representation Learning X Han, K.W Ng , Sauradip Nag , Z Qu eBay eProduct Visual Search Challenge Fine-Grained Visual Categorization Workshop (FGVC9), CVPR 2022 New Orleans, USA We achieved runners up position in this retrieval challenge. Additionally, we proposed novel solutions for mining pseudo-attributes and treat them as labels, some innovative training recipes and novel post-processing solutions for large-scale product retrieval task Abstract / Code / ArXiv / BibTex Large-scale weakly supervised product retrieval is a practically useful yet computationally challenging problem. This paper introduces a novel solution for the eBay Visual Search Challenge (https://eval.ai/challenge/1541/overview) held at the Ninth Workshop on Fine-Grained Visual Categorisation workshop FGVC9 of CVPR 2022. There are two main challenges to be addressed in this competition: (a) e-commerce is a super fine-grained domain, where there are many products with subtle visual differences; (b) unavailability of instance-level labels for training, but only coarse-grained category-labels and product titles available. To this end, we adopt a series of empirically effective techniques: (a) Instead of using text training data directly, thousands of pseudo-attributes are mined from product titles, used as ground truths for multi-label classification. (b) Several strong backbones and advanced training recipes are incorporated to learn more discriminative feature representations. (c) Several post-processing techniques including whitening, re-ranking and model ensemble are integrated to further refine model predictions. Consequently, the final score of our team (Involution King) reaches 71.53\% MAR, ranked at the second position in the leaderboard. title={Few-Shot Temporal Action Localization with Query Adaptive Transformer}, author={Sauradip Nag and Xiatian Zhu and Tao Xiang}, year={2021}, eprint={2110.10552}, archivePrefix={arXiv}, primaryClass={cs.CV} }
	How Far Can I Go ? : A Self-Supervised Approach for Deterministic Video Depth Forecasting Sauradip Nag* , Nisarg Shah* , Anran Qi* , R Ramachandra Machine Learning for Autonomous Driving Workshop (ML4AD), NeurIPS 2021 Australia, Virtual This work introduced the first self-supervised Video Depth Forecasting solution for autonomous driving. It proposed a new Feature Forecasting paradigm of generative modeling for generating future depth maps from rgb frames. Abstract / Code / BibTex In this paper we present a novel self-supervised method to anticipate the depth estimate for a future, unobserved real-world urban scene. This work is the first to explore self-supervised learning for estimation of monocular depth of future unobserved frames of a video. Existing works rely on a large number of annotated samples to generate the probabilistic prediction of depth for unseen frames. However, this makes it unrealistic due to its requirement for large amount of annotated depth samples of video. In addition, the probabilistic nature of the case, where one past can have multiple future outcomes often leads to incorrect depth estimates. Unlike previous methods, we model the depth estimation of the unobserved frame as a view-synthesis problem, which treats the depth estimate of the unseen video frame as an auxiliary task while synthesizing back the views using learned pose. This approach is not only cost effective - we do not use any ground truth depth for training (hence practical) but also deterministic (a sequence of past frames map to an immediate future). To address this task we first develop a novel depth forecasting network DeFNet which estimates depth of unobserved future by forecasting latent features. Second, we develop a channel-attention based pose estimation network that estimates the pose of the unobserved frame. Using this learned pose, estimated depth map is reconstructed back into the image domain, thus forming a self-supervised solution. Our proposed approach shows significant improvements of ~ 5%/8% in Abs Rel metric compared to state-of-the-art alternatives on both short and mid-term forecasting setting, benchmarked on KITTI and Cityscapes. title={Few-Shot Temporal Action Localization with Query Adaptive Transformer}, author={Sauradip Nag and Xiatian Zhu and Tao Xiang}, year={2021}, eprint={2110.10552}, archivePrefix={arXiv}, primaryClass={cs.CV} }

Academic Projects

	Motion-Aware 3D Object Texturing In this project we explore a technique that leverages diffusion models to seamlessly paint a given a raw 3D input mesh of an articulated object. Existing methods iteratively renders the object from different viewpoints, applies a depth-based painting scheme, and projects it back to the mesh vertices or U-V atlas. Such optimization does not generate faithful texturing of all the mesh vertices or consistent textures for articulated objects like avatars. We instead leverage a motion prior in optimizing the consistency of the U-V atlas and then improve the consistency for mesh vertex coloring. Code (Coming Soon)
	Free-ViewPoint Rendering based 3D Clothed Human Human avatars will be key for future games and movies, mixed-reality, tele-presence and the “metaverse”. To build realistic and personalized avatars at scale, we need to faithfully reconstruct detailed 3D humans from color photos taken in the wild. A good reconstruction method must accurately capture these, while also being robust to novel clothing and poses. The goal of this project is to create a free-viewpoint rendering 3D human with with free-form garment reconstruction with appropriate texture. Code (Coming Soon)
	Temporal Action Localization Visualization Tool Impressive progress has been reported in recent literature for action recognition. This trend motivates another challenging topic - temporal action localization: given a long untrimmed video, “when does a specific action start and end?” This problem is important because real applications usually involve long untrimmed videos, which can be highly unconstrained in space and time, and one video can contain multiple action instances plus background scenes or other activities. However, there is practically no code available to visualize the results and compare with the ground truth for a given video. The only thing that is available currently is the quantitaive results which is evaluated via the codes given by respective dataset. This is a visualization tool designed to bridge this gap and observe the performance of any pytorch model on Temporal Activity Localization. This has been designed in HTML, CSS , JS and python. Code
	Computer Vision based Robot Locomotion Vision-based robot navigation has long been a fundamental goal in both robotics and computer vision research. However, we do not require all semantic labelling of a particular environment for a robot to move. It requires only the floor part of a environment to navigate its path as we are dealing with ground based robots. Depth information is particularly useful in predicting how much the robot can move in a particular direction. Hence, the marriage between both the vision tasks provides us a free space map which enables the robot to move freely in a given direction.Hence, we have designed a novel motion control algorithm which enables the robot to naviagte through obstacles in its path. This is implemented in Pytorch. Code
	Autonomous Robot Locomotion in prespecified path Robotics have helped humans greatly in achieving everyday tasks. Robots are designed to work in any environment and perform task on behalf of humans. They operate under real-world and real-time constraints where sensors and effectors with specific physical characteristics have to be controlled. In many cases, those robots are controlled manually to move from one destination to another. An Unmanned Ground Vehicle (UGV) is a vehicle that operates while in contact with the ground and without an onboard human presence. We have used one of such robots to demonstrate the custom path which user can choose. Currently we have implemented 2 such custom paths. This has been implemented in Python and ROS. Code
	Light-weight Salient Object Detection Salient object detection is a prevalent computer vision task that has applications ranging from abnormality detection to abnormality processing. Context modelling is an important criterion in the domain of saliency detection. A global context helps in determining the salient object in a given image by contrasting away other objects in the global view of the scene. However, the local context features detects the boundaries of the salient object with higher accuracy in a given region. To incorporate the best of both worlds, our proposed SaLite model uses both global and local contextual features. It is an encoder-decoder based architecture in which the encoder uses a lightweight SqueezeNet and decoder is modelled using convolution layers. This has been implemented in PyTorch. Code / Arxiv
	Boundary Growing Algorithm for Recovery of Torso from Corrupt Face Automatic face detection has been intensively studied for human related recognition systems. However, there has been a very little work in recovering face from a corrupted face image. We have designed a boundary growing algorithm where we incrementally grew the boundary of the corrupted face image and passed into HAAR cascade classifier to get confidence score. We kept on doing until we reach the maximum confidence. After that we used tailor measurements to recover the torso part of the human. This has been implemented using OpenCV and Python. Code
	BaseLine Remover from Doument Images This a small Matlab Implementation for Removal of Base Line from document using Edge Directional Kernel stated in paper " Edge enhancement algorithm for low-dose X-ray fluoroscopic imaging " by Lee et al. Here Baselines are Removed using Edge Directional Kernel . BaseLine Removal is an important topic in Document Image Analysis . In this paper Lee et al proposed removal of Noise from XRAY Images using Edge Directional Kernel and High Pass Filter but since our Noise is only Baseline we used a clever trick to implement only the Edge Directional Kernel and the reuslts are quite neat. This method works well for Half Page Document and Cropped Line Images , however if full page images are preprocessed then it may work pretty well. Code / Paper

Collaboration/Mentoring

Hiren Parmar (IIT Delhi), Summer Intern in UP-Lab
Topic : Physics Guided Character Animation

Pengjin Wu Masters in University of Surrey under Prof. Xiatian Zhu
Topic : Multi-Modal Few-Shot Image Classification

Vysakh Ramakrishnan , Part-Time Intern in UP-Lab with Ext-Colab
Topic : 3D Asset Generation, 4D Generation/Editing

Abdul Wasi , Part-Time Intern in UP-Lab with Junting(CUHK)
Topic : Controlled Video Editing

Jay Gala , Part-Time Intern in UP-Lab
Topic : Remote Sensing (NeurIPS'23W), Editing via Diffusion Models, Digital Human

Anindya Mondal , PhD in Surrey AI institute under Prof. Anjan Dutta and Ext-Colab
Topic : Animal Action Understanding (ICCV'23W), Object Counting, 3D Animals

Silpa V.S , PhD in Surrey AI institute under Prof. Anjan Dutta and Prof. Serge Belongie
Topic : Debiasing Generative Models

Academic Services

Teaching:

COM3013: Computational Intelligence (2022), University of Surrey
EEEM004: Advanced Topics in Computer Vision (2023), University of Surrey

Technical Programme Committee:

ML for Autonomous Driving Workshop, NeurIPS

Conference Review Services:

CVPR, ICCV, ECCV, ICLR, NeurIPS, AAAI, ACCV

Journal Review Services:

Springer Nature Computer Science
International Journal of Human-Computer Interaction
IEEE Transactions of Circuit Systems and Video Technology
IEEE Transactions of Pattern Analysis and Machine Intelligence
IEEE Transactions of Image Processing
Elsevier Computer Vision and Image Understanding

Invited Talks

Dec, 2023 at Eizen AI : "Modern Approaches in Video Understanding" [Slides]

Jan, 2024 at Adobe : "Future of Video Editing"

April, 2024 at CMU RI : "CreativeX: Future of Creative Generation/Editing"