Sauradip Nag

What Goes Around
Comes Around.

I am currently a 3rd Year Doctor of Philosophy (PhD) student, focusing on Computer Vision and Deep Learning, in Xiang's Phd Group of Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, England, United Kingdom. My primary supervisor is Prof. Tao (Tony) Xiang, and co-supervisor is Prof. Yi-Zhe Song. I also work closely with Dr. Xiatian (Eddy) Zhu.

Prior to this, I was a Project Associate in Visualization and Percepion Lab , IIT Madras working on a DRDO Project under Prof. Sukhendu Das which involves finding out hidden location from a unknown environment. I have also collaborated with Prof. Umapada Pal from Indian Statistical Institute (ISI), Kolkata and Prof. Palaihnakhote Shivakumara of University of Malaya , Malyasia on various Computer Vision research problems during undergraduate.

Email  /  Resume  /  Google Scholar  /  LinkedIn  /  GitHub  /  Twitter

Profile Photo
Research Interests

I am broadly interested in the field of Computer Vision and Deep Learning. Particularly, I have mostly focused on Visual Scene Understanding (VSU) from images and videos, effective methods of Transfer Learning for VSU, building systems that learn with minimal or no supervision in real and diverse scenarios. I have worked in Image Understanding Tasks ( Object Detection/ Semantic Segmentation/ Depth Estimation ), Video Understanding Tasks ( Exo-Centric Videos/ Ego-Centric Videos/ Human Activity Detection/ Video Classification), Vision-Language Modeling , Low Resource Learning (Few-Shot/ Zero-Shot/ Meta-Learning/ Semi-Supervised Learning) . Recently, I have shifted my focus on Generative Modeling, 3D Human (Animation/ Motion Synthesis/ Texture Synthesis/ Editing) .

Collaboration : I am always open to discussions and collaborations, feel free to ping me on Email/Linkedin if you are interested in. Check this before contacting.


Research Experience
Noah's Ark Lab, London and MPI-IS, Germany

Position : Research Internship on Generative Modeling and 3D Clothed Avatars
Under Dr. Jiankang Deng
Dec 2022 - Aug 2023

Indian Institute of Technology Madras, India

Position : Project Associate in Visualization and Perception Lab and CAIR, DRDO
Under Prof. Sukhendu Das, Prof. K Mitra and Prof. B Ravindran
Jul 2018 - Jul 2020

Indian Statistical Institute Kolkata, India

Position : Research Associate in Computer Vision and Pattern Recognition Unit
Under Dr. Umapada Pal and Dr. P Shivakumara
Jul 2016 - Jul 2018

University of Surrey, United Kingdom

Position : Doctor of Philosophy (PhD) in Electrical Engineering
Under Prof. Tao Xiang and Prof. Yi-Zhe Song
Jul 2020 - Present

Kalyani Government Engineering College, India

Position : Bachelor of Technology (B.Tech) in Computer Science and Engineering
Under Dr. Kousik Dasgupta
Thesis : Interacting with Softwares using Gestures
Jul 2014 - Jul 2018

Journal Publications
A New Unified Method for Detecting Text from Marathon Runners and Sports Players in Video

Sauradip Nag , P Shivakumara , Umapada Pal , Tong Lu , Michael Blumenstein
Pattern Recognition, Elsevier [IF : 7.196]

Introduces a new way of Detecting Bib Numbers from sports video by taking Human Torso, Skin and Head into consideration.

Abstract / Code / BibTex

An Episodic Learning Network for Text Detection on Human Bodies in Sports Images

P Chowdhury , P Shivakumara , R Ramachandra , Sauradip Nag , Umapada Pal , Tong Lu , Daniel Lopresti
IEEE Transactions on CSVT [IF : 4.6]

Introduces a new improved Human Centric approach of Detecting Bib Numbers from sports video by taking motion influenced Human Clothing and Camera Pose into consideration.

Abstract / BibTex

Multi-Modal Few-Shot Temporal Action Detection

Sauradip Nag , Mengmeng Xu , Xiatian Zhu , Juan Perez-Rua , Bernard Ghanem , Yi-Zhe Song , Tao Xiang
IEEE Transactions on PAMI (Under Review) [IF : 24.3]

This work introduced a novel Multi-Modal Few-Shot setting for Human Activity Detection task, where each Support Set consists of both Videos and associated Captions/Text. This work also shows how Video-to-NullText inversion is done, similar to DreamBooth.

Abstract / Code / arXiv / BibTex

DiffSED: Diffusion-based Sound Event Detection

Swapnil Bhosale* , Sauradip Nag* , Diptesh Kanojia , Jiankang Deng , Xiatian Zhu
ArXiv, 2023

This work reformulated the discriminative Sound-Event Detection task into a Generative Learning paradigm using Noise-to-Latent Densoising Diffusion.

Abstract / Code / ArXiv / BibTex

PersonalTailor: Personalizing 2D Pattern Design from 3D Point Clouds

Sauradip Nag , Anran Qi , Xiatian Zhu , Ariel Shamir
ArXiv, 2023

This work introduced a multi-modal latent-space disentanglement pipeline for 2D Garment Pattern editing from 3D point clouds. Disentangling latent gives the flexibility to add/edit/delete the panel latents individually whose composition forms new Garment Styles.

Abstract / Code / ArXiv / BibTex

Conference Publications
DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion

Sauradip Nag , Xiatian Zhu , Jiankang Deng , Yi-Zhe Song , Tao Xiang
IEEE International Conference on Computer Vision (ICCV), 2023
Paris, France [ H5-Index : 239 ]

This work introduced the first DETR based Diffusion framework for Human Activity Detection task. It introduces a new Noise-to-Proposal denoising paradigm of Diffusion via Transformer Decoder as denoiser. This can be extended to any detection task.

Abstract / Code / ArXiv / BibTex

Post-Processing Temporal Action Detection

Sauradip Nag , Xiatian Zhu , Yi-Zhe Song , Tao Xiang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023
Vancouver, Canada [ H5-Index : 389 ]

This work introduced a new Parameter-Free learnable Post-Processing technique for Human Action Detection task. It uses Gaussian Based refinement of start/end points where the refined shift is estimated using Taylor's Expansion.

Abstract / Code / ArXiv / BibTex

Proposal-Free Temporal Action Detection via Global Segmentation Mask

Sauradip Nag , Xiatian Zhu , Yi-Zhe Song , Tao Xiang
European Conference in Computer Vision (ECCV), 2022
Tel Aviv, Israel [ H5-Index : 187 ]

This is the first work that introduces a new Proposal-Free paradigm in Human Action Detection task. It reformulates action start/end regression into a action-mask prediction problem. This makes it 30x faster in training and 2x in inference than existing approaches

Abstract / Code / ArXiv / Project Page / BibTex

Zero-Shot Temporal Action Detection via Vision-Language Prompting

Sauradip Nag , Xiatian Zhu , Yi-Zhe Song , Tao Xiang
European Conference in Computer Vision (ECCV), 2022
Tel Aviv, Israel [ H5-Index : 187 ]

This is the first work that introduces Vision-language models for Zero-Shot Action Detection task. CLIP models off-the-shelf are not meant for detection tasks, it needs a class-agnostic masking to make it generalizable to zero-shot setting which is illustrated in this work.

Abstract / Code / ArXiv / Project Page / BibTex

Semi-Supervised Temporal Action Detection with Proposal-Free Masking

Sauradip Nag , Xiatian Zhu , Yi-Zhe Song , Tao Xiang
European Conference in Computer Vision (ECCV), 2022
Tel Aviv, Israel [ H5-Index : 187 ]

This work showcases that having a two-stage pipeline for Human Action Detection task suffers from Proposal Error-Propagation problem. This work propsoed a new single-stage framework coupled with novel self-supervised pre-training task to curb out this error.

Abstract / Code / ArXiv / Project Page / BibTex

Few-Shot Temporal Action Localization with Query Adaptive Transformer

Sauradip Nag , Xiatian Zhu , Tao Xiang
British Machine Vision Conference (BMVC), 2021
Manchester, Virtual [ H5-Index : 66 ]

Existing Few-Shot Action Detection tasks deal with trimmed video and has different designs for different few-shot algorithms. This work proposed a Model-Agnostic approach to use Untrimmed Video for adapting Query Videos using Support Samples

Abstract / Code / ArXiv / BibTex / Slides

What's There in the Dark

Sauradip Nag , Saptakatha Adak , Sukhendu Das
International Conference in Image Processing (ICIP), 2019 (Spotlight Paper)
Taipei, Taiwan [ H5-Index : 45 ]

This is the first work that introduced Semantic Segmentation for Night-Time scenes. This approach used Cycle-GANS as a means to generate Night time segmentations and used a comparator network as a discriminator to distinguish real vs fake night-time sample.

Abstract / Code / BibTex

Facial Micro-Expression Spotting and Recognition using Time Contrasted Feature with Visual Memory

Sauradip Nag , Ayan Kumar Bhunia , Aishik Konwer, Partha Pratim Roy
International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019
Brighton, United Kingdom [ H5-Index : 80 ]

This work introduced a new spatio-temporal network for Facial Micro-Expression detection task. It introduced a time-constrasted feature extraction module that greatly improved the spotting of micro-expression from inconspicuous facial movements.

Abstract / arXiv / BibTex

CRNN based Jersey-Bib Number/Text Recognition in Sports and Marathon Images

Sauradip Nag , Raghavendra Ramachandra , Palaiahnakote Shivakumara , Umapada Pal , Tong Lu , Mohan Kankanhalli
International Conference on Document Analysis and Recognition (ICDAR), 2019
Sydney, Australia [ H5-Index : 26 ]

This work further improves the Bib Detection from images. It uses a 2D-Human Pose Keypoints to identify different possible locations for Bib numbers and then individually extracts them using LSTM based text recognition pipeline

Abstract / BibTex

A New COLD Feature based Handwriting Analysis for Ethnicity/Nationality Identification

Sauradip Nag , Palaiahnakote Shivakumara , Wu Yirui , Umapada Pal , Tong Lu
International Conference on Frontiers in Handwriting Recognition (ICFHR), 2018
Niagara Falls, USA [ H5-Index : 18 ]

This is the first work that can identify Ethnicity from Handwriting in documents. It uses a Cloud-of-Line distribution based feature representation whose dimension is reduced by a PCA to select the prominent differences and used by SVM for classification of ethnicity.

Abstract / arXiv / BibTex

Workshops and Challenges
Actor-Agnostic Multi-Label Action Recognition with Multi-Modal Query

Anindya Mondal* , Sauradip Nag* , Joaquin M. Prada , Xiatian Zhu , Anjan Dutta
New Ideas in Vision Transformers Workshop (NIVT), ICCV 2023

This work showcases that Action-recognition is not Actor specific if we can make use of Language embeddings. Hence be it Animal or Human action, it is a unified model without any actor specific information requirement for recognition.

Abstract / Code / ArXiv / BibTex

Large-Scale Product Retrieval with Weakly Supervised Representation Learning

X Han*, K.W Ng* , Sauradip Nag , Z Qu
eBay eProduct Visual Search Challenge
Fine-Grained Visual Categorization Workshop (FGVC9), CVPR 2022
New Orleans, USA

We achieved runners up position in this retrieval challenge. Additionally, we proposed novel solutions for mining pseudo-attributes and treat them as labels, some innovative training recipes and novel post-processing solutions for large-scale product retrieval task

Abstract / Code / ArXiv / BibTex

How Far Can I Go ? : A Self-Supervised Approach for Deterministic Video Depth Forecasting

Sauradip Nag* , Nisarg Shah* , Anran Qi* , R Ramachandra
Machine Learning for Autonomous Driving Workshop (ML4AD), NeurIPS 2021
Australia, Virtual

This work introduced the first self-supervised Video Depth Forecasting solution for autonomous driving. It proposed a new Feature Forecasting paradigm of generative modeling for generating future depth maps from rgb frames.

Abstract / Code / BibTex

Academic Projects
Motion-Aware 3D Object Texturing

In this project we explore a technique that leverages diffusion models to seamlessly paint a given a raw 3D input mesh of an articulated object. Existing methods iteratively renders the object from different viewpoints, applies a depth-based painting scheme, and projects it back to the mesh vertices or U-V atlas. Such optimization does not generate faithful texturing of all the mesh vertices or consistent textures for articulated objects like avatars. We instead leverage a motion prior in optimizing the consistency of the U-V atlas and then improve the consistency for mesh vertex coloring.

Code (Coming Soon)

Rendering-Free 3D Clothed Human

Human avatars will be key for future games and movies, mixed-reality, tele-presence and the “metaverse”. To build realistic and personalized avatars at scale, we need to faithfully reconstruct detailed 3D humans from color photos taken in the wild. A good reconstruction method must accurately capture these, while also being robust to novel clothing and poses. The goal of this project is to create a rendering free 3D human with with free-form garment reconstruction with appropriate texture.

Code (Coming Soon)

Temporal Action Localization Visualization Tool

Impressive progress has been reported in recent literature for action recognition. This trend motivates another challenging topic - temporal action localization: given a long untrimmed video, “when does a specific action start and end?” This problem is important because real applications usually involve long untrimmed videos, which can be highly unconstrained in space and time, and one video can contain multiple action instances plus background scenes or other activities. However, there is practically no code available to visualize the results and compare with the ground truth for a given video. The only thing that is available currently is the quantitaive results which is evaluated via the codes given by respective dataset. This is a visualization tool designed to bridge this gap and observe the performance of any pytorch model on Temporal Activity Localization. This has been designed in HTML, CSS , JS and python.


Computer Vision based Robot Locomotion

Vision-based robot navigation has long been a fundamental goal in both robotics and computer vision research. However, we do not require all semantic labelling of a particular environment for a robot to move. It requires only the floor part of a environment to navigate its path as we are dealing with ground based robots. Depth information is particularly useful in predicting how much the robot can move in a particular direction. Hence, the marriage between both the vision tasks provides us a free space map which enables the robot to move freely in a given direction.Hence, we have designed a novel motion control algorithm which enables the robot to naviagte through obstacles in its path. This is implemented in Pytorch.


Autonomous Robot Locomotion in prespecified path

Robotics have helped humans greatly in achieving everyday tasks. Robots are designed to work in any environment and perform task on behalf of humans. They operate under real-world and real-time constraints where sensors and effectors with specific physical characteristics have to be controlled. In many cases, those robots are controlled manually to move from one destination to another. An Unmanned Ground Vehicle (UGV) is a vehicle that operates while in contact with the ground and without an onboard human presence. We have used one of such robots to demonstrate the custom path which user can choose. Currently we have implemented 2 such custom paths. This has been implemented in Python and ROS.


Light-weight Salient Object Detection

Salient object detection is a prevalent computer vision task that has applications ranging from abnormality detection to abnormality processing. Context modelling is an important criterion in the domain of saliency detection. A global context helps in determining the salient object in a given image by contrasting away other objects in the global view of the scene. However, the local context features detects the boundaries of the salient object with higher accuracy in a given region. To incorporate the best of both worlds, our proposed SaLite model uses both global and local contextual features. It is an encoder-decoder based architecture in which the encoder uses a lightweight SqueezeNet and decoder is modelled using convolution layers. This has been implemented in PyTorch.

Code / Arxiv

Boundary Growing Algorithm for Recovery of Torso from Corrupt Face

Automatic face detection has been intensively studied for human related recognition systems. However, there has been a very little work in recovering face from a corrupted face image. We have designed a boundary growing algorithm where we incrementally grew the boundary of the corrupted face image and passed into HAAR cascade classifier to get confidence score. We kept on doing until we reach the maximum confidence. After that we used tailor measurements to recover the torso part of the human. This has been implemented using OpenCV and Python.


BaseLine Remover from Doument Images

This a small Matlab Implementation for Removal of Base Line from document using Edge Directional Kernel stated in paper " Edge enhancement algorithm for low-dose X-ray fluoroscopic imaging " by Lee et al. Here Baselines are Removed using Edge Directional Kernel . BaseLine Removal is an important topic in Document Image Analysis . In this paper Lee et al proposed removal of Noise from XRAY Images using Edge Directional Kernel and High Pass Filter but since our Noise is only Baseline we used a clever trick to implement only the Edge Directional Kernel and the reuslts are quite neat. This method works well for Half Page Document and Cropped Line Images , however if full page images are preprocessed then it may work pretty well.

Code / Paper

Students Supervised/Mentored
Academic Services
  • COM3013: Computational Intelligence (2022-23), University of Surrey

  • Technical Programme Committee:
  • ML for Autonomous Driving Workshop, NeurIPS

  • Conference Review Services:

  • Journal Review Services:
  • Springer Nature Computer Science
  • International Journal of Human-Computer Interaction
  • IEEE Transactions of Circuit Systems and Video Technology
  • IEEE Transactions of Pattern Analysis and Machine Intelligence
  • IEEE Transactions of Image Processing
  • Elsevier Computer Vision and Image Understanding

Visitor Map

Copyright © Sauradip Nag. Last updated Aug 2022 | Template provided by Dr. Jon Barron