Proposal-Free Temporal Action Detection with Global Segmentation Mask Learning

1CVSSP, University of Surrey, UK
2iFlyTek-Surrey Joint Research Center on Artificial Intelligence, UK
3Surrey Institute for People-Centred Artificial Intelligence,UK

Published in ECCV 2022

TAGS converts a regression based Action Detection task into a Classification-only problem.

Overview of the existing approaches vs Our approach.

Abstract

Existing temporal action detection (TAD) methods rely on generating an overwhelmingly large number of proposals per video. This leads to complex model designs due to proposal generation and/or per-proposal action instance evaluation and the resultant high computational cost. In this work, for the first time, we propose a proposal-free Temporal Action detection model with Global Segmentation mask (TAGS). Our core idea is to learn a global segmentation mask of each action instance jointly at the full video length. The TAGS model differs significantly from the conventional proposal-based methods by focusing on global temporal representation learning to directly detect local start and end points of action instances without proposals. Further, by modeling TAD holistically rather than locally at the individual proposal level, TAGS needs a much simpler model architecture with lower computational cost. Extensive experiments show that despite its simpler design, TAGS outperforms existing TAD methods, achieving new state-of-the-art performance on two benchmarks. Importantly, it is ~ 20x faster to train and ~1.6x more efficient for inference.

Video

Anchor-Based Approaches

Motivated from Anchor-based Object Detection (e.g. Faster RCNN). Such methods produces a large number (~8K) of action proposals (i.e temporal boundaries) for a given video and then adjusted to predict foreground. Then such foreground proposals are used for action classification.

Anchor-Free Approaches

Motivated from Anchor-free Object Detection (e.g. FCOS). Unlike anchor-based object detection, these approaches predict the start/end point of action boundaries thus faster than anchor-based approaches. These learned proposals are then used for action classification.


Proposal-Free Action Detection

We briefly touch upon the components that make our approach Proposal-Free.

Multi-Scale Snippet Embedding

We start with a Transformer based Multi-Scale Snippet Embedding module. This is one-step forward than existing 1-D Conv backbones. However, Transformers are good for global context and only classification task can enjoy benefit from this. We use a multi-scale Transformer to model the temporal context in multiple hierarchies. However, in general, localization is done using local context. So how is this handled ? The answer is we predict the global level segmentation mask which indeed requires global context. We discuss this in the following sub-sections.


Temporal Action Detection Heads

Action Classifier

The global context is used for snippet level classification. Specifically, each snippet is classified into one of K classes and background class. This is performed for all the video snippets. This module is designed with just a single 1-D Convolution Block followed by Softmax.

Global Segmentation Mask

In general , the localization module estimates start/end points for predicting action boundaries. These formulation require local context to regress the start/end points. However, we are estimating the video-level masks per action instance. Hence we use global context from the transformer for localization. Each global mask is action instance specific and class agnostic. For a training video, all temporal snippets of a single action instance are assigned with the same 1-D global mask. With the proposed mask signal as learning supervision, our TAGS model can facilitate context-aware representation learning, which brings clear benefit on TAD accuracy. Since we converted the start/end regression into a binary mask classification problem, we are free from predicting anchor-proposals, anchor-free proposals and thus Proposal-Free.

Label Assignment and Inference

Ground Truth Formulation

To train TAGS, the ground-truth needs to be arranged into the designed format. For a given video, we label all the snippets (orange or blue squares) of a single action instance with the same action class. All the snippets off from action intervals are labeled as background. For an action snippet of a particular instance, its global mask is defined as the video-length binary mask of that action instance. Each mask is action instance specific. All snippets of a specific action instance share the same mask



TAGS Inference

Starting with the top scoring snippets from classifier branch , we obtain their segmentation mask predictions from mask branch by thresholding the corresponding columns of M. To generate sufficient candidates, we apply multiple thresholds Θ to yield action candidates with varying lengths and confidences. For each candidate, we compute its confidence score by multiplying the classification score (obtained from the corresponding top-scoring snippet in P) and the segmentation mask score (i.e., the mean predicted foreground segment in M).

Performance

Our TAGS achieves SOTA on 2 benchmarked datasets ActivityNet v1.3 and THUMOS14.



Qualitative Results

BibTeX


        @article{nag2022temporal,
          title={Temporal Action Detection with Global Segmentation Mask Learning},
          author={Nag, Sauradip and Zhu, Xiatian and Song, Yi-Zhe and Xiang, Tao},
          journal={arXiv preprint arXiv:2207.06580},
          year={2022}
        }
      
-->