Skip to content

Latest commit

 

History

History

Per-Pixel Classification is not all you need for Semantic Segmentation

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

Per-Pixel Classification is Not All You Need for Semantic Segmentation

Objective

Key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction.

Summary

Introduction

Most deep, like FCN, learning-based semantic segmentation approaches formulate semantic segmentation as per-pixel classification, applying a classification loss to each output pixel. Instead of classifying each pixel, mask classification-based methods predict a set of binary masks, each associated with a single class prediction.The more flexible mask classification dominates the field of instance-level segmentation.

MaskFormer vs Per-Pixel

Network Structure

MaskFormer overview: A backbone is used to extract image features F. A pixel decoder gradually upsamples image features to extract per-pixel embeddings Epixel. A transformer decoder attends to image features and produces N per-segment embeddings Q. The embeddings independently generate N class predictions with N corresponding mask embeddings Emask. Then, the model predicts N possibly overlapping binary mask predictions via a dot product between pixel embeddings Epixel and mask embeddings Emask followed by a sigmoid activation. For semantic segmentation task, the final prediction is generated by combining N binary masks with their class predictions using a simple matrix multiplication. The model contains three modules:

  1. a pixel-level module, that extracts per-pixel embeddings used to generate binary mask predictions. A backbone generates a (typically) low-resolution image feature map F.Then, a pixel decoder gradually upsamples the features to generate per-pixel embeddings Epixel. Any per-pixel classificationbased segmentation model fits the pixel-level module design including recent Transformer-based models. MaskFormer seamlessly converts such a model to mask classification.
  2. a transformer module, where a stack of Transformer decoder layers computes N per-segment embeddings
  3. a segmentation module, which generates predictions from these embeddings.
MaskFormer vs Per-Pixel

Results

The new model solves both semantic- and instance-level segmentation tasks in a unified manner: no changes to the model, losses, and training procedure are required. While MaskFormer performs on par with per-pixel classification models for Cityscapes, which has a few diverse classes, the new model demonstrates superior performance for datasets with larger vocabulary. It has been observed improvements in semantic segmentation indeed stem from the shift from per-pixel classification to mask classification. t MaskFormer outperforms the best per-pixel classification-based models while having fewer parameters and faster inference time. This observation suggests that on datasets where class recognition is relatively easy to solve, the main challenge for mask classification-based approaches is pixel-level accuracy (i.e., mask quality).

Further Reading

1.Paper
2.Github Implementation