Name		Name	Last commit message	Last commit date
parent directory ..
Readme.md		Readme.md

Readme.md

Per-Pixel Classification is Not All You Need for Semantic Segmentation

Objective

Key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction.

Summary

Introduction

Most deep, like FCN, learning-based semantic segmentation approaches formulate semantic segmentation as per-pixel classification, applying a classification loss to each output pixel. Instead of classifying each pixel, mask classification-based methods predict a set of binary masks, each associated with a single class prediction.The more flexible mask classification dominates the field of instance-level segmentation.

Network Structure

MaskFormer overview: A backbone is used to extract image features F. A pixel decoder gradually upsamples image features to extract per-pixel embeddings Epixel. A transformer decoder attends to image features and produces N per-segment embeddings Q. The embeddings independently generate N class predictions with N corresponding mask embeddings Emask. Then, the model predicts N possibly overlapping binary mask predictions via a dot product between pixel embeddings Epixel and mask embeddings Emask followed by a sigmoid activation. For semantic segmentation task, the final prediction is generated by combining N binary masks with their class predictions using a simple matrix multiplication. The model contains three modules:

a pixel-level module, that extracts per-pixel embeddings used to generate binary mask predictions. A backbone generates a (typically) low-resolution image feature map F.Then, a pixel decoder gradually upsamples the features to generate per-pixel embeddings Epixel. Any per-pixel classificationbased segmentation model fits the pixel-level module design including recent Transformer-based models. MaskFormer seamlessly converts such a model to mask classification.
a transformer module, where a stack of Transformer decoder layers computes N per-segment embeddings
a segmentation module, which generates predictions from these embeddings.

Results

The new model solves both semantic- and instance-level segmentation tasks in a unified manner: no changes to the model, losses, and training procedure are required. While MaskFormer performs on par with per-pixel classification models for Cityscapes, which has a few diverse classes, the new model demonstrates superior performance for datasets with larger vocabulary. It has been observed improvements in semantic segmentation indeed stem from the shift from per-pixel classification to mask classification. t MaskFormer outperforms the best per-pixel classification-based models while having fewer parameters and faster inference time. This observation suggests that on datasets where class recognition is relatively easy to solve, the main challenge for mask classification-based approaches is pixel-level accuracy (i.e., mask quality).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-Pixel Classification is not all you need for Semantic Segmentation

Per-Pixel Classification is not all you need for Semantic Segmentation

Readme.md

Per-Pixel Classification is Not All You Need for Semantic Segmentation

Objective

Summary

Introduction

Network Structure

Results

Further Reading

Files

Per-Pixel Classification is not all you need for Semantic Segmentation

Directory actions

More options

Directory actions

More options

Latest commit

History

Per-Pixel Classification is not all you need for Semantic Segmentation

Folders and files

parent directory

Readme.md

Per-Pixel Classification is Not All You Need for Semantic Segmentation

Objective

Summary

Introduction

Network Structure

Results

Further Reading