Key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction.
Most deep, like FCN, learning-based semantic segmentation approaches formulate semantic segmentation as per-pixel classification, applying a classification loss to each output pixel. Instead of classifying each pixel, mask classification-based methods predict a set of binary masks, each associated with a single class prediction.The more flexible mask classification dominates the field of instance-level segmentation.
MaskFormer overview: A backbone is used to extract image features F. A pixel decoder gradually upsamples image features to extract per-pixel embeddings Epixel. A transformer decoder attends to image features and produces N per-segment embeddings Q. The embeddings independently generate N class predictions with N corresponding mask embeddings Emask. Then, the model predicts N possibly overlapping binary mask predictions via a dot product between pixel embeddings Epixel and mask embeddings Emask followed by a sigmoid activation. For semantic segmentation task, the final prediction is generated by combining N binary masks with their class predictions using a simple matrix multiplication. The model contains three modules:
- a pixel-level module, that extracts per-pixel embeddings used to generate binary mask predictions. A backbone generates a (typically) low-resolution image feature map F.Then, a pixel decoder gradually upsamples the features to generate per-pixel embeddings Epixel. Any per-pixel classificationbased segmentation model fits the pixel-level module design including recent Transformer-based models. MaskFormer seamlessly converts such a model to mask classification.
- a transformer module, where a stack of Transformer decoder layers computes N per-segment embeddings
- a segmentation module, which generates predictions from these embeddings.
The new model solves both semantic- and instance-level segmentation tasks in a unified manner: no changes to the model, losses, and training procedure are required. While MaskFormer performs on par with per-pixel classification models for Cityscapes, which has a few diverse classes, the new model demonstrates superior performance for datasets with larger vocabulary. It has been observed improvements in semantic segmentation indeed stem from the shift from per-pixel classification to mask classification. t MaskFormer outperforms the best per-pixel classification-based models while having fewer parameters and faster inference time. This observation suggests that on datasets where class recognition is relatively easy to solve, the main challenge for mask classification-based approaches is pixel-level accuracy (i.e., mask quality).