Are deep learning models robust to partial object occlusion in visual recognition tasks?

Kaleb Kassaw, Francesco Luzi, Leslie M. Collins, Jordan M. Malof

Research output: Contribution to journalArticlepeer-review

Abstract

Image classification models, including convolutional neural networks (CNNs), perform well on a variety of classification tasks but struggle under conditions of partial occlusion of relevant objects. Methods to improve performance under occlusion, including data augmentation, part-based clustering, and more inherently robust architectures, including Vision Transformer (ViT) models, have, to some extent, been evaluated on their ability to classify objects under partial occlusion. However, evaluations of these methods have largely relied on images containing artificial occlusion, since they are inexpensive to generate and label. Additionally, these methods are compared to early, now outdated models, and rarely to each other. We contribute the Image Recognition Under Occlusion (IRUO) dataset, based on the OVIS dataset in [1]. IRUO utilizes real-world and artificially occluded images to test and benchmark leading methods’ robustness to partial occlusion in visual recognition tasks. In addition, we contribute the design and results of a human study using images from IRUO evaluating human classification performance on multiple levels and types of occlusion. We find that ViT-based models show higher recognition accuracy than modern CNN-based models, which are more accurate than earlier CNN-based models, but that ViT models are still modestly below human accuracy. We also find that diffuse occlusion, in which relevant objects are seen through“holes” in occluders such as fences and leaves, can greatly reduce the accuracy of deep recognition models as compared to humans, especially CNNs.

Original languageEnglish
Article number112215
JournalPattern Recognition
Volume171
DOIs
StatePublished - Mar 2026

Keywords

  • Computer vision
  • Deep learning
  • Machine learning
  • Occlusion

Fingerprint

Dive into the research topics of 'Are deep learning models robust to partial object occlusion in visual recognition tasks?'. Together they form a unique fingerprint.

Cite this