TY - JOUR
T1 - Are deep learning models robust to partial object occlusion in visual recognition tasks?
AU - Kassaw, Kaleb
AU - Luzi, Francesco
AU - Collins, Leslie M.
AU - Malof, Jordan M.
N1 - Publisher Copyright:
© 2025 Elsevier Ltd
PY - 2026/3
Y1 - 2026/3
N2 - Image classification models, including convolutional neural networks (CNNs), perform well on a variety of classification tasks but struggle under conditions of partial occlusion of relevant objects. Methods to improve performance under occlusion, including data augmentation, part-based clustering, and more inherently robust architectures, including Vision Transformer (ViT) models, have, to some extent, been evaluated on their ability to classify objects under partial occlusion. However, evaluations of these methods have largely relied on images containing artificial occlusion, since they are inexpensive to generate and label. Additionally, these methods are compared to early, now outdated models, and rarely to each other. We contribute the Image Recognition Under Occlusion (IRUO) dataset, based on the OVIS dataset in [1]. IRUO utilizes real-world and artificially occluded images to test and benchmark leading methods’ robustness to partial occlusion in visual recognition tasks. In addition, we contribute the design and results of a human study using images from IRUO evaluating human classification performance on multiple levels and types of occlusion. We find that ViT-based models show higher recognition accuracy than modern CNN-based models, which are more accurate than earlier CNN-based models, but that ViT models are still modestly below human accuracy. We also find that diffuse occlusion, in which relevant objects are seen through“holes” in occluders such as fences and leaves, can greatly reduce the accuracy of deep recognition models as compared to humans, especially CNNs.
AB - Image classification models, including convolutional neural networks (CNNs), perform well on a variety of classification tasks but struggle under conditions of partial occlusion of relevant objects. Methods to improve performance under occlusion, including data augmentation, part-based clustering, and more inherently robust architectures, including Vision Transformer (ViT) models, have, to some extent, been evaluated on their ability to classify objects under partial occlusion. However, evaluations of these methods have largely relied on images containing artificial occlusion, since they are inexpensive to generate and label. Additionally, these methods are compared to early, now outdated models, and rarely to each other. We contribute the Image Recognition Under Occlusion (IRUO) dataset, based on the OVIS dataset in [1]. IRUO utilizes real-world and artificially occluded images to test and benchmark leading methods’ robustness to partial occlusion in visual recognition tasks. In addition, we contribute the design and results of a human study using images from IRUO evaluating human classification performance on multiple levels and types of occlusion. We find that ViT-based models show higher recognition accuracy than modern CNN-based models, which are more accurate than earlier CNN-based models, but that ViT models are still modestly below human accuracy. We also find that diffuse occlusion, in which relevant objects are seen through“holes” in occluders such as fences and leaves, can greatly reduce the accuracy of deep recognition models as compared to humans, especially CNNs.
KW - Computer vision
KW - Deep learning
KW - Machine learning
KW - Occlusion
UR - https://www.scopus.com/pages/publications/105012485668
U2 - 10.1016/j.patcog.2025.112215
DO - 10.1016/j.patcog.2025.112215
M3 - Article
AN - SCOPUS:105012485668
SN - 0031-3203
VL - 171
JO - Pattern Recognition
JF - Pattern Recognition
M1 - 112215
ER -