Abstract

PURPOSE

To compare the performance and explainability of the visual transformer (ViT) and convolutional neural network (CNN) architectures in predicting genomic mutations from brain MRI.

METHODS

The performances of the ViT and CNN classification models in predicting the IDH mutation status of gliomas were compared. The two models were fine-tuned on the TCIA dataset. The fine-tuned models were evaluated on the TCIA dataset and an external independent dataset, namely the Japanese Cohort (JC) dataset. To evaluate their explanatory power, the gradient-weighted class activation mapping (Grad-CAM) visualization of the CNNs model and attention map visualization of the ViT model were compared.

RESULTS

The visual transformer model consistently outperforms the convolutional neural network on both the TCIA and JC datasets (p-value = 0.021, p-value < 0.001, statistically different). The attention map of the ViT model accurately highlighted the tumor, and the Grad-CAM of the CNN model sometimes highlighted non-tumor areas.

CONCLUSION

The ViT model was more robust against differences in the image domain. The ViT model's attention map had superior explainability.

This content is only available as a PDF.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]