Quantifying uncertainty in deep learning approaches to radio galaxy classification

Three digit identifiers for sources in Miraghaei & Best (2017).

Digit 1	Digit 2	Digit 3
1: FR I	0: Confident	0: Standard
2: FR II	1: Uncertain	1: Double Double
3: Hybrid		2: Wide Angle Tail
4: Unclassifiable		3: Diffuse
		4: Head Tail

Digit 1	Digit 2	Digit 3
1: FR I	0: Confident	0: Standard
2: FR II	1: Uncertain	1: Double Double
3: Hybrid		2: Wide Angle Tail
4: Unclassifiable		3: Diffuse
		4: Head Tail

Table 1.

Three digit identifiers for sources in Miraghaei & Best (2017).

Digit 1	Digit 2	Digit 3
1: FR I	0: Confident	0: Standard
2: FR II	1: Uncertain	1: Double Double
3: Hybrid		2: Wide Angle Tail
4: Unclassifiable		3: Diffuse
		4: Head Tail

Digit 1	Digit 2	Digit 3
1: FR I	0: Confident	0: Standard
2: FR II	1: Uncertain	1: Double Double
3: Hybrid		2: Wide Angle Tail
4: Unclassifiable		3: Diffuse
		4: Head Tail

To construct the machine learning data set, several pre-processing steps were applied to the data following the approach described in Aniyan & Thorat (2017) and Tang, Scaife & Leahy (2019):

In order to minimize the background noise in the images, all pixels below the |$3\sigma$| level of the background noise were set to 0. This threshold was chosen because among the classifiers trained by Aniyan & Thorat (2017) on images with |$2\sigma$|⁠, 3σ, and 5σ cut-offs, 3σ performed most well.
The images were clipped to 150 × 150 pixels, centred on the source.
The images were normalized as follows:
$$\begin{eqnarray*} {\rm Output} = 255 ~ ~ \frac{ {\rm Input} - {\rm Input_{min}}}{{\rm Input_{max}} - {\rm Input_{min}}} ~ , \end{eqnarray*}$$
(43)
where Input refers to the input image, Input_min and Input_max are the minimum and maximum pixel values in the input image, and Output is the image after normalization.

To ensure the integrity of the machine learning data set, the following 73 objects out of the 1329 extended sources identified in the catalogue were not included: (i) 40 unclassifiable objects; (ii) 28 objects with extent greater than the chosen image size of 150 × 150 pixels; (iii) four objects which were found in overlapping regions of the FIRST survey; (iv) one object in category 103 (FR I Confident Diffuse). Since this was the only instance of this category, it would not have been possible for the test set to be representative of the training set. The composition of the final data set is shown in Table 2. We do not include the sub-types in this table as we have not considered their classification.

Table 2.

MiraBest class-wise composition.

Class	Confidence	No.
FR I	Confident	397
	Uncertain	194
FR II	Confident	436
	Uncertain	195
Hybrid	Confident	19
	Uncertain	15

Table 2.

MiraBest class-wise composition.

Class	Confidence	No.
FR I	Confident	397
	Uncertain	194
FR II	Confident	436
	Uncertain	195
Hybrid	Confident	19
	Uncertain	15

In this work, we use the MiraBest Confident subset to train the BBB models. Examples of FR I and FR II galaxies from the MiraBest Confident data set are shown in Fig. 1.

Figure 1.

Examples of (a) FR I and (b) FR II galaxies from the MiraBest Confident subset.

Additionally, we use 49 samples from the MiraBest Uncertain subset and 30 samples from the MiraBest Hybrid class to test the trained model’s ability to correctly represent different measures of uncertainty, since these samples can be considered as being drawn from the same data generating distribution as the MiraBest Confident samples, but have a differing degree of belief in their classification. We note that there may be components of both epistemic and aleatoric uncertainty in the Uncertain and Hybrid samples, and this is discussed further in Section 6.

We note that all the previous work published using this data set uses some form of data augmentation. In this work we do not use any data augmentation, although the effect of data augmentation is discussed further in Section 7.2. The reasons for this are two-fold: firstly, that unprincipled data augmentation has been suggested to negatively affect the performance of Bayesian deep learning models (Nabarro et al. 2021); and second, that a noted advantage of Bayesian models is their ability to obtain good performance using only small data sets (e.g. Xiong, Barash & Frey 2011; Jospin et al. 2020; Semenova et al. 2020).

5 MODEL

5.1 Architecture

The architecture used to classify the MiraBest data set using BBB is shown in Table 3. We use a LeNet-5-style architecture (LeCun et al. 1998) with two additional convolutional layers. We found it essential to add two convolutional layers to the LeNet-5 architecture in order to obtain good model performance. Adding additional fully connected and convolutional layers beyond this resulted in no further improvement. The number of channels in the additional convolutional layers were also optimized. We used a kernel support size of 5 to be consistent with previous CNN-style architectures used with the MiraBest data set (e.g. Scaife & Porter 2021). ReLU activation functions are used for each layer with the exception of the output layer and max-pooling is used to down-sample the feature data after each convolutional layer.

Table 3.

CNN architecture. Stride = 1 is used for all the convolutional and max pooling layers.

Operation	Kernel	Channels	Padding
Convolution	5 × 5	6	1
ReLU
Max pooling	2 × 2
Convolution	5 × 5	16	1
ReLU
Max pooling	2 × 2
Convolution	5 × 5	26	1
ReLU
Max pooling	2 × 2
Convolution	5 × 5	32	1
ReLU
Max pooling	2 × 2
Fully connected		120
ReLU
Fully connected		84
ReLU
Fully connected		2
Log Softmax

Operation	Kernel	Channels	Padding
Convolution	5 × 5	6	1
ReLU
Max pooling	2 × 2
Convolution	5 × 5	16	1
ReLU
Max pooling	2 × 2
Convolution	5 × 5	26	1
ReLU
Max pooling	2 × 2
Convolution	5 × 5	32	1
ReLU
Max pooling	2 × 2
Fully connected		120
ReLU
Fully connected		84
ReLU
Fully connected		2
Log Softmax

Table 3.

CNN architecture. Stride = 1 is used for all the convolutional and max pooling layers.

Operation	Kernel	Channels	Padding
Convolution	5 × 5	6	1
ReLU
Max pooling	2 × 2
Convolution	5 × 5	16	1
ReLU
Max pooling	2 × 2
Convolution	5 × 5	26	1
ReLU
Max pooling	2 × 2
Convolution	5 × 5	32	1
ReLU
Max pooling	2 × 2
Fully connected		120
ReLU
Fully connected		84
ReLU
Fully connected		2
Log Softmax

Operation	Kernel	Channels	Padding
Convolution	5 × 5	6	1
ReLU
Max pooling	2 × 2
Convolution	5 × 5	16	1
ReLU
Max pooling	2 × 2
Convolution	5 × 5	26	1
ReLU
Max pooling	2 × 2
Convolution	5 × 5	32	1
ReLU
Max pooling	2 × 2
Fully connected		120
ReLU
Fully connected		84
ReLU
Fully connected		2
Log Softmax

The functional form of our priors is as defined in Section 2.4 and the hyper-parameters of the priors were tuned using the validation data set. We build models with four different priors: (i) a simple Gaussian prior with σ = 0.1, (ii) a GMM prior with {π, σ₁, σ₂} = {0.75, 1, 9.10⁻⁴}, (iii) a Laplace prior with b = 1, and (iv) a LMM prior with {π, b₁, b₂} = {0.75, 1, 10⁻³}.

We use a Gaussian distribution as our variational approximation to the posterior over both the weights and biases in our network. Models using the BBB method are known to be highly sensitive to the initialization of this posterior and in this work we initialize the posterior means, μ, from a uniform distribution, |$\mathcal {U}(-0.1,0.1)$|⁠, and the posterior variance parametrization, ρ, from |$\mathcal {U}(-5,-4)$|⁠.

5.2 Training

The MiraBest data set has a predefined training: test split. We further divide the training data into a ratio of 80:20 to create training and validation sets. The final split contains 584 training samples, 145 validation samples, and 104 test samples.

All the models are trained for 500 epochs, with mini-batches of size 50. We train the models using the Adam optimizer with a learning rate of ℓ = 5 × 10⁻⁵ for all the priors. A learning rate scheduler is implemented which reduces the learning rate by 95 per cent if the validation likelihood cost does not improve for four consecutive epochs.

After a model has been trained, the test error is calculated as the percentage of incorrectly classified galaxies by comparing the output of the model to the labels in the test set.

5.3 Cold posterior effect

It has been observed by several authors (Wenzel et al. 2020) that in order to get good predictive performance from Bayesian neural networks, the Bayesian posterior has to be down-weighted or tempered. We denote this by a weighting factor, T, in the cost function,

$$\begin{eqnarray*} F_i(D_i, \theta) = \frac{T}{M} {\rm KL} [q(w|\theta ]||P(w)] - \mathbb {E}_{q(w|\theta)} [\log P(D_i|w)] , \end{eqnarray*}$$

(44)

where T ≤ 1 is the temperature. We also observe this effect in our experiments.

We found it necessary to temper our posterior in order to get a good performance from the Bayesian neural network, without which the accuracy remains around |$55{{\ \rm per\ cent}}$|⁠. We tempered the posterior for a range of temperature values, T, between [10⁻⁵, 1) and chose the largest value of T for which the validation accuracy was improved significantly. Thus, for all experiments described in the following sections we use T = 10⁻² in equation (44).

Several hypotheses have been proposed to explain the cold posterior effect including model or prior misspecification (Wenzel et al. 2020), and data augmentation or data set curation leading to likelihood misspecification (Aitchison 2021). These are discussed in detail and investigated further in Section 7.

5.4 Weight pruning

Variational inference based neural networks have several advantages over non-Bayesian neural networks; however, for a typical variational posterior such as the Gaussian distribution, the number of parameters in the network double compared to a non-Bayesian model with the same architecture because both the mean and standard deviation values need to be learned. This increases the computational and memory overhead at test time and during deployment. Thus, there is a need to develop network pruning approaches which can be used to remove the parameters that contain least or no useful information. Several authors have also considered pruning to improve the generalization performance of the network (LeCun, Denker & Solla 1989). Many of the pruning methods that have been developed can also be applied to non-Bayesian neural networks, but in this section we discuss a signal-to-noise ratio (SNR) based pruning criterion which can be applied naturally to a model trained with Gaussian variational densities (Graves 2011; Blundell et al. 2015).

The SNR of a model weight is calculated as follows:

$$\begin{eqnarray*} {\rm SNR} = \frac{|\mu |}{ \sigma } , \end{eqnarray*}$$

(45)

where μ and σ are the values of the variational parameters after a model has been trained. In practice, we use the SNR values in dB:

$$\begin{eqnarray*} {\rm SNR}_{{\rm dB}} = 10 \log _{10} {\rm SNR} . \end{eqnarray*}$$

(46)

The weights of the network with the lowest SNR values are removed. The effect of removing these parameters is measured by re-calculating the test error using the pruned model. While the SNR-based method may seem very simple, it allows for a large proportion of weights to be removed for some models. We discuss alternative pruning approaches in Section 6.4.

We adapt SNR-based pruning for a convolutional Bayesian neural network by considering only the fully connected layer weights of the model for pruning, instead of all the weights of the network. This is because the convolutional layer weights are shared weights and removing even a small fraction may result in disastrous consequences for model performance. However, the fully connected layers make up |$\sim 85{{\ \rm per\ cent}}$| of the total weights of our network, so pruning methods are still worth considering for convolutional BBB models. Pruning only the fully connected layers is also consistent with pruning methods developed for standard CNN models (e.g. Gong et al. 2014; Soulié, Gripon & Robert 2016; Tu et al. 2016).

For the model trained on the MiraBest data set with a Laplace prior, we find that up to 30 per cent of the fully connected layer weights can be pruned without a significant change in performance. This is discussed further in Section 6.4.

6 RESULTS

6.1 Classification and calibration

The results of our classification experiments are shown in Table 4. The mean and standard deviation values of the test error are calculated by taking 100 samples from the posterior predictive distribution for each test data point in the MiraBest Confident data set.

Table 4.

Classification error and percentage cUCE on MiraBest Confident test set using BBB-CNN. The percentage cUCE is shown separately for predictive uncertainty as measured by predictive entropy (PE), epistemic uncertainty as measured by mutual information (MI), and aleatoric uncertainty as measured by average entropy (AE) as calculated on the MiraBest Confident test set. For a fuller explanation of these metrics, please see Section 3.

		cUCE per cent
Prior	Test error	PE	MI	AE
Gaussian prior	\|$14.48 \pm 3.40\ \mathrm{ per \, cent}$\|	30.49	21.90	25.48
GMM prior	\|$12.89 \pm 2.23\ \mathrm{ per \, cent}$\|	19.92	18.86	16.86
Laplace prior	\|$11.62 \pm 2.38\ \mathrm{ per \, cent}$\|	9.69	16.37	10.84
LMM prior	\|$17.29 \pm 2.71\ \mathrm{ per \, cent}$\|	21.02	26.05	17.69

		cUCE per cent
Prior	Test error	PE	MI	AE
Gaussian prior	\|$14.48 \pm 3.40\ \mathrm{ per \, cent}$\|	30.49	21.90	25.48
GMM prior	\|$12.89 \pm 2.23\ \mathrm{ per \, cent}$\|	19.92	18.86	16.86
Laplace prior	\|$11.62 \pm 2.38\ \mathrm{ per \, cent}$\|	9.69	16.37	10.84
LMM prior	\|$17.29 \pm 2.71\ \mathrm{ per \, cent}$\|	21.02	26.05	17.69

Table 4.

		cUCE per cent
Prior	Test error	PE	MI	AE
Gaussian prior	\|$14.48 \pm 3.40\ \mathrm{ per \, cent}$\|	30.49	21.90	25.48
GMM prior	\|$12.89 \pm 2.23\ \mathrm{ per \, cent}$\|	19.92	18.86	16.86
Laplace prior	\|$11.62 \pm 2.38\ \mathrm{ per \, cent}$\|	9.69	16.37	10.84
LMM prior	\|$17.29 \pm 2.71\ \mathrm{ per \, cent}$\|	21.02	26.05	17.69

		cUCE per cent
Prior	Test error	PE	MI	AE
Gaussian prior	\|$14.48 \pm 3.40\ \mathrm{ per \, cent}$\|	30.49	21.90	25.48
GMM prior	\|$12.89 \pm 2.23\ \mathrm{ per \, cent}$\|	19.92	18.86	16.86
Laplace prior	\|$11.62 \pm 2.38\ \mathrm{ per \, cent}$\|	9.69	16.37	10.84
LMM prior	\|$17.29 \pm 2.71\ \mathrm{ per \, cent}$\|	21.02	26.05	17.69

Using the model trained with a Laplace prior, we recover a test error of |$11.62 \pm 2.38 {{\ \rm per\ cent}}$|⁠. We also recover a comparable test error of |$12.89 \pm 2.23 {{\ \rm per\ cent}}$| using a GMM prior. The standard deviation values represent the spread in the overall test error and indicate the model’s confidence in its predictions on the test set. Bowles et al. (2021) who augment the MiraBest Confident samples by a factor of 72 report a test error of |$8{{\ \rm per\ cent}}$|⁠, whereas Scaife & Porter (2021) who use random rotations of the same data set as a function of epoch to augment the data, report a test error of |$5.95 \pm 1.37 {{\ \rm per\ cent}}$| with a LeNet-5 style CNN with MC dropout, and |$3.43\pm 1.29 {{\ \rm per\ cent}}$| using a D₁₆ group-equivariant CNN with MC dropout. We again emphasize that the test error values we report are without any data augmentation. If we include data augmentation using random rotations from 0 to 360 degrees this improves the BBB test error using the Laplace prior to |$7.41 \pm 2.22{{\ \rm per\ cent}}$|⁠, but at the cost of increased uncertainty calibration error. We note that the wider effect of using augmentation with Bayesian models is a subject of debate in the literature and this is discussed further in Section 7.2.

Whilst differences in performance are often used to choose a preferred model, it is also the case that more accurate point-wise models are overconfident in their predictions. This problem of overconfidence in standard NNs is well documented in the literature (see e.g. Nguyen, Yosinski & Clune 2015). In particular this effect has been shown to lead to miscalibrated uncertainty in predictions, especially for data samples that are less similar to canonical examples of a class (Guo et al. 2017b; Hein et al. 2018).

Table 4 shows the percentage cUCE values of the uncertainty metrics calculated for the MiraBest Confident test set (see Section 3.5). Among the four priors tested in this work, we find that the Laplace prior gives the most well-calibrated uncertainty metrics, followed by the GMM prior. Given the uncertainties on each test error due to the small size of our test set, we cannot draw any strong conclusions about which prior should be preferred. However, after analysing the uncertainty calibration error for each prior, we suggest that the Laplace prior produces the most well-calibrated uncertainties. We also find that the cold posterior effect is less pronounced in the case of the Laplace prior model (see Fig. 2). Similarly to the model accuracy, these results indicate that learning benefits most from a sparser prior and consequently in the following analysis we report results for the Laplace prior.

Figure 2.

The ‘cold posterior’ effect for the MiraBest classification problem (see Section 7 for details). Data are shown for the BBB models with no data augmentation and the original ELBO cost function trained with a Laplace prior (solid blue line), and trained with a GMM prior (orange dashed line).

6.2 Uncertainty quantification

We first look at some illustrative examples of galaxies from the MiraBest Confident test set to build intuition about the uncertainty metrics used. For each test sample we make N = 200 forward passes through the trained model. This results in a distribution of model outputs following the learned predictive posterior, which allow us to estimate the different uncertainty measures described in Section 3.

We show some examples of galaxies that have been correctly classified with high confidence in Fig. 3. These galaxies correspond to typical FR I/FR II classifications. The corresponding uncertainty metrics are shown in Table 5. The predictive entropy and mutual information for all the galaxies shown are very low (<0.01 nats). The overlap indices η_soft and η_logits are ≪10⁻⁵, which indicates that there is virtually no overlap, as can be seen in the distributions of Softmax probabilities in Fig. 3.

Figure 3.

Examples of galaxies correctly classified with high predictive confidence. Top: Softmax values for 200 forward passes through the trained model. Bottom: Input data images.

Table 5.

Predictive entropy (PE), mutual information (MI), and overlap indices for Softmax (η_soft) and logit-space (η_logits) for galaxies correctly classified with high confidence shown in Fig. 3.

Galaxy	PE	MI	\|$\boldsymbol{\eta _{\rm soft}}$\|	\|$\boldsymbol{\eta _{\rm logits}}$\|
3a	<0.01	<0.01	≪10⁻⁵	≪10⁻⁵
3b	<0.01	<0.01	≪10⁻⁵	≪10⁻⁵
3c	<0.01	<0.01	≪10⁻⁵	≪10⁻⁵
3d	<0.01	<0.01	≪10⁻⁵	≪10⁻⁵

Galaxy	PE	MI	\|$\boldsymbol{\eta _{\rm soft}}$\|	\|$\boldsymbol{\eta _{\rm logits}}$\|
3a	<0.01	<0.01	≪10⁻⁵	≪10⁻⁵
3b	<0.01	<0.01	≪10⁻⁵	≪10⁻⁵
3c	<0.01	<0.01	≪10⁻⁵	≪10⁻⁵
3d	<0.01	<0.01	≪10⁻⁵	≪10⁻⁵

Table 5.

Predictive entropy (PE), mutual information (MI), and overlap indices for Softmax (η_soft) and logit-space (η_logits) for galaxies correctly classified with high confidence shown in Fig. 3.

Galaxy	PE	MI	\|$\boldsymbol{\eta _{\rm soft}}$\|	\|$\boldsymbol{\eta _{\rm logits}}$\|
3a	<0.01	<0.01	≪10⁻⁵	≪10⁻⁵
3b	<0.01	<0.01	≪10⁻⁵	≪10⁻⁵
3c	<0.01	<0.01	≪10⁻⁵	≪10⁻⁵
3d	<0.01	<0.01	≪10⁻⁵	≪10⁻⁵

Galaxy	PE	MI	\|$\boldsymbol{\eta _{\rm soft}}$\|	\|$\boldsymbol{\eta _{\rm logits}}$\|
3a	<0.01	<0.01	≪10⁻⁵	≪10⁻⁵
3b	<0.01	<0.01	≪10⁻⁵	≪10⁻⁵
3c	<0.01	<0.01	≪10⁻⁵	≪10⁻⁵
3d	<0.01	<0.01	≪10⁻⁵	≪10⁻⁵

We then consider galaxies for which the predictive uncertainty is high, as shown in Fig. 4. These galaxies have the highest predictive entropy among the test samples of the MiraBest Confident data set and large values of overlap indices in both the Softmax and logit space. These samples also have high mutual information, which indicates that the model’s confidence in its classification is very low. The values of uncertainty metrics corresponding to these galaxies are shown in Table 6.

Figure 4.

Examples of galaxies classified with low predictive confidence. Top: Softmax values for 200 forward passes through the trained model. Bottom: Input data images.

Table 6.

Predictive entropy (PE), mutual information (MI), and overlap indices for Softmax (η_soft) and logit-space (η_logits) for galaxies classified with low confidence shown in Fig. 4.

Galaxy	PE	MI	\|$\boldsymbol{\eta _{\rm soft}}$\|	\|$\boldsymbol{\eta _{\rm logits}}$\|
4a	0.68	0.25	0.72	0.10
4b	0.69	0.20	0.90	0.08
4c	0.67	0.27	0.70	0.08
4d	0.69	0.14	0.88	0.12

Galaxy	PE	MI	\|$\boldsymbol{\eta _{\rm soft}}$\|	\|$\boldsymbol{\eta _{\rm logits}}$\|
4a	0.68	0.25	0.72	0.10
4b	0.69	0.20	0.90	0.08
4c	0.67	0.27	0.70	0.08
4d	0.69	0.14	0.88	0.12

Table 6.

Predictive entropy (PE), mutual information (MI), and overlap indices for Softmax (η_soft) and logit-space (η_logits) for galaxies classified with low confidence shown in Fig. 4.

Galaxy	PE	MI	\|$\boldsymbol{\eta _{\rm soft}}$\|	\|$\boldsymbol{\eta _{\rm logits}}$\|
4a	0.68	0.25	0.72	0.10
4b	0.69	0.20	0.90	0.08
4c	0.67	0.27	0.70	0.08
4d	0.69	0.14	0.88	0.12

Galaxy	PE	MI	\|$\boldsymbol{\eta _{\rm soft}}$\|	\|$\boldsymbol{\eta _{\rm logits}}$\|
4a	0.68	0.25	0.72	0.10
4b	0.69	0.20	0.90	0.08
4c	0.67	0.27	0.70	0.08
4d	0.69	0.14	0.88	0.12

Finally in Fig. 5, we show one example where the model has incorrectly classified a galaxy with high confidence. The galaxy has been labelled an FR II and the model incorrectly classifies it as an FR I. The predictive entropy, mutual information, and overlap indices are very low, as shown in Table 7, which means that the model’s confidence in its prediction is very high for this galaxy. We can see that the galaxy deviates from the typical FR II classification because it has additional bright components and its label is somewhat ambiguous. Thus, the bias introduced by the ambiguity in the definition of FR I/FR II and the ambiguity in the labels gives rise to uncertainty metrics that can potentially be misleading.

Figure 5.

A galaxy that has been incorrectly classified with high predictive confidence. Top: Softmax values for 200 forward passes through the trained model. Bottom: Input data image.

Table 7.

Predictive entropy (PE), mutual information (MI), and overlap indices for Softmax (η_soft) and logit-space (η_logits) for a galaxy incorrectly classified with high confidence shown in Fig. 5.

Galaxy	PE	MI	\|$\boldsymbol{\eta _{\rm soft}}$\|	\|$\boldsymbol{\eta _{\rm logits}}$\|
5	0.10	0.02	<0.01	<0.01

Table 7.

Predictive entropy (PE), mutual information (MI), and overlap indices for Softmax (η_soft) and logit-space (η_logits) for a galaxy incorrectly classified with high confidence shown in Fig. 5.

Galaxy	PE	MI	\|$\boldsymbol{\eta _{\rm soft}}$\|	\|$\boldsymbol{\eta _{\rm logits}}$\|
5	0.10	0.02	<0.01	<0.01

In this section we saw how high or low values of predictive entropy, mutual information, and overlap indices indicate the model’s confidence in making predictions about individual galaxies; in the next section we analyse the distributions of uncertainty metrics for all the galaxies in the data set.

6.3 Analysis of uncertainty estimates

We test the trained model’s ability to capture different measures of uncertainty by calculating uncertainty metrics for (i) the MiraBest Uncertain test samples (49 objects), and (ii) the MiraBest Hybrid samples (30 objects), using the model trained on the MiraBest Confident samples. As before this is done by making N = 200 forward passes through the model for each test input. Overall distributions for each uncertainty metric as a function of the three different test sets are shown in Fig. 6.

Figure 6.

Distributions of uncertainty metrics for MiraBest Confident (MBFR_Conf), Uncertain (MBFR_Uncert), and Hybrid (MBHybrid) data sets. (a) Predictive uncertainty as measured using predictive entropy; (b) Epistemic uncertainty as measured using mutual information; and (c) Aleatoric uncertainty as measured using average entropy. For a fuller explanation of these metrics, please see Section 3.

The MiraBest Uncertain samples are considered to be drawn from the same distribution as the MiraBest Confident samples, and from a machine learning perspective would therefore be denoted as in-distribution. The MiraBest Hybrid samples are a more complex case: in principle they are a separate class that was not considered when training the model and therefore might be denoted as being out-of-distribution by some measures; however, given that they are still a sub-population of the over-all radio galaxy population, and moreover that they are defined as amalgams of the two classes used to train the model, they could also be considered to be in-distribution. Consequently, in this work we treat the MiraBest Hybrid test sample as being in-distribution.

6.3.1 Analysis of uncertainty estimates on MiraBest Uncertain

Fig. 6 shows that the MiraBest Uncertain test set has on average higher measures of uncertainty across all estimators than the MiraBest Confident test set. In Fig. 7(a) we can see that this is also reflected in the median values of the predictive entropy distribution being higher for both FR I and FR II classes in the MiraBest Uncertain test set compared to the Confident test set, and that the interquartile range is also larger. This indicates that a larger number of galaxies are being classified with higher predictive entropy. The predictive entropy distribution has a higher median value for FR II objects for both MiraBest Confident and Uncertain samples. However, the distributions are wider for FR I objects.

Figure 7.

Class-wise distributions of uncertainty metrics for MiraBest Confident and MiraBest Uncertain data sets. (a) Predictive uncertainty as measured using predictive entropy; (b) Epistemic uncertainty as measured using mutual information; and (c) Aleatoric uncertainty as measured using average entropy. For a fuller explanation of these metrics, please see Section 3.

The distribution of mutual information is shown in Fig. 6(b). The mutual information is also higher for samples from the MiraBest Uncertain test set. This indicates a higher epistemic uncertainty in classifying these samples, which is consistent with how the data sets are defined. We also find that FR IIs have a wider distribution of epistemic uncertainty than FR Is for MiraBest Uncertain samples (see Fig. 7b).

Using the average entropy of a test sample as a measure of aleatoric uncertainty, we see in Fig. 6(c) that the Uncertain samples have higher median value and a larger interquartile range than the Confident samples. From Fig. 7(c) we note that that for both FR I and FR II type galaxies, the interquartile range has shifted to a higher value and the median average entropy is also higher.

From Fig. 8, we can also see that FR II samples have lower uncertainty than FR I samples when combined across the Confident and Uncertain test sets, which could be because the training set contains ∼7 per cent more FR IIs than FR Is.

Figure 8.

Morphology-wise distributions of uncertainty metrics for the MiraBest data set. (a) Predictive uncertainty as measured using predictive entropy; (b) Epistemic uncertainty as measured using mutual information; and (c) Aleatoric uncertainty as measured using average entropy. For a fuller explanation of these metrics, please see Section 3.

Thus in general we find that BBB can correctly represent model uncertainty in radio galaxy classification, and that this uncertainty is correlated with how human classifiers defined the MiraBest Confident and Uncertain qualifications.

6.3.2 Analysis of uncertainty estimates on MiraBest Hybrid

In Fig. 6 we can see that the interquartile range of the distributions of uncertainty metrics for the MiraBest Hybrid samples are well separated from the distributions for the MiraBest Confident samples.

The median predictive entropy of the Hybrid samples is higher than the MiraBest Confident samples by 0.5 nats, as shown in Fig. 6(a). This indicates that there is a high degree of predictive uncertainty associated with the hybrid samples, which is expected as the training set does not contain any hybrid samples. This behaviour is also echoed as a function of overall morphology (see Fig. 8). It can be seen that the MiraBest Hybrid samples have substantially higher median uncertainties than either the FR I or FR II objects (combined across the MiraBest Confident and MiraBest Uncertain samples).

In Fig. 6(b) we see that the median value for the distribution of mutual information for the Hybrid test set is higher than the upper quartile of the MiraBest Confident test set. This high degree of epistemic uncertainty could be because the model did not see any Hybrid samples during training. We also note that among the sub-classes of the Hybrid data set, the confidently labelled samples have higher epistemic uncertainty than the uncertainly labelled samples, as shown in Fig. 9(b). We suggest that this may be because the uncertain samples are more similar to the FR I/FR II galaxies than the model has seen during training, i.e. their classification as a Hybrid was considered uncertain by a human classifier because the morphology was biased towards one of the standard FR I or FR II classifications. In which case their epistemic uncertainty might be expected to be lower since the model was trained to predict those morphologies.

Figure 9.

Class-wise distributions of uncertainty metrics for MiraBest Hybrid data set. (a) Predictive uncertainty as measured using predictive entropy; (b) Epistemic uncertainty as measured using mutual information; and (c) Aleatoric uncertainty as measured using average entropy. For a fuller explanation of these metrics, please see Section 3.

The distribution of the average entropy (aleatoric uncertainty) has a higher median value and the interquartile range has shifted to higher values of average entropy compared to the Confident and Uncertain test sets (see Fig. 6c). While it can be seen that Hybrid samples have higher aleatoric uncertainty on average, in Fig. 9(c) we can also see how the aleatoric uncertainty is distributed among the classes in the hybrid samples. The confidently labelled Hybrid samples span almost the entire range of the entropy function between (0, 0.693] nats. The uncertainty labelled samples also have a high degree of aleatoric uncertainty.

Thus we find that the Hybrid test set has an even higher degree of uncertainty than the Uncertain test set.

6.4 Alternative pruning approaches

In Section 5.4 we found that 30 per cent of the weights in the fully connected layers of our trained model could be pruned without a loss of performance using an SNR-based pruning approach. In this section, we discuss an alternative pruning approach based on Fisher information. We compare the performance of our BBB model trained on radio galaxies for different pruning methods and analyse the effect of pruning on uncertainty estimates.

A number of alternative methods for model pruning have been described in the literature. Hessian based methods have been proposed to rank model parameters by their importance, but in practice the calculation of a full Hessian for typical deep learning models is in general prohibitively expensive in terms of computation. Consequently, one of the most popular methods is to simply rank parameters by their magnitude, where magnitude refers to the absolute value of the weights. The SNR approach, where parameters with low magnitudes or high variances are removed (Section 5.4), may be considered a natural extension of this method. Tu et al. (2016) showed that for deterministic neural networks it is possible to improve on a simple magnitude-based pruning by using the Fisher information matrix (FIM) for a particular parameter of the network, θ. Here we implement the method of Tu et al. (2016) and compare it to the SNR-based pruning method.

These two approaches are based on fundamentally different methodologies: while SNR pruning takes into account ‘noisy’ weights that are either too small in magnitude or have large posterior variances, the Fisher-information based method removes parameters based on their contribution to the gradients. If a parameter has smaller FIM values, this indicates that the gradients of the parameter did not change much during training, i.e. that the parameter contained less information and was less relevant to producing the optimized model. Thus one method may be preferred to the other in specific applications.

The empirical FIM for a parameter, θ, can be calculated as follows:

$$\begin{eqnarray*} F(\theta) = \mathbb {E}_y \left[ \left(\frac{\partial \log L}{\partial \theta }\right) \left(\frac{\partial \log L}{\partial \theta }\right)^T \right] ~ , \end{eqnarray*}$$

(47)

where L is the loss function. Tu et al. (2016) used the log likelihood loss function for their classical neural network, but for our model this loss is the ELBO function (see equation 8). Using the Adam optimizer to train the models as described in Section 5.2 allows one to use the bias-corrected second raw moment estimate of the gradient to approximate the FIM diagonal, since this value is used by the Adam optimizer to adapt to the geometry of the data (Kingma & Ba 2014). The value of the FIM diagonal for each parameter is used to rank the parameters in order of importance and those with the lowest values are removed. For a more thorough discussion of different FIM calculations used in the statistical and machine learning literature we refer the reader to Kunstner, Hennig & Balles (2019).

The results of our pruning experiments are shown in Fig. 10. As also observed by Tu et al. (2016), pruning the weights based on Fisher information alone does not allow for a large number of parameters to be pruned effectively because many values in the FIM diagonal are close to zero. We find this to be true for our model as well, and that only 10 per cent of the fully connected layers can be pruned using Fisher information alone without incurring a significant penalty in performance.

Figure 10.

Comparison of model performance for different pruning methods based on: SNR, Fisher information, and a combination of magnitude and Fisher information. The grey shaded portion indicates the standard deviation of test error for the unpruned model.

To remedy this, Tu et al. (2016) suggest combining Fisher pruning with magnitude-based pruning. Following their approach, we define a parameter, r, to determine the proportion of weights that are pruned by either of these methods. To prune P parameters from the network, we perform the following steps in order: (i) remove the P(1 − r) weights with the lowest magnitude; (ii) remove the |$P \, r$| parameters with the lowest FIM values. The parameter r is between (0,1) and is tuned like a hyper-parameter. We get an optimal pruning performance with r = 0.5. We find that up to 60 per cent of the fully connected layer weights can be pruned using this method without a significant change in performance, which is double the volume of weights pruned by the SNR-based method discussed in Section 5.4. We refer to this method as Fisher pruning in the following sections for conciseness.

6.4.1 Analysis of uncertainty estimates for different pruning methods

The effect of pruning on uncertainty quantification for the MiraBest Confident test set is shown in Fig. 11. We plot uncertainty metrics for the two pruning methods discussed in this work: (i) based on SNR, with 30 per cent pruning, and (ii) Fisher pruning which is based on magnitude combined with Fisher information, with 60 per cent pruning, and compare them to the metrics obtained for the unpruned model. We complement this with an analysis of the change in uncertainty calibration error for the two pruning methods considered in this work.

Figure 11.

Distributions of uncertainty metrics for different pruning methods for the MiraBest Confident data set. (a) Predictive uncertainty as measured using predictive entropy; (b) Epistemic uncertainty as measured using mutual information; and (c) Aleatoric uncertainty as measured using average entropy. For a fuller explanation of these metrics, please see Section 3.

From Fig. 11 a we note that predictive entropy increases with pruning. The interquartile range increases for both pruning methods. This increase is mainly due to an increase in predictive entropy for FR I galaxies in case of SNR pruning and for FR II galaxies in case of Fisher pruning (see Fig. 12a). The median predictive entropy and interquartile range increase for FR IIs with Fisher pruning.

Figure 12.

Class-wise distributions of uncertainty metrics for different pruning methods for the MiraBest Confident data set. (a) Predictive uncertainty as measured using predictive entropy; (b) Epistemic uncertainty as measured using mutual information; and (c) Aleatoric uncertainty as measured using average entropy. For a fuller explanation of these metrics, please see Section 3.

We also find that the cUCE of predictive entropy increases with pruning, and that this increase is larger in the case of SNR pruning (see Table 8). SNR pruning has more of an adverse effect on the classification of FR I samples than FR II samples, whereas the effect of Fisher pruning is similar for both classes.

Table 8.

Percentage cUCE on MiraBest Confident test set for our BBB-CNN model trained with a Laplace prior, pruned to its threshold limit for SNR and Fisher pruning. The percentage cUCE is shown separately for the predictive entropy (PE), mutual information (MI), and average entropy (AE) as calculated on the MiraBest Confident test set.

		per cent cUCE
Prior	Pruning	PE	MI	AE
Laplace	Unpruned	9.69	16.37	10.84
	SNR	14.35	16.82	13.93
	Fisher	13.43	15.29	11.25

		per cent cUCE
Prior	Pruning	PE	MI	AE
Laplace	Unpruned	9.69	16.37	10.84
	SNR	14.35	16.82	13.93
	Fisher	13.43	15.29	11.25

Table 8.

		per cent cUCE
Prior	Pruning	PE	MI	AE
Laplace	Unpruned	9.69	16.37	10.84
	SNR	14.35	16.82	13.93
	Fisher	13.43	15.29	11.25

		per cent cUCE
Prior	Pruning	PE	MI	AE
Laplace	Unpruned	9.69	16.37	10.84
	SNR	14.35	16.82	13.93
	Fisher	13.43	15.29	11.25

Pruning does not seem to have a large effect on the distributions of mutual information, which narrow by a small amount for both pruning methods (see Fig. 11b). The cUCE does not increase significantly with SNR pruning and reduces with Fisher pruning (see Table 8).

Looking at the class-wise distributions of mutual information in Fig. 12(b), we can see that both pruning methods narrow the distribution of mutual information for FR I galaxies. However, SNR pruning increases the uncertainty calibration error for FR Is and decreases UCE for FR IIs. Fisher pruning on the other hand decreases UCE for FR Is and increases UCE for FR IIs. Therefore, SNR and Fisher pruning both seem to be adversely affecting the uncertainty calibration for one of the two classes.

The average entropy increases with both pruning methods and the distribution becomes more broad in the case of Fisher pruning (see Fig. 11c). However, we also find that Fisher pruning does not significantly change cUCE and that there is more increase in cUCE with SNR pruning (see Table 8).

Fig. 12(c) shows that average entropy increases with pruning for FR I galaxies for both pruning methods and there is a greater increase in average entropy with Fisher pruning for FR II galaxies. We find that SNR pruning increases UCE for both FR I and FR II galaxies, but the increase is more significant for FR I galaxies. Fisher pruning reduces UCE for FR Is and increases UCE for FR IIs.

Whilst we have described the differing effect of alternative pruning methods on a class-wise basis, we note that pruning itself cannot be applied selectively by class since it depends on the model parameters. However, based on the fundamental differences in the results from SNR and Fisher pruning, we suggest that FR I galaxies are less influential on the gradients of the learnable parameters during training compared to FR IIs and that the learned model weights are less noisy for FR IIs compared to FR Is.

Hüllermeier & Waegeman (2021) argue that epistemic and aleatoric uncertainty are mutable quantities as a function of model specification (see also Kiureghian & Ditlevsen 2009); specifically, that the uncertainty contributed by each is affected by model complexity and class separability. For example, embedding a data set into a higher dimensional feature space may result in greater separability of the target classes leading to lower aleatoric uncertainty, but the additional complexity from the higher dimensionality results in a model with higher epistemic uncertainty. In SNR-based pruning, we are reducing the dimensionality of the feature space, but at the same time we are also removing noisy weights (and hence noisy features) that may adversely affect class separability, and so it is expected that measures of both epistemic and aleatoric uncertainty will be changed as a function of pruning degree.

7 COLD POSTERIOR EFFECT

In Section 5.3 we found that we can improve the generalization performance of our BBB model significantly by cooling the posterior with a temperature T ≪ 1, deviating from the true Bayes posterior. This cold posterior effect is shown in detail in Fig. 13 for our model trained on radio galaxies with the Laplace prior. In Section 5.3 we modified our cost function to down-weight the posterior in equation (44) using a temperature term, T, with all subsequent experiments performed using T = 10⁻². In Fig. 13 we show the effect of varying T over a wide range of values.

Figure 13.

The ‘cold posterior’ effect for the MiraBest classification problem (see Section 7 for details). Data are shown for the BBB model trained with a Laplace prior with no data augmentation and the original ELBO cost function (solid blue line), the BBB model with no data augmentation and the Masegosa posterior cost function (red dashed line), the BBB model with data augmentation and the original ELBO function (orange dot–dashed line), and the BBB model with data augmentation and the Masegosa posterior cost function (green dotted line).

10.1088/0067-0049/182/2/543

These results suggest that some component of the Bayesian framework in the context of this application is misspecified and it becomes difficult to justify using a Bayesian approach to these models whilst artificially reducing the effect of the components that make the learning Bayesian in the first place.

Finding an explanation for the cold posterior effect is an active area of research and several hypotheses have been proposed to explain this effect (Wenzel et al. 2020): use of uninformative priors, such as the standard Gaussian, which may lead to prior misspecification (Fortuin 2021); model misspecification; data augmentation or data set curation issues which lead to likelihood misspecification (Aitchison 2021; Nabarro et al. 2021). Here we consider two approaches for investigating the effect of these misspecifications.

7.1 Model misspecification

We examine whether the cold posterior effect in our results is due to model misspecification by optimizing the model with a modified cost function, following the work of Masegosa (2019). The cost function is modified on the basis of Probably Approximately Correct (PAC)-Bayesian theory. PAC theory has its roots in Statistical Learning Theory and was first described in Valiant (1984) as a method for evaluating learnability, i.e. how well a machine learns hypotheses given a set of examples. Although PAC started out as a frequentist framework, it was soon combined with Bayesian principles. It has now evolved into a formalized mathematical theory used to give statistical guarantees on the performance of machine learning algorithms by placing bounds on their generalization performance.

According to the PAC theory, we can obtain an approximately correct upper bound on the generalization performance of a model, as measured by test loss for example, which holds true with an arbitrarily high probability as more data is collected, hence the name ‘Probably Approximately Correct’. Since the goal of any learning algorithm is to minimize the generalization gap, which is defined as the difference between the out-of-sample or theoretical loss, L(θ), and the in-sample or empirical loss, |$\hat{L}(\theta)$|⁠, PAC inequalities can be used to define new learning strategies to train models. Using PAC theory, we can obtain the following inequality:

$$\begin{eqnarray*} L(\theta) \le \hat{L}(\theta) + \epsilon (\delta , D) , \end{eqnarray*}$$

(48)

where δ is a confidence parameter that defines the probability that a sample in the training set is misleading, and ϵ(δ, D_train) is an upper bound on the generalization gap.

McAllester (1999) presented PAC-Bayesian inequalities which combine PAC-learning with Bayesian principles and provide guarantees on the performance of generalized Bayesian algorithms. These algorithms are referred to as generalized because the PAC-Bayes framework has similar components to the Bayesian framework: a prior, π, defined over a set of hypotheses, θ ∈ Θ, and a posterior, ρ, which is updated using Bayes-rule style updates using samples from a data generating distribution, ν(x). But these bounds hold true for all choices of priors, whereas there is no guarantee on performance in Bayesian inference if the data set is not generated from the prior distribution i.e. if the prior assumptions are incorrect. The bounds also hold true for all choices of posteriors, so in principle we can have model-free learning. However, traditionally most of the PAC-Bayes bounds are only applicable to bounded loss functions and this has made it difficult to apply them to the unbounded loss functions that are typically used to train neural networks. Fortunately more recent works have introduced PAC Bayes bounds for unbounded losses as well (e.g. Alquier, Ridgway & Chopin 2016; Germain et al. 2016; Shalaeva et al. 2020). We refer the reader to Guedj (2019) for an overview of the PAC-Bayesian framework.

Learning in the PAC-Bayesian framework happens such that an optimal value of the posterior, ρ*, is found that minimizes the KL divergence between the data generating distribution, ν(x), and the posterior predictive distribution given by |$\mathbb {E}_{\rho } [p(x|\theta)]$|⁠:

$$\begin{eqnarray*} \rho ^{*} = {\rm arg\, min}_{\rho } {\rm KL}[\nu (x) ~ || ~ \mathbb {E}_{\rho } [p(x|\theta)] ~ ] . \end{eqnarray*}$$

(49)

Minimizing this KL divergence is equivalent to minimizing the cross-entropy (CE) function,

$$\begin{eqnarray*} \rho ^{*} = {\rm arg\, min}_{\rho } {\rm CE} (\rho), \end{eqnarray*}$$

(50)

which is the expected log loss of the posterior predictive distribution, p(x|θ), with respect to the data generating distribution, ν(x),

$$\begin{eqnarray*} {\rm CE}(\rho) = \mathbb {E}_{\nu (x)} [-\log \mathbb {E}_{\rho (\theta)}[p(x|\theta)]] . \end{eqnarray*}$$

(51)

Thus, by minimizing this CE function, we can find the optimal ρ*. This cross entropy loss is bounded by the expected log loss of the posterior predictive distribution as

$$\begin{eqnarray*} {\rm CE}(\rho) \le \mathbb {E}_{\rho (\theta)}[L(\theta)]. \end{eqnarray*}$$

(52)

This is an example of what is known as an ‘oracle’ bound, since the inequality depends on the unknown data generating distribution, ν(x). It is also an example of a first order Jensen inequality, which gives a linear bound such that

$$\begin{eqnarray*} \mathbb {E}_{\nu (x)} [-\log \mathbb {E}_{\rho (\theta)}[p(x|\theta)]] \le \mathbb {E}_{\rho (\theta)}[\mathbb {E}_{\nu (x)}[-\log p(x|\theta)]] , \end{eqnarray*}$$

(53)

which is an expansion of the terms given in equation (52).

Germain et al. (2016) derived a first order PAC-Bayes bound for unbounded losses

$$\begin{eqnarray*} {\rm CE}(\rho) \le \mathbb {E}_{\rho (\theta)}[L(\theta)] \le \mathbb {E}_{\rho (\theta)}[\hat{L}(\theta , D)] + \frac{{\rm KL}(\rho , \pi)}{c_1} + c_2, \end{eqnarray*}$$

(54)

where c₁ and c₂ are constants. In the same work, Germain et al. (2016) also showed that under i.i.d. assumptions, the Bayesian posterior, p(θ|D), minimizes this PAC-Bayes bound over the expected log loss, |$\mathbb {E}_{\rho (\theta)}[L(\theta)]$|⁠, which bounds the cross-entropy loss.

Since variational inference is an approximation to the Bayesian marginal likelihood, we can use PAC-Bayesian bounds to optimize VI-based Bayesian NNs. By applying these bounds to train Bayesian neural networks, one deviates from the variational inference as defined in the Bayesian paradigm and moves towards a more generalized variational inference algorithm. Training a model with the modified cost function minimizes an upper bound on the test loss and provides a more optimal learning strategy compared to optimizing the Bayesian posterior or its approximations such as the ELBO function. To do this, the cost function is modified such that it minimizes a second-order PAC-Bayesian bound on the cross entropy loss (CE), rather than the standard ELBO. Following the literature, we refer to this new objective function as a ‘Masegosa posterior’.

7.1.1 Masegosa posteriors

Masegosa (2019) showed that the Bayesian posterior minimizes a PAC-Bayes bound over the CE loss only when the model is perfectly specified. When the model is misspecified, the minimum of the CE loss is not equal to minimum of the expected log loss, and thus optimizing the Bayesian posterior does not give an optimal learning strategy. Since this is more often the case, they propose an alternative posterior by introducing a variance term, |$\mathbb {V}(\rho)$|⁠, that measures the variance of the posterior predictive distribution. They define a second-order oracle and Jensen bound, which is given as

$$\begin{eqnarray*} { \rm CE}(\rho) \le \mathbb {E}_{\rho (\theta)}[L(\theta)] - \mathbb {V}(\rho), \end{eqnarray*}$$

(55)

where

$$\begin{eqnarray*} \mathbb {V}(\rho) = \mathbb {E}_{\nu (x)} \left[ \frac{1}{2 {\rm max}_\theta p(x|\theta)^2} \mathbb {E}_{\rho (\theta)}[(p(x|\theta) - p(x))^2 ] \right] \end{eqnarray*}$$

(56)

is the variance term.

Since the true data generating distribution, ν(x), is not known, the authors place an upper bound on equation (55) using a second-order PAC-Bayes bound,

$$\begin{eqnarray*} {\rm CE}(\rho) \le \mathbb {E}_{\rho (\theta)}[L(\theta ] - \mathbb {V}(\rho) \le \mathbb {E}_{\rho (\theta)}[\hat{L}(\theta ,D)] - {\hat{\mathbb V}}(\rho , D) + \frac{{\rm KL}}{c_1} + c_2 \nonumber \\. \end{eqnarray*}$$

(57)

This alternative posterior is compatible with VI and we can modify our cost function to test the hypothesis that the cold posterior effect observed in our work is due to model misspecification. Instead of optimizing the ELBO function in equation (11), we optimize the following function:

$$\begin{eqnarray*} {\rm arg\, min}_{\theta } {\rm KL} [q(w|\theta)|P(w)] - \mathbb {E}_{q(w|\theta)} [\log P(D|w)] - {\hat{\mathbb V}}(q|D) . \end{eqnarray*}$$

(58)

The empirical variance term, |${\hat{\mathbb V}}$|⁠, can be numerically calculated as follows:

$$\begin{eqnarray*} {\hat{\mathbb V}}(w, w\:^{\prime }, D) &=& \exp (2 \log P(D|w)- 2 m_{D})\nonumber \\ && - \exp (\log P(D|w) + \log P(D|w\:^{\prime }) - 2 m_{D}), \end{eqnarray*}$$

(59)

where w, w ′ are samples from the variational posterior, q(w|θ), D is the training data and m_D is given as

$$\begin{eqnarray*} m_{\mathrm{ D}} = \rm{max}_{w} \log P(D|w). \end{eqnarray*}$$

(60)

The results of this modification for a range of temperatures is shown in Fig. 13, where it can be seen that the Masegosa posterior PAC bound (red dashed line) improves the cold posterior effect over the original BBB model (solid blue line) slightly, but does not fully compensate for the overall behaviour.

7.2 Likelihood misspecification

Aitchison (2021) suggested that rather than prior or model misspecification, the cold posterior effect might be caused by over-curation of training data sets leading to likelihood misspecification, where the training data was not statistically representative of the underlying data distribution. They showed that for highly curated data sets, such as CIFAR-10, the cold posterior effect could be mitigated by adding label noise to the training data. Other works in this area have suggested that unprincipled data augmentation could be a contributing factor to the cold posterior effect (e.g. Izmailov et al. 2021; Nabarro et al. 2021).

The MiraBest data set used in this work is likely to be similarly subject to some over-curation in the same sense as CIFAR-10 as it was compiled using an average or consensus labelling scheme from multiple human classifiers. For CIFAR-10, Aitchison (2021) introduced label noise by augmenting the original data set using all individual classifications from 50 human classifiers in their work, which were provided by the CIFAR-10H data set (Peterson et al. 2019). Here we do not have access to the individual classifications for the MiraBest data set, but we are able to augment our data set in a more standard manner using rotations.

We find that the cold posterior effect observed in our work reduces slightly with data augmentation (see Fig. 13). We suggest that this is because we have augmented the MiraBest data set using principled methods that correspond to an informed prior for how radio galaxies are oriented, as radio galaxy class is assumed to be equivariant to orientation and chirality (see e.g. Ntwaetsile & Geach 2021; Scaife & Porter 2021). Fig. 13 shows the cold posterior effect for the radio galaxy classification problem addressed in this work both with (orange dot–dashed line) and without (blue solid line) data augmentation. It can be seen that there is an improvement in performance at temperatures below T = 0.01 causing the test accuracy to reach a plateau at higher temperatures than for unaugmented data. Data augmentation also improves performance at temperatures above T = 0.01; however, we also find that the uncertainty calibration error increases for the model trained with augmented data for T = 0.01.

Fig. 13 also shows that combining a Masegosa posterior with data augmentation provides the most significant improvement to the cold posterior effect (green dotted line); however, it does not rectify it completely. Since the Masegosa posterior is a more complete test of model misspecification than the data augmentation used here is of likelihood misspecification, we suggest that a key element for exploring the problem of likelihood misspecification in future may be the availability of radio astronomy training sets that do not only present average or consensus target labels, but instead include all individual labels from human classifiers.

8 CONCLUSIONS

In this work we have presented the first application of a variational inference based approach to deep learning classification of radio galaxies, using a binary FR I/FR II classification. Using a Bayesian Convolutional Neural Network based on the Bayes by Backprop (BBB) algorithm, we have shown that posterior uncertainties on the predictions of the model can be estimated by making a variational approximation to the posterior probability distribution over the model parameters.

We have considered the use of four different prior distributions over the parameters of our model. We find that a model trained with a Laplace prior performs better than one using a Gaussian prior in terms of mean test error by |$\sim 3{{\ \rm per\ cent}}$|⁠, and better than a GMM prior by |$\sim 1{{\ \rm per\ cent}}$|⁠; a model trained with a LMM prior performs worse than those using Gaussian priors and the Laplace prior. We also find that the calibration of the posterior uncertainties for a model trained using a Laplace prior is better than for models trained using the other priors considered in this work. This suggests that learning in this case may benefit from sparser weights. However, given the uncertainties on these values we cannot yet draw statistically significant conclusions for prior selection. We also note that this work uses relatively simple priors and that future extensions will look more closely at prior specification and whether more informative priors can help learning.

We note that we obtain a larger error value than that obtained by other neural network based models trained on the MiraBest Confident data set, but emphasize that other works have used data augmentation to increase the size of the data set, whereas we have only used the original samples. This allowed us to study the use of VI-based neural networks on small data sets as well. If we include data augmentation we obtain a comparable performance to previously published results; however, this comes at the cost of increased uncertainty calibration error. Thus there is a trade-off between standard models, which are somewhat more accurate, and BBB, which is reasonably accurate while giving more reliable posteriors and therefore potentially more scientifically useful.

Our analysis of different measures of uncertainty for our deep learning model indicates that model uncertainty is correlated with the degree of belief of the human classifiers who originally assigned the labels in the MiraBest data set. We find that our BBB model trained on confidently labelled radio galaxies is able to reliably estimate its confidence in predictions when presented with radio galaxies that have been classified with a lower degree of confidence. Notably, all measures of uncertainty are higher for the samples qualified as Uncertain. The model also made predictions with higher uncertainty for a sample of Hybrid radio galaxies, which was expected as these samples were not present in the training data, but contained FR I/FR II like components nevertheless. Looking more closely at the class-wise distributions of uncertainties, we found that FR II type objects are associated with a lower degree of uncertainty than FR Is. Among the classes of the Hybrid samples, we found that the uncertainty was higher for the confidently labelled samples compared to the uncertainly labelled Hybrid samples. We suggest that this may be because objects with uncertain labels are more similar to the FR I/FR II samples the model was trained on, which is why the human classifiers were uncertain in their classification as a Hybrid.

We have explored different weight pruning approaches with the motivation of reducing the storage and computation cost of these models at deployment. We find that using a SNR based method using posterior means and variances allows the fully connected layers of the model to be pruned by up to 30 per cent, but a method that combines Fisher information with weight magnitudes allows an even higher proportion of weights to be pruned, by up to 60 per cent, without compromising the model performance. The effect of removing some of these weights can also be seen in the uncertainty metrics. We found that both the uncertainty and the uncertainty calibration error increase with model pruning. However, both pruning methods seem to affect FR I/FR II samples differently. Future work in this area could make a comparison of these methods with augmented data to verify whether one method should be preferred over the other. Another possible extension could be re-training a pruned model to test whether pruning improves the generalization performance of a network.

Finally, we consider the cold posterior effect and its implications for the use of Bayesian deep learning with radio galaxy data in future. We find that the cold posterior effect is worse when using a GMM prior than for a Laplace prior, and we consider the hypothesis that further model misspecification may be causing the observed cold posterior effect. We test this hypothesis by retraining our model with a modified cost function that provides a loose PAC-Bayes bound over the cross-entropy loss. We find that although the modified cost function improves model performance slightly, it does not compensate for the cold posterior effect completely.

We also consider the possibility of likelihood misspecification and test whether a principled data augmentation could improve the cold posterior effect. Similarly, we find that a small improvement is observed, but not a sufficiently large change to remove the effect entirely. Based on these results, we suggest that over-curation of the training data set may be responsible for the majority of the cold posterior effect in radio galaxy classification and recommend that future labelling schemes for radio astronomy data retain full details of labelling from all human classifiers in order to test and potentially mitigate this effect more fully.

In this work we have considered a binary classification of morphology, but a diverse and complex population of galaxies exist in the radio universe. Understanding how populations of radio galaxies are distributed gives us insight into the effect of extrinsic and intrinsic factors that may have led to the morphologies, which in turn help shape our understanding of radio-loud AGN, their excitation and accretion modes, how they evolve, and their relationship with their host galaxies and environments. Deep learning will play an important role in extracting scientific value from the next-generation of radio facilities and understanding how neural network models propagate uncertainties will be crucial for deploying these models scientifically.

ACKNOWLEDGEMENTS

We thank the anonymous referee for their useful comments that improved this work. AMS, MW, and MB gratefully acknowledge support from Alan Turing Institute AI Fellowship EP/V030302/1. FP gratefully acknowledges support from UK Science & Technology Facilities Council (STFC) and IBM through the iCASE studentship ST/P006795/1. MB gratefully acknowledges support from STFC.

DATA AVAILABILITY

Code for this work is available at https://github.com/devinamhn/RadioGalaxies-BBB. This work makes use of the MiraBest machine learning data set, which is publically available under a Creative Commons 4.0 license at https://doi.org/10.5281/zenodo.4288837.

Footnotes

We note that the Radio Galaxy Zoo catalogue (Banfield et al. 2015) contains of order 10⁴ objects, but these are not labelled by morphological type.

Here we refer only to the weights, w, of the network for simplicity, but the equations are applicable to the biases as well.

Logits are the unnormalized outputs of the network, i.e. the outputs of the final layer before the Softmax function is applied.

REFERENCES

Abazajian

K. N.

et al. ,

2009

ApJS

182

543

. https://openreview.net/forum?id = Rd138pWXMvG

Abdar

et al. ,

2021

Inf. Fusion

Aitchison

2021

, in

International Conference on Learning Representations

Alquier

Ridgway

Chopin

2016

J. Mach. Learn. Res.

8374

Aniyan

Thorat

2017

ApJS

230

Banfield

J. K.

et al. ,

2015

MNRAS

453

2326

10.1093/mnras/stv1688

Bastien

D. J.

Scaife

A. M. M.

Tang

Bowles

Porter

2021

MNRAS

503

3351

10.1093/mnras/stab588

Beardsley

A. P.

et al. ,

2019

Publ. Astron. Soc. Aust.

10.1017/pasa.2019.41

Becker

R. H.

White

R. L.

Helfand

D. J.

1995

ApJ

450

559

10.1086/176166

10.1111/j.1365-2966.2012.20414.x

Best

P. N.

Heckman

T. M.

2012

MNRAS

421

1569

Bicknell

G. V.

1995

ApJS

101

10.1086/192232

, http://proceedings.mlr.press/v37/blundell15.pdf

Blundell

Cornebise

Kavukcuoglu

Wierstra

2015

32nd International Conference on International Conference on Machine Learning - Vol. 37

, p.

1613

Bowles

Scaife

A. M.

Porter

Tang

Bastien

D. J.

2021

MNRAS

501

4579

Condon

J. J.

Cotton

W. D.

Greisen

E. W.

Yin

Q. F.

Perley

R. A.

Taylor

G. B.

Broderick

J. J.

1998

115

1693

10.1086/300337

Croston

J. H.

Ineson

Hardcastle

M. J.

2018

MNRAS

476

1614

10.1093/mnras/sty274

Fanaroff

B. L.

Riley

J. M.

1974

MNRAS

167

31P

10.1093/mnras/167.1.31P

, https://proceedings.neurips.cc/paper/2020/file/b6dfd41875bc090bd31d0b1740eb5b1b-Paper.pdf

Foong

A. Y.

Burt

D. R.

Turner

R. E.

2020

Advances in Neural Information Processing Systems

, p.

15897

Fortuin

2021

preprint (arXiv:2105.06868)

Gal

2016

Uncertainty in Deep Learning

PhD thesis, University of Cambridge

, http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf

Gal

Ghahramani

2016

International Conference on Learning Representations

preprint (arXiv:1506.02158)

Gal

Ghahramani

2016

, in

Balcan

M. F.

Weinberger

K. Q.

, eds,

Proceedings of The 33rd International Conference on Machine Learning

PMLR

New York, USA

, p.

1050

. https://dl.acm.org/doi/abs/10.5555/3157096.3157307

Germain

Bach

Lacoste

Lacoste-Julien

2016

Proceedings of the 30th International Conference on Neural Information Processing Systems

Gong

Liu

Yang

Bourdev

2014

preprint (arXiv:1412.6115)

Gopal-Krishna

Wiita

P. J.

2000

A&A

363

507

Graves

2011

, in

Shawe-Taylor

Zemel

R. S.

Bartlett

P. L.

Pereira

Weinberger

K. Q.

, eds,

Proceedings of the 24th International Conference on Neural Information Processing Systems. NIPS’11

Curran Associates Inc

Red Hook, NY, USA

, p.

2348

Guedj

2019

preprint (arXiv:1901.05353)

Guo

Pleiss

Sun

Weinberger

K. Q.

2017a

, in

Precup

Teh

Y. W.

, eds,

International Conference on Machine Learning

PMLR

, p.

1321

Guo

Pleiss

Sun

Weinberger

K. Q.

2017b

, in

Precup

Teh

Y. W.

, eds,

Proceedings of the 34th International Conference on Machine Learning - Vol. 70. ICML’17

PMLR

, p.

1321

Hein

Andriushchenko

Bitterwolf

2018

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

. p.

10.1109/CVPR.2019.00013

10.1007/s10994-021-05946-3

Hinton

G. E.

van Camp

1993

, in

Proceedings of the Sixth Annual Conference on Computational Learning Theory. COLT ’93

Association for Computing Machinery

New York, NY, USA

, p.

10.1145/168304.168306

Hüllermeier

Waegeman

2021

Mach. Learn.

110

457

, https://proceedings.mlr.press/v139/izmailov21a.html

Izmailov

Vikram

Hoffman

M. D.

Wilson

A. G. G.

2021

, in

Meila

Zhang

, eds,

Proceedings of Machine Learning Research Vol. 139, Proceedings of the 38th International Conference on Machine Learning

PMLR

, p.

4629

10.1007/s10686-008-9124-7

Jarvis

M. J.

et al. ,

2016

, in

Taylor

Camilo

Leeuw

Moodley

, eds,

Proceedings of Science, MeerKAT Science: On the Pathway to the SKA

Stellenbosch

Johnston

et al. ,

2008

Exp. Astron.

151

10.1111/j.1365-2966.2007.12350.x

Jospin

L. V.

Buntine

Boussaid

Laga

Bennamoun

2020

preprint (arXiv:2007.06823)

Kaiser

C. R.

Best

P. N.

2007

MNRAS

381

1548

https://doi.org/10.1016/j.strusafe.2008.06.020

Kingma

D. P.

2014

preprint (arXiv:1412.6980)

Kingma

D. P.

Welling

2013

preprint (arXiv:1312.6114)

Kingma

D. P.

Salimans

Welling

2015

Adv. Neural Inf. Process. Syst.

2575

Kiureghian

A. D.

Ditlevsen

2009

Structural Safety

105

Krishnan

Tickoo

2020

Adv. in Neural Info. Proc. Syst.

18237

Kullback

Leibler

R. A.

1951

Ann. Math. Stat.

, https://proceedings.neurips.cc/paper/2019/file/46a558d97954d0692411c861cf78ef79-Paper.pdf

Kunstner

Hennig

Balles

2019

, in

Wallach

Larochelle

Beygelzimer

d'Alché-Buc

Fox

Garnett

, eds,

Vol. 32, Advances in Neural Information Processing Systems

Curran Associates, Inc

Laves

M.-H.

Ihler

Kortmann

K.-P.

Ortmaier

2019

preprint (arXiv:1909.13550)

LeCun

Denker

J. S.

Solla

S. A.

1989

, in

Advances in Neural Information Processing Systems

MIT Press

Cambridge, MA, USA

, p.

598

LeCun

Bottou

Bengio

Haffner

1998

Proc. IEEE

2278

Lukic

Brüggen

Banfield

J. K.

Wong

O. I.

Rudnick

Norris

R. P.

Simmons

2018

MNRAS

476

246

10.1093/mnras/sty163

McAllester

D. A.

1999

Mach. Learn.

355

10.1162/neco.1992.4.3.448

MacKay

D. J. C.

1992a

Neural Comput.

448

10.1162/neco.1992.4.5.720

MacKay

D. J. C.

1992b

Neural Comput.

720

Masegosa

A. R.

2019

Adv. in Neural Info. Proc. Syst.

5479

Mingo

et al. ,

2019

MNRAS

488

2701

Miraghaei

Best

P. N.

2017

MNRAS

466

4346

10.1093/mnras/stx007

Mukhoti

Kirsch

van Amersfoort

Torr

P. H.

Gal

2021

preprint (arXiv:2102.11582)

Nabarro

Ganev

Garriga-Alonso

Fortuin

van der Wilk

Aitchison

2021

preprint (arXiv:2106.05586)

Neal

R. M.

Hinton

G. E.

1998

, in

Learning in Graphical Models

Springer

Dordrecht

, p.

355

Nguyen

A. M.

Yosinski

Clune

2015

, in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

, p.

427

Ntwaetsile

Geach

J. E.

2021

MNRAS

502

3417

10.1093/mnras/stab271

O’Dea

C. P.

Owen

F. N.

1985

927

10.1086/113801

Pastore

Calcagnì

2019

Frontiers Psychol.

1089

Peterson

J. C.

Battleday

R. M.

Griffiths

T. L.

Russakovsky

2019

, in

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

IEEE

Piscataway

, p.

9617

Rudnick

Owen

F. N.

1976

ApJ

203

L107

10.1086/182030

Saul

L. K.

Jaakkola

Jordan

M. I.

1996

J. Artif. Intell. Res.

Scaife

A. M. M.

Porter

2021

MNRAS

503

2369

10.1093/mnras/stab530

10.1046/j.1365-8711.2000.03430.x

Schoenmakers

A. P.

de Bruyn

A. G.

Röttgering

H. J. A.

van der Laan

Kaiser

C. R.

2000

MNRAS

315

371

Semenova

Williams

D. P.

Afzal

A. M.

Lazic

S. E.

2020

Comput. Toxicol.

100133

Shalaeva

Esfahani

A. F.

Germain

Petreczky

2020

, in

Proceedings of the AAAI Conference on Artificial Intelligence

. Vol.

AAAI Press

Palo Alto, California, USA

, p.

5660

Shridhar

Laumann

Liwicki

2019

preprint (arXiv:1901.02731)

Soulié

Gripon

Robert

2016

, in

Villa

A. E. P.

Masulli

Rivero

A. J. P.

, eds,

International Conference on Artificial Neural Networks

Springer

Cham

, p.

153

Tang

Scaife

A. M. M.

Leahy

J. P.

2019

MNRAS

488

3358

10.1093/mnras/stz1883

Berisha

Cao

Seo

J.-s.

2016

, in

2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

IEEE

, p.

Valiant

L. G.

1984

Commun. ACM

1134

10.1051/0004-6361/201220873

van Haarlem

M. P.

et al. ,

2013

A&A

556

Wenzel

et al. ,

2020

, in

III

H. D.

Singh

, eds,

Proceedings of Machine Learning Research Vol. 119, Proceedings of the 37th International Conference on Machine Learning

PMLR

, p.

10248

et al. ,

2018

MNRAS

482

1211

10.1093/mnras/sty2646

Xiong

H. Y.

Barash

Frey

B. J.

2011

Bioinformatics

2554