A classifier-based approach to multiclass anomaly detection for astronomical transients

Number of transients in the training set, validation set, test set, and realistic samples (see Section 4.3.2) for each class. All anomalous data are reserved for evaluation.

Class	Training	Validation	Test	Total	Realistic sample\|$^a$\|
SNIa	9314	1131	1142	11 587	1142
SNIa-91bg	10 361	1318	1321	13 000	1318
SNIax	10 413	1248	1339	13 000	1339
SNIb	4197	507	563	5267	563
SNIc	1279	169	135	1583	135
SNIc-BL	1157	124	142	1423	142
SNII	10 420	1279	1301	13 000	1301
SNIIn	10 323	1359	1318	13 000	1318
SNIIb	9882	1233	1208	12 323	1208
TDE	9078	1162	1114	11 354	1114
SLSN-I	10 285	1322	1273	12 880	1273
AGN	8473	1046	1042	10 561	1042
CaRT	0	0	10 353	10 353	\|$11 \pm 3$\|
KN	0	0	11 166	11 166	\|$11 \pm 3$\|
PISN	0	0	10 840	10 840	\|$11 \pm 3$\|
ILOT	0	0	11 128	11 128	\|$10 \pm 3$\|
uLens-BSR	0	0	11 244	11 244	\|$10 \pm 3$\|

Class	Training	Validation	Test	Total	Realistic sample\|$^a$\|
SNIa	9314	1131	1142	11 587	1142
SNIa-91bg	10 361	1318	1321	13 000	1318
SNIax	10 413	1248	1339	13 000	1339
SNIb	4197	507	563	5267	563
SNIc	1279	169	135	1583	135
SNIc-BL	1157	124	142	1423	142
SNII	10 420	1279	1301	13 000	1301
SNIIn	10 323	1359	1318	13 000	1318
SNIIb	9882	1233	1208	12 323	1208
TDE	9078	1162	1114	11 354	1114
SLSN-I	10 285	1322	1273	12 880	1273
AGN	8473	1046	1042	10 561	1042
CaRT	0	0	10 353	10 353	\|$11 \pm 3$\|
KN	0	0	11 166	11 166	\|$11 \pm 3$\|
PISN	0	0	10 840	10 840	\|$11 \pm 3$\|
ILOT	0	0	11 128	11 128	\|$10 \pm 3$\|
uLens-BSR	0	0	11 244	11 244	\|$10 \pm 3$\|

|$^a$| The mean number of transients across the 50 test samples is shown. The

errors refer to the STD in the population size across the 50 sets. All common

test data is part of every sample, hence errors are not shown.

Table 1.

Number of transients in the training set, validation set, test set, and realistic samples (see Section 4.3.2) for each class. All anomalous data are reserved for evaluation.

Class	Training	Validation	Test	Total	Realistic sample\|$^a$\|
SNIa	9314	1131	1142	11 587	1142
SNIa-91bg	10 361	1318	1321	13 000	1318
SNIax	10 413	1248	1339	13 000	1339
SNIb	4197	507	563	5267	563
SNIc	1279	169	135	1583	135
SNIc-BL	1157	124	142	1423	142
SNII	10 420	1279	1301	13 000	1301
SNIIn	10 323	1359	1318	13 000	1318
SNIIb	9882	1233	1208	12 323	1208
TDE	9078	1162	1114	11 354	1114
SLSN-I	10 285	1322	1273	12 880	1273
AGN	8473	1046	1042	10 561	1042
CaRT	0	0	10 353	10 353	\|$11 \pm 3$\|
KN	0	0	11 166	11 166	\|$11 \pm 3$\|
PISN	0	0	10 840	10 840	\|$11 \pm 3$\|
ILOT	0	0	11 128	11 128	\|$10 \pm 3$\|
uLens-BSR	0	0	11 244	11 244	\|$10 \pm 3$\|

Class	Training	Validation	Test	Total	Realistic sample\|$^a$\|
SNIa	9314	1131	1142	11 587	1142
SNIa-91bg	10 361	1318	1321	13 000	1318
SNIax	10 413	1248	1339	13 000	1339
SNIb	4197	507	563	5267	563
SNIc	1279	169	135	1583	135
SNIc-BL	1157	124	142	1423	142
SNII	10 420	1279	1301	13 000	1301
SNIIn	10 323	1359	1318	13 000	1318
SNIIb	9882	1233	1208	12 323	1208
TDE	9078	1162	1114	11 354	1114
SLSN-I	10 285	1322	1273	12 880	1273
AGN	8473	1046	1042	10 561	1042
CaRT	0	0	10 353	10 353	\|$11 \pm 3$\|
KN	0	0	11 166	11 166	\|$11 \pm 3$\|
PISN	0	0	10 840	10 840	\|$11 \pm 3$\|
ILOT	0	0	11 128	11 128	\|$10 \pm 3$\|
uLens-BSR	0	0	11 244	11 244	\|$10 \pm 3$\|

|$^a$| The mean number of transients across the 50 test samples is shown. The

errors refer to the STD in the population size across the 50 sets. All common

test data is part of every sample, hence errors are not shown.

3 METHODS

3.1 Overview and rationale

Fig. 1 summarizes our methodology. First, we train a recurrent neural network (RNN) to classify the common classes of transients. Then, we remove the final layer of the trained model and use the remaining architecture as an encoder. The penultimate layer of our classifier has 100 neurons, but we find that any reasonably large latent space size works well for anomaly detection (See Section 4.5). To effectively extract anomalies from a well-represented space, it is essential to ensure that transients from similar classes cluster together. In our encoder, the latent space is directly used for light-curve classifications, which should naturally lead to clustering of similar transients.

Figure 1.

A visual summary of the architecture described in this work. Our approach first trains a classifier, then repurposes it as an encoder, and finally applies MCIF, proposed in this work, for anomaly detection. Colours in the plot have changed.

Once we have established this representation space, we must extract anomalies from it. However, when dealing with multiple clusters, a single isolation forest may struggle to capture each cluster equally (for further details, refer to Appendix A). This challenge motivated our approach, MCIF, where we train an isolation forest for each class, representing a distinct cluster, and select the minimum anomaly score as the final score. This minimum score should come from the cluster to which the latent observation is closest, providing the desired functionality.

In this work we chose to use a neural-network based architecture for anomaly detection. One of the advantages of using a neural network-based architecture over hand-selected features is that it is a data-driven model, which should make it more sensitive to identifying out-of-distribution data. This inherent quality of neural networks makes them especially good for anomaly detection.

3.2 Classifier

We train a DNN (deep neural network) classifier that maps a matrix of multi-passband light-curve data |$\boldsymbol {X}_s$| for a transient |$s$| to a |$1 \times N_c$| vector of probabilities, reflecting the likelihood of the given light curve being each of the aforementioned non-anomalous transient classes, where |$N_c$| is the number of classes.

The transient classifier utilizes a RNN with gated recurrent units (GRU; Cho et al. 2014) to handle the sequential time-series data. The input for each transient, |$\boldsymbol {X}_s$|⁠, is a |$4 \times N_T$| matrix where |$N_T$| is the maximum number of time-steps for any input sample. |$N_T$| is 656 in this work, but most transients have much fewer observations. Each row of the input matrix is composed of the following vector,

$$\begin{eqnarray} \boldsymbol {X}_{sj} = [f_{sj}, \epsilon _{sj}, t_{sj}, \lambda _p], \end{eqnarray}$$

(2)

where |$f_{sj}$| is the scaled flux for the |$j$|th observation of transient |$s$|⁠, |$\epsilon _{sj}$| is the corresponding scaled uncertainty, |$t_{sj}$| is the scaled time of when the measurement was taken, and |$\lambda _p$| is the central wavelength of the passband from which the measurement comes from. For the two passbands in ZTF, the central wavelengths used are as follows {|$\lambda _g=0.4827 \, \mathrm{\mu m}$|⁠, |$\lambda _r=0.6233 \, \mathrm{\mu m}$|}. We include flux error as an input to the DNN to enable it to learn how to weigh individual flux measurements for more accurate classifications. To implement the variable length input in tensorflow, we use a Masking layer and pad the arrays with zero entries to make the input matrix as long as the longest light curve (which is of size |$N_T$|⁠).

This input format has two major advantages. First, it eliminates the need for interpolation methods to pass sequential data into our model. Interpolation or imputation is often needed in astronomical transient classifiers as observations are recorded at irregular intervals. For real-time light-curve tasks, linear interpolation is sometimes used (e.g. Muthukrishna et al. 2022) but yields poor results for sparse light curves or surveys with larger cadences, such as LSST. Linear interpolation may confuse models, particularly when applied to a light curve with prolonged intervals of missing data. In such cases, extended sequences of the interpolated light curve may represent only two original observations.

Gaussian Processes (GP) interpolation has shown success for light-curve data (e.g. Boone 2019 and Villar et al. 2021) but does not help with real-time anomaly detection as it requires the whole light curve to perform the interpolation. Villar et al. (2021) gives justification for using GP for real-time anomaly detection, stating that each new observation heavily anchors the GP and that the GP is similar to physical model priors. However, methods that avoid any reliance on future observations are more appropriate for real-time analysis and better suited for early anomaly detection.

The second advantage of our input method is the inclusion of wavelength information to represent different passbands. Previous works have typically passed the light curves from different passbands separately, resulting in even larger sequences of missing (thus interpolated or imputed) data in the common case where some passbands have significantly fewer observations than other passbands. In our work, the inclusion of the central wavelength helps the model learn the relationship between different passbands, infer parts of the light curve in bands where few observations are present, and allows for all data to be passed in one channel.

From ZTF, we only get data in the |$g$| and |$r$| passbands which makes the difference in passbands less significant. However, LSST and other large-scale transient surveys will have data in up to six different passbands, and giving the model insight into relationships between different passbands will be crucial. This learned relationship, along with some transfer learning (Iman, Arabnia & Rasheed 2023), may make it possible to consolidate data from multiple observatories with different passbands to train one model or allow for a model trained on data from one observatory to be quickly used on data from another observatory.

After the recurrent layers of the DNN, we pass some contextual information into the classifier, which has been shown to be helpful for light-curve classification (Foley & Mandel 2013). In this work, we use the Milky Way extinction and the host galaxy’s spectroscopic redshift as additional inputs to the network. However, we understand that spectroscopic redshift is not always available, and future work should include training a model with photometric redshift or without redshift entirely.

Fig. 2 illustrates the architecture of the neural network classifier. The classifier was implemented using keras and tensorflow (Abadi et al. 2015; Chollet 2015). We detail the activation functions used in each layer of our classifier as follows. The input stream that each layer is part of is shown in parentheses.

Input Layer 1 (Light Curve Stream) – Takes a matrix of shape 4 x |$N_T$| as input to the recurrent neural network.
Gated Recurrent Unit (Cho et al. 2014) (Light Curve Stream) – Two recurrent layers consisting of 100 gated recurrent units with tanh activation functions.
Dense (Light Curve Stream) – A dense layer consisting of 100 neurons with tanh activation functions.
Input Layer 2 (Contextual Stream) – Takes a vector of length 2 containing the Milky Way extinction and spectroscopic redshift.
Dense (Contextual Stream) – A dense layer consisting of 10 neurons with ReLU activation functions. This dense layer is connected to Input Layer 2.
Concatenate Layer – A layer to merge the 2 streams of input. This layer concatenates the final dense layers for each input stream into one layer with 110 neurons.
Dense – A layer with 100 neurons to act as the latent representation for the light curves with a ReLU activation function.
Dense – A layer with 12 neurons (1 for each common transient class). This layer has a softmax activation function to map the output values to a probability score.

Figure 2.

A visualization of the neural network classifier being used in this work. Our model has two input streams, one for real-time light-curve data and the other for contextual information. The light-curve data (first input stream) goes through multiple GRU layers and then a dense layer. The contextual information (second input stream) feeds through a dense layer. The final dense layers from both input streams are merged into a concatenate layer. We feed that to a 100-neuron dense layer that will serve as the latent space of the encoder. Finally, this dense layer feeds into the output layer which provides classification scores. Colours in the plot have changed.

In our final model architecture, we use GRU layers proposed and tested in Cho et al. (2014). They are shown to perform better than typical RNNs and have quicker training times than long short-term memory networks (LSTMs; Hochreiter & Schmidhuber 1997), another variant of RNNs³ (Chung et al. 2014).

To counteract imbalances in the distribution of classes in the data set, we use a weighted categorical cross-entropy (see equation 6 of Muthukrishna et al. 2019) as a loss function with the weight |$w_{c}$| proportional to the fraction of transients from each class |$c$| in the training set,

$$\begin{eqnarray} w_{c} = \frac{T}{T_{c}}, \end{eqnarray}$$

(3)

where |$T_c$| is the number of transients from the class |$c$| and |$T$| is the total number of samples in the training set. This weighting scheme ensures that classes with fewer samples have higher weights. To train the classifier, we ran it over 40 epochs using the Adam optimizer (Kingma & Ba 2017) with EarlyStopping implemented in keras.

3.3 Multiclass isolation forests

Once the classifier is trained, we remove the last layer and use the remaining architecture to map any light curve to the latent space. We define this encoder as a function |$E(\boldsymbol {X}_s)$|⁠, that takes the aforementioned preprocessed light-curve data, |$\boldsymbol {X}_s$|⁠, and maps it to a 100D latent space |$\boldsymbol {z}_s$|⁠,

$$\begin{eqnarray} \boldsymbol {z}_s = E(\boldsymbol {X}_s). \end{eqnarray}$$

(4)

For anomaly detection, we now want to compute the anomaly score,

$$\begin{eqnarray} a_{s} = A(\boldsymbol {z}_s), \end{eqnarray}$$

(5)

where |$A(\boldsymbol {z}_s)$| is a function that evaluates the anomaly score |$a_s$| for a latent observation |$\boldsymbol {z}_s$|⁠. The goal of this work is to generate relatively large anomaly scores for anomalous transients and smaller anomaly scores for non-anomalous transients.

Isolation forests are known to be a very simple yet effective anomaly detection algorithm, especially in the domain of astronomical time-series (e.g. Pruzhinskaya et al. 2019; Ishida et al. 2021; Villar et al. 2021). However, using a single isolation forest performs poorly in determining some common classes as non-anomalous. This challenge arises from the complexity of our latent space, which contains various distinct clusters that pose difficulties for a single isolation forest to differentiate. Thus, we propose a new framework, MCIF, where an isolation forest is trained separately on data from every class, using the minimum anomaly score from any isolation forest as the final anomaly score.

We define 12 isolation forests, |$I_c(\boldsymbol {z}_s)$|⁠, trained on latent space observations from the common transient class |$c$|⁠. The final anomaly score is defined as

$$\begin{eqnarray} A(\boldsymbol {z}_s) = \min _{\forall c} \Bigl (-I_c(\boldsymbol {z}_s) \Bigr ). \end{eqnarray}$$

(6)

The function |$I_c(\boldsymbol {z}_s)$| is positive for less anomalous transients and negative for anomalous ones, to be consistent with the sklearn implementation of isolation forests. We negate the scores as we prefer defining transients with higher anomaly scores to be more anomalous, but this makes no difference to the results. All isolation forests used in this work are trained with 200 estimators. The results of using a single isolation forest and the benefits of using multiclass isolation forest are explored further in Appendix A.

3.4 Limitations of evaluating anomaly detection methods

Evaluating the performance of anomaly detection models is challenging because anomalies are rare, and it is difficult to build validation sets that account for the unknown. To simulate seeing anomalous data for the first time, the five aforementioned anomalous classes are not revealed to the model until final evaluation. This approach mimics the real-world situation where anomaly detectors (and astronomers) have limited knowledge of the anomalies they may discover.

The goal of this work is not to identify the specific anomalous classes mentioned, but rather to detect anomalies in general. Hence, using physical model priors or giving the model any information about anomalous classes beforehand would reveal too much about the specific rare transient classes used in this work. This could hinder the model’s performance on novel, real-world anomalies that may differ from the ones used in this study.

4 RESULTS AND ANALYSIS

In this section, we first evaluate the performance of our neural network classifier on distinguishing between common transient classes (Section 4.1). We then analyse how well our proposed anomaly detection pipeline, utilizing the classifier as an encoder (Section 4.2), is able to identify rare/anomalous transients (Section 4.3).

4.1 Classifier

In this section, we evaluate the performance of our classifier on the test set consisting of the 12 common transient classes.

The normalized confusion matrix in Fig. 3 [left] illustrates the classifier’s ability to accurately predict the correct transient class on the test data. Each cell indicates the fraction of transients from the true class that are classified into the predicted class. The high values along the diagonal, approaching 1.0, indicate strong performance. The misclassifications, indicated by the off-diagonal values, predominantly occur between subclasses of Type Ia supernovae (SNIa, SNIa-91bg, and SNIax) and between the core-collapse supernova types (SNIb, SNIc, SNII subtypes), which is expected given their observational similarities. These SNe have been shown to confuse previous models (see fig. 7 of Muthukrishna et al. 2019).

$The normalized confusion matrix [left] and ROC curve [right] of the 12 common transient classes used for training given full light-curve data. Each cell in the confusion matrix signifies the fraction of transients from each True Class that was classified into the Predicted Class. The ROC curve illustrates the true positive rate against the false positive rate across various threshold probabilities for each class, with the Area Under ROC curve (AUROC) in parenthesis. The model’s evaluation is conducted on the test set consisting of 10 per cent of the data from the common classes.$

Figure 3.

The normalized confusion matrix [left] and ROC curve [right] of the 12 common transient classes used for training given full light-curve data. Each cell in the confusion matrix signifies the fraction of transients from each True Class that was classified into the Predicted Class. The ROC curve illustrates the true positive rate against the false positive rate across various threshold probabilities for each class, with the Area Under ROC curve (AUROC) in parenthesis. The model’s evaluation is conducted on the test set consisting of 10 per cent of the data from the common classes.

Figure 4.

The UMAP reduction of the latent space derived from the test set, which includes 10 per cent of the common transients reserved for testing the classifier [left] and randomly sampled anomalous transients from the unseen anomaly data set [right]. Despite not being trained on this data, the learned features still exhibit clear visual structure and anomalous transients from distinct clusters separate from the common classes. It is important to note that the UMAP reduction is used only for visualization purposes, and the actual anomaly detection (as seen in Fig. 5 and the remaining plots) is performed on the 100D latent space.

Figure 5.

The median anomaly score (rounded to two decimal places) for each class extracted from the latent representations of full light curves. The scores come from the full, unseen anomalous data set for anomalous classes and the 10 per cent test data set for common classes. The five classes on the right (in bold) are anomalous. The separation between the scores of anomalous classes and common classes is evident, and the anomaly scores for the common classes are consistently low signifying they are not erroneously marked anomalous. For further analysis, Fig. 6 shows the full anomaly score distribution for each class.

Figure 6.

The distribution of anomaly scores for each class, computed using MCIF on the latent representations derived from full light curves. The scores are plotted using 100 per cent of the anomalous data set (unseen during training) and the test data set of common classes. The anomalous classes (bottom five) generally show higher anomaly scores with positively skewed distributions. The common classes and CaRTs all have low anomaly scores on average.

Figure 7.

The precision-recall curves for our anomaly detection pipeline for each anomalous class and a grey line indicating the average across all common classes. The precision and recall are plotted against the threshold anomaly score in the left sub-figures. Precision and recall are defined in equation (7) and equation (8), respectively, and calculated on a set comprising half of the transients from the tested anomalous class and half randomly sampled common transients (all coming from the test data that was unseen by the model). Promisingly, the Area Under the Precision-Recall Curve (AUCPR) for each anomalous class (except CaRTs) is very high. The AUCPR for the common classes is low indicating that they are not often predicted as anomalous by MCIF.

While the confusion matrix provides valuable insights into the accuracy of our model, it only considers the highest scoring predicted class and does not use the continuous probability scores that our classifier outputs for each possible class. The receiver operating characteristic (ROC) curve, shown in Fig. 3 [right], effectively uses these probabilities. It plots the true positive rate (the fraction of positive samples correctly identified as positive) against the false positive rate (the fraction of negative samples incorrectly identified as positive) for each class across a range of threshold probability values. This metric is particularly useful in a multiclass context, as it captures the model’s ability to assign low probabilities to several classes when it is uncertain. A key measure of performance, the Area Under the ROC Curve (AUROC), quantifies the overall ability of the classifier to discriminate between classes. In our study, high AUROC values, approaching 1 for all classes, underscore the robustness of our classifier.

A brief comparison of the results in this work and those of a similar DNN light-curve classifier (RAPID; Muthukrishna et al. 2019) hints that our classifier can perform on par with a reasonable baseline and has better accuracies for many classes. This improvement can be attributed, in part, to the use of an improved simulated data set that has fixed some problems with the simulation of several events including a lack of diversity of core-collapse supernovae (see Vincenzi et al. 2021, for details on the problems with the original PLAsTiCC simulations). RAPID also included some of the rare classes that we deliberately did not include in our classifier and instead designated as anomalous (see fig. 7 of Muthukrishna et al. 2019). While our classification methodology was very similar to RAPID, a key difference was our unique input method that bypassed the missing data problem. Future work should compare the effectiveness of this input method with traditional interpolation or imputation methods, and further comparison is beyond the scope of this work.

4.2 Latent representation

After repurposing the classifier as an encoder, we obtain a 100D latent space. We can visualize this latent space with UMAP (McInnes, Healy & Melville 2020), a manifold embedding technique, to determine if there is visible clustering.⁴ In Fig. 4 [left], we plot the UMAP representations of the test data. While it is difficult to examine some of the overlapping classes in this embedded space, there is clear clustering of many of the classes. In Fig. 4 [right], we colour all of the common classes grey and include a sample of transients from the anomalous classes. We see that the anomalous classes cluster together in the embedded space and separate from the common transients despite the model not being trained on these objects. This level of clustering suggests that our encoder may be discovering generalizable patterns within light curves, and this property may have potential use cases beyond anomaly detection in few-shot classification. It is important to note that we only use UMAP for visualization purposes and that the latent space used for anomaly detection is obtained directly from the penultimate layer of the classifier.

4.3 Anomaly detection

After training MCIF on the latent representations of the training data, we pass unseen test data and anomalous data through our pipeline for evaluation. In Fig. 5, we list the median anomaly score predicted by MCIF for each class. The anomalous classes have much higher median anomaly scores than the common classes, illustrating a significant distinction between the scores of all common and most anomalous classes. This difference is not as pronounced when using a single isolation forest and the advantages of employing MCIF are discussed further in Appendix A.

In Fig. 6, we plot the distribution of anomaly scores predicted by MCIF for each class. The plot further demonstrates the distinction in anomaly scores of common and anomalous transients. Notably, there is a significant skew towards larger anomaly scores for the anomalous classes, reaffirming our model’s performance. However, calcium rich transients (CaRTs), despite being one of our anomalous classes, tend to have lower anomaly scores. CaRTs are notoriously difficult to photometrically classify as anomalous due to their resemblance to other common supernova classes (see fig. 8 of Muthukrishna et al. 2019 for example). One of the most effective ways to detect CaRTs is to observe a calcium line in their spectra, and a robust anomaly detector would use photometric differences to discern this spectroscopic dissimilarity. However, ZTF is limited to only the |$g$| and |$r$| passbands, which lets this subtle spectroscopic difference go unnoticed. The upcoming Legacy Survey of Space and Time on the Rubin Observatory will observe data in six passbands and will likely mitigate this issue.

$Anomalies detected in the 2000 top-ranked transients by MCIF anomaly score index, using a test sample reflecting the estimated frequency of anomalies in nature. In the sample of 12 040 common transients and 54 anomalous transients, the model recalls $41\pm 3$ $(\sim 75~{{\rm per\ cent}})$ of the anomalies after following up the top 2000 ranked transients. The left plot aggregates all anomalies and the right plot delineates per class. To control for the variance imposed by the small anomaly sample size, we repeat the sampling 50 times. The mean and standard deviation of detected anomalies are plotted as the solid lines and shaded regions, respectively.$

Figure 8.

Anomalies detected in the 2000 top-ranked transients by MCIF anomaly score index, using a test sample reflecting the estimated frequency of anomalies in nature. In the sample of 12 040 common transients and 54 anomalous transients, the model recalls |$41\pm 3$| |$(\sim 75~{{\rm per\ cent}})$| of the anomalies after following up the top 2000 ranked transients. The left plot aggregates all anomalies and the right plot delineates per class. To control for the variance imposed by the small anomaly sample size, we repeat the sampling 50 times. The mean and standard deviation of detected anomalies are plotted as the solid lines and shaded regions, respectively.

4.3.1 Anomaly precision and recall

To identify anomalies with MCIF, a threshold anomaly score would need to be chosen such that the common transient classes are not flagged as anomalous, while only the anomalous classes are flagged as anomalous. This threshold score needs to lead to a high precision and a high recall of anomalies. Precision is a measure of how pure our anomaly predictions are, and recall is a measure of how many anomalies we can expect to find. We define anomaly precision and recall as

$$\begin{eqnarray} P_{c, \tau } = \frac{|\lbrace L^s \in L^c \mid A(L^s) \gt \tau \rbrace |}{|\lbrace L^s \in L \mid A(L^s) \gt \tau \rbrace |} \end{eqnarray}$$

(7)

$$\begin{eqnarray} R_{c, \tau } = \frac{|\lbrace L^s \in L^c \mid A(L^s) \gt \tau \rbrace |}{|L^c|}, \end{eqnarray}$$

(8)

where |$P_{c, \tau }$| and |$R_{c, \tau }$| are the precision and recall of class |$c$|⁠, the tested class, at threshold anomaly score |$\tau$|⁠, |$L$| is the set of all transients, |$L^c$| is the set of all transients from class |$c$| in |$L$|⁠, and |$L^s$| is a transient from either |$L^c$| or |$L$|⁠. We further define the set |$L$| to be comprised of half class |$c$| and half of transients coming from the opposite type as |$c$| when computing the precision and recall for class |$c$|⁠. For example, if the tested class was KN, the set |$L$| would contain half KN and half common transients. Note that only precision (and not recall) is influenced by the composition of |$L$|⁠.

In other words, precision is calculated as the number of predicted anomalies from each class divided by the total number of predicted anomalies, and recall is calculated as the number of predicted anomalies from each class divided by the total number of transients from that class. Both are defined given a threshold anomaly score and over a deliberately defined sample (as described above).

In Fig. 7 we plot the anomaly precision and recall at various threshold anomaly scores |$\tau$| for all anomalous classes and an average for all common classes. In this context, the precision is 0.5 at the lowest threshold, as at that point all transients are marked as anomalous, and 50 per cent of them are from the tested class. Recall is 1 at the lowest threshold, much like for many other machine learning tasks, as all transients of the tested class are identified as anomalous. This interpretation serves as the rationale for the 50–50 composition of the set |$L$|⁠, as otherwise, it would be difficult to standardize AUC scores across anomalous and common classes.

The Area Under the Precision-Recall Curve (AUCPR) is a good indicator of how often a class is being marked anomalous. The AUCPR scores for anomalous classes are significantly better than the AUCPR scores for the common classes, even CaRTs. However, the precision of the common classes jumps suddenly at high thresholds while it declines for anomalous classes. This signifies that the transients with the highest anomaly scores are false positive anomalies. This phenomenon is understandable if we consider that most anomalies populate anomaly scores |$a_{s} \lesssim 0.1$| in Fig. 6. Beyond this, at thresholds |$\tau \gtrsim 0.1$|⁠, the recall drops as these anomalous transients no longer meet the threshold. With inherently more common transients, a few extreme latents among them dictate precision initially. Fig. 8 confirms that the top 5–10 candidates are common transients, but after this, as |$\tau$| reduces, our model captures a much higher fraction of true anomalies.

Selecting a threshold anomaly score near the upper-right region of the precision-recall curve will be a good choice for identifying as many anomalies as possible while still having a pure sample with few false positives. This point represents a trade-off between high precision and high recall, often occurring where the curve begins to plateau or shows a notable change in slope.

4.3.2 Detection rates in a representative population

The previous results do not take into account that anomalous transients are inherently less frequent than common transients. While the frequency of anomalies in nature is not known, a good estimate for the expected population frequency was presented in Kessler et al. (2019) for the PLAsTiCC data set (The PLAsTiCC team 2018). The rate of common transients, as defined in this work, was roughly 220 times larger than that of anomalous transients, using PLAsTiCC frequencies for each class. We used this rate to randomly select a more realistic test data set that contained 12 040 normal transients and 54 anomalies. Randomly selecting a representative sample of only 54 anomalies is subject to significant variance. Therefore, we created 50 sample data sets to perform 50-fold cross-validation. The mean and standard deviation of the number of transients from each class present in our 50 test samples are listed in Table 1.

For each validation set, we ranked the transients by the anomaly scores predicted by MCIF. We followed up the top 2000 ranked transients (roughly 15 per cent of the data set) as the candidate pool. Across 50 repeated trials, we identified |$41\pm 3$| out of the 54 true anomalies in our data set (recalling |$\sim 75~{{\rm per\ cent}}$| of the anomalies). In Fig. 8, we plot the fraction of anomalies recalled⁵ and the total number of anomalies recovered for thresholds up to the top 2000 transients. MCIF recalls most anomalies within candidates having the highest anomaly scores, followed by a tapering as fewer anomalies remain.

Examining the detection rate for each anomaly class, we see that the model’s trouble in identifying CaRTs as anomalous brings down the overall anomaly recall. This plot and the precision-recall curves show a consistent pseudo-hierarchy of which anomalies are easiest to detect. If we exclude CaRTs from our sample of anomalies, our recall of anomalies increases to |$47\pm 2$| out of the 54 true anomalies in our data set (recalling |$\sim 87~{{\rm per\ cent}}$| of the anomalies).

It is worth emphasizing that the objective of this work is to identify anomalies in a general sense rather than tailoring to specific classes, and therefore, this work does not rely on specific information about the anomalous classes defined (see Section 3.4). The only specific attribute used is the estimated frequency of anomalies (220 times less frequent than the common classes), which serves as a reference as it is impossible to estimate a similar number for anomalies that have never been observed.

Given the complexity of our deep learning approach, it’s important to examine how the sparsity of light-curve sampling affects anomaly detection performance. Sparsely sampled light curves from common classes could potentially be assigned large anomaly scores if the model struggles to accurately represent them in the latent space. To investigate this, we analysed the relationship between the number of observations in a light curve and its likelihood of being classified as anomalous. Our analysis revealed that while anomaly detection performance generally improves with more observations, sparsely sampled light curves are not disproportionately classified as false positives. This resilience to sparse sampling may be attributed to our RNN-based architecture and input method, which are designed to handle irregular time-series data. The ability of our model to maintain performance even with limited observations is particularly valuable for early detection of anomalies in ongoing surveys.

4.3.3 Real-time detection

Identifying anomalies in real-time is important for obtaining early-time follow-up observations, which is crucial for understanding their physical mechanisms and progenitor systems. However, directly assessing our architecture’s real-time performance is challenging due to the irregular sampling of light curves in our input format.

To assess the real-time performance of our architecture, we plot the median anomaly scores over time for a sample of 2000 common and 2000 anomalous transients in Fig. 9. To construct this plot without relying on interpolation, we calculate scores at discrete times |$l$| sampled at 1-d intervals from |$-30$| to 70 d relative to trigger, using only observations occurring before each time |$l$| to mimic a real-time scenario. To ensure sufficient information for robust scoring, we only consider transients where the final observation was recorded after time |$l-5$|⁠. The results show a clear divergence where common transient scores tend to decline around trigger, while anomalous transient scores remain consistently high.

Figure 9.

Median MCIF anomaly score over time for a sample of transients from the test set. Real-time anomaly scores are calculated at intervals of 1 d for a sample of 2000 common and 2000 total anomalous light curves. The left plot shows the scores for the common and anomalous transients as a whole, while the right plot shows each anomalous class individually. The anomaly scores for the common transients decline before the trigger, while the anomalous transients remain at high scores throughout most of the transient’s evolution.

Fig. 9 reveals two notable irregularities. First, the anomaly scores for common transients decline before trigger, which is unexpected given that the pre-trigger phase of most transient classes should primarily consist of background noise. Further analysis of the pre-trigger classification results (Fig. C2) reveals that certain transients, most notably SLSN-I and AGN, are almost all classified before trigger, thereby lowering the average anomaly score for common transients. This can be attributed to the fact that redshift and pre-trigger information such as host galaxy colour and some AGN pre-trigger variability are particularly useful for classifying these transients before trigger (see fig. 16 of Muthukrishna et al. 2019).

Figure 10.

Anomaly detection performance (AUROC) of models trained with different latent space sizes. A significant improvement is observed when increasing the latent size up to 50 dimensions, with performance plateauing thereafter.

Secondly, KN exhibit a significant dip around the time of trigger. Upon further analysis, we found that certain common transient classes also experienced a similar dip around trigger; however, unlike KN, they do not rebound back to higher anomaly scores. This dip is related to the inherent nature of the trigger of a light curve, which often marks the first real observation of the transient phase of a light curve, and serves as a reset for the anomaly score. A more detailed analysis of this phenomenon is provided in Appendix C.

These preliminary findings suggest the potential for enabling real-time identification of anomalous transients. While some known rare classes can be difficult to distinguish from the common classes without a significant amount of data, others can be detected surprisingly soon after trigger. The ability to flag unusual events early in their evolution could prove invaluable for optimizing the allocation of follow-up resources and maximizing the scientific returns from rare transient discoveries.

4.4 Comparison against other approaches

In the field of anomaly detection in time-domain astronomy, there is no comprehensive baseline on which to evaluate different detection methods. This is largely because of the vastly differing definitions of what anomaly detection is, for example, the difference between unsupervised and novelty detection methods as described in Section 1. Baselining all existing anomaly detection methods is a much-needed line of future work, especially as there is no consensus on which method will work best on the deluge of data that will be available when LSST is running.

Despite these challenges, Perez-Carrasco et al. (2023) evaluated five different approaches to anomaly detection (see Table 2 for all benchmarked approaches), and we use their data set (which was inspired by Sánchez-Sáez et al. 2021) to benchmark our classifier-based approach. In contrast to our data set of raw light-curve data, this data set consists of tabular features extracted from light curves. We evaluate three new techniques for anomaly detection on this data set: using a classifier with MCIF, a classifier with just a single isolation forest, and MCIF on its own.⁶ The data set is split into three hierarchical categories with 4–5 transient classes each. Evaluation is performed separately for each class, each time counting that transient class as anomalous and the rest of its hierarchical category as common. Full evaluation is performed across five folds of testing data for cross-validation.

Table 2.

Performance of each model when applied to the data set used in Perez-Carrasco et al. (2023). Each row represents a different anomaly detection algorithm and each column represents a different class being chosen as the anomalous class. The performance is evaluated using the AUROC score of detected anomalies. The top three metrics per class are marked in bold. The AUROC scores for the first five methods are taken directly from and are reported in Perez-Carrasco et al. (2023). A visual representation of this table is shown in Fig. B1.

	Transient				Stochastic					Periodic
Method	SLSN	SNII	SNIa	SNIbc	AGN	Blazar	CV/Nova	QSO	YSO	CEP	DSCT	E	RRL	LPV
IForest	0.640	0.721	0.428	0.490	0.573	0.710	\|$\mathbf {0.975}$\|	0.468	\|$\mathbf {0.913}$\|	0.359	0.295	0.469	0.549	\|$\mathbf {0.971}$\|
(Liu et al. 2008)	\|$\pm 0.014$\|	\|$\pm 0.021$\|	\|$\pm 0.032$\|	\|$\pm 0.038$\|	\|$\pm 0.017$\|	\|$\pm 0.009$\|	\|$\mathbf {\pm 0.001}$\|	\|$\pm 0.016$\|	\|$\mathbf {\pm 0.003}$\|	\|$\pm 0.007$\|	\|$\pm 0.012$\|	\|$\pm 0.021$\|	\|$\pm 0.033$\|	\|$\mathbf {\pm 0.007}$\|
OCSVM	0.577	0.587	0.434	0.492	0.532	0.443	0.909	\|$\mathbf {0.517}$\|	0.792	0.432	\|$\mathbf {0.557}$\|	0.555	0.539	0.943
(Schölkopf et al. 1999)	\|$\pm 0.014$\|	\|$\pm 0.014$\|	\|$\pm 0.021$\|	\|$\pm 0.011$\|	\|$\pm 0.008$\|	\|$\pm 0.002$\|	\|$\pm 0.001$\|	\|$\mathbf {\pm 0.005}$\|	\|$\pm 0.005$\|	\|$\pm 0.004$\|	\|$\mathbf {\pm 0.005}$\|	\|$\pm 0.003$\|	\|$\pm 0.004$\|	\|$\pm 0.001$\|
AE	\|$\mathbf {0.736}$\|	\|$\mathbf {0.807}$\|	0.438	0.537	\|$\mathbf {0.701}$\|	\|$\mathbf {0.762}$\|	\|$\mathbf {0.980}$\|	0.443	\|$\mathbf {0.990}$\|	0.564	0.367	\|$\mathbf {0.864}$\|	\|$\mathbf {0.907}$\|	\|$\mathbf {0.996}$\|
(Rumelhart & McClelland 1986)	\|$\mathbf {\pm 0.022}$\|	\|$\mathbf {\pm 0.021}$\|	\|$\pm 0.015$\|	\|$\pm 0.019$\|	\|$\mathbf {\pm 0.010}$\|	\|$\mathbf {\pm 0.006}$\|	\|$\mathbf {\pm 0.016}$\|	\|$\pm 0.004$\|	\|$\mathbf {\pm 0.001}$\|	\|$\pm 0.024$\|	\|$\pm 0.015$\|	\|$\mathbf {\pm 0.009}$\|	\|$\mathbf {\pm 0.015}$\|	\|$\mathbf {\pm 0.000}$\|
VAE	0.669	0.690	0.404	0.522	0.596	0.597	0.849	\|$\mathbf {0.500}$\|	0.795	0.442	0.417	0.561	0.451	0.936
(Kingma & Welling 2014)	\|$\pm 0.015$\|	\|$\pm 0.023$\|	\|$\pm 0.018$\|	\|$\pm 0.025$\|	\|$\pm 0.007$\|	\|$\pm 0.010$\|	\|$\pm 0.028$\|	\|$\mathbf {\pm 0.009}$\|	\|$\pm 0.009$\|	\|$\pm 0.010$\|	\|$\pm 0.007$\|	\|$\pm 0.007$\|	\|$\pm 0.006$\|	\|$\pm 0.007$\|
Deep SVDD	0.644	0.731	0.475	0.507	0.496	0.607	0.932	0.411	0.901	0.707	0.482	0.636	0.774	0.785
(Ruff et al. 2018)	\|$\pm 0.043$\|	\|$\pm 0.043$\|	\|$\pm 0.040$\|	\|$\pm 0.040$\|	\|$\pm 0.025$\|	\|$\pm 0.044$\|	\|$\pm 0.015$\|	\|$\pm 0.008$\|	\|$\pm 0.022$\|	\|$\pm 0.027$\|	\|$\pm 0.054$\|	\|$\pm 0.055$\|	\|$\pm 0.068$\|	\|$\pm 0.025$\|
MCDSVDD	\|$\mathbf {0.686}$\|	\|$\mathbf {0.828}$\|	\|$\mathbf {0.624}$\|	\|$\mathbf {0.584}$\|	\|$\mathbf {0.706}$\|	0.512	0.770	0.483	0.854	\|$\mathbf {0.858}$\|	\|$\mathbf {0.819}$\|	\|$\mathbf {0.945}$\|	\|$\mathbf {0.953}$\|	0.953
(Perez-Carrasco et al. 2023)	\|$\mathbf {\pm 0.051}$\|	\|$\mathbf {\pm 0.024}$\|	\|$\mathbf {\pm 0.039}$\|	\|$\mathbf {\pm 0.032}$\|	\|$\mathbf {\pm 0.069}$\|	\|$\pm 0.113$\|	\|$\pm 0.127$\|	\|$\pm 0.080$\|	\|$\pm 0.041$\|	\|$\mathbf {\pm 0.025}$\|	\|$\mathbf {\pm 0.015}$\|	\|$\mathbf {\pm 0.006}$\|	\|$\mathbf {\pm 0.003}$\|	\|$\pm 0.008$\|
Classifier + IForest	\|$\mathbf {0.757}$\|	\|$\mathbf {0.811}$\|	\|$\mathbf {0.619}$\|	0.556	\|$\mathbf {0.715}$\|	\|$\mathbf {0.720}$\|	0.945	0.456	\|$\mathbf {0.977}$\|	\|$\mathbf {0.766}$\|	0.504	\|$\mathbf {0.811}$\|	\|$\mathbf {0.907}$\|	\|$\mathbf {0.969}$\|
(This work)	\|$\mathbf {\pm 0.047}$\|	\|$\mathbf {\pm 0.017}$\|	\|$\mathbf {\pm 0.073}$\|	\|$\pm 0.039$\|	\|$\mathbf {\pm 0.028}$\|	\|$\mathbf {\pm 0.032}$\|	\|$\pm 0.015$\|	\|$\pm 0.041$\|	\|$\mathbf {\pm 0.003}$\|	\|$\mathbf {\pm 0.066}$\|	\|$\pm 0.111$\|	\|$\mathbf {\pm 0.038}$\|	\|$\mathbf {\pm 0.026}$\|	\|$\mathbf {\pm 0.016}$\|
Classifier + MCIF	0.567	0.699	\|$\mathbf {0.536}$\|	\|$\mathbf {0.560}$\|	0.615	0.701	0.882	\|$\mathbf {0.605}$\|	0.893	\|$\mathbf {0.875}$\|	\|$\mathbf {0.742}$\|	0.773	0.808	0.779
(This work)	\|$\pm 0.091$\|	\|$\pm 0.046$\|	\|$\mathbf {\pm 0.061}$\|	\|$\mathbf {\pm 0.034}$\|	\|$\pm 0.048$\|	\|$\pm 0.045$\|	\|$\pm 0.050$\|	\|$\mathbf {\pm 0.051}$\|	\|$\pm 0.025$\|	\|$\mathbf {\pm 0.036}$\|	\|$\mathbf {\pm 0.044}$\|	\|$\pm 0.031$\|	\|$\pm 0.046$\|	\|$\pm 0.107$\|
MCIF	0.503	0.668	0.532	\|$\mathbf {0.643}$\|	0.614	\|$\mathbf {0.745}$\|	\|$\mathbf {0.966}$\|	0.446	0.907	0.514	0.433	0.476	0.447	0.959
(This work)	\|$\pm 0.018$\|	\|$\pm 0.008$\|	\|$\pm 0.007$\|	\|$\mathbf {\pm 0.005}$\|	\|$\pm 0.02$\|	\|$\mathbf {\pm 0.008}$\|	\|$\mathbf {\pm 0.003}$\|	\|$\pm 0.007$\|	\|$\pm 0.007$\|	\|$\pm 0.013$\|	\|$\pm 0.009$\|	\|$\pm 0.021$\|	\|$\pm 0.011$\|	\|$\pm 0.004$\|

	Transient				Stochastic					Periodic
Method	SLSN	SNII	SNIa	SNIbc	AGN	Blazar	CV/Nova	QSO	YSO	CEP	DSCT	E	RRL	LPV
IForest	0.640	0.721	0.428	0.490	0.573	0.710	\|$\mathbf {0.975}$\|	0.468	\|$\mathbf {0.913}$\|	0.359	0.295	0.469	0.549	\|$\mathbf {0.971}$\|
(Liu et al. 2008)	\|$\pm 0.014$\|	\|$\pm 0.021$\|	\|$\pm 0.032$\|	\|$\pm 0.038$\|	\|$\pm 0.017$\|	\|$\pm 0.009$\|	\|$\mathbf {\pm 0.001}$\|	\|$\pm 0.016$\|	\|$\mathbf {\pm 0.003}$\|	\|$\pm 0.007$\|	\|$\pm 0.012$\|	\|$\pm 0.021$\|	\|$\pm 0.033$\|	\|$\mathbf {\pm 0.007}$\|
OCSVM	0.577	0.587	0.434	0.492	0.532	0.443	0.909	\|$\mathbf {0.517}$\|	0.792	0.432	\|$\mathbf {0.557}$\|	0.555	0.539	0.943
(Schölkopf et al. 1999)	\|$\pm 0.014$\|	\|$\pm 0.014$\|	\|$\pm 0.021$\|	\|$\pm 0.011$\|	\|$\pm 0.008$\|	\|$\pm 0.002$\|	\|$\pm 0.001$\|	\|$\mathbf {\pm 0.005}$\|	\|$\pm 0.005$\|	\|$\pm 0.004$\|	\|$\mathbf {\pm 0.005}$\|	\|$\pm 0.003$\|	\|$\pm 0.004$\|	\|$\pm 0.001$\|
AE	\|$\mathbf {0.736}$\|	\|$\mathbf {0.807}$\|	0.438	0.537	\|$\mathbf {0.701}$\|	\|$\mathbf {0.762}$\|	\|$\mathbf {0.980}$\|	0.443	\|$\mathbf {0.990}$\|	0.564	0.367	\|$\mathbf {0.864}$\|	\|$\mathbf {0.907}$\|	\|$\mathbf {0.996}$\|
(Rumelhart & McClelland 1986)	\|$\mathbf {\pm 0.022}$\|	\|$\mathbf {\pm 0.021}$\|	\|$\pm 0.015$\|	\|$\pm 0.019$\|	\|$\mathbf {\pm 0.010}$\|	\|$\mathbf {\pm 0.006}$\|	\|$\mathbf {\pm 0.016}$\|	\|$\pm 0.004$\|	\|$\mathbf {\pm 0.001}$\|	\|$\pm 0.024$\|	\|$\pm 0.015$\|	\|$\mathbf {\pm 0.009}$\|	\|$\mathbf {\pm 0.015}$\|	\|$\mathbf {\pm 0.000}$\|
VAE	0.669	0.690	0.404	0.522	0.596	0.597	0.849	\|$\mathbf {0.500}$\|	0.795	0.442	0.417	0.561	0.451	0.936
(Kingma & Welling 2014)	\|$\pm 0.015$\|	\|$\pm 0.023$\|	\|$\pm 0.018$\|	\|$\pm 0.025$\|	\|$\pm 0.007$\|	\|$\pm 0.010$\|	\|$\pm 0.028$\|	\|$\mathbf {\pm 0.009}$\|	\|$\pm 0.009$\|	\|$\pm 0.010$\|	\|$\pm 0.007$\|	\|$\pm 0.007$\|	\|$\pm 0.006$\|	\|$\pm 0.007$\|
Deep SVDD	0.644	0.731	0.475	0.507	0.496	0.607	0.932	0.411	0.901	0.707	0.482	0.636	0.774	0.785
(Ruff et al. 2018)	\|$\pm 0.043$\|	\|$\pm 0.043$\|	\|$\pm 0.040$\|	\|$\pm 0.040$\|	\|$\pm 0.025$\|	\|$\pm 0.044$\|	\|$\pm 0.015$\|	\|$\pm 0.008$\|	\|$\pm 0.022$\|	\|$\pm 0.027$\|	\|$\pm 0.054$\|	\|$\pm 0.055$\|	\|$\pm 0.068$\|	\|$\pm 0.025$\|
MCDSVDD	\|$\mathbf {0.686}$\|	\|$\mathbf {0.828}$\|	\|$\mathbf {0.624}$\|	\|$\mathbf {0.584}$\|	\|$\mathbf {0.706}$\|	0.512	0.770	0.483	0.854	\|$\mathbf {0.858}$\|	\|$\mathbf {0.819}$\|	\|$\mathbf {0.945}$\|	\|$\mathbf {0.953}$\|	0.953
(Perez-Carrasco et al. 2023)	\|$\mathbf {\pm 0.051}$\|	\|$\mathbf {\pm 0.024}$\|	\|$\mathbf {\pm 0.039}$\|	\|$\mathbf {\pm 0.032}$\|	\|$\mathbf {\pm 0.069}$\|	\|$\pm 0.113$\|	\|$\pm 0.127$\|	\|$\pm 0.080$\|	\|$\pm 0.041$\|	\|$\mathbf {\pm 0.025}$\|	\|$\mathbf {\pm 0.015}$\|	\|$\mathbf {\pm 0.006}$\|	\|$\mathbf {\pm 0.003}$\|	\|$\pm 0.008$\|
Classifier + IForest	\|$\mathbf {0.757}$\|	\|$\mathbf {0.811}$\|	\|$\mathbf {0.619}$\|	0.556	\|$\mathbf {0.715}$\|	\|$\mathbf {0.720}$\|	0.945	0.456	\|$\mathbf {0.977}$\|	\|$\mathbf {0.766}$\|	0.504	\|$\mathbf {0.811}$\|	\|$\mathbf {0.907}$\|	\|$\mathbf {0.969}$\|
(This work)	\|$\mathbf {\pm 0.047}$\|	\|$\mathbf {\pm 0.017}$\|	\|$\mathbf {\pm 0.073}$\|	\|$\pm 0.039$\|	\|$\mathbf {\pm 0.028}$\|	\|$\mathbf {\pm 0.032}$\|	\|$\pm 0.015$\|	\|$\pm 0.041$\|	\|$\mathbf {\pm 0.003}$\|	\|$\mathbf {\pm 0.066}$\|	\|$\pm 0.111$\|	\|$\mathbf {\pm 0.038}$\|	\|$\mathbf {\pm 0.026}$\|	\|$\mathbf {\pm 0.016}$\|
Classifier + MCIF	0.567	0.699	\|$\mathbf {0.536}$\|	\|$\mathbf {0.560}$\|	0.615	0.701	0.882	\|$\mathbf {0.605}$\|	0.893	\|$\mathbf {0.875}$\|	\|$\mathbf {0.742}$\|	0.773	0.808	0.779
(This work)	\|$\pm 0.091$\|	\|$\pm 0.046$\|	\|$\mathbf {\pm 0.061}$\|	\|$\mathbf {\pm 0.034}$\|	\|$\pm 0.048$\|	\|$\pm 0.045$\|	\|$\pm 0.050$\|	\|$\mathbf {\pm 0.051}$\|	\|$\pm 0.025$\|	\|$\mathbf {\pm 0.036}$\|	\|$\mathbf {\pm 0.044}$\|	\|$\pm 0.031$\|	\|$\pm 0.046$\|	\|$\pm 0.107$\|
MCIF	0.503	0.668	0.532	\|$\mathbf {0.643}$\|	0.614	\|$\mathbf {0.745}$\|	\|$\mathbf {0.966}$\|	0.446	0.907	0.514	0.433	0.476	0.447	0.959
(This work)	\|$\pm 0.018$\|	\|$\pm 0.008$\|	\|$\pm 0.007$\|	\|$\mathbf {\pm 0.005}$\|	\|$\pm 0.02$\|	\|$\mathbf {\pm 0.008}$\|	\|$\mathbf {\pm 0.003}$\|	\|$\pm 0.007$\|	\|$\pm 0.007$\|	\|$\pm 0.013$\|	\|$\pm 0.009$\|	\|$\pm 0.021$\|	\|$\pm 0.011$\|	\|$\pm 0.004$\|

Table 2.

	Transient				Stochastic					Periodic
Method	SLSN	SNII	SNIa	SNIbc	AGN	Blazar	CV/Nova	QSO	YSO	CEP	DSCT	E	RRL	LPV
IForest	0.640	0.721	0.428	0.490	0.573	0.710	\|$\mathbf {0.975}$\|	0.468	\|$\mathbf {0.913}$\|	0.359	0.295	0.469	0.549	\|$\mathbf {0.971}$\|
(Liu et al. 2008)	\|$\pm 0.014$\|	\|$\pm 0.021$\|	\|$\pm 0.032$\|	\|$\pm 0.038$\|	\|$\pm 0.017$\|	\|$\pm 0.009$\|	\|$\mathbf {\pm 0.001}$\|	\|$\pm 0.016$\|	\|$\mathbf {\pm 0.003}$\|	\|$\pm 0.007$\|	\|$\pm 0.012$\|	\|$\pm 0.021$\|	\|$\pm 0.033$\|	\|$\mathbf {\pm 0.007}$\|
OCSVM	0.577	0.587	0.434	0.492	0.532	0.443	0.909	\|$\mathbf {0.517}$\|	0.792	0.432	\|$\mathbf {0.557}$\|	0.555	0.539	0.943
(Schölkopf et al. 1999)	\|$\pm 0.014$\|	\|$\pm 0.014$\|	\|$\pm 0.021$\|	\|$\pm 0.011$\|	\|$\pm 0.008$\|	\|$\pm 0.002$\|	\|$\pm 0.001$\|	\|$\mathbf {\pm 0.005}$\|	\|$\pm 0.005$\|	\|$\pm 0.004$\|	\|$\mathbf {\pm 0.005}$\|	\|$\pm 0.003$\|	\|$\pm 0.004$\|	\|$\pm 0.001$\|
AE	\|$\mathbf {0.736}$\|	\|$\mathbf {0.807}$\|	0.438	0.537	\|$\mathbf {0.701}$\|	\|$\mathbf {0.762}$\|	\|$\mathbf {0.980}$\|	0.443	\|$\mathbf {0.990}$\|	0.564	0.367	\|$\mathbf {0.864}$\|	\|$\mathbf {0.907}$\|	\|$\mathbf {0.996}$\|
(Rumelhart & McClelland 1986)	\|$\mathbf {\pm 0.022}$\|	\|$\mathbf {\pm 0.021}$\|	\|$\pm 0.015$\|	\|$\pm 0.019$\|	\|$\mathbf {\pm 0.010}$\|	\|$\mathbf {\pm 0.006}$\|	\|$\mathbf {\pm 0.016}$\|	\|$\pm 0.004$\|	\|$\mathbf {\pm 0.001}$\|	\|$\pm 0.024$\|	\|$\pm 0.015$\|	\|$\mathbf {\pm 0.009}$\|	\|$\mathbf {\pm 0.015}$\|	\|$\mathbf {\pm 0.000}$\|
VAE	0.669	0.690	0.404	0.522	0.596	0.597	0.849	\|$\mathbf {0.500}$\|	0.795	0.442	0.417	0.561	0.451	0.936
(Kingma & Welling 2014)	\|$\pm 0.015$\|	\|$\pm 0.023$\|	\|$\pm 0.018$\|	\|$\pm 0.025$\|	\|$\pm 0.007$\|	\|$\pm 0.010$\|	\|$\pm 0.028$\|	\|$\mathbf {\pm 0.009}$\|	\|$\pm 0.009$\|	\|$\pm 0.010$\|	\|$\pm 0.007$\|	\|$\pm 0.007$\|	\|$\pm 0.006$\|	\|$\pm 0.007$\|
Deep SVDD	0.644	0.731	0.475	0.507	0.496	0.607	0.932	0.411	0.901	0.707	0.482	0.636	0.774	0.785
(Ruff et al. 2018)	\|$\pm 0.043$\|	\|$\pm 0.043$\|	\|$\pm 0.040$\|	\|$\pm 0.040$\|	\|$\pm 0.025$\|	\|$\pm 0.044$\|	\|$\pm 0.015$\|	\|$\pm 0.008$\|	\|$\pm 0.022$\|	\|$\pm 0.027$\|	\|$\pm 0.054$\|	\|$\pm 0.055$\|	\|$\pm 0.068$\|	\|$\pm 0.025$\|
MCDSVDD	\|$\mathbf {0.686}$\|	\|$\mathbf {0.828}$\|	\|$\mathbf {0.624}$\|	\|$\mathbf {0.584}$\|	\|$\mathbf {0.706}$\|	0.512	0.770	0.483	0.854	\|$\mathbf {0.858}$\|	\|$\mathbf {0.819}$\|	\|$\mathbf {0.945}$\|	\|$\mathbf {0.953}$\|	0.953
(Perez-Carrasco et al. 2023)	\|$\mathbf {\pm 0.051}$\|	\|$\mathbf {\pm 0.024}$\|	\|$\mathbf {\pm 0.039}$\|	\|$\mathbf {\pm 0.032}$\|	\|$\mathbf {\pm 0.069}$\|	\|$\pm 0.113$\|	\|$\pm 0.127$\|	\|$\pm 0.080$\|	\|$\pm 0.041$\|	\|$\mathbf {\pm 0.025}$\|	\|$\mathbf {\pm 0.015}$\|	\|$\mathbf {\pm 0.006}$\|	\|$\mathbf {\pm 0.003}$\|	\|$\pm 0.008$\|
Classifier + IForest	\|$\mathbf {0.757}$\|	\|$\mathbf {0.811}$\|	\|$\mathbf {0.619}$\|	0.556	\|$\mathbf {0.715}$\|	\|$\mathbf {0.720}$\|	0.945	0.456	\|$\mathbf {0.977}$\|	\|$\mathbf {0.766}$\|	0.504	\|$\mathbf {0.811}$\|	\|$\mathbf {0.907}$\|	\|$\mathbf {0.969}$\|
(This work)	\|$\mathbf {\pm 0.047}$\|	\|$\mathbf {\pm 0.017}$\|	\|$\mathbf {\pm 0.073}$\|	\|$\pm 0.039$\|	\|$\mathbf {\pm 0.028}$\|	\|$\mathbf {\pm 0.032}$\|	\|$\pm 0.015$\|	\|$\pm 0.041$\|	\|$\mathbf {\pm 0.003}$\|	\|$\mathbf {\pm 0.066}$\|	\|$\pm 0.111$\|	\|$\mathbf {\pm 0.038}$\|	\|$\mathbf {\pm 0.026}$\|	\|$\mathbf {\pm 0.016}$\|
Classifier + MCIF	0.567	0.699	\|$\mathbf {0.536}$\|	\|$\mathbf {0.560}$\|	0.615	0.701	0.882	\|$\mathbf {0.605}$\|	0.893	\|$\mathbf {0.875}$\|	\|$\mathbf {0.742}$\|	0.773	0.808	0.779
(This work)	\|$\pm 0.091$\|	\|$\pm 0.046$\|	\|$\mathbf {\pm 0.061}$\|	\|$\mathbf {\pm 0.034}$\|	\|$\pm 0.048$\|	\|$\pm 0.045$\|	\|$\pm 0.050$\|	\|$\mathbf {\pm 0.051}$\|	\|$\pm 0.025$\|	\|$\mathbf {\pm 0.036}$\|	\|$\mathbf {\pm 0.044}$\|	\|$\pm 0.031$\|	\|$\pm 0.046$\|	\|$\pm 0.107$\|
MCIF	0.503	0.668	0.532	\|$\mathbf {0.643}$\|	0.614	\|$\mathbf {0.745}$\|	\|$\mathbf {0.966}$\|	0.446	0.907	0.514	0.433	0.476	0.447	0.959
(This work)	\|$\pm 0.018$\|	\|$\pm 0.008$\|	\|$\pm 0.007$\|	\|$\mathbf {\pm 0.005}$\|	\|$\pm 0.02$\|	\|$\mathbf {\pm 0.008}$\|	\|$\mathbf {\pm 0.003}$\|	\|$\pm 0.007$\|	\|$\pm 0.007$\|	\|$\pm 0.013$\|	\|$\pm 0.009$\|	\|$\pm 0.021$\|	\|$\pm 0.011$\|	\|$\pm 0.004$\|

	Transient				Stochastic					Periodic
Method	SLSN	SNII	SNIa	SNIbc	AGN	Blazar	CV/Nova	QSO	YSO	CEP	DSCT	E	RRL	LPV
IForest	0.640	0.721	0.428	0.490	0.573	0.710	\|$\mathbf {0.975}$\|	0.468	\|$\mathbf {0.913}$\|	0.359	0.295	0.469	0.549	\|$\mathbf {0.971}$\|
(Liu et al. 2008)	\|$\pm 0.014$\|	\|$\pm 0.021$\|	\|$\pm 0.032$\|	\|$\pm 0.038$\|	\|$\pm 0.017$\|	\|$\pm 0.009$\|	\|$\mathbf {\pm 0.001}$\|	\|$\pm 0.016$\|	\|$\mathbf {\pm 0.003}$\|	\|$\pm 0.007$\|	\|$\pm 0.012$\|	\|$\pm 0.021$\|	\|$\pm 0.033$\|	\|$\mathbf {\pm 0.007}$\|
OCSVM	0.577	0.587	0.434	0.492	0.532	0.443	0.909	\|$\mathbf {0.517}$\|	0.792	0.432	\|$\mathbf {0.557}$\|	0.555	0.539	0.943
(Schölkopf et al. 1999)	\|$\pm 0.014$\|	\|$\pm 0.014$\|	\|$\pm 0.021$\|	\|$\pm 0.011$\|	\|$\pm 0.008$\|	\|$\pm 0.002$\|	\|$\pm 0.001$\|	\|$\mathbf {\pm 0.005}$\|	\|$\pm 0.005$\|	\|$\pm 0.004$\|	\|$\mathbf {\pm 0.005}$\|	\|$\pm 0.003$\|	\|$\pm 0.004$\|	\|$\pm 0.001$\|
AE	\|$\mathbf {0.736}$\|	\|$\mathbf {0.807}$\|	0.438	0.537	\|$\mathbf {0.701}$\|	\|$\mathbf {0.762}$\|	\|$\mathbf {0.980}$\|	0.443	\|$\mathbf {0.990}$\|	0.564	0.367	\|$\mathbf {0.864}$\|	\|$\mathbf {0.907}$\|	\|$\mathbf {0.996}$\|
(Rumelhart & McClelland 1986)	\|$\mathbf {\pm 0.022}$\|	\|$\mathbf {\pm 0.021}$\|	\|$\pm 0.015$\|	\|$\pm 0.019$\|	\|$\mathbf {\pm 0.010}$\|	\|$\mathbf {\pm 0.006}$\|	\|$\mathbf {\pm 0.016}$\|	\|$\pm 0.004$\|	\|$\mathbf {\pm 0.001}$\|	\|$\pm 0.024$\|	\|$\pm 0.015$\|	\|$\mathbf {\pm 0.009}$\|	\|$\mathbf {\pm 0.015}$\|	\|$\mathbf {\pm 0.000}$\|
VAE	0.669	0.690	0.404	0.522	0.596	0.597	0.849	\|$\mathbf {0.500}$\|	0.795	0.442	0.417	0.561	0.451	0.936
(Kingma & Welling 2014)	\|$\pm 0.015$\|	\|$\pm 0.023$\|	\|$\pm 0.018$\|	\|$\pm 0.025$\|	\|$\pm 0.007$\|	\|$\pm 0.010$\|	\|$\pm 0.028$\|	\|$\mathbf {\pm 0.009}$\|	\|$\pm 0.009$\|	\|$\pm 0.010$\|	\|$\pm 0.007$\|	\|$\pm 0.007$\|	\|$\pm 0.006$\|	\|$\pm 0.007$\|
Deep SVDD	0.644	0.731	0.475	0.507	0.496	0.607	0.932	0.411	0.901	0.707	0.482	0.636	0.774	0.785
(Ruff et al. 2018)	\|$\pm 0.043$\|	\|$\pm 0.043$\|	\|$\pm 0.040$\|	\|$\pm 0.040$\|	\|$\pm 0.025$\|	\|$\pm 0.044$\|	\|$\pm 0.015$\|	\|$\pm 0.008$\|	\|$\pm 0.022$\|	\|$\pm 0.027$\|	\|$\pm 0.054$\|	\|$\pm 0.055$\|	\|$\pm 0.068$\|	\|$\pm 0.025$\|
MCDSVDD	\|$\mathbf {0.686}$\|	\|$\mathbf {0.828}$\|	\|$\mathbf {0.624}$\|	\|$\mathbf {0.584}$\|	\|$\mathbf {0.706}$\|	0.512	0.770	0.483	0.854	\|$\mathbf {0.858}$\|	\|$\mathbf {0.819}$\|	\|$\mathbf {0.945}$\|	\|$\mathbf {0.953}$\|	0.953
(Perez-Carrasco et al. 2023)	\|$\mathbf {\pm 0.051}$\|	\|$\mathbf {\pm 0.024}$\|	\|$\mathbf {\pm 0.039}$\|	\|$\mathbf {\pm 0.032}$\|	\|$\mathbf {\pm 0.069}$\|	\|$\pm 0.113$\|	\|$\pm 0.127$\|	\|$\pm 0.080$\|	\|$\pm 0.041$\|	\|$\mathbf {\pm 0.025}$\|	\|$\mathbf {\pm 0.015}$\|	\|$\mathbf {\pm 0.006}$\|	\|$\mathbf {\pm 0.003}$\|	\|$\pm 0.008$\|
Classifier + IForest	\|$\mathbf {0.757}$\|	\|$\mathbf {0.811}$\|	\|$\mathbf {0.619}$\|	0.556	\|$\mathbf {0.715}$\|	\|$\mathbf {0.720}$\|	0.945	0.456	\|$\mathbf {0.977}$\|	\|$\mathbf {0.766}$\|	0.504	\|$\mathbf {0.811}$\|	\|$\mathbf {0.907}$\|	\|$\mathbf {0.969}$\|
(This work)	\|$\mathbf {\pm 0.047}$\|	\|$\mathbf {\pm 0.017}$\|	\|$\mathbf {\pm 0.073}$\|	\|$\pm 0.039$\|	\|$\mathbf {\pm 0.028}$\|	\|$\mathbf {\pm 0.032}$\|	\|$\pm 0.015$\|	\|$\pm 0.041$\|	\|$\mathbf {\pm 0.003}$\|	\|$\mathbf {\pm 0.066}$\|	\|$\pm 0.111$\|	\|$\mathbf {\pm 0.038}$\|	\|$\mathbf {\pm 0.026}$\|	\|$\mathbf {\pm 0.016}$\|
Classifier + MCIF	0.567	0.699	\|$\mathbf {0.536}$\|	\|$\mathbf {0.560}$\|	0.615	0.701	0.882	\|$\mathbf {0.605}$\|	0.893	\|$\mathbf {0.875}$\|	\|$\mathbf {0.742}$\|	0.773	0.808	0.779
(This work)	\|$\pm 0.091$\|	\|$\pm 0.046$\|	\|$\mathbf {\pm 0.061}$\|	\|$\mathbf {\pm 0.034}$\|	\|$\pm 0.048$\|	\|$\pm 0.045$\|	\|$\pm 0.050$\|	\|$\mathbf {\pm 0.051}$\|	\|$\pm 0.025$\|	\|$\mathbf {\pm 0.036}$\|	\|$\mathbf {\pm 0.044}$\|	\|$\pm 0.031$\|	\|$\pm 0.046$\|	\|$\pm 0.107$\|
MCIF	0.503	0.668	0.532	\|$\mathbf {0.643}$\|	0.614	\|$\mathbf {0.745}$\|	\|$\mathbf {0.966}$\|	0.446	0.907	0.514	0.433	0.476	0.447	0.959
(This work)	\|$\pm 0.018$\|	\|$\pm 0.008$\|	\|$\pm 0.007$\|	\|$\mathbf {\pm 0.005}$\|	\|$\pm 0.02$\|	\|$\mathbf {\pm 0.008}$\|	\|$\mathbf {\pm 0.003}$\|	\|$\pm 0.007$\|	\|$\pm 0.007$\|	\|$\pm 0.013$\|	\|$\pm 0.009$\|	\|$\pm 0.021$\|	\|$\pm 0.011$\|	\|$\pm 0.004$\|

As seen in Table 2 (and visually in Fig. B1), our classifier-based approach with an isolation forest is one of the top approaches for most transient classes, showing the power of using a classifier’s latent space for anomaly detection. Using a classifier with MCIF also performs promisingly, however is sometimes worse than using a classifier with a single isolation forest. This is not the case on our data set and is discussed further in Appendix A.

4.5 Scaling the latent space

Anomaly detection poses a unique challenge for model evaluation due to the nature of unsupervised learning: true anomalies are only revealed during the final testing phase. Consequently, we refrain from tuning hyperparameters for model selection and instead retrospectively analyse the effects of different hyperparameter choices, particularly the size of the latent space without risking overfitting during the development process.

To assess the impact of latent space size on anomaly detection performance, we train multiple models with varying latent dimensions and evaluate them using the AUROC. As shown in Fig. 10, increasing the latent size beyond 50 leads to significant improvements in anomaly detection performance, with diminishing returns thereafter. Smaller models generally exhibit lower average performance and higher variance. Interestingly, we do not observe a performance drop in high-dimensional latent spaces, despite the presence of numerous correlated features. This robustness can be attributed to the effectiveness of isolation forests and tree-based algorithms in handling high-dimensional data by selectively identifying the most discriminative features for data partitioning.

Based on Fig. 10, we select a latent size of 100 neurons for this work, as it yields a high anomaly detection AUROC score. However, it’s worth noting that this choice is somewhat arbitrary, as any sufficiently large latent size beyond 50 dimensions appears to provide comparable performance.

Interestingly, while classifiers prove effective for anomaly detection, we do not find a significant correlation between classification accuracy and anomaly detection performance. This highlights a key interpretability challenge in both traditional neural networks and our approach, warranting further investigation.

5 CONCLUSION

The advent of large-scale wide-field surveys has revolutionized time-domain astronomy, unveiling an extraordinary diversity of astrophysical phenomena. Surveys such as the Zwicky Transient Facility and the upcoming Legacy Survey of Space and Time conducted by the Vera C. Rubin Observatory promise to discover transients in vast numbers, with LSST expected to generate millions of alerts each night. This deluge of data presents both a challenge and an opportunity: while the sheer volume of detections makes manual inspection infeasible, it also offers the potential to discover entirely new classes of transients that are rare or have remained hidden in previous surveys. To fully harness the potential of these wide-field surveys, automated methods for real-time anomaly detection are essential. By identifying and prioritizing the most unusual and interesting events amid the flood of data, these techniques can facilitate rapid follow-up and characterization, enabling us to deepen our understanding of the time-domain universe.

The most common approach for transient anomaly detection is to construct a feature extractor that can map light curves to a low-dimensional latent space, and then apply clustering or outlier detection algorithms to identify anomalies within that space. Constructing a latent space that represents transients well for anomaly detection is difficult, and most previous approaches have either user-defined features or unsupervised deep-learning approaches.

In this work, we have introduced a novel approach that leverages the latent space of a neural network classifier for identifying anomalous transients. Our pipeline, which combines a deep recurrent neural network classifier with our novel MCIF anomaly detection method, demonstrates promising performance on simulated data matched to the characteristics of the Zwicky Transient Facility.

The key advantages of our approach are:

The RNN classifier maps light curves into a low-dimensional latent space that naturally clusters similar transient classes together, providing an effective representation for anomaly detection. We repurposed the penultimate layer of this classifier as the feature space for anomaly detection.
Our novel MCIF method addresses the limitations of using a single isolation forest on the complex latent space by training separate isolation forests for each known transient class and taking the minimum score as the final anomaly score.
Our classifier input format eliminates the need for interpolation by incorporating time and passband information, enabling the model to learn inter-passband relationships and handle irregular sampling.

To mimic a real-world scenario, we evaluated our approach on a realistic simulated data set containing 12 040 common transients and 54 anomalous events. After following up MCIF’s top 2000 ranked transients, we accurately identified |$41 \pm 3$| out of the 54 true anomalies. That is, after following up the top 15 per cent highest ranked scores, we recovered 75 per cent of the true anomalies. CaRTs look very similar to common supernovae, and thus are difficult to identify. If we exclude CaRTs from our anomalous sample, our recovery of anomalies increases sharply to 87 per cent (⁠|$47 \pm 2$| out of the 54 true anomalies) after following up the top 2000 (⁠|$\sim 15~{{\rm per\ cent}}$|⁠) highest scoring transients.

The learned latent space exhibits clear separation between common and anomalous transient classes, and our preliminary analysis suggests the potential for real-time anomaly detection using limited early-time observations. The pre-trigger information encoded by our RNN enables our model to identify anomalous transients at early stages in the light curve, and even by trigger, a significant separation between common and anomalous transients is captured. In particular, KN, PISN, and ILOT all stand out as anomalous shortly after the time of trigger.

Future work encompasses several promising directions. First, benchmarking our model against other similar approaches is important for a comprehensive performance assessment. Currently, comparing models is difficult, because a standard test data set of anomalies does not exist. Developing a realistic benchmark data set that encompasses a representative population of common and example anomalous transients will improve the quality of methods developed by the community and enable robust evaluation metrics. Moreover, a detailed comparative analysis of MCIF with previous class-by-class anomaly detection approaches should be carried out to gain a deeper understanding of their relative strengths and limitations in this domain.

Secondly, integrating techniques from other anomaly detection methods, such as active learning (Lochner & Bassett 2021), could help us to distinguish new anomalies as interesting or not. Beyond direct anomaly detection, MCIF can be used to identify which known class an anomalous object most closely resembles based on the individual isolation forest scores. Additionally, we plan to apply the proposed architecture to real observational data, moving beyond simulations and testing the model’s effectiveness in a practical astronomical context.

A significant contribution of this work is the demonstration that a well-trained classifier can be effectively repurposed for anomaly detection by leveraging the clustering properties of its latent space. The flexibility of our approach allows for the adaptation of any classifier to an anomaly detector. For example, using existing classifiers as feature extractors for spectra, images, or time series from other domains, we can build effective anomaly detectors.

Another significant advantage of our approach is that the clustering properties of the latent space extend to unseen data, enabling few-shot classification of astronomical transients with limited labelled examples. This will be useful for the early observations from new surveys such as LSST. Furthermore, our input method lends itself well to transfer learning from one survey to another because it explicitly uses the passband wavelength. Future work should explore transfer learning from ZTF data to other surveys such as PanSTARRS or LSST simulations.

In conclusion, our novel approach to real-time anomaly detection in astronomical light curves, combining a deep neural network classifier with multiclass isolation forests, demonstrates the power of leveraging well-clustered latent space representations for identifying rare and unusual transients. As the era of large-scale astronomical surveys continues to produce unprecedented volumes of data, the development and refinement of such techniques will be crucial for making discoveries in time-domain astronomy.

ACKNOWLEDGEMENTS

We would like to thank the Cambridge Centre for International Research (CCIR) for fostering this collaboration. ML acknowledges support from the South African Radio Astronomy Observatory and the National Research Foundation (NRF) towards this research. Opinions expressed and conclusions arrived at, are those of the authors and are not necessarily to be attributed to the NRF.

This work made use of the python programming language and the following packages: numpy (Harris et al. 2020), matplotlib (Hunter 2007), seaborn (Waskom 2021), scikit-learn (Pedregosa et al. 2011), pandas (McKinney 2010), astropy (Astropy Collaboration 2022), umap-learn (Sainburg, McInnes & Gentner 2021), keras (Chollet 2015), and tensorflow (Abadi et al. 2015).

We acknowledge the use of the ilifu cloud computing facility – www.ilifu.ac.za, a partnership between the University of Cape Town, the University of the Western Cape, Stellenbosch University, Sol Plaatje University, and the Cape Peninsula University of Technology. The ilifu facility is supported by contributions from the Inter-University Institute for Data Intensive Astronomy (IDIA – a partnership between the University of Cape Town, the University of Pretoria, and the University of the Western Cape), the Computational Biology division at UCT and the Data Intensive Research Initiative of South Africa (DIRISA).

This work used Bridges-2 at Pittsburgh Supercomputing Center through allocation PHY240105 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program (Boerner et al. 2023), which is supported by U.S. National Science Foundation grants #2138259,#2138286, #2138307, #2137603, and #2138296.

DATA AVAILABILITY

The code used in this work is publicly available. The models used to create the simulations that generate the data used in this work were released in PLAsTiCC (Kessler et al. 2019) and are available at https://zenodo.org/record/2612896#.YYAz1NbMJhE. To generate light curves following ZTF observing properties, we use the SNANA software package (Kessler et al. 2009), developed for PLAsTiCC, with observing logs from the ZTF survey to generate data. A version of these simulations was first used in Muthukrishna et al. (2019) and have since been updated to resolve a known problem with core-collapse SNe. The data are publicly available upon reasonable request to the corresponding author.

Footnotes

The public MSIP ZTF survey has since changed to a 2-d median cadence.

The expected relative frequencies of each class are taken from Kessler et al. (2019) developed for the PLaSTiCC Data set.

We empirically find that there is little difference between an LSTM and GRU model, in both classification accuracy and anomaly detection.

We use the umap-learn implementation in python using the hyperparameters 'minimum distance' set to 0.5 and ‘number of neighbours' set to 500.

This usage of the word recall has a different population distribution than defined in equation (8).

We can use MCIF on its own as this is a data set of features extracted from time-series, not the raw time-series.

References

Abadi

et al. ,

2015

TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems

, available at: https://www.tensorflow.org/

Abbott

B. P.

et al. ,

2017

ApJ

848

L12

10.3847/2041-8213/aa91c9

Astropy Collaboration

2022

ApJ

935

167

10.3847/1538-4357/ac7c74

Bellm

E. C.

et al. ,

2018

PASP

131

018002

10.1088/1538-3873/aaecbe

10.1016/j.compind.2018.01.021

Feng

Yuan

2018

Comput. Industry

Boerner

T. J.

Deems

Furlani

T. R.

Knuth

S. L.

Towns

2023

, in

Practice and Experience in Advanced Research Computing 2023: Computing for the Common Good. PEARC’23

Association for Computing Machinery

New York, NY

, p.

173

10.1145/3569951.3597559

Boone

2019

158

257

10.3847/1538-3881/ab5182

Charnock

Moss

2017

ApJ

837

L28

10.3847/2041-8213/aa603d

Available at: https://www.semanticscholar.org/paper/Empirical-Evaluation-of-Gated-Recurrent-Neural-on-Chung-G%C3%BCl%C3%A7ehre/ac3ee98020251797c2b401e1389461df88e52e62

Cho

van Merrienboer

Gulcehre

Bahdanau

Bougares

Schwenk

Bengio

2014

, in

Proc. Conf. Empirical Methods in Natural Language Processing (EMNLP)

Association for Computational Linguistics

, p.

1724

Chollet

2015

keras

Github

, available at: https://github.com/fchollet/keras

Chung

Gulcehre

Cho

Bengio

2014

, in

NIPS 2014 Workshop on Deep Learning (NIPS'14)

Coppejans

D. L.

et al. ,

2020

ApJ

895

L23

10.3847/2041-8213/ab8cc7

10.1109/BigDataCongress.2017.38

Etsebeth

Lochner

Walmsley

Grespan

2024

MNRAS

529

732

10.1093/mnras/stae496

Feng

Sun

Wei

Liu

2017

, in

IEEE International Congress on Big Data (BigData Congress)

. p.

224

Foley

R. J.

Mandel

2013

ApJ

778

167

10.1088/0004-637x/778/2/167

Giles

Walkowicz

2019

MNRAS

484

834

10.1093/mnras/sty3461

10.1038/s41586-020-2649-2

Harris

C. R.

et al. ,

2020

Nature

585

357

10.1162/neco.1997.9.8.1735

PubMed

Hochreiter

Schmidhuber

1997

Neural Comput.

1735

PubMed

Hunter

J. D.

2007

Comput. Sci. Eng.

10.1109/MCSE.2007.55

10.3390/technologies11020040

Iman

Arabnia

H. R.

Rasheed

2023

Technologies

10.1051/0004-6361/202037709

Ishida

E. E. O.

et al. ,

2021

A&A

650

A195

Ivezić

Ž.

et al. ,

2019

ApJ

873

111

10.3847/1538-4357/ab042c

10.1088/0004-637X/708/2/1025

Kasen

2010

ApJ

708

1025

Kessler

et al. ,

2009

PASP

121

1028

10.1086/605984

Kessler

et al. ,

2019

PASP

131

094501

10.1088/1538-3873/ab26f1

Kingma

D. P.

2017

preprint

(

)

Kingma

D. P.

Welling

2014

Conference proceedings: papers accepted to the International Conference on Learning Representations (ICLR) 2014

2nd International Conference on Learning Representations (ICLR 2014)

Liu

F. T.

Ting

K. M.

Zhou

Z.-H.

2008

, in

2008 Eighth IEEE International Conference on Data Mining

IEEE

New York

, p.

413

10.1016/j.ascom.2021.100481

Lochner

Bassett

B. A.

2021

Astron. Comput.

100481

10.3847/0067-0049/225/2/31

Lochner

McEwen

J. D.

Peiris

H. V.

Lahav

Winter

M. K.

2016

ApJS

225

Mahabal

et al. ,

2008

, in

Bailer-Jones

C. A. L.

, ed.,

AIP Conf. Proc. Vol. 1082, Towards Real‐time Classification of Astronomical Transients

Am. Inst. Phys

New York

Malanchev

K. L.

et al. ,

2021

MNRAS

502

5147

10.1093/mnras/stab316

McInnes

Healy

Melville

2020

preprint

(

)

McKinney

2010

, in

van der Walt

Millman

eds,

Proc. 9th Python in Science Conference

. p.

Möller

de Boissière

2020

MNRAS

491

4277

10.1093/mnras/stz3312

Muthukrishna

Narayan

Mandel

K. S.

Biswas

Hložek

2019

PASP

131

118002

10.1088/1538-3873/ab1609

Muthukrishna

Mandel

K. S.

Lochner

Webb

Narayan

2022

MNRAS

517

393

10.1093/mnras/stac2582

Narayan

et al. ,

2018

ApJS

236

10.3847/1538-4365/aab781

10.1051/0004-6361/201834473

Pasquet

Chaumont

Fouchez

2019

A&A

627

A21

Pedregosa

et al. ,

2011

J. Mach. Learn. Res.

2825

Perez-Carrasco

et al. ,

2023

166

151

10.3847/1538-3881/ace0c1

Pruzhinskaya

M. V.

Malanchev

K. L.

Kornilov

M. V.

Ishida

E. E. O.

Mondon

Volnova

A. A.

Korolev

V. S.

2019

MNRAS

489

3591

10.1093/mnras/stz2362

10.1117/1.JATIS.1.1.014003

Ricker

G. R.

et al. ,

2015

J. Astron. Telesc. Instrum. Syst.

014003

Ruff

Vandermeulen

Goernitz

Deecke

Siddiqui

S. A.

Binder

Müller

Kloft

2018

, in

Krause

, eds,

Proc.Machine Learning Research Vol. 80, Proc. 35th International Conference on Machine Learning

PMLR

, p.

4393

. Available at: https://proceedings.mlr.press/v80/ruff18a.html

Rumelhart

D. E.

McClelland

J. L.

1986

Learning Internal Representations by Error Propagation

The MIT Press

, p.

318

Sainburg

McInnes

Gentner

T. Q.

2021

preprint

(

)

Sánchez-Sáez

et al. ,

2021

161

141

10.3847/1538-3881/abd5c1

Schölkopf

Williamson

R. C.

Smola

Shawe-Taylor

Platt

1999

, in

Solla

Leen

Müller

eds,

Proc. 12th International Conference on Neural Information Processing Systems Vol. 12, Advances in Neural Information Processing Systems

MIT Press

. Available at: https://proceedings.neurips.cc/paper_files/paper/1999/file/8725fb777f25776ffa9076e44fcfd776-Paper.pdf

10.1007/978-3-031-30111-7_31

Singh

Luo

2023

Multi-Class Anomaly Detection

Springer-Verlag

Berlin, Heidelberg

, p.

359

Solarz

Bilicki

Gromadzki

Pollo

Durkalec

Wypych

2017

A&A

606

A39

10.1051/0004-6361/201730968

Soraisam

M. D.

et al. ,

2020

ApJ

892

112

10.3847/1538-4357/ab7b61

The PLAsTiCC team

2018

preprint

(

)

Villar

V. A.

Cranmer

Berger

Contardo

Hosseinzadeh

Lin

J. Y.-Y.

2021

ApJS

255

10.3847/1538-4365/ac0893

Vincenzi

et al. ,

2021

MNRAS

505

2819

10.1093/mnras/stab1353

Walmsley

et al. ,

2022

MNRAS

513

1581

10.1093/mnras/stac525

Waskom

M. L.

2021

J. Open Source Softw.

3021

10.21105/joss.03021

Webb

et al. ,

2020

MNRAS

498

3077

10.1093/mnras/staa2395

APPENDIX A: ADVANTAGES OF MCIF

Before proposing the MCIF pipeline, we attempted to use a normal isolation forest to detect anomalies from the latent representation |$z_s$| of a light curve. We trained an isolation forest on all the common classes of our training data using 200 estimators. To account for the class imbalance in our training data, we weighted samples from underrepresented classes more heavily during the training of the isolation forest, using the same weighting scheme as in equation (3). The anomaly score function |$A(Z_s)$| was simply the negated anomaly score output from a single isolation forest trained on all the latent representations of the training data.

As shown in Fig. A2, there is little distinction in the anomaly scores of most anomalous and common classes when using a single isolation forest. Surprisingly, the common classes SLSN-I and AGN are classified as relatively more anomalous than all the other classes. The distribution of anomaly scores in Fig. A1 reveals that although there is overall separation between common and anomalous classes, certain common classes are classified as very anomalous.

The UMAP reduction of the latent space, as depicted in Fig. 4, provides insight into this behaviour. The SLSN-I and AGN classes are significantly distant from the main cluster formed by other classes. This isolation from the central cluster may explain the high anomaly scores associated with these classes. This hypothesis is also supported by the near-perfect classification of these classes, shown in the confusion matrix and ROC curves in Fig. 3. In fact, the near-perfect classification hinted towards this poor result in anomaly detection, showing us that these transients are easy to separate from other classes, and hence are also easy to mark as anomalous. In summary, while an isolation forest is good at detecting anomalies, it struggles to capture the structure of a latent space with numerous clusters. This drawback of using a single isolation forest could explain why other works report high anomaly scores for SLSN-I and AGN (e.g. Villar et al. 2021). Using a class-by-class (or cluster-by-cluster) anomaly detector, such as MCIF, can mitigate this. Directly comparing Fig. 6 and Fig. A1 empirically demonstrates the advantages of MCIF.

Further analysis of MCIF’s performance on the comparative evaluation data set (Section 4.4) reveals that, contrary to the results shown in Fig. 6, a single isolation forest generally outperforms MCIF (Table 2). Investigating the UMAP representations of the latent space for classes exhibiting this discrepancy offers insights. When SNII is considered anomalous, the latent space (Fig. A3 [left]) lacks clear separation between SNIbc and SNIa, likely due to poor generalization caused by the limited number of SNIbc transients in the training set, explaining the single isolation forest’s superior performance. However, for the DSCT class (Fig. A3 [right]), distinct visual clusters are present, and MCIF achieves better results. These findings suggest that MCIF enhances performance when majority classes are well-separated, a characteristic seemingly inherent to the data set rather than the classifier-based latent space identification approach, as a single isolation forest surpasses MCIF on the raw data for most classes where it also outperforms MCIF on the classifier’s latent space. Future research should explore the factors influencing MCIF’s effectiveness based on the separability of raw data, with the SNII case indicating a partial dependence on data quantity, as increased data improves the DNN’s generalization ability.

Figure A1.

The distribution of anomaly scores for full light curves when using a single isolation forest for anomaly detection. The scores are derived from the unseen anomalous data and the common transient testing data. The bottom five classes are the anomalous classes. There is some separation between the anomaly scores of common and anomalous classes, but certain common classes are considered very anomalous (unlike when using MCIF as seen in Fig. 6).

Figure A2.

The median anomaly score for each class computed for latent representations of transients obtained from full light curves when a single isolation forest is used for anomaly detection [bottom] and when MCIF is used [top] (this is the exact same as Fig. 5, reproduced for convenience). The scores are derived from the unseen anomalous data and the common transient testing data. The five classes on the right (scores in bold) are anomalous. The common classes have somewhat lower median scores when using a single isolation forest, but the common classes SLSN-I and AGN (among others) are considered very anomalous, unlike when using MCIF.

Figure A3.

The UMAP reduction of the training data in the latent space for a classifier trained for detecting the class SNII [left] and DSCT [right] as anomalous using the data introduced in Perez-Carrasco et al. (2023) and used in Section 4.4. As the UMAP only plots the training data, it includes all the classes in the respective hierarchical category (seen in Table 2) but the one set aside as anomalous.

APPENDIX B: VISUAL COMPARISON TO OTHER APPROACHES

Fig. B1 is a visual representation of the results depicted in Table 2.

Figure B1.

Visual representation of the comparative analysis depicted in Table 2. The AUROC is written for the models top 3 models for each class.

APPENDIX C: THE KILONOVA DIP

As illustrated in Fig. 9, there is an unusual dip in the anomaly scores of KN around the trigger time. Further analysis reveals that most common classes also experience a similar dip at trigger, but they do not rebound to high anomaly scores afterwards. Examples are shown in Fig. C1.

The trigger of a light curve often corresponds to the first observation detected as part of the transient phase of the light curve, and very few common classes can be effectively classified before trigger. Effective pre-trigger classification is likely due to host galaxy information (host redshift and Milky Way extinction) or the periodic nature of certain transient events (e.g. AGNs) which means they are midway through their evolution at trigger. Fig. C2 shows that classes with a high pre-trigger classification accuracy (e.g. SLSN-I) have consistently declining anomaly scores before trigger. In contrast, classes with poor pre-trigger classification (e.g. SNIIn and SNIa) exhibit a slight increase in anomaly scores before trigger, followed by a sudden dip. This sudden dip resembles the behaviour of KN and coincides perfectly with the sudden jump in classification performance. This suggests that our pipeline struggles to detect KN as anomalous before trigger for the same reasons it is unable to classify SNIIn before trigger. Despite KN exhibiting a slight upward trend before trigger, it seems that the new observation near trigger means much more (likely due to the high S/N of that observation).

For observations after trigger, we found that the anomaly score ‘resets’ to mark the true beginning of the transient phase. For example, in the case of KN, the high S/N trigger observation signals the actual start of the transient and resets the anomaly score. Subsequent observations, characterized by a sudden decline back to the background level, quickly push KN into the anomalous category (as short-time-scale events are rare). However, in the case of poorly classified common classes, this reset is followed by a further decline in anomaly score as classification accuracy increases, making the dip appear normal. A similar effect can be seen in other anomalous classes, most notably in uLens-BSR transients. The result is less pronounced as the first rise of uLens-BSR transients does not always coincide with trigger, leading to a distributed dip around trigger for uLens-BSR in Fig. 9 and a dip offset from trigger in Fig. C1.

Figure C1.

Real-time anomaly scores for a sample KN, uLens-BSR, and SNIa. They all exhibit a significant dip near trigger, but the dip for the SNIa is followed by a further decline, whereas KN and uLens-BSR show a sharp increase after the dip. In light curves in the top panels, the blue markers represent the g-band and the orange markers represent the r-band fluxes.

Figure C2.

Real-time AUROC values for selected classes [left] and real-time anomaly scores for a different subset of classes than Fig. 9 [right]. Classes that are poorly classified pre-trigger (e.g. SNIIn and SNIa) exhibit a dip in anomaly score similar to KN, which coincides precisely with the sudden increase in classification accuracy.

APPENDIX D: EXAMPLE LIGHT CURVES

A sample light curve from each class is illustrated in Fig. D1.

$Sample light curves from each transient class used in this work. We only plot transients with low signal-to-noise and low host redshift ($z \lt 0.5$) to help visually compare shapes. The dark circular markers represent the r band while the light triangular markers represent the g band. Flux errors are not plotted.$

Figure D1.

Sample light curves from each transient class used in this work. We only plot transients with low signal-to-noise and low host redshift (⁠|$z \lt 0.5$|⁠) to help visually compare shapes. The dark circular markers represent the r band while the light triangular markers represent the g band. Flux errors are not plotted.