-
PDF
- Split View
-
Views
-
Cite
Cite
Rithwik Gupta, Daniel Muthukrishna, Michelle Lochner, A classifier-based approach to multiclass anomaly detection for astronomical transients, RAS Techniques and Instruments, Volume 4, 2025, rzae054, https://doi.org/10.1093/rasti/rzae054
- Share Icon Share
ABSTRACT
Automating real-time anomaly detection is essential for identifying rare transients, with modern survey telescopes generating tens of thousands of alerts per night, and future telescopes, such as the Vera C. Rubin Observatory, projected to increase this number dramatically. Currently, most anomaly detection algorithms for astronomical transients rely either on hand-crafted features extracted from light curves or on features generated through unsupervised representation learning, coupled with standard anomaly detection algorithms. In this work, we introduce an alternative approach: using the penultimate layer of a neural network classifier as the latent space for anomaly detection. We then propose a novel method, Multi-Class Isolation Forests, which trains separate isolation forests for each class to derive an anomaly score for a light curve from its latent space representation. This approach significantly outperforms a standard isolation forest. We also use a simpler input method for real-time transient classifiers which circumvents the need for interpolation and helps the neural network handle irregular sampling and model inter-passband relationships. Our anomaly detection pipeline identifies rare classes including kilonovae, pair-instability supernovae, and intermediate luminosity transients shortly after trigger on simulated Zwicky Transient Facility light curves. Using a sample of our simulations matching the population of anomalies expected in nature (54 anomalies and 12 040 common transients), our method discovered |$41\pm 3$| anomalies (|$\sim 75~{{\rm per\ cent}}$| recall) after following up the top 2000 (|$\sim 15~{{\rm per\ cent}}$|) ranked transients. Our novel method shows that classifiers can be effectively repurposed for real-time anomaly detection.
1 INTRODUCTION
With the advancement of survey telescopes and the advent of large-scale transient surveys, we are entering a new paradigm for astronomical study. The Vera Rubin Observatory’s Legacy Survey of Space and Time (LSST) is expected to generate ten million transient alerts per night (Ivezić et al. 2019). The traditional approach of manual examination of astronomical data, which has led to some of the biggest discoveries in astronomy, is no longer feasible. As a result, there is a growing need to develop methods that can automate the serendipity that has so far played a pivotal role in scientific discovery. Furthermore, in the case of transients, it is also imperative that anomalies can be identified in real time so that new astrophysical phenomena can be discovered early and studied at each stage of their evolution. In particular, detailed early-time follow-up is necessary to understand the progenitor systems and explosion mechanisms of transients (e.g. Kasen 2010). Currently, the central engines of many rare transient classes, such as calcium-rich transients, kilonovae, and the newly discovered fast blue optical transients (FBOTs) (e.g. Coppejans et al. 2020) remain poorly understood. This, along with the considerable amount of human effort in the follow-up of GW170817 (e.g. Abbott et al. 2017), the first observed binary neutron star merger, has reinforced the need for automatic photometric identification of new astrophysical phenomena.
The announcement of LSST motivated the development of many real-time (e.g. Narayan et al. 2018; Muthukrishna et al. 2019; Möller & de Boissière 2020) and full light-curve classifiers (e.g. Lochner et al. 2016; Charnock & Moss 2017; Pasquet et al. 2019). An inherent quality of classifiers is the fundamental assumption that all observed objects belong to one of the predefined classes; however, this is not the case. New telescopes are finding interesting new types of astronomical objects by probing deeper, wider, and faster than ever before. For example, LSST will have an unprecedented point-source depth of |$r \sim 27.5$| (Ivezić et al. 2019), probing fainter than any other survey to date, and the Transiting Exoplanet Survey Satellite (TESS; Ricker et al. 2015) is exploring transients at a much shorter time-scale, from hours to minutes, using its wide field-of-view. Astronomers will need automatic tools to assist in identifying which potentially anomalous events to follow up.
In this regard, anomaly detection is most simply defined as identifying outlier samples. While this may be straightforward in low-dimensional spaces, it becomes considerably more challenging when dealing with multi-passband astronomical light curves, which typically feature a large and often variable number of inputs. Thus, most previous studies in anomaly detection attempt to find a lower dimensional latent space that is easier to cluster and identify anomalies. Previous works often use either user-defined feature extraction (e.g. Giles & Walkowicz 2019; Pruzhinskaya et al. 2019; Webb et al. 2020; Ishida et al. 2021; Malanchev et al. 2021; Perez-Carrasco et al. 2023) or deep learning (Solarz et al. 2017; Villar et al. 2021; Malanchev et al. 2021) to encode this latent space. In this work, we employ deep learning for feature extraction, which is quickly becoming the gold standard in the field. Deep learning is emerging as the preferred approach due to its ability to automatically learn complex, hierarchical features from raw data without the need for manual feature engineering. This is particularly advantageous in astronomy, where the underlying physical processes generating the data are often not fully understood, making it difficult to design comprehensive user-defined features. These data-driven models can also adapt to new data distributions and scale efficiently with increasing data volumes allowing for more generalizable models that can be applied across various astronomical objects and observational scenarios with minimal modification.
Throughout the literature on anomaly detection for astronomical transients, two different definitions of the problem are presented. Some approaches, categorized as unsupervised methods, focus on extracting anomalies from large data sets without relying on prior information (e.g. Giles & Walkowicz 2019; Webb et al. 2020; Villar et al. 2021). Numerous differing approaches exist for unsupervised anomaly detection. Villar et al. (2021) used an unsupervised recurrent variational autoencoder to find a representative latent space mapping of the light curves to then derive anomaly scores using an isolation forest. Webb et al. (2020) used user-defined feature extraction and then an isolation forest with active learning to find anomalies.
In contrast, our work, among others (e.g. Muthukrishna et al. 2022; Perez-Carrasco et al. 2023), utilizes previous, either simulated or real, transients to determine whether a new light curve is anomalous. This approach is often referred to as novelty detection or supervised anomaly detection. Previous novelty detection approaches (e.g. Soraisam et al. 2020; Muthukrishna et al. 2022) are often variations of one-class classification (Schölkopf et al. 1999). One-class classifiers attempt to model a set of normal samples and then classify new transients as either part of that sample or as outliers. One-class methods have been shown to be effective at anomaly detection (Ruff et al. 2018), but they do not capture the complexity of the population of known astronomical transients, that are grouped into numerous classes with intrinsically different qualities. It is challenging for an algorithm to classify this diverse population of known transients into a single class and still identify anomalies. Perez-Carrasco et al. (2023) released a method to combat this issue. It extended the one-class classifier to multiple classes on features extracted from full multi-passband light-curve data. Their method adapts the single-class loss function to multiple classes by encouraging light curves of the same class to cluster together. The loss function of their encoder becomes a measure of how clustered objects of the same class are in the latent space, and extracts anomalies by using the minimum distance to any cluster in the latent space.
The announcement of LSST has also made real-time anomaly detection more important (e.g. Feng et al. 2017; Bi, Feng & Yuan 2018; Soraisam et al. 2020; Villar et al. 2021; Muthukrishna et al. 2022). Villar et al. (2021) generalized their variational autoencoder to use a recurrent neural network, allowing real-time anomaly scores. Muthukrishna et al. (2022) used predictive modelling and derived the anomaly score as the deviation from real-time predictions. Soraisam et al. (2020) used magnitude changes in real time to assess the probability of a new transient being similar to the common transient sample.
In this work, we leverage a light-curve classifier to address the one-class challenge and distinguish between the various classes of transients. Our approach demonstrates promising clustering in the feature space, the penultimate layer of the classifier, and shows a substantial level of discrimination in anomaly scores. Notably, similar feature extraction methods have shown potential in the field of astronomical image analysis (e.g. Walmsley et al. 2022; Etsebeth et al. 2024).
Previous light-curve classification works exist in both the domain of real-time classification (e.g. Mahabal et al. 2008; Muthukrishna et al. 2019; Möller & de Boissière 2020) and full light-curve classification (e.g. Charnock & Moss 2017; Pasquet et al. 2019). Real-time classifiers predict a classification output at every new observation, while full light-curve classifiers retrospectively predict a classification output on a full light curve. Most previous real-time approaches employ some interpolation technique which serves as the bottleneck for the model. Our real-time classifier, on the other hand, uses a novel input method in this domain, which omits the use of interpolation and can help the neural network understand the relationship between different passbands.
After identifying a feature space using one of the aforementioned methods, several prior works have employed an isolation forest (Liu, Ting & Zhou 2008) to generate anomaly scores. Isolation forests are one of the most popular anomaly detection architectures. They work by recursively partitioning data using random splits, with the idea that outliers are rare and different, thus requiring fewer partitions to isolate them. The algorithm assigns an anomaly score based on the average path length needed to isolate each object, with shorter paths indicating potential anomalies.
While Isolation Forests have demonstrated success in previous research within the same domain (e.g. Pruzhinskaya et al. 2019; Ishida et al. 2021; Villar et al. 2021), it, too, faces challenges when dealing with a complex latent space housing multiple clusters of intrinsically different transient classes. Consequently, the application of a single isolation forest may have limitations in accurately identifying certain anomalies. Singh, Luo & Li (2023) also recognized problems in a general machine learning context, and introduced a pipeline where an autoencoder is trained as an anomaly detector on observations from every class, treating all other observations as anomalous. The final anomaly score is determined as the minimum score from all detectors. This method has shown promising results in comparison to other anomaly detection methods.
In response to this limitation and following a similar principle to Singh et al. (2023), we propose the use of Multi-Class Isolation Forests (MCIF): a method that involves training a separate isolation forest for each known class and extracting the minimum score among them as the final anomaly score for a given sample. Our experimental results suggest that MCIF can improve anomaly detection performance for certain types of optical transients.
The paper is organized as follows. In Section 2, we discuss the simulated data used and how it is preprocessed to fit real-time detection. In Section 3 we provide an overview of the proposed architecture, with Section 3.2 detailing the classifier and Section 3.3 introducing MCIF. Similarly, Section 4 discusses the results of our approach, with Section 4.1 analysing the classifier and Section 4.3 evaluating the anomaly detection pipeline. We conclude in Section 5 by outlining avenues for future work. Finally, we show the advantages of MCIF compared to a normal isolation forest in Appendix A.
2 DATA
2.1 Simulated data
In this work, we use a collection of simulated light curves that match the observing properties of the Zwicky Transient Facility (ZTF; Bellm et al. 2018). This data set is described in section 2 of Muthukrishna et al. (2022) and is based on the simulations developed for PLAsTiCC (Kessler et al. 2019). Each transient in the data set has flux and flux error measurements in the |$g$| and |$r$| passbands with a median cadence of roughly 3 d in each passband.1 We briefly describe the 17 transient classes from Kessler et al. (2019) that are used in this work. Example light curves from each of these classes are illustrated in Appendix D, figs 1–3 of Kessler et al. (2019), and fig. 2 of Muthukrishna et al. (2019).
Type Ia Supernovae (SNIa) involves a carbon–oxygen white dwarf star accreting mass from a binary companion star. The white dwarf star eventually reaches a critical mass that triggers an explosion.
Type Ia-91bg Supernovae (SNIa-91bg) tend to have fast-evolving light curves and often have lower luminosities when compared with typical SNIa.
Type Iax Supernovae (SNIax) tend to have lower luminosity and slower ejecta velocities than typical SNIa.
Type Ib and Ic Supernovae (SNIb and SNIc) are thought to be caused by the core collapse of highly dense stars. Both SNIb and SNIc have lost their hydrogen envelopes prior to collapse, but SNIc have also lost their helium envelopes. These SNe are characterized by the lack of hydrogen emissions in their spectra.
Type Ic-BL Supernovae (SNIc-BL) are SNIc with broad lines in their spectra, indicating much faster expansion velocities than typical SNIc.
Type II Supernovae (SNII) are thought to also be formed by the core-collapses of highly dense stars. However, unlike SNIb and SNIc, SNII have hydrogen emission lines in their spectra.
Type IIb Supernovae (SNIIb) appear very similar to SNIb. They have rapidly fading hydrogen emission lines resulting in a very similar structure to SNIb.
Type IIn Supernovae (SNIIn) are SNII with very narrow hydrogen emission lines.
Type I Super Luminous Supernovae (SLSN-I) are poorly understood and very bright SNe events. SLSN-I lack hydrogen lines in their spectra.
Pair Instability Supernovae (PISN) are the result of the explosion of a massive star, much larger than a SNII or SNIb (130 to 250 solar masses).
Kilonovae (KN) are explosions resulting from the merging of two neutron stars or a neutron star and a black hole. Only one KN has been observed to date.
Active Galactic Nuclei (AGNs) are the very bright centres of galaxies where the supermassive black hole accretes material and emits significant radiation across the electromagnetic spectrum.
Tidal Disruption Events (TDE) occur when a star orbiting a black hole is pulled apart by the black hole’s tidal forces. The bright flare caused by this event can last up to a few years.
Intermediate Luminosity Optical Transients (ILOT) are a very poorly understood transient. They occur in the energy gap between normal novae and supernovae.
Calcium Rich Transients (CaRT) are defined by their strong calcium emission lines. Their lower luminosity than SNIa and their rapidly evolving rise in luminosity after explosion make their light curves have a strong resemblance to core-collapse supernovae (SNIb and SNIc) and some SNIa-91bg.
|$\mu$|Lens-BSR (uLens-BSR) are a special case of microlensing events where a binary system in the foreground acts as the lens for a background star. The light curves from these events can exhibit asymmetries, multiple peaks, plateaus, and quasi-periodic behaviour.
Due to their low occurrence in nature,2 KN, ILOT, CaRT, PISN, and uLens-BSR are considered the anomalous classes in this work, and all remaining classes are considered the 'common' classes. However, note that the goal of this work is not to identify these specific anomalous classes, but rather identify anomalies in general, as discussed further in Section 3.4.
2.2 Preprocessing
To preprocess our light curves, we first remove observations that have a S/N ratio (signal-to-noise) less than 1 (where the noise is simulated based on the observing properties of the ZTF) as a rough threshold for observations that are not meaningful. We correct our light curves for Milky Way extinction, which is solely based on the positioning of the observation in the sky and hence is available in real time. We then define the trigger as the first measurement in a light curve that exceeds a |$5\sigma$| S/N ratio. We remove all measured observations 70 d after the trigger and 30 d before the trigger as these are likely not part of the transient phase of the light curve. Next, we correct all observation times for cosmological time dilation and scale the times between |$-30 \lt T_{\mathrm{trigger}} \lt 70$| d to values between 0 and 1. The scaled time is computed as follows,
where |$z$| is the spectroscopic host redshift and |$T$| refers to the observer frame time in Modified Julian Days (MJD). We found that directly incorporating time dilation through the spectroscopic redshift improved our results. However, we acknowledge that the host redshift may not always be available and discuss this limitation briefly in Section 3.2. We then compute the scaled flux (|$f$|) and flux error (|$\epsilon$|) values for each transient by dividing the measured flux and flux error by 500, a reasonable constant close to the mean flux in our data set that helps keep the inputs to the neural network small without using any future observations. In this work, we aim to detect anomalies in real-time, and thus we cannot scale the light curves by the peak flux or any method that uses future observations.
We set aside 80 per cent of the data from the common transient classes for training, 10 per cent for validation, and 10 per cent for testing. We use 100 per cent of our anomalous data for testing and ensure that it is unseen by our model, as mentioned in Section 3.4. The number of transients in our data set from each class are listed in Table 1.
Number of transients in the training set, validation set, test set, and realistic samples (see Section 4.3.2) for each class. All anomalous data are reserved for evaluation.
Class . | Training . | Validation . | Test . | Total . | Realistic sample|$^a$| . |
---|---|---|---|---|---|
SNIa | 9314 | 1131 | 1142 | 11 587 | 1142 |
SNIa-91bg | 10 361 | 1318 | 1321 | 13 000 | 1318 |
SNIax | 10 413 | 1248 | 1339 | 13 000 | 1339 |
SNIb | 4197 | 507 | 563 | 5267 | 563 |
SNIc | 1279 | 169 | 135 | 1583 | 135 |
SNIc-BL | 1157 | 124 | 142 | 1423 | 142 |
SNII | 10 420 | 1279 | 1301 | 13 000 | 1301 |
SNIIn | 10 323 | 1359 | 1318 | 13 000 | 1318 |
SNIIb | 9882 | 1233 | 1208 | 12 323 | 1208 |
TDE | 9078 | 1162 | 1114 | 11 354 | 1114 |
SLSN-I | 10 285 | 1322 | 1273 | 12 880 | 1273 |
AGN | 8473 | 1046 | 1042 | 10 561 | 1042 |
CaRT | 0 | 0 | 10 353 | 10 353 | |$11 \pm 3$| |
KN | 0 | 0 | 11 166 | 11 166 | |$11 \pm 3$| |
PISN | 0 | 0 | 10 840 | 10 840 | |$11 \pm 3$| |
ILOT | 0 | 0 | 11 128 | 11 128 | |$10 \pm 3$| |
uLens-BSR | 0 | 0 | 11 244 | 11 244 | |$10 \pm 3$| |
Class . | Training . | Validation . | Test . | Total . | Realistic sample|$^a$| . |
---|---|---|---|---|---|
SNIa | 9314 | 1131 | 1142 | 11 587 | 1142 |
SNIa-91bg | 10 361 | 1318 | 1321 | 13 000 | 1318 |
SNIax | 10 413 | 1248 | 1339 | 13 000 | 1339 |
SNIb | 4197 | 507 | 563 | 5267 | 563 |
SNIc | 1279 | 169 | 135 | 1583 | 135 |
SNIc-BL | 1157 | 124 | 142 | 1423 | 142 |
SNII | 10 420 | 1279 | 1301 | 13 000 | 1301 |
SNIIn | 10 323 | 1359 | 1318 | 13 000 | 1318 |
SNIIb | 9882 | 1233 | 1208 | 12 323 | 1208 |
TDE | 9078 | 1162 | 1114 | 11 354 | 1114 |
SLSN-I | 10 285 | 1322 | 1273 | 12 880 | 1273 |
AGN | 8473 | 1046 | 1042 | 10 561 | 1042 |
CaRT | 0 | 0 | 10 353 | 10 353 | |$11 \pm 3$| |
KN | 0 | 0 | 11 166 | 11 166 | |$11 \pm 3$| |
PISN | 0 | 0 | 10 840 | 10 840 | |$11 \pm 3$| |
ILOT | 0 | 0 | 11 128 | 11 128 | |$10 \pm 3$| |
uLens-BSR | 0 | 0 | 11 244 | 11 244 | |$10 \pm 3$| |
|$^a$| The mean number of transients across the 50 test samples is shown. The
errors refer to the STD in the population size across the 50 sets. All common
test data is part of every sample, hence errors are not shown.
Number of transients in the training set, validation set, test set, and realistic samples (see Section 4.3.2) for each class. All anomalous data are reserved for evaluation.
Class . | Training . | Validation . | Test . | Total . | Realistic sample|$^a$| . |
---|---|---|---|---|---|
SNIa | 9314 | 1131 | 1142 | 11 587 | 1142 |
SNIa-91bg | 10 361 | 1318 | 1321 | 13 000 | 1318 |
SNIax | 10 413 | 1248 | 1339 | 13 000 | 1339 |
SNIb | 4197 | 507 | 563 | 5267 | 563 |
SNIc | 1279 | 169 | 135 | 1583 | 135 |
SNIc-BL | 1157 | 124 | 142 | 1423 | 142 |
SNII | 10 420 | 1279 | 1301 | 13 000 | 1301 |
SNIIn | 10 323 | 1359 | 1318 | 13 000 | 1318 |
SNIIb | 9882 | 1233 | 1208 | 12 323 | 1208 |
TDE | 9078 | 1162 | 1114 | 11 354 | 1114 |
SLSN-I | 10 285 | 1322 | 1273 | 12 880 | 1273 |
AGN | 8473 | 1046 | 1042 | 10 561 | 1042 |
CaRT | 0 | 0 | 10 353 | 10 353 | |$11 \pm 3$| |
KN | 0 | 0 | 11 166 | 11 166 | |$11 \pm 3$| |
PISN | 0 | 0 | 10 840 | 10 840 | |$11 \pm 3$| |
ILOT | 0 | 0 | 11 128 | 11 128 | |$10 \pm 3$| |
uLens-BSR | 0 | 0 | 11 244 | 11 244 | |$10 \pm 3$| |
Class . | Training . | Validation . | Test . | Total . | Realistic sample|$^a$| . |
---|---|---|---|---|---|
SNIa | 9314 | 1131 | 1142 | 11 587 | 1142 |
SNIa-91bg | 10 361 | 1318 | 1321 | 13 000 | 1318 |
SNIax | 10 413 | 1248 | 1339 | 13 000 | 1339 |
SNIb | 4197 | 507 | 563 | 5267 | 563 |
SNIc | 1279 | 169 | 135 | 1583 | 135 |
SNIc-BL | 1157 | 124 | 142 | 1423 | 142 |
SNII | 10 420 | 1279 | 1301 | 13 000 | 1301 |
SNIIn | 10 323 | 1359 | 1318 | 13 000 | 1318 |
SNIIb | 9882 | 1233 | 1208 | 12 323 | 1208 |
TDE | 9078 | 1162 | 1114 | 11 354 | 1114 |
SLSN-I | 10 285 | 1322 | 1273 | 12 880 | 1273 |
AGN | 8473 | 1046 | 1042 | 10 561 | 1042 |
CaRT | 0 | 0 | 10 353 | 10 353 | |$11 \pm 3$| |
KN | 0 | 0 | 11 166 | 11 166 | |$11 \pm 3$| |
PISN | 0 | 0 | 10 840 | 10 840 | |$11 \pm 3$| |
ILOT | 0 | 0 | 11 128 | 11 128 | |$10 \pm 3$| |
uLens-BSR | 0 | 0 | 11 244 | 11 244 | |$10 \pm 3$| |
|$^a$| The mean number of transients across the 50 test samples is shown. The
errors refer to the STD in the population size across the 50 sets. All common
test data is part of every sample, hence errors are not shown.
3 METHODS
3.1 Overview and rationale
Fig. 1 summarizes our methodology. First, we train a recurrent neural network (RNN) to classify the common classes of transients. Then, we remove the final layer of the trained model and use the remaining architecture as an encoder. The penultimate layer of our classifier has 100 neurons, but we find that any reasonably large latent space size works well for anomaly detection (See Section 4.5). To effectively extract anomalies from a well-represented space, it is essential to ensure that transients from similar classes cluster together. In our encoder, the latent space is directly used for light-curve classifications, which should naturally lead to clustering of similar transients.

A visual summary of the architecture described in this work. Our approach first trains a classifier, then repurposes it as an encoder, and finally applies MCIF, proposed in this work, for anomaly detection. Colours in the plot have changed.
Once we have established this representation space, we must extract anomalies from it. However, when dealing with multiple clusters, a single isolation forest may struggle to capture each cluster equally (for further details, refer to Appendix A). This challenge motivated our approach, MCIF, where we train an isolation forest for each class, representing a distinct cluster, and select the minimum anomaly score as the final score. This minimum score should come from the cluster to which the latent observation is closest, providing the desired functionality.
In this work we chose to use a neural-network based architecture for anomaly detection. One of the advantages of using a neural network-based architecture over hand-selected features is that it is a data-driven model, which should make it more sensitive to identifying out-of-distribution data. This inherent quality of neural networks makes them especially good for anomaly detection.
3.2 Classifier
We train a DNN (deep neural network) classifier that maps a matrix of multi-passband light-curve data |$\boldsymbol {X}_s$| for a transient |$s$| to a |$1 \times N_c$| vector of probabilities, reflecting the likelihood of the given light curve being each of the aforementioned non-anomalous transient classes, where |$N_c$| is the number of classes.
The transient classifier utilizes a RNN with gated recurrent units (GRU; Cho et al. 2014) to handle the sequential time-series data. The input for each transient, |$\boldsymbol {X}_s$|, is a |$4 \times N_T$| matrix where |$N_T$| is the maximum number of time-steps for any input sample. |$N_T$| is 656 in this work, but most transients have much fewer observations. Each row of the input matrix is composed of the following vector,
where |$f_{sj}$| is the scaled flux for the |$j$|th observation of transient |$s$|, |$\epsilon _{sj}$| is the corresponding scaled uncertainty, |$t_{sj}$| is the scaled time of when the measurement was taken, and |$\lambda _p$| is the central wavelength of the passband from which the measurement comes from. For the two passbands in ZTF, the central wavelengths used are as follows {|$\lambda _g=0.4827 \, \mathrm{\mu m}$|, |$\lambda _r=0.6233 \, \mathrm{\mu m}$|}. We include flux error as an input to the DNN to enable it to learn how to weigh individual flux measurements for more accurate classifications. To implement the variable length input in tensorflow, we use a Masking layer and pad the arrays with zero entries to make the input matrix as long as the longest light curve (which is of size |$N_T$|).
This input format has two major advantages. First, it eliminates the need for interpolation methods to pass sequential data into our model. Interpolation or imputation is often needed in astronomical transient classifiers as observations are recorded at irregular intervals. For real-time light-curve tasks, linear interpolation is sometimes used (e.g. Muthukrishna et al. 2022) but yields poor results for sparse light curves or surveys with larger cadences, such as LSST. Linear interpolation may confuse models, particularly when applied to a light curve with prolonged intervals of missing data. In such cases, extended sequences of the interpolated light curve may represent only two original observations.
Gaussian Processes (GP) interpolation has shown success for light-curve data (e.g. Boone 2019 and Villar et al. 2021) but does not help with real-time anomaly detection as it requires the whole light curve to perform the interpolation. Villar et al. (2021) gives justification for using GP for real-time anomaly detection, stating that each new observation heavily anchors the GP and that the GP is similar to physical model priors. However, methods that avoid any reliance on future observations are more appropriate for real-time analysis and better suited for early anomaly detection.
The second advantage of our input method is the inclusion of wavelength information to represent different passbands. Previous works have typically passed the light curves from different passbands separately, resulting in even larger sequences of missing (thus interpolated or imputed) data in the common case where some passbands have significantly fewer observations than other passbands. In our work, the inclusion of the central wavelength helps the model learn the relationship between different passbands, infer parts of the light curve in bands where few observations are present, and allows for all data to be passed in one channel.
From ZTF, we only get data in the |$g$| and |$r$| passbands which makes the difference in passbands less significant. However, LSST and other large-scale transient surveys will have data in up to six different passbands, and giving the model insight into relationships between different passbands will be crucial. This learned relationship, along with some transfer learning (Iman, Arabnia & Rasheed 2023), may make it possible to consolidate data from multiple observatories with different passbands to train one model or allow for a model trained on data from one observatory to be quickly used on data from another observatory.
After the recurrent layers of the DNN, we pass some contextual information into the classifier, which has been shown to be helpful for light-curve classification (Foley & Mandel 2013). In this work, we use the Milky Way extinction and the host galaxy’s spectroscopic redshift as additional inputs to the network. However, we understand that spectroscopic redshift is not always available, and future work should include training a model with photometric redshift or without redshift entirely.
Fig. 2 illustrates the architecture of the neural network classifier. The classifier was implemented using keras and tensorflow (Abadi et al. 2015; Chollet 2015). We detail the activation functions used in each layer of our classifier as follows. The input stream that each layer is part of is shown in parentheses.
Input Layer 1 (Light Curve Stream) – Takes a matrix of shape 4 x |$N_T$| as input to the recurrent neural network.
Gated Recurrent Unit (Cho et al. 2014) (Light Curve Stream) – Two recurrent layers consisting of 100 gated recurrent units with tanh activation functions.
Dense (Light Curve Stream) – A dense layer consisting of 100 neurons with tanh activation functions.
Input Layer 2 (Contextual Stream) – Takes a vector of length 2 containing the Milky Way extinction and spectroscopic redshift.
Dense (Contextual Stream) – A dense layer consisting of 10 neurons with ReLU activation functions. This dense layer is connected to Input Layer 2.
Concatenate Layer – A layer to merge the 2 streams of input. This layer concatenates the final dense layers for each input stream into one layer with 110 neurons.
Dense – A layer with 100 neurons to act as the latent representation for the light curves with a ReLU activation function.
Dense – A layer with 12 neurons (1 for each common transient class). This layer has a softmax activation function to map the output values to a probability score.

A visualization of the neural network classifier being used in this work. Our model has two input streams, one for real-time light-curve data and the other for contextual information. The light-curve data (first input stream) goes through multiple GRU layers and then a dense layer. The contextual information (second input stream) feeds through a dense layer. The final dense layers from both input streams are merged into a concatenate layer. We feed that to a 100-neuron dense layer that will serve as the latent space of the encoder. Finally, this dense layer feeds into the output layer which provides classification scores. Colours in the plot have changed.
In our final model architecture, we use GRU layers proposed and tested in Cho et al. (2014). They are shown to perform better than typical RNNs and have quicker training times than long short-term memory networks (LSTMs; Hochreiter & Schmidhuber 1997), another variant of RNNs3 (Chung et al. 2014).
To counteract imbalances in the distribution of classes in the data set, we use a weighted categorical cross-entropy (see equation 6 of Muthukrishna et al. 2019) as a loss function with the weight |$w_{c}$| proportional to the fraction of transients from each class |$c$| in the training set,
where |$T_c$| is the number of transients from the class |$c$| and |$T$| is the total number of samples in the training set. This weighting scheme ensures that classes with fewer samples have higher weights. To train the classifier, we ran it over 40 epochs using the Adam optimizer (Kingma & Ba 2017) with EarlyStopping implemented in keras.
3.3 Multiclass isolation forests
Once the classifier is trained, we remove the last layer and use the remaining architecture to map any light curve to the latent space. We define this encoder as a function |$E(\boldsymbol {X}_s)$|, that takes the aforementioned preprocessed light-curve data, |$\boldsymbol {X}_s$|, and maps it to a 100D latent space |$\boldsymbol {z}_s$|,
For anomaly detection, we now want to compute the anomaly score,
where |$A(\boldsymbol {z}_s)$| is a function that evaluates the anomaly score |$a_s$| for a latent observation |$\boldsymbol {z}_s$|. The goal of this work is to generate relatively large anomaly scores for anomalous transients and smaller anomaly scores for non-anomalous transients.
Isolation forests are known to be a very simple yet effective anomaly detection algorithm, especially in the domain of astronomical time-series (e.g. Pruzhinskaya et al. 2019; Ishida et al. 2021; Villar et al. 2021). However, using a single isolation forest performs poorly in determining some common classes as non-anomalous. This challenge arises from the complexity of our latent space, which contains various distinct clusters that pose difficulties for a single isolation forest to differentiate. Thus, we propose a new framework, MCIF, where an isolation forest is trained separately on data from every class, using the minimum anomaly score from any isolation forest as the final anomaly score.
We define 12 isolation forests, |$I_c(\boldsymbol {z}_s)$|, trained on latent space observations from the common transient class |$c$|. The final anomaly score is defined as
The function |$I_c(\boldsymbol {z}_s)$| is positive for less anomalous transients and negative for anomalous ones, to be consistent with the sklearn implementation of isolation forests. We negate the scores as we prefer defining transients with higher anomaly scores to be more anomalous, but this makes no difference to the results. All isolation forests used in this work are trained with 200 estimators. The results of using a single isolation forest and the benefits of using multiclass isolation forest are explored further in Appendix A.
3.4 Limitations of evaluating anomaly detection methods
Evaluating the performance of anomaly detection models is challenging because anomalies are rare, and it is difficult to build validation sets that account for the unknown. To simulate seeing anomalous data for the first time, the five aforementioned anomalous classes are not revealed to the model until final evaluation. This approach mimics the real-world situation where anomaly detectors (and astronomers) have limited knowledge of the anomalies they may discover.
The goal of this work is not to identify the specific anomalous classes mentioned, but rather to detect anomalies in general. Hence, using physical model priors or giving the model any information about anomalous classes beforehand would reveal too much about the specific rare transient classes used in this work. This could hinder the model’s performance on novel, real-world anomalies that may differ from the ones used in this study.
4 RESULTS AND ANALYSIS
In this section, we first evaluate the performance of our neural network classifier on distinguishing between common transient classes (Section 4.1). We then analyse how well our proposed anomaly detection pipeline, utilizing the classifier as an encoder (Section 4.2), is able to identify rare/anomalous transients (Section 4.3).
4.1 Classifier
In this section, we evaluate the performance of our classifier on the test set consisting of the 12 common transient classes.
The normalized confusion matrix in Fig. 3 [left] illustrates the classifier’s ability to accurately predict the correct transient class on the test data. Each cell indicates the fraction of transients from the true class that are classified into the predicted class. The high values along the diagonal, approaching 1.0, indicate strong performance. The misclassifications, indicated by the off-diagonal values, predominantly occur between subclasses of Type Ia supernovae (SNIa, SNIa-91bg, and SNIax) and between the core-collapse supernova types (SNIb, SNIc, SNII subtypes), which is expected given their observational similarities. These SNe have been shown to confuse previous models (see fig. 7 of Muthukrishna et al. 2019).
![The normalized confusion matrix [left] and ROC curve [right] of the 12 common transient classes used for training given full light-curve data. Each cell in the confusion matrix signifies the fraction of transients from each True Class that was classified into the Predicted Class. The ROC curve illustrates the true positive rate against the false positive rate across various threshold probabilities for each class, with the Area Under ROC curve (AUROC) in parenthesis. The model’s evaluation is conducted on the test set consisting of 10 per cent of the data from the common classes.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/rasti/4/10.1093_rasti_rzae054/8/m_rzae054fig3.jpeg?Expires=1749164835&Signature=EuiQibhy2TNZ~tAkMfdTdsgvhDelUO5l2ChMp4Y7MBzTce8Cai51z90Dr79~jmJzjos38gzDI9Io6SHCiKfFKoI-Y8pWnYSHv4rGLjUCiMgXZ2EWFLAi6Kte~mYS1r2eHweiv615TnorfsJ8ofiXoDGZc9GUl3qZtxgQwY5XJ2eO1iQ6dPMSm-A5Vwyrp5OHth8HVEOcMAXfikYeKvLPYTcLfD8GeM3r7ksFPY1ndwe4DTyDVSEQNDFqJLe1IsB-NimIy~ecJAbqRKTieXNrXrwdt9KZxqYJY8z2AdccVDrcqDBGFSDhoTwa7yMhCFL9BmtiAz~j1XiVBhR~l1c4yg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
The normalized confusion matrix [left] and ROC curve [right] of the 12 common transient classes used for training given full light-curve data. Each cell in the confusion matrix signifies the fraction of transients from each True Class that was classified into the Predicted Class. The ROC curve illustrates the true positive rate against the false positive rate across various threshold probabilities for each class, with the Area Under ROC curve (AUROC) in parenthesis. The model’s evaluation is conducted on the test set consisting of 10 per cent of the data from the common classes.
![The UMAP reduction of the latent space derived from the test set, which includes 10 per cent of the common transients reserved for testing the classifier [left] and randomly sampled anomalous transients from the unseen anomaly data set [right]. Despite not being trained on this data, the learned features still exhibit clear visual structure and anomalous transients from distinct clusters separate from the common classes. It is important to note that the UMAP reduction is used only for visualization purposes, and the actual anomaly detection (as seen in Fig. 5 and the remaining plots) is performed on the 100D latent space.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/rasti/4/10.1093_rasti_rzae054/8/m_rzae054fig4.jpeg?Expires=1749164835&Signature=n38gcRCMB7gGxSHsgkt9ikZ57uZBaNnnRvNKM0mJrD2rwnPuP-SbIm8ykTU566DqTY3GGwlniIVDbGQz8Q-qBvFBFg3RJoDJFHIo8qO0RtqbLHe9LxAx8uX2kTAWRiP7o6BHOcf90XIBJBpE7Y8fs8dOHDAwD58eG4HOBMsV5n-fTq6K5Jh~vIiIClMsvIB0pfJtzzAXWoVQ~1zk6p7-vUaM8Tf9tuuKzUD5YYbe6S4cre~YB8M8FwJSaZk-9ZvzEzwwwAZ8t4PyYr8onlgnEmHwdcdsOIuMwXqnNXh7MV-YANnR4ZALEFht~4RdGC7ToaHzuxg3bdh56eBzsN8CHw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
The UMAP reduction of the latent space derived from the test set, which includes 10 per cent of the common transients reserved for testing the classifier [left] and randomly sampled anomalous transients from the unseen anomaly data set [right]. Despite not being trained on this data, the learned features still exhibit clear visual structure and anomalous transients from distinct clusters separate from the common classes. It is important to note that the UMAP reduction is used only for visualization purposes, and the actual anomaly detection (as seen in Fig. 5 and the remaining plots) is performed on the 100D latent space.

The median anomaly score (rounded to two decimal places) for each class extracted from the latent representations of full light curves. The scores come from the full, unseen anomalous data set for anomalous classes and the 10 per cent test data set for common classes. The five classes on the right (in bold) are anomalous. The separation between the scores of anomalous classes and common classes is evident, and the anomaly scores for the common classes are consistently low signifying they are not erroneously marked anomalous. For further analysis, Fig. 6 shows the full anomaly score distribution for each class.

The distribution of anomaly scores for each class, computed using MCIF on the latent representations derived from full light curves. The scores are plotted using 100 per cent of the anomalous data set (unseen during training) and the test data set of common classes. The anomalous classes (bottom five) generally show higher anomaly scores with positively skewed distributions. The common classes and CaRTs all have low anomaly scores on average.

The precision-recall curves for our anomaly detection pipeline for each anomalous class and a grey line indicating the average across all common classes. The precision and recall are plotted against the threshold anomaly score in the left sub-figures. Precision and recall are defined in equation (7) and equation (8), respectively, and calculated on a set comprising half of the transients from the tested anomalous class and half randomly sampled common transients (all coming from the test data that was unseen by the model). Promisingly, the Area Under the Precision-Recall Curve (AUCPR) for each anomalous class (except CaRTs) is very high. The AUCPR for the common classes is low indicating that they are not often predicted as anomalous by MCIF.
While the confusion matrix provides valuable insights into the accuracy of our model, it only considers the highest scoring predicted class and does not use the continuous probability scores that our classifier outputs for each possible class. The receiver operating characteristic (ROC) curve, shown in Fig. 3 [right], effectively uses these probabilities. It plots the true positive rate (the fraction of positive samples correctly identified as positive) against the false positive rate (the fraction of negative samples incorrectly identified as positive) for each class across a range of threshold probability values. This metric is particularly useful in a multiclass context, as it captures the model’s ability to assign low probabilities to several classes when it is uncertain. A key measure of performance, the Area Under the ROC Curve (AUROC), quantifies the overall ability of the classifier to discriminate between classes. In our study, high AUROC values, approaching 1 for all classes, underscore the robustness of our classifier.
A brief comparison of the results in this work and those of a similar DNN light-curve classifier (RAPID; Muthukrishna et al. 2019) hints that our classifier can perform on par with a reasonable baseline and has better accuracies for many classes. This improvement can be attributed, in part, to the use of an improved simulated data set that has fixed some problems with the simulation of several events including a lack of diversity of core-collapse supernovae (see Vincenzi et al. 2021, for details on the problems with the original PLAsTiCC simulations). RAPID also included some of the rare classes that we deliberately did not include in our classifier and instead designated as anomalous (see fig. 7 of Muthukrishna et al. 2019). While our classification methodology was very similar to RAPID, a key difference was our unique input method that bypassed the missing data problem. Future work should compare the effectiveness of this input method with traditional interpolation or imputation methods, and further comparison is beyond the scope of this work.
4.2 Latent representation
After repurposing the classifier as an encoder, we obtain a 100D latent space. We can visualize this latent space with UMAP (McInnes, Healy & Melville 2020), a manifold embedding technique, to determine if there is visible clustering.4 In Fig. 4 [left], we plot the UMAP representations of the test data. While it is difficult to examine some of the overlapping classes in this embedded space, there is clear clustering of many of the classes. In Fig. 4 [right], we colour all of the common classes grey and include a sample of transients from the anomalous classes. We see that the anomalous classes cluster together in the embedded space and separate from the common transients despite the model not being trained on these objects. This level of clustering suggests that our encoder may be discovering generalizable patterns within light curves, and this property may have potential use cases beyond anomaly detection in few-shot classification. It is important to note that we only use UMAP for visualization purposes and that the latent space used for anomaly detection is obtained directly from the penultimate layer of the classifier.
4.3 Anomaly detection
After training MCIF on the latent representations of the training data, we pass unseen test data and anomalous data through our pipeline for evaluation. In Fig. 5, we list the median anomaly score predicted by MCIF for each class. The anomalous classes have much higher median anomaly scores than the common classes, illustrating a significant distinction between the scores of all common and most anomalous classes. This difference is not as pronounced when using a single isolation forest and the advantages of employing MCIF are discussed further in Appendix A.
In Fig. 6, we plot the distribution of anomaly scores predicted by MCIF for each class. The plot further demonstrates the distinction in anomaly scores of common and anomalous transients. Notably, there is a significant skew towards larger anomaly scores for the anomalous classes, reaffirming our model’s performance. However, calcium rich transients (CaRTs), despite being one of our anomalous classes, tend to have lower anomaly scores. CaRTs are notoriously difficult to photometrically classify as anomalous due to their resemblance to other common supernova classes (see fig. 8 of Muthukrishna et al. 2019 for example). One of the most effective ways to detect CaRTs is to observe a calcium line in their spectra, and a robust anomaly detector would use photometric differences to discern this spectroscopic dissimilarity. However, ZTF is limited to only the |$g$| and |$r$| passbands, which lets this subtle spectroscopic difference go unnoticed. The upcoming Legacy Survey of Space and Time on the Rubin Observatory will observe data in six passbands and will likely mitigate this issue.

Anomalies detected in the 2000 top-ranked transients by MCIF anomaly score index, using a test sample reflecting the estimated frequency of anomalies in nature. In the sample of 12 040 common transients and 54 anomalous transients, the model recalls |$41\pm 3$| |$(\sim 75~{{\rm per\ cent}})$| of the anomalies after following up the top 2000 ranked transients. The left plot aggregates all anomalies and the right plot delineates per class. To control for the variance imposed by the small anomaly sample size, we repeat the sampling 50 times. The mean and standard deviation of detected anomalies are plotted as the solid lines and shaded regions, respectively.
4.3.1 Anomaly precision and recall
To identify anomalies with MCIF, a threshold anomaly score would need to be chosen such that the common transient classes are not flagged as anomalous, while only the anomalous classes are flagged as anomalous. This threshold score needs to lead to a high precision and a high recall of anomalies. Precision is a measure of how pure our anomaly predictions are, and recall is a measure of how many anomalies we can expect to find. We define anomaly precision and recall as
where |$P_{c, \tau }$| and |$R_{c, \tau }$| are the precision and recall of class |$c$|, the tested class, at threshold anomaly score |$\tau$|, |$L$| is the set of all transients, |$L^c$| is the set of all transients from class |$c$| in |$L$|, and |$L^s$| is a transient from either |$L^c$| or |$L$|. We further define the set |$L$| to be comprised of half class |$c$| and half of transients coming from the opposite type as |$c$| when computing the precision and recall for class |$c$|. For example, if the tested class was KN, the set |$L$| would contain half KN and half common transients. Note that only precision (and not recall) is influenced by the composition of |$L$|.
In other words, precision is calculated as the number of predicted anomalies from each class divided by the total number of predicted anomalies, and recall is calculated as the number of predicted anomalies from each class divided by the total number of transients from that class. Both are defined given a threshold anomaly score and over a deliberately defined sample (as described above).
In Fig. 7 we plot the anomaly precision and recall at various threshold anomaly scores |$\tau$| for all anomalous classes and an average for all common classes. In this context, the precision is 0.5 at the lowest threshold, as at that point all transients are marked as anomalous, and 50 per cent of them are from the tested class. Recall is 1 at the lowest threshold, much like for many other machine learning tasks, as all transients of the tested class are identified as anomalous. This interpretation serves as the rationale for the 50–50 composition of the set |$L$|, as otherwise, it would be difficult to standardize AUC scores across anomalous and common classes.
The Area Under the Precision-Recall Curve (AUCPR) is a good indicator of how often a class is being marked anomalous. The AUCPR scores for anomalous classes are significantly better than the AUCPR scores for the common classes, even CaRTs. However, the precision of the common classes jumps suddenly at high thresholds while it declines for anomalous classes. This signifies that the transients with the highest anomaly scores are false positive anomalies. This phenomenon is understandable if we consider that most anomalies populate anomaly scores |$a_{s} \lesssim 0.1$| in Fig. 6. Beyond this, at thresholds |$\tau \gtrsim 0.1$|, the recall drops as these anomalous transients no longer meet the threshold. With inherently more common transients, a few extreme latents among them dictate precision initially. Fig. 8 confirms that the top 5–10 candidates are common transients, but after this, as |$\tau$| reduces, our model captures a much higher fraction of true anomalies.
Selecting a threshold anomaly score near the upper-right region of the precision-recall curve will be a good choice for identifying as many anomalies as possible while still having a pure sample with few false positives. This point represents a trade-off between high precision and high recall, often occurring where the curve begins to plateau or shows a notable change in slope.
4.3.2 Detection rates in a representative population
The previous results do not take into account that anomalous transients are inherently less frequent than common transients. While the frequency of anomalies in nature is not known, a good estimate for the expected population frequency was presented in Kessler et al. (2019) for the PLAsTiCC data set (The PLAsTiCC team 2018). The rate of common transients, as defined in this work, was roughly 220 times larger than that of anomalous transients, using PLAsTiCC frequencies for each class. We used this rate to randomly select a more realistic test data set that contained 12 040 normal transients and 54 anomalies. Randomly selecting a representative sample of only 54 anomalies is subject to significant variance. Therefore, we created 50 sample data sets to perform 50-fold cross-validation. The mean and standard deviation of the number of transients from each class present in our 50 test samples are listed in Table 1.
For each validation set, we ranked the transients by the anomaly scores predicted by MCIF. We followed up the top 2000 ranked transients (roughly 15 per cent of the data set) as the candidate pool. Across 50 repeated trials, we identified |$41\pm 3$| out of the 54 true anomalies in our data set (recalling |$\sim 75~{{\rm per\ cent}}$| of the anomalies). In Fig. 8, we plot the fraction of anomalies recalled5 and the total number of anomalies recovered for thresholds up to the top 2000 transients. MCIF recalls most anomalies within candidates having the highest anomaly scores, followed by a tapering as fewer anomalies remain.
Examining the detection rate for each anomaly class, we see that the model’s trouble in identifying CaRTs as anomalous brings down the overall anomaly recall. This plot and the precision-recall curves show a consistent pseudo-hierarchy of which anomalies are easiest to detect. If we exclude CaRTs from our sample of anomalies, our recall of anomalies increases to |$47\pm 2$| out of the 54 true anomalies in our data set (recalling |$\sim 87~{{\rm per\ cent}}$| of the anomalies).
It is worth emphasizing that the objective of this work is to identify anomalies in a general sense rather than tailoring to specific classes, and therefore, this work does not rely on specific information about the anomalous classes defined (see Section 3.4). The only specific attribute used is the estimated frequency of anomalies (220 times less frequent than the common classes), which serves as a reference as it is impossible to estimate a similar number for anomalies that have never been observed.
Given the complexity of our deep learning approach, it’s important to examine how the sparsity of light-curve sampling affects anomaly detection performance. Sparsely sampled light curves from common classes could potentially be assigned large anomaly scores if the model struggles to accurately represent them in the latent space. To investigate this, we analysed the relationship between the number of observations in a light curve and its likelihood of being classified as anomalous. Our analysis revealed that while anomaly detection performance generally improves with more observations, sparsely sampled light curves are not disproportionately classified as false positives. This resilience to sparse sampling may be attributed to our RNN-based architecture and input method, which are designed to handle irregular time-series data. The ability of our model to maintain performance even with limited observations is particularly valuable for early detection of anomalies in ongoing surveys.
4.3.3 Real-time detection
Identifying anomalies in real-time is important for obtaining early-time follow-up observations, which is crucial for understanding their physical mechanisms and progenitor systems. However, directly assessing our architecture’s real-time performance is challenging due to the irregular sampling of light curves in our input format.
To assess the real-time performance of our architecture, we plot the median anomaly scores over time for a sample of 2000 common and 2000 anomalous transients in Fig. 9. To construct this plot without relying on interpolation, we calculate scores at discrete times |$l$| sampled at 1-d intervals from |$-30$| to 70 d relative to trigger, using only observations occurring before each time |$l$| to mimic a real-time scenario. To ensure sufficient information for robust scoring, we only consider transients where the final observation was recorded after time |$l-5$|. The results show a clear divergence where common transient scores tend to decline around trigger, while anomalous transient scores remain consistently high.

Median MCIF anomaly score over time for a sample of transients from the test set. Real-time anomaly scores are calculated at intervals of 1 d for a sample of 2000 common and 2000 total anomalous light curves. The left plot shows the scores for the common and anomalous transients as a whole, while the right plot shows each anomalous class individually. The anomaly scores for the common transients decline before the trigger, while the anomalous transients remain at high scores throughout most of the transient’s evolution.
Fig. 9 reveals two notable irregularities. First, the anomaly scores for common transients decline before trigger, which is unexpected given that the pre-trigger phase of most transient classes should primarily consist of background noise. Further analysis of the pre-trigger classification results (Fig. C2) reveals that certain transients, most notably SLSN-I and AGN, are almost all classified before trigger, thereby lowering the average anomaly score for common transients. This can be attributed to the fact that redshift and pre-trigger information such as host galaxy colour and some AGN pre-trigger variability are particularly useful for classifying these transients before trigger (see fig. 16 of Muthukrishna et al. 2019).

Anomaly detection performance (AUROC) of models trained with different latent space sizes. A significant improvement is observed when increasing the latent size up to 50 dimensions, with performance plateauing thereafter.
Secondly, KN exhibit a significant dip around the time of trigger. Upon further analysis, we found that certain common transient classes also experienced a similar dip around trigger; however, unlike KN, they do not rebound back to higher anomaly scores. This dip is related to the inherent nature of the trigger of a light curve, which often marks the first real observation of the transient phase of a light curve, and serves as a reset for the anomaly score. A more detailed analysis of this phenomenon is provided in Appendix C.
These preliminary findings suggest the potential for enabling real-time identification of anomalous transients. While some known rare classes can be difficult to distinguish from the common classes without a significant amount of data, others can be detected surprisingly soon after trigger. The ability to flag unusual events early in their evolution could prove invaluable for optimizing the allocation of follow-up resources and maximizing the scientific returns from rare transient discoveries.
4.4 Comparison against other approaches
In the field of anomaly detection in time-domain astronomy, there is no comprehensive baseline on which to evaluate different detection methods. This is largely because of the vastly differing definitions of what anomaly detection is, for example, the difference between unsupervised and novelty detection methods as described in Section 1. Baselining all existing anomaly detection methods is a much-needed line of future work, especially as there is no consensus on which method will work best on the deluge of data that will be available when LSST is running.
Despite these challenges, Perez-Carrasco et al. (2023) evaluated five different approaches to anomaly detection (see Table 2 for all benchmarked approaches), and we use their data set (which was inspired by Sánchez-Sáez et al. 2021) to benchmark our classifier-based approach. In contrast to our data set of raw light-curve data, this data set consists of tabular features extracted from light curves. We evaluate three new techniques for anomaly detection on this data set: using a classifier with MCIF, a classifier with just a single isolation forest, and MCIF on its own.6 The data set is split into three hierarchical categories with 4–5 transient classes each. Evaluation is performed separately for each class, each time counting that transient class as anomalous and the rest of its hierarchical category as common. Full evaluation is performed across five folds of testing data for cross-validation.
Performance of each model when applied to the data set used in Perez-Carrasco et al. (2023). Each row represents a different anomaly detection algorithm and each column represents a different class being chosen as the anomalous class. The performance is evaluated using the AUROC score of detected anomalies. The top three metrics per class are marked in bold. The AUROC scores for the first five methods are taken directly from and are reported in Perez-Carrasco et al. (2023). A visual representation of this table is shown in Fig. B1.
. | Transient . | Stochastic . | Periodic . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method . | SLSN . | SNII . | SNIa . | SNIbc . | AGN . | Blazar . | CV/Nova . | QSO . | YSO . | CEP . | DSCT . | E . | RRL . | LPV . |
IForest | 0.640 | 0.721 | 0.428 | 0.490 | 0.573 | 0.710 | |$\mathbf {0.975}$| | 0.468 | |$\mathbf {0.913}$| | 0.359 | 0.295 | 0.469 | 0.549 | |$\mathbf {0.971}$| |
(Liu et al. 2008) | |$\pm 0.014$| | |$\pm 0.021$| | |$\pm 0.032$| | |$\pm 0.038$| | |$\pm 0.017$| | |$\pm 0.009$| | |$\mathbf {\pm 0.001}$| | |$\pm 0.016$| | |$\mathbf {\pm 0.003}$| | |$\pm 0.007$| | |$\pm 0.012$| | |$\pm 0.021$| | |$\pm 0.033$| | |$\mathbf {\pm 0.007}$| |
OCSVM | 0.577 | 0.587 | 0.434 | 0.492 | 0.532 | 0.443 | 0.909 | |$\mathbf {0.517}$| | 0.792 | 0.432 | |$\mathbf {0.557}$| | 0.555 | 0.539 | 0.943 |
(Schölkopf et al. 1999) | |$\pm 0.014$| | |$\pm 0.014$| | |$\pm 0.021$| | |$\pm 0.011$| | |$\pm 0.008$| | |$\pm 0.002$| | |$\pm 0.001$| | |$\mathbf {\pm 0.005}$| | |$\pm 0.005$| | |$\pm 0.004$| | |$\mathbf {\pm 0.005}$| | |$\pm 0.003$| | |$\pm 0.004$| | |$\pm 0.001$| |
AE | |$\mathbf {0.736}$| | |$\mathbf {0.807}$| | 0.438 | 0.537 | |$\mathbf {0.701}$| | |$\mathbf {0.762}$| | |$\mathbf {0.980}$| | 0.443 | |$\mathbf {0.990}$| | 0.564 | 0.367 | |$\mathbf {0.864}$| | |$\mathbf {0.907}$| | |$\mathbf {0.996}$| |
(Rumelhart & McClelland 1986) | |$\mathbf {\pm 0.022}$| | |$\mathbf {\pm 0.021}$| | |$\pm 0.015$| | |$\pm 0.019$| | |$\mathbf {\pm 0.010}$| | |$\mathbf {\pm 0.006}$| | |$\mathbf {\pm 0.016}$| | |$\pm 0.004$| | |$\mathbf {\pm 0.001}$| | |$\pm 0.024$| | |$\pm 0.015$| | |$\mathbf {\pm 0.009}$| | |$\mathbf {\pm 0.015}$| | |$\mathbf {\pm 0.000}$| |
VAE | 0.669 | 0.690 | 0.404 | 0.522 | 0.596 | 0.597 | 0.849 | |$\mathbf {0.500}$| | 0.795 | 0.442 | 0.417 | 0.561 | 0.451 | 0.936 |
(Kingma & Welling 2014) | |$\pm 0.015$| | |$\pm 0.023$| | |$\pm 0.018$| | |$\pm 0.025$| | |$\pm 0.007$| | |$\pm 0.010$| | |$\pm 0.028$| | |$\mathbf {\pm 0.009}$| | |$\pm 0.009$| | |$\pm 0.010$| | |$\pm 0.007$| | |$\pm 0.007$| | |$\pm 0.006$| | |$\pm 0.007$| |
Deep SVDD | 0.644 | 0.731 | 0.475 | 0.507 | 0.496 | 0.607 | 0.932 | 0.411 | 0.901 | 0.707 | 0.482 | 0.636 | 0.774 | 0.785 |
(Ruff et al. 2018) | |$\pm 0.043$| | |$\pm 0.043$| | |$\pm 0.040$| | |$\pm 0.040$| | |$\pm 0.025$| | |$\pm 0.044$| | |$\pm 0.015$| | |$\pm 0.008$| | |$\pm 0.022$| | |$\pm 0.027$| | |$\pm 0.054$| | |$\pm 0.055$| | |$\pm 0.068$| | |$\pm 0.025$| |
MCDSVDD | |$\mathbf {0.686}$| | |$\mathbf {0.828}$| | |$\mathbf {0.624}$| | |$\mathbf {0.584}$| | |$\mathbf {0.706}$| | 0.512 | 0.770 | 0.483 | 0.854 | |$\mathbf {0.858}$| | |$\mathbf {0.819}$| | |$\mathbf {0.945}$| | |$\mathbf {0.953}$| | 0.953 |
(Perez-Carrasco et al. 2023) | |$\mathbf {\pm 0.051}$| | |$\mathbf {\pm 0.024}$| | |$\mathbf {\pm 0.039}$| | |$\mathbf {\pm 0.032}$| | |$\mathbf {\pm 0.069}$| | |$\pm 0.113$| | |$\pm 0.127$| | |$\pm 0.080$| | |$\pm 0.041$| | |$\mathbf {\pm 0.025}$| | |$\mathbf {\pm 0.015}$| | |$\mathbf {\pm 0.006}$| | |$\mathbf {\pm 0.003}$| | |$\pm 0.008$| |
Classifier + IForest | |$\mathbf {0.757}$| | |$\mathbf {0.811}$| | |$\mathbf {0.619}$| | 0.556 | |$\mathbf {0.715}$| | |$\mathbf {0.720}$| | 0.945 | 0.456 | |$\mathbf {0.977}$| | |$\mathbf {0.766}$| | 0.504 | |$\mathbf {0.811}$| | |$\mathbf {0.907}$| | |$\mathbf {0.969}$| |
(This work) | |$\mathbf {\pm 0.047}$| | |$\mathbf {\pm 0.017}$| | |$\mathbf {\pm 0.073}$| | |$\pm 0.039$| | |$\mathbf {\pm 0.028}$| | |$\mathbf {\pm 0.032}$| | |$\pm 0.015$| | |$\pm 0.041$| | |$\mathbf {\pm 0.003}$| | |$\mathbf {\pm 0.066}$| | |$\pm 0.111$| | |$\mathbf {\pm 0.038}$| | |$\mathbf {\pm 0.026}$| | |$\mathbf {\pm 0.016}$| |
Classifier + MCIF | 0.567 | 0.699 | |$\mathbf {0.536}$| | |$\mathbf {0.560}$| | 0.615 | 0.701 | 0.882 | |$\mathbf {0.605}$| | 0.893 | |$\mathbf {0.875}$| | |$\mathbf {0.742}$| | 0.773 | 0.808 | 0.779 |
(This work) | |$\pm 0.091$| | |$\pm 0.046$| | |$\mathbf {\pm 0.061}$| | |$\mathbf {\pm 0.034}$| | |$\pm 0.048$| | |$\pm 0.045$| | |$\pm 0.050$| | |$\mathbf {\pm 0.051}$| | |$\pm 0.025$| | |$\mathbf {\pm 0.036}$| | |$\mathbf {\pm 0.044}$| | |$\pm 0.031$| | |$\pm 0.046$| | |$\pm 0.107$| |
MCIF | 0.503 | 0.668 | 0.532 | |$\mathbf {0.643}$| | 0.614 | |$\mathbf {0.745}$| | |$\mathbf {0.966}$| | 0.446 | 0.907 | 0.514 | 0.433 | 0.476 | 0.447 | 0.959 |
(This work) | |$\pm 0.018$| | |$\pm 0.008$| | |$\pm 0.007$| | |$\mathbf {\pm 0.005}$| | |$\pm 0.02$| | |$\mathbf {\pm 0.008}$| | |$\mathbf {\pm 0.003}$| | |$\pm 0.007$| | |$\pm 0.007$| | |$\pm 0.013$| | |$\pm 0.009$| | |$\pm 0.021$| | |$\pm 0.011$| | |$\pm 0.004$| |
. | Transient . | Stochastic . | Periodic . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method . | SLSN . | SNII . | SNIa . | SNIbc . | AGN . | Blazar . | CV/Nova . | QSO . | YSO . | CEP . | DSCT . | E . | RRL . | LPV . |
IForest | 0.640 | 0.721 | 0.428 | 0.490 | 0.573 | 0.710 | |$\mathbf {0.975}$| | 0.468 | |$\mathbf {0.913}$| | 0.359 | 0.295 | 0.469 | 0.549 | |$\mathbf {0.971}$| |
(Liu et al. 2008) | |$\pm 0.014$| | |$\pm 0.021$| | |$\pm 0.032$| | |$\pm 0.038$| | |$\pm 0.017$| | |$\pm 0.009$| | |$\mathbf {\pm 0.001}$| | |$\pm 0.016$| | |$\mathbf {\pm 0.003}$| | |$\pm 0.007$| | |$\pm 0.012$| | |$\pm 0.021$| | |$\pm 0.033$| | |$\mathbf {\pm 0.007}$| |
OCSVM | 0.577 | 0.587 | 0.434 | 0.492 | 0.532 | 0.443 | 0.909 | |$\mathbf {0.517}$| | 0.792 | 0.432 | |$\mathbf {0.557}$| | 0.555 | 0.539 | 0.943 |
(Schölkopf et al. 1999) | |$\pm 0.014$| | |$\pm 0.014$| | |$\pm 0.021$| | |$\pm 0.011$| | |$\pm 0.008$| | |$\pm 0.002$| | |$\pm 0.001$| | |$\mathbf {\pm 0.005}$| | |$\pm 0.005$| | |$\pm 0.004$| | |$\mathbf {\pm 0.005}$| | |$\pm 0.003$| | |$\pm 0.004$| | |$\pm 0.001$| |
AE | |$\mathbf {0.736}$| | |$\mathbf {0.807}$| | 0.438 | 0.537 | |$\mathbf {0.701}$| | |$\mathbf {0.762}$| | |$\mathbf {0.980}$| | 0.443 | |$\mathbf {0.990}$| | 0.564 | 0.367 | |$\mathbf {0.864}$| | |$\mathbf {0.907}$| | |$\mathbf {0.996}$| |
(Rumelhart & McClelland 1986) | |$\mathbf {\pm 0.022}$| | |$\mathbf {\pm 0.021}$| | |$\pm 0.015$| | |$\pm 0.019$| | |$\mathbf {\pm 0.010}$| | |$\mathbf {\pm 0.006}$| | |$\mathbf {\pm 0.016}$| | |$\pm 0.004$| | |$\mathbf {\pm 0.001}$| | |$\pm 0.024$| | |$\pm 0.015$| | |$\mathbf {\pm 0.009}$| | |$\mathbf {\pm 0.015}$| | |$\mathbf {\pm 0.000}$| |
VAE | 0.669 | 0.690 | 0.404 | 0.522 | 0.596 | 0.597 | 0.849 | |$\mathbf {0.500}$| | 0.795 | 0.442 | 0.417 | 0.561 | 0.451 | 0.936 |
(Kingma & Welling 2014) | |$\pm 0.015$| | |$\pm 0.023$| | |$\pm 0.018$| | |$\pm 0.025$| | |$\pm 0.007$| | |$\pm 0.010$| | |$\pm 0.028$| | |$\mathbf {\pm 0.009}$| | |$\pm 0.009$| | |$\pm 0.010$| | |$\pm 0.007$| | |$\pm 0.007$| | |$\pm 0.006$| | |$\pm 0.007$| |
Deep SVDD | 0.644 | 0.731 | 0.475 | 0.507 | 0.496 | 0.607 | 0.932 | 0.411 | 0.901 | 0.707 | 0.482 | 0.636 | 0.774 | 0.785 |
(Ruff et al. 2018) | |$\pm 0.043$| | |$\pm 0.043$| | |$\pm 0.040$| | |$\pm 0.040$| | |$\pm 0.025$| | |$\pm 0.044$| | |$\pm 0.015$| | |$\pm 0.008$| | |$\pm 0.022$| | |$\pm 0.027$| | |$\pm 0.054$| | |$\pm 0.055$| | |$\pm 0.068$| | |$\pm 0.025$| |
MCDSVDD | |$\mathbf {0.686}$| | |$\mathbf {0.828}$| | |$\mathbf {0.624}$| | |$\mathbf {0.584}$| | |$\mathbf {0.706}$| | 0.512 | 0.770 | 0.483 | 0.854 | |$\mathbf {0.858}$| | |$\mathbf {0.819}$| | |$\mathbf {0.945}$| | |$\mathbf {0.953}$| | 0.953 |
(Perez-Carrasco et al. 2023) | |$\mathbf {\pm 0.051}$| | |$\mathbf {\pm 0.024}$| | |$\mathbf {\pm 0.039}$| | |$\mathbf {\pm 0.032}$| | |$\mathbf {\pm 0.069}$| | |$\pm 0.113$| | |$\pm 0.127$| | |$\pm 0.080$| | |$\pm 0.041$| | |$\mathbf {\pm 0.025}$| | |$\mathbf {\pm 0.015}$| | |$\mathbf {\pm 0.006}$| | |$\mathbf {\pm 0.003}$| | |$\pm 0.008$| |
Classifier + IForest | |$\mathbf {0.757}$| | |$\mathbf {0.811}$| | |$\mathbf {0.619}$| | 0.556 | |$\mathbf {0.715}$| | |$\mathbf {0.720}$| | 0.945 | 0.456 | |$\mathbf {0.977}$| | |$\mathbf {0.766}$| | 0.504 | |$\mathbf {0.811}$| | |$\mathbf {0.907}$| | |$\mathbf {0.969}$| |
(This work) | |$\mathbf {\pm 0.047}$| | |$\mathbf {\pm 0.017}$| | |$\mathbf {\pm 0.073}$| | |$\pm 0.039$| | |$\mathbf {\pm 0.028}$| | |$\mathbf {\pm 0.032}$| | |$\pm 0.015$| | |$\pm 0.041$| | |$\mathbf {\pm 0.003}$| | |$\mathbf {\pm 0.066}$| | |$\pm 0.111$| | |$\mathbf {\pm 0.038}$| | |$\mathbf {\pm 0.026}$| | |$\mathbf {\pm 0.016}$| |
Classifier + MCIF | 0.567 | 0.699 | |$\mathbf {0.536}$| | |$\mathbf {0.560}$| | 0.615 | 0.701 | 0.882 | |$\mathbf {0.605}$| | 0.893 | |$\mathbf {0.875}$| | |$\mathbf {0.742}$| | 0.773 | 0.808 | 0.779 |
(This work) | |$\pm 0.091$| | |$\pm 0.046$| | |$\mathbf {\pm 0.061}$| | |$\mathbf {\pm 0.034}$| | |$\pm 0.048$| | |$\pm 0.045$| | |$\pm 0.050$| | |$\mathbf {\pm 0.051}$| | |$\pm 0.025$| | |$\mathbf {\pm 0.036}$| | |$\mathbf {\pm 0.044}$| | |$\pm 0.031$| | |$\pm 0.046$| | |$\pm 0.107$| |
MCIF | 0.503 | 0.668 | 0.532 | |$\mathbf {0.643}$| | 0.614 | |$\mathbf {0.745}$| | |$\mathbf {0.966}$| | 0.446 | 0.907 | 0.514 | 0.433 | 0.476 | 0.447 | 0.959 |
(This work) | |$\pm 0.018$| | |$\pm 0.008$| | |$\pm 0.007$| | |$\mathbf {\pm 0.005}$| | |$\pm 0.02$| | |$\mathbf {\pm 0.008}$| | |$\mathbf {\pm 0.003}$| | |$\pm 0.007$| | |$\pm 0.007$| | |$\pm 0.013$| | |$\pm 0.009$| | |$\pm 0.021$| | |$\pm 0.011$| | |$\pm 0.004$| |
Performance of each model when applied to the data set used in Perez-Carrasco et al. (2023). Each row represents a different anomaly detection algorithm and each column represents a different class being chosen as the anomalous class. The performance is evaluated using the AUROC score of detected anomalies. The top three metrics per class are marked in bold. The AUROC scores for the first five methods are taken directly from and are reported in Perez-Carrasco et al. (2023). A visual representation of this table is shown in Fig. B1.
. | Transient . | Stochastic . | Periodic . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method . | SLSN . | SNII . | SNIa . | SNIbc . | AGN . | Blazar . | CV/Nova . | QSO . | YSO . | CEP . | DSCT . | E . | RRL . | LPV . |
IForest | 0.640 | 0.721 | 0.428 | 0.490 | 0.573 | 0.710 | |$\mathbf {0.975}$| | 0.468 | |$\mathbf {0.913}$| | 0.359 | 0.295 | 0.469 | 0.549 | |$\mathbf {0.971}$| |
(Liu et al. 2008) | |$\pm 0.014$| | |$\pm 0.021$| | |$\pm 0.032$| | |$\pm 0.038$| | |$\pm 0.017$| | |$\pm 0.009$| | |$\mathbf {\pm 0.001}$| | |$\pm 0.016$| | |$\mathbf {\pm 0.003}$| | |$\pm 0.007$| | |$\pm 0.012$| | |$\pm 0.021$| | |$\pm 0.033$| | |$\mathbf {\pm 0.007}$| |
OCSVM | 0.577 | 0.587 | 0.434 | 0.492 | 0.532 | 0.443 | 0.909 | |$\mathbf {0.517}$| | 0.792 | 0.432 | |$\mathbf {0.557}$| | 0.555 | 0.539 | 0.943 |
(Schölkopf et al. 1999) | |$\pm 0.014$| | |$\pm 0.014$| | |$\pm 0.021$| | |$\pm 0.011$| | |$\pm 0.008$| | |$\pm 0.002$| | |$\pm 0.001$| | |$\mathbf {\pm 0.005}$| | |$\pm 0.005$| | |$\pm 0.004$| | |$\mathbf {\pm 0.005}$| | |$\pm 0.003$| | |$\pm 0.004$| | |$\pm 0.001$| |
AE | |$\mathbf {0.736}$| | |$\mathbf {0.807}$| | 0.438 | 0.537 | |$\mathbf {0.701}$| | |$\mathbf {0.762}$| | |$\mathbf {0.980}$| | 0.443 | |$\mathbf {0.990}$| | 0.564 | 0.367 | |$\mathbf {0.864}$| | |$\mathbf {0.907}$| | |$\mathbf {0.996}$| |
(Rumelhart & McClelland 1986) | |$\mathbf {\pm 0.022}$| | |$\mathbf {\pm 0.021}$| | |$\pm 0.015$| | |$\pm 0.019$| | |$\mathbf {\pm 0.010}$| | |$\mathbf {\pm 0.006}$| | |$\mathbf {\pm 0.016}$| | |$\pm 0.004$| | |$\mathbf {\pm 0.001}$| | |$\pm 0.024$| | |$\pm 0.015$| | |$\mathbf {\pm 0.009}$| | |$\mathbf {\pm 0.015}$| | |$\mathbf {\pm 0.000}$| |
VAE | 0.669 | 0.690 | 0.404 | 0.522 | 0.596 | 0.597 | 0.849 | |$\mathbf {0.500}$| | 0.795 | 0.442 | 0.417 | 0.561 | 0.451 | 0.936 |
(Kingma & Welling 2014) | |$\pm 0.015$| | |$\pm 0.023$| | |$\pm 0.018$| | |$\pm 0.025$| | |$\pm 0.007$| | |$\pm 0.010$| | |$\pm 0.028$| | |$\mathbf {\pm 0.009}$| | |$\pm 0.009$| | |$\pm 0.010$| | |$\pm 0.007$| | |$\pm 0.007$| | |$\pm 0.006$| | |$\pm 0.007$| |
Deep SVDD | 0.644 | 0.731 | 0.475 | 0.507 | 0.496 | 0.607 | 0.932 | 0.411 | 0.901 | 0.707 | 0.482 | 0.636 | 0.774 | 0.785 |
(Ruff et al. 2018) | |$\pm 0.043$| | |$\pm 0.043$| | |$\pm 0.040$| | |$\pm 0.040$| | |$\pm 0.025$| | |$\pm 0.044$| | |$\pm 0.015$| | |$\pm 0.008$| | |$\pm 0.022$| | |$\pm 0.027$| | |$\pm 0.054$| | |$\pm 0.055$| | |$\pm 0.068$| | |$\pm 0.025$| |
MCDSVDD | |$\mathbf {0.686}$| | |$\mathbf {0.828}$| | |$\mathbf {0.624}$| | |$\mathbf {0.584}$| | |$\mathbf {0.706}$| | 0.512 | 0.770 | 0.483 | 0.854 | |$\mathbf {0.858}$| | |$\mathbf {0.819}$| | |$\mathbf {0.945}$| | |$\mathbf {0.953}$| | 0.953 |
(Perez-Carrasco et al. 2023) | |$\mathbf {\pm 0.051}$| | |$\mathbf {\pm 0.024}$| | |$\mathbf {\pm 0.039}$| | |$\mathbf {\pm 0.032}$| | |$\mathbf {\pm 0.069}$| | |$\pm 0.113$| | |$\pm 0.127$| | |$\pm 0.080$| | |$\pm 0.041$| | |$\mathbf {\pm 0.025}$| | |$\mathbf {\pm 0.015}$| | |$\mathbf {\pm 0.006}$| | |$\mathbf {\pm 0.003}$| | |$\pm 0.008$| |
Classifier + IForest | |$\mathbf {0.757}$| | |$\mathbf {0.811}$| | |$\mathbf {0.619}$| | 0.556 | |$\mathbf {0.715}$| | |$\mathbf {0.720}$| | 0.945 | 0.456 | |$\mathbf {0.977}$| | |$\mathbf {0.766}$| | 0.504 | |$\mathbf {0.811}$| | |$\mathbf {0.907}$| | |$\mathbf {0.969}$| |
(This work) | |$\mathbf {\pm 0.047}$| | |$\mathbf {\pm 0.017}$| | |$\mathbf {\pm 0.073}$| | |$\pm 0.039$| | |$\mathbf {\pm 0.028}$| | |$\mathbf {\pm 0.032}$| | |$\pm 0.015$| | |$\pm 0.041$| | |$\mathbf {\pm 0.003}$| | |$\mathbf {\pm 0.066}$| | |$\pm 0.111$| | |$\mathbf {\pm 0.038}$| | |$\mathbf {\pm 0.026}$| | |$\mathbf {\pm 0.016}$| |
Classifier + MCIF | 0.567 | 0.699 | |$\mathbf {0.536}$| | |$\mathbf {0.560}$| | 0.615 | 0.701 | 0.882 | |$\mathbf {0.605}$| | 0.893 | |$\mathbf {0.875}$| | |$\mathbf {0.742}$| | 0.773 | 0.808 | 0.779 |
(This work) | |$\pm 0.091$| | |$\pm 0.046$| | |$\mathbf {\pm 0.061}$| | |$\mathbf {\pm 0.034}$| | |$\pm 0.048$| | |$\pm 0.045$| | |$\pm 0.050$| | |$\mathbf {\pm 0.051}$| | |$\pm 0.025$| | |$\mathbf {\pm 0.036}$| | |$\mathbf {\pm 0.044}$| | |$\pm 0.031$| | |$\pm 0.046$| | |$\pm 0.107$| |
MCIF | 0.503 | 0.668 | 0.532 | |$\mathbf {0.643}$| | 0.614 | |$\mathbf {0.745}$| | |$\mathbf {0.966}$| | 0.446 | 0.907 | 0.514 | 0.433 | 0.476 | 0.447 | 0.959 |
(This work) | |$\pm 0.018$| | |$\pm 0.008$| | |$\pm 0.007$| | |$\mathbf {\pm 0.005}$| | |$\pm 0.02$| | |$\mathbf {\pm 0.008}$| | |$\mathbf {\pm 0.003}$| | |$\pm 0.007$| | |$\pm 0.007$| | |$\pm 0.013$| | |$\pm 0.009$| | |$\pm 0.021$| | |$\pm 0.011$| | |$\pm 0.004$| |
. | Transient . | Stochastic . | Periodic . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method . | SLSN . | SNII . | SNIa . | SNIbc . | AGN . | Blazar . | CV/Nova . | QSO . | YSO . | CEP . | DSCT . | E . | RRL . | LPV . |
IForest | 0.640 | 0.721 | 0.428 | 0.490 | 0.573 | 0.710 | |$\mathbf {0.975}$| | 0.468 | |$\mathbf {0.913}$| | 0.359 | 0.295 | 0.469 | 0.549 | |$\mathbf {0.971}$| |
(Liu et al. 2008) | |$\pm 0.014$| | |$\pm 0.021$| | |$\pm 0.032$| | |$\pm 0.038$| | |$\pm 0.017$| | |$\pm 0.009$| | |$\mathbf {\pm 0.001}$| | |$\pm 0.016$| | |$\mathbf {\pm 0.003}$| | |$\pm 0.007$| | |$\pm 0.012$| | |$\pm 0.021$| | |$\pm 0.033$| | |$\mathbf {\pm 0.007}$| |
OCSVM | 0.577 | 0.587 | 0.434 | 0.492 | 0.532 | 0.443 | 0.909 | |$\mathbf {0.517}$| | 0.792 | 0.432 | |$\mathbf {0.557}$| | 0.555 | 0.539 | 0.943 |
(Schölkopf et al. 1999) | |$\pm 0.014$| | |$\pm 0.014$| | |$\pm 0.021$| | |$\pm 0.011$| | |$\pm 0.008$| | |$\pm 0.002$| | |$\pm 0.001$| | |$\mathbf {\pm 0.005}$| | |$\pm 0.005$| | |$\pm 0.004$| | |$\mathbf {\pm 0.005}$| | |$\pm 0.003$| | |$\pm 0.004$| | |$\pm 0.001$| |
AE | |$\mathbf {0.736}$| | |$\mathbf {0.807}$| | 0.438 | 0.537 | |$\mathbf {0.701}$| | |$\mathbf {0.762}$| | |$\mathbf {0.980}$| | 0.443 | |$\mathbf {0.990}$| | 0.564 | 0.367 | |$\mathbf {0.864}$| | |$\mathbf {0.907}$| | |$\mathbf {0.996}$| |
(Rumelhart & McClelland 1986) | |$\mathbf {\pm 0.022}$| | |$\mathbf {\pm 0.021}$| | |$\pm 0.015$| | |$\pm 0.019$| | |$\mathbf {\pm 0.010}$| | |$\mathbf {\pm 0.006}$| | |$\mathbf {\pm 0.016}$| | |$\pm 0.004$| | |$\mathbf {\pm 0.001}$| | |$\pm 0.024$| | |$\pm 0.015$| | |$\mathbf {\pm 0.009}$| | |$\mathbf {\pm 0.015}$| | |$\mathbf {\pm 0.000}$| |
VAE | 0.669 | 0.690 | 0.404 | 0.522 | 0.596 | 0.597 | 0.849 | |$\mathbf {0.500}$| | 0.795 | 0.442 | 0.417 | 0.561 | 0.451 | 0.936 |
(Kingma & Welling 2014) | |$\pm 0.015$| | |$\pm 0.023$| | |$\pm 0.018$| | |$\pm 0.025$| | |$\pm 0.007$| | |$\pm 0.010$| | |$\pm 0.028$| | |$\mathbf {\pm 0.009}$| | |$\pm 0.009$| | |$\pm 0.010$| | |$\pm 0.007$| | |$\pm 0.007$| | |$\pm 0.006$| | |$\pm 0.007$| |
Deep SVDD | 0.644 | 0.731 | 0.475 | 0.507 | 0.496 | 0.607 | 0.932 | 0.411 | 0.901 | 0.707 | 0.482 | 0.636 | 0.774 | 0.785 |
(Ruff et al. 2018) | |$\pm 0.043$| | |$\pm 0.043$| | |$\pm 0.040$| | |$\pm 0.040$| | |$\pm 0.025$| | |$\pm 0.044$| | |$\pm 0.015$| | |$\pm 0.008$| | |$\pm 0.022$| | |$\pm 0.027$| | |$\pm 0.054$| | |$\pm 0.055$| | |$\pm 0.068$| | |$\pm 0.025$| |
MCDSVDD | |$\mathbf {0.686}$| | |$\mathbf {0.828}$| | |$\mathbf {0.624}$| | |$\mathbf {0.584}$| | |$\mathbf {0.706}$| | 0.512 | 0.770 | 0.483 | 0.854 | |$\mathbf {0.858}$| | |$\mathbf {0.819}$| | |$\mathbf {0.945}$| | |$\mathbf {0.953}$| | 0.953 |
(Perez-Carrasco et al. 2023) | |$\mathbf {\pm 0.051}$| | |$\mathbf {\pm 0.024}$| | |$\mathbf {\pm 0.039}$| | |$\mathbf {\pm 0.032}$| | |$\mathbf {\pm 0.069}$| | |$\pm 0.113$| | |$\pm 0.127$| | |$\pm 0.080$| | |$\pm 0.041$| | |$\mathbf {\pm 0.025}$| | |$\mathbf {\pm 0.015}$| | |$\mathbf {\pm 0.006}$| | |$\mathbf {\pm 0.003}$| | |$\pm 0.008$| |
Classifier + IForest | |$\mathbf {0.757}$| | |$\mathbf {0.811}$| | |$\mathbf {0.619}$| | 0.556 | |$\mathbf {0.715}$| | |$\mathbf {0.720}$| | 0.945 | 0.456 | |$\mathbf {0.977}$| | |$\mathbf {0.766}$| | 0.504 | |$\mathbf {0.811}$| | |$\mathbf {0.907}$| | |$\mathbf {0.969}$| |
(This work) | |$\mathbf {\pm 0.047}$| | |$\mathbf {\pm 0.017}$| | |$\mathbf {\pm 0.073}$| | |$\pm 0.039$| | |$\mathbf {\pm 0.028}$| | |$\mathbf {\pm 0.032}$| | |$\pm 0.015$| | |$\pm 0.041$| | |$\mathbf {\pm 0.003}$| | |$\mathbf {\pm 0.066}$| | |$\pm 0.111$| | |$\mathbf {\pm 0.038}$| | |$\mathbf {\pm 0.026}$| | |$\mathbf {\pm 0.016}$| |
Classifier + MCIF | 0.567 | 0.699 | |$\mathbf {0.536}$| | |$\mathbf {0.560}$| | 0.615 | 0.701 | 0.882 | |$\mathbf {0.605}$| | 0.893 | |$\mathbf {0.875}$| | |$\mathbf {0.742}$| | 0.773 | 0.808 | 0.779 |
(This work) | |$\pm 0.091$| | |$\pm 0.046$| | |$\mathbf {\pm 0.061}$| | |$\mathbf {\pm 0.034}$| | |$\pm 0.048$| | |$\pm 0.045$| | |$\pm 0.050$| | |$\mathbf {\pm 0.051}$| | |$\pm 0.025$| | |$\mathbf {\pm 0.036}$| | |$\mathbf {\pm 0.044}$| | |$\pm 0.031$| | |$\pm 0.046$| | |$\pm 0.107$| |
MCIF | 0.503 | 0.668 | 0.532 | |$\mathbf {0.643}$| | 0.614 | |$\mathbf {0.745}$| | |$\mathbf {0.966}$| | 0.446 | 0.907 | 0.514 | 0.433 | 0.476 | 0.447 | 0.959 |
(This work) | |$\pm 0.018$| | |$\pm 0.008$| | |$\pm 0.007$| | |$\mathbf {\pm 0.005}$| | |$\pm 0.02$| | |$\mathbf {\pm 0.008}$| | |$\mathbf {\pm 0.003}$| | |$\pm 0.007$| | |$\pm 0.007$| | |$\pm 0.013$| | |$\pm 0.009$| | |$\pm 0.021$| | |$\pm 0.011$| | |$\pm 0.004$| |
As seen in Table 2 (and visually in Fig. B1), our classifier-based approach with an isolation forest is one of the top approaches for most transient classes, showing the power of using a classifier’s latent space for anomaly detection. Using a classifier with MCIF also performs promisingly, however is sometimes worse than using a classifier with a single isolation forest. This is not the case on our data set and is discussed further in Appendix A.
4.5 Scaling the latent space
Anomaly detection poses a unique challenge for model evaluation due to the nature of unsupervised learning: true anomalies are only revealed during the final testing phase. Consequently, we refrain from tuning hyperparameters for model selection and instead retrospectively analyse the effects of different hyperparameter choices, particularly the size of the latent space without risking overfitting during the development process.
To assess the impact of latent space size on anomaly detection performance, we train multiple models with varying latent dimensions and evaluate them using the AUROC. As shown in Fig. 10, increasing the latent size beyond 50 leads to significant improvements in anomaly detection performance, with diminishing returns thereafter. Smaller models generally exhibit lower average performance and higher variance. Interestingly, we do not observe a performance drop in high-dimensional latent spaces, despite the presence of numerous correlated features. This robustness can be attributed to the effectiveness of isolation forests and tree-based algorithms in handling high-dimensional data by selectively identifying the most discriminative features for data partitioning.
Based on Fig. 10, we select a latent size of 100 neurons for this work, as it yields a high anomaly detection AUROC score. However, it’s worth noting that this choice is somewhat arbitrary, as any sufficiently large latent size beyond 50 dimensions appears to provide comparable performance.
Interestingly, while classifiers prove effective for anomaly detection, we do not find a significant correlation between classification accuracy and anomaly detection performance. This highlights a key interpretability challenge in both traditional neural networks and our approach, warranting further investigation.
5 CONCLUSION
The advent of large-scale wide-field surveys has revolutionized time-domain astronomy, unveiling an extraordinary diversity of astrophysical phenomena. Surveys such as the Zwicky Transient Facility and the upcoming Legacy Survey of Space and Time conducted by the Vera C. Rubin Observatory promise to discover transients in vast numbers, with LSST expected to generate millions of alerts each night. This deluge of data presents both a challenge and an opportunity: while the sheer volume of detections makes manual inspection infeasible, it also offers the potential to discover entirely new classes of transients that are rare or have remained hidden in previous surveys. To fully harness the potential of these wide-field surveys, automated methods for real-time anomaly detection are essential. By identifying and prioritizing the most unusual and interesting events amid the flood of data, these techniques can facilitate rapid follow-up and characterization, enabling us to deepen our understanding of the time-domain universe.
The most common approach for transient anomaly detection is to construct a feature extractor that can map light curves to a low-dimensional latent space, and then apply clustering or outlier detection algorithms to identify anomalies within that space. Constructing a latent space that represents transients well for anomaly detection is difficult, and most previous approaches have either user-defined features or unsupervised deep-learning approaches.
In this work, we have introduced a novel approach that leverages the latent space of a neural network classifier for identifying anomalous transients. Our pipeline, which combines a deep recurrent neural network classifier with our novel MCIF anomaly detection method, demonstrates promising performance on simulated data matched to the characteristics of the Zwicky Transient Facility.
The key advantages of our approach are:
The RNN classifier maps light curves into a low-dimensional latent space that naturally clusters similar transient classes together, providing an effective representation for anomaly detection. We repurposed the penultimate layer of this classifier as the feature space for anomaly detection.
Our novel MCIF method addresses the limitations of using a single isolation forest on the complex latent space by training separate isolation forests for each known transient class and taking the minimum score as the final anomaly score.
Our classifier input format eliminates the need for interpolation by incorporating time and passband information, enabling the model to learn inter-passband relationships and handle irregular sampling.
To mimic a real-world scenario, we evaluated our approach on a realistic simulated data set containing 12 040 common transients and 54 anomalous events. After following up MCIF’s top 2000 ranked transients, we accurately identified |$41 \pm 3$| out of the 54 true anomalies. That is, after following up the top 15 per cent highest ranked scores, we recovered 75 per cent of the true anomalies. CaRTs look very similar to common supernovae, and thus are difficult to identify. If we exclude CaRTs from our anomalous sample, our recovery of anomalies increases sharply to 87 per cent (|$47 \pm 2$| out of the 54 true anomalies) after following up the top 2000 (|$\sim 15~{{\rm per\ cent}}$|) highest scoring transients.
The learned latent space exhibits clear separation between common and anomalous transient classes, and our preliminary analysis suggests the potential for real-time anomaly detection using limited early-time observations. The pre-trigger information encoded by our RNN enables our model to identify anomalous transients at early stages in the light curve, and even by trigger, a significant separation between common and anomalous transients is captured. In particular, KN, PISN, and ILOT all stand out as anomalous shortly after the time of trigger.
Future work encompasses several promising directions. First, benchmarking our model against other similar approaches is important for a comprehensive performance assessment. Currently, comparing models is difficult, because a standard test data set of anomalies does not exist. Developing a realistic benchmark data set that encompasses a representative population of common and example anomalous transients will improve the quality of methods developed by the community and enable robust evaluation metrics. Moreover, a detailed comparative analysis of MCIF with previous class-by-class anomaly detection approaches should be carried out to gain a deeper understanding of their relative strengths and limitations in this domain.
Secondly, integrating techniques from other anomaly detection methods, such as active learning (Lochner & Bassett 2021), could help us to distinguish new anomalies as interesting or not. Beyond direct anomaly detection, MCIF can be used to identify which known class an anomalous object most closely resembles based on the individual isolation forest scores. Additionally, we plan to apply the proposed architecture to real observational data, moving beyond simulations and testing the model’s effectiveness in a practical astronomical context.
A significant contribution of this work is the demonstration that a well-trained classifier can be effectively repurposed for anomaly detection by leveraging the clustering properties of its latent space. The flexibility of our approach allows for the adaptation of any classifier to an anomaly detector. For example, using existing classifiers as feature extractors for spectra, images, or time series from other domains, we can build effective anomaly detectors.
Another significant advantage of our approach is that the clustering properties of the latent space extend to unseen data, enabling few-shot classification of astronomical transients with limited labelled examples. This will be useful for the early observations from new surveys such as LSST. Furthermore, our input method lends itself well to transfer learning from one survey to another because it explicitly uses the passband wavelength. Future work should explore transfer learning from ZTF data to other surveys such as PanSTARRS or LSST simulations.
In conclusion, our novel approach to real-time anomaly detection in astronomical light curves, combining a deep neural network classifier with multiclass isolation forests, demonstrates the power of leveraging well-clustered latent space representations for identifying rare and unusual transients. As the era of large-scale astronomical surveys continues to produce unprecedented volumes of data, the development and refinement of such techniques will be crucial for making discoveries in time-domain astronomy.
ACKNOWLEDGEMENTS
We would like to thank the Cambridge Centre for International Research (CCIR) for fostering this collaboration. ML acknowledges support from the South African Radio Astronomy Observatory and the National Research Foundation (NRF) towards this research. Opinions expressed and conclusions arrived at, are those of the authors and are not necessarily to be attributed to the NRF.
This work made use of the python programming language and the following packages: numpy (Harris et al. 2020), matplotlib (Hunter 2007), seaborn (Waskom 2021), scikit-learn (Pedregosa et al. 2011), pandas (McKinney 2010), astropy (Astropy Collaboration 2022), umap-learn (Sainburg, McInnes & Gentner 2021), keras (Chollet 2015), and tensorflow (Abadi et al. 2015).
We acknowledge the use of the ilifu cloud computing facility – www.ilifu.ac.za, a partnership between the University of Cape Town, the University of the Western Cape, Stellenbosch University, Sol Plaatje University, and the Cape Peninsula University of Technology. The ilifu facility is supported by contributions from the Inter-University Institute for Data Intensive Astronomy (IDIA – a partnership between the University of Cape Town, the University of Pretoria, and the University of the Western Cape), the Computational Biology division at UCT and the Data Intensive Research Initiative of South Africa (DIRISA).
This work used Bridges-2 at Pittsburgh Supercomputing Center through allocation PHY240105 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program (Boerner et al. 2023), which is supported by U.S. National Science Foundation grants #2138259,#2138286, #2138307, #2137603, and #2138296.
DATA AVAILABILITY
The code used in this work is publicly available. The models used to create the simulations that generate the data used in this work were released in PLAsTiCC (Kessler et al. 2019) and are available at https://zenodo.org/record/2612896#.YYAz1NbMJhE. To generate light curves following ZTF observing properties, we use the SNANA software package (Kessler et al. 2009), developed for PLAsTiCC, with observing logs from the ZTF survey to generate data. A version of these simulations was first used in Muthukrishna et al. (2019) and have since been updated to resolve a known problem with core-collapse SNe. The data are publicly available upon reasonable request to the corresponding author.
Footnotes
The public MSIP ZTF survey has since changed to a 2-d median cadence.
The expected relative frequencies of each class are taken from Kessler et al. (2019) developed for the PLaSTiCC Data set.
We empirically find that there is little difference between an LSTM and GRU model, in both classification accuracy and anomaly detection.
We use the umap-learn implementation in python using the hyperparameters 'minimum distance' set to 0.5 and ‘number of neighbours' set to 500.
This usage of the word recall has a different population distribution than defined in equation (8).
We can use MCIF on its own as this is a data set of features extracted from time-series, not the raw time-series.
References
APPENDIX A: ADVANTAGES OF MCIF
Before proposing the MCIF pipeline, we attempted to use a normal isolation forest to detect anomalies from the latent representation |$z_s$| of a light curve. We trained an isolation forest on all the common classes of our training data using 200 estimators. To account for the class imbalance in our training data, we weighted samples from underrepresented classes more heavily during the training of the isolation forest, using the same weighting scheme as in equation (3). The anomaly score function |$A(Z_s)$| was simply the negated anomaly score output from a single isolation forest trained on all the latent representations of the training data.
As shown in Fig. A2, there is little distinction in the anomaly scores of most anomalous and common classes when using a single isolation forest. Surprisingly, the common classes SLSN-I and AGN are classified as relatively more anomalous than all the other classes. The distribution of anomaly scores in Fig. A1 reveals that although there is overall separation between common and anomalous classes, certain common classes are classified as very anomalous.
The UMAP reduction of the latent space, as depicted in Fig. 4, provides insight into this behaviour. The SLSN-I and AGN classes are significantly distant from the main cluster formed by other classes. This isolation from the central cluster may explain the high anomaly scores associated with these classes. This hypothesis is also supported by the near-perfect classification of these classes, shown in the confusion matrix and ROC curves in Fig. 3. In fact, the near-perfect classification hinted towards this poor result in anomaly detection, showing us that these transients are easy to separate from other classes, and hence are also easy to mark as anomalous. In summary, while an isolation forest is good at detecting anomalies, it struggles to capture the structure of a latent space with numerous clusters. This drawback of using a single isolation forest could explain why other works report high anomaly scores for SLSN-I and AGN (e.g. Villar et al. 2021). Using a class-by-class (or cluster-by-cluster) anomaly detector, such as MCIF, can mitigate this. Directly comparing Fig. 6 and Fig. A1 empirically demonstrates the advantages of MCIF.
Further analysis of MCIF’s performance on the comparative evaluation data set (Section 4.4) reveals that, contrary to the results shown in Fig. 6, a single isolation forest generally outperforms MCIF (Table 2). Investigating the UMAP representations of the latent space for classes exhibiting this discrepancy offers insights. When SNII is considered anomalous, the latent space (Fig. A3 [left]) lacks clear separation between SNIbc and SNIa, likely due to poor generalization caused by the limited number of SNIbc transients in the training set, explaining the single isolation forest’s superior performance. However, for the DSCT class (Fig. A3 [right]), distinct visual clusters are present, and MCIF achieves better results. These findings suggest that MCIF enhances performance when majority classes are well-separated, a characteristic seemingly inherent to the data set rather than the classifier-based latent space identification approach, as a single isolation forest surpasses MCIF on the raw data for most classes where it also outperforms MCIF on the classifier’s latent space. Future research should explore the factors influencing MCIF’s effectiveness based on the separability of raw data, with the SNII case indicating a partial dependence on data quantity, as increased data improves the DNN’s generalization ability.

The distribution of anomaly scores for full light curves when using a single isolation forest for anomaly detection. The scores are derived from the unseen anomalous data and the common transient testing data. The bottom five classes are the anomalous classes. There is some separation between the anomaly scores of common and anomalous classes, but certain common classes are considered very anomalous (unlike when using MCIF as seen in Fig. 6).
![The median anomaly score for each class computed for latent representations of transients obtained from full light curves when a single isolation forest is used for anomaly detection [bottom] and when MCIF is used [top] (this is the exact same as Fig. 5, reproduced for convenience). The scores are derived from the unseen anomalous data and the common transient testing data. The five classes on the right (scores in bold) are anomalous. The common classes have somewhat lower median scores when using a single isolation forest, but the common classes SLSN-I and AGN (among others) are considered very anomalous, unlike when using MCIF.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/rasti/4/10.1093_rasti_rzae054/8/m_rzae054figa2.jpeg?Expires=1749164835&Signature=Y8pE02tDOAoFp617Mk05xwveic99DU3b4RlCfH3WjsQRf4D99RdqiBwadHh7m9ycWOErp5UzGmQPoUfE6twVrQPNJJREiAEQ~0JP43JBuN-PuI9~7ai1VVLQALG4NYj-QNb7kpOGLorU-2cyR2W9f1PmWh0mBh6pYBg9HVLyXYEx39AXfe2PXcLXnnHzBuRaUg3njfn0veRI0b6AGlwhk5sdx5XKU67k4BI4KEc-LcTNiBJ0iJqNL-O-5eSt94BKyRyA2TrXNDY6CXl6cv~p6tq2~aKgTR02y7Aiset6HtUTxd2LOo7NDBu6DKNOgJxFRM0icUGbeUPrKS88oU6NAw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
The median anomaly score for each class computed for latent representations of transients obtained from full light curves when a single isolation forest is used for anomaly detection [bottom] and when MCIF is used [top] (this is the exact same as Fig. 5, reproduced for convenience). The scores are derived from the unseen anomalous data and the common transient testing data. The five classes on the right (scores in bold) are anomalous. The common classes have somewhat lower median scores when using a single isolation forest, but the common classes SLSN-I and AGN (among others) are considered very anomalous, unlike when using MCIF.
![The UMAP reduction of the training data in the latent space for a classifier trained for detecting the class SNII [left] and DSCT [right] as anomalous using the data introduced in Perez-Carrasco et al. (2023) and used in Section 4.4. As the UMAP only plots the training data, it includes all the classes in the respective hierarchical category (seen in Table 2) but the one set aside as anomalous.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/rasti/4/10.1093_rasti_rzae054/8/m_rzae054figa3.jpeg?Expires=1749164835&Signature=dUdBcI-SRWCqabvwD~hyMMN4eMzQu5NFmyEHl4zKaA59MuSfbPPVu4Jw30e61RoFdJlrm2mW0qFDvo0npcT1HqFMEU2kqtlJm90aFlwpiQZeJLhWqv9ZxowOi-0LOArpJLfaGJjNNIhZWkdkdArXITP3w9HPudQ5zeASSUahhFT7tZHoTAbyarMVFtJmYBrnGshAwu6fNNb4Agu0WJ78feZYN~-R84V929wi0PwgG2Zu8txWKpBbyMJ5sQdfXf9u-h4UGE8YYR-rmG1h2A-1w~brLd9JAmXUY4m94m1W3WLRqp93m-WEFtApqmpxyodPCejgsxarB~H0~VaV-dRU7A__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
The UMAP reduction of the training data in the latent space for a classifier trained for detecting the class SNII [left] and DSCT [right] as anomalous using the data introduced in Perez-Carrasco et al. (2023) and used in Section 4.4. As the UMAP only plots the training data, it includes all the classes in the respective hierarchical category (seen in Table 2) but the one set aside as anomalous.
APPENDIX B: VISUAL COMPARISON TO OTHER APPROACHES
Fig. B1 is a visual representation of the results depicted in Table 2.

Visual representation of the comparative analysis depicted in Table 2. The AUROC is written for the models top 3 models for each class.
APPENDIX C: THE KILONOVA DIP
As illustrated in Fig. 9, there is an unusual dip in the anomaly scores of KN around the trigger time. Further analysis reveals that most common classes also experience a similar dip at trigger, but they do not rebound to high anomaly scores afterwards. Examples are shown in Fig. C1.
The trigger of a light curve often corresponds to the first observation detected as part of the transient phase of the light curve, and very few common classes can be effectively classified before trigger. Effective pre-trigger classification is likely due to host galaxy information (host redshift and Milky Way extinction) or the periodic nature of certain transient events (e.g. AGNs) which means they are midway through their evolution at trigger. Fig. C2 shows that classes with a high pre-trigger classification accuracy (e.g. SLSN-I) have consistently declining anomaly scores before trigger. In contrast, classes with poor pre-trigger classification (e.g. SNIIn and SNIa) exhibit a slight increase in anomaly scores before trigger, followed by a sudden dip. This sudden dip resembles the behaviour of KN and coincides perfectly with the sudden jump in classification performance. This suggests that our pipeline struggles to detect KN as anomalous before trigger for the same reasons it is unable to classify SNIIn before trigger. Despite KN exhibiting a slight upward trend before trigger, it seems that the new observation near trigger means much more (likely due to the high S/N of that observation).
For observations after trigger, we found that the anomaly score ‘resets’ to mark the true beginning of the transient phase. For example, in the case of KN, the high S/N trigger observation signals the actual start of the transient and resets the anomaly score. Subsequent observations, characterized by a sudden decline back to the background level, quickly push KN into the anomalous category (as short-time-scale events are rare). However, in the case of poorly classified common classes, this reset is followed by a further decline in anomaly score as classification accuracy increases, making the dip appear normal. A similar effect can be seen in other anomalous classes, most notably in uLens-BSR transients. The result is less pronounced as the first rise of uLens-BSR transients does not always coincide with trigger, leading to a distributed dip around trigger for uLens-BSR in Fig. 9 and a dip offset from trigger in Fig. C1.

Real-time anomaly scores for a sample KN, uLens-BSR, and SNIa. They all exhibit a significant dip near trigger, but the dip for the SNIa is followed by a further decline, whereas KN and uLens-BSR show a sharp increase after the dip. In light curves in the top panels, the blue markers represent the g-band and the orange markers represent the r-band fluxes.
![Real-time AUROC values for selected classes [left] and real-time anomaly scores for a different subset of classes than Fig. 9 [right]. Classes that are poorly classified pre-trigger (e.g. SNIIn and SNIa) exhibit a dip in anomaly score similar to KN, which coincides precisely with the sudden increase in classification accuracy.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/rasti/4/10.1093_rasti_rzae054/8/m_rzae054figc2.jpeg?Expires=1749164835&Signature=se9Hr~qa4sgfnWD3vkMV~rXTKACa~IwOXiCIB7EOpQxms2dheJc8wbtyqj7KUcEzXWh9QNlBA9pY3gki0f~YqRl39fWDpjE7iKhDNo34MbG5wk34blTksce630ZYbRUqYn8867y9sXSUOZo~lTG5ndt4PyBE5lMoYITPqH2aQl2wjnhL25-HZd3HDZs6QsVtmwyhZMDPNqbUVypTRn-rQcev-74yWk39u27M0LbVYGVch4YOJIWiZ~0vUsxi5vKCHTyQwgeiwFt4UtcDE~zMJ8-bXPvc1jhqpmqwulgI3tt753CCtCRVIzjqFpzwXWYks-q-M3mArJRzduAmIVHcpw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Real-time AUROC values for selected classes [left] and real-time anomaly scores for a different subset of classes than Fig. 9 [right]. Classes that are poorly classified pre-trigger (e.g. SNIIn and SNIa) exhibit a dip in anomaly score similar to KN, which coincides precisely with the sudden increase in classification accuracy.
APPENDIX D: EXAMPLE LIGHT CURVES
A sample light curve from each class is illustrated in Fig. D1.

Sample light curves from each transient class used in this work. We only plot transients with low signal-to-noise and low host redshift (|$z \lt 0.5$|) to help visually compare shapes. The dark circular markers represent the r band while the light triangular markers represent the g band. Flux errors are not plotted.